pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	314a502eb0	Revert "Reland "[C10] PG observability hooks. (#108815 )" (#110907 )" This reverts commit `7678cd22af`. Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this `7678cd22af` ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))	2023-10-11 00:23:42 +00:00
Will Constable	7678cd22af	Reland "[C10] PG observability hooks. (#108815 )" (#110907 ) This reverts commit `ff0358b038`. (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907 Approved by: https://github.com/fduwjj	2023-10-10 20:09:40 +00:00
Edward Z. Yang	de3ae93e9b	Include rank of default PG in C++ log messages (#110623 ) I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test. The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work. This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-10 00:26:52 +00:00
Kazuaki Ishizaki	b5f9696d81	Fix typo under torch directory (#110824 ) This PR fixes typo `the the` of comments and exception messages in files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824 Approved by: https://github.com/H-Huang	2023-10-09 19:16:43 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit `0c7a877745`. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
Howard Huang	0949d97c16	fix batch_isend_irecv example incorrect usage (#110408 ) mismatched dtypes silently leads to wrong outputs in nccl ``` 1:recv_tensor=tensor([0., 0.], device='cuda:1') 0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408 Approved by: https://github.com/awgu, https://github.com/Neilblaze	2023-10-04 22:57:03 +00:00
Rohan Varma	40be6b72e1	[ez] Type function in distributed_c10d (#110435 ) This function returns a `torch.device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435 Approved by: https://github.com/awgu	2023-10-03 17:54:04 +00:00
Rodrigo Kumpera	c26270c733	[C10D] Even more store scalability work. (#109218 ) Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks. Make the minimum wait time in _store_based_barrier to be adaptative based on the number of ranks. Longer timeouts give more room for the store to do productive work when swamped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218 Approved by: https://github.com/XilunWu ghstack dependencies: #109217	2023-09-22 21:27:09 +00:00
Howard Huang	600d0d0284	Add "cuda" to MPI backend capabilities (#109614 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/109543 Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests. Differential Revision: D49420438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614 Approved by: https://github.com/XilunWu	2023-09-21 13:34:58 +00:00
Rodrigo Kumpera	881bfbf21d	[c10d] Add tests for usig libuv through init_process_group. (#108661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-09-20 16:02:20 +00:00
Rodrigo Kumpera	2bca5f2af7	[C10D] Track pg name in c++. (#108813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813 Approved by: https://github.com/wconstab	2023-09-15 01:10:29 +00:00
Brian Vaughan	bb14805bcd	fix an incorrect indent in documentation (#108273 ) doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument). Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273 Approved by: https://github.com/awgu, https://github.com/kit1980	2023-09-11 21:27:52 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
wz337	264df88a2d	[C10D][Logger]Add more info to c10d logger (#107331 ) This PR adds pg_name and world_size to c10d logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331 Approved by: https://github.com/kumpera	2023-08-28 15:10:56 +00:00
Codle	42738c56a0	Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509 ) The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element. Before (30G tensor)： <img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944"> After (46G tensor): <img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5"> Test Code: ```python if __name__ == "__main__": dist.init_process_group(backend='nccl') torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count()) fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4) if dist.get_rank() == 0: state_dict = {"fake_tensor": fake_tensor} else: state_dict = {} object_list = [state_dict] dist.broadcast_object_list(object_list, src=0) print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys()) dist.barrier() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509 Approved by: https://github.com/awgu	2023-08-23 17:19:10 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Rodrigo Kumpera	bbf03561a9	[functional collectives] Move back to registering finalizers on wrappers. (#107250 ) We cannot use inner tensors for finalizers as they are uncollective until waited. This PR adds a bunch of tests for the observable behavior we want, including the necessary scafold for us to test code for their waitiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250 Approved by: https://github.com/wconstab	2023-08-17 21:08:28 +00:00
Shen Li	45128ab67c	[Reland] Add OnCompletion Hook to ProcessGroup (#106988 ) (#107233 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233 Approved by: https://github.com/kumpera	2023-08-15 17:35:14 +00:00
PyTorch MergeBot	fd214aa8be	Revert "Add OnCompletion Hook to ProcessGroup (#106988 )" This reverts commit `ba1da47e8f`. Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error. The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))	2023-08-15 08:24:33 +00:00
Shen Li	ba1da47e8f	Add OnCompletion Hook to ProcessGroup (#106988 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988 Approved by: https://github.com/kumpera, https://github.com/H-Huang ghstack dependencies: #107140, #107141, #107160	2023-08-15 04:32:23 +00:00
Bruce Jiang	2624da638d	Support third-party devices to use the init_process_group method with… (#107113 ) …out specifying the Backend When init_process_group is not been done before, it will automatically apply init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113 Approved by: https://github.com/wanchaol	2023-08-15 03:46:07 +00:00
Jirka	858b465d74	fix str splits in single line (#106005 ) Simple formating improvement and two spell fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005 Approved by: https://github.com/H-Huang	2023-08-14 23:07:38 +00:00
Michael Voznesensky	42660015b4	[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886 Approved by: https://github.com/awgu, https://github.com/wconstab ghstack dependencies: #106884	2023-08-11 22:35:50 +00:00
Louis Feng	3a01c056f5	[PyTorch][ET] Collect Process Groups Mapping Info (#104373 ) Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info). Test Plan: Testing in HPC: ``` TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000 ``` Example output in ET: ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Differential Revision: D46321690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373 Approved by: https://github.com/kwen2501	2023-07-25 03:34:53 +00:00
Howard Huang	0ab74044c2	[BE] remove deprecated attributes from distributed_c10d (#105753 ) Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now. Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753 Approved by: https://github.com/rohan-varma	2023-07-24 16:35:08 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Ke Wen	22e8a61d9b	Implement coalesced reduce_scatter_tensor (#103561 ) Map of #101157. This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax: Sync communication style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) ``` Async communication style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the reduce-scatters' results ``` Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561 Approved by: https://github.com/fegin	2023-06-15 20:11:12 +00:00
zhuhong61	50c972bfd2	[c10d] Add xpu to the default device supported by user specified backend (#103410 ) Motivation: For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping. Solution: We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly. Usage: When using xpu device, user can specify backend name only: `dist.init_process_group(backend='ccl')` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-06-12 19:46:33 +00:00
Ke Wen	07104ca99c	[c10d] Make it default that PG do not perform barrier after init (#103033 ) Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init. In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033 Approved by: https://github.com/XilunWu	2023-06-07 06:11:14 +00:00
Ashwin Hari	cf0aa38005	Allow ORT backend for DTensor (#101914 ) fixes #101911 Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend. * `Backend.NAME` attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl. * remove unused `_check_for_nccl_backend` function * add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914 Approved by: https://github.com/wanchaol	2023-06-01 22:37:09 +00:00
shaoyf42	8d7e082300	[c10d] Add is_backend_available for c10d backend. (#101945 ) Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``. There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553 > For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101 to also add their own is_available property It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function. This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`. > Or we could add an Is_available(backend) function, that checks for the backend. Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945 that supports both built-in backends and third-party backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945 Approved by: https://github.com/H-Huang	2023-05-31 22:51:51 +00:00
Wanchao Liang	3ef4d697df	[c10d] default backend need to check for nccl availability (#102470 ) As titled, we can only initialize nccl backend when NCCL is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2023-05-30 19:22:37 +00:00
Wanchao Liang	7b47cd0a6c	[c10d] add fake pg necessary collectives (#102238 ) This PR adds fake pg necessary collectives to enable e2e FSDP run with out multiprocess or multithreading Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238 Approved by: https://github.com/ezyang	2023-05-25 05:01:16 +00:00
Wanchao Liang	9a19262556	[c10d] conslidate barrier after init logic (#102237 ) This PR consolidates the barrier after init logic to allow custom backend to set the env var when creating the pg, so that `init_process_group` would skip barrier Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237 Approved by: https://github.com/ezyang	2023-05-25 05:01:16 +00:00
Edward Z. Yang	c903b12cb8	Add fake process group (#102180 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180 Approved by: https://github.com/wanchaol	2023-05-24 23:27:40 +00:00
Iris	ee95e37a69	[c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912 ) 1. Record time spent for init_process_group, new_group, _store_based_barrier 2. Rename c10d_error_logger to c10d_logger for generalization. 3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py. 4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912 Approved by: https://github.com/fduwjj	2023-05-24 09:36:34 +00:00
Aaron Gokaslan	3e2ea32dab	[BE]: Enable ruff rule TRY302 and apply fixes (#101874 ) Removes useless try statements and unreachable code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874 Approved by: https://github.com/malfet	2023-05-19 17:30:52 +00:00
shaoyf42	97180aca5e	Enables barrier to support the specified device (#99589 ) Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Today, there are two limitations of barrier: One is that barrier does not support custom #device: `fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)` The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device. `789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589 Approved by: https://github.com/kwen2501	2023-05-17 05:26:04 +00:00
Ke Wen	daed3bf8f9	Implement coalesced all_gather_into_tensor (#101157 ) This PR adds support for the following use cases: - Sync style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) ``` - Async style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the all-gather's ``` Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-05-11 20:58:47 +00:00
Ke Wen	0848ed21b8	[c10d] Figure out device to use for object collectives (#100954 ) Fixes https://github.com/pytorch/pytorch/issues/97938 this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But @kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction. the only confliction is `distributed_c10d.py:2653` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954 Approved by: https://github.com/kwen2501	2023-05-11 01:49:09 +00:00
Rodrigo Kumpera	a204f7f518	[c10d] Fix subprocess group handlig in scatter_object_list. (#100552 ) scatter_object_list assumed src was a group rank while all collectives use global ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552 Approved by: https://github.com/fduwjj	2023-05-04 10:04:21 +00:00
Xiaodong Wang	c29ab84115	Fix bug in process_group_name when there is duplicate pgs (#100518 ) Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518 Approved by: https://github.com/kumpera	2023-05-04 02:12:28 +00:00
Animesh Jain	5fbb40669f	[dynamo][moco] Disallow_in_graph distributed APIs (#100071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071 Approved by: https://github.com/jansel, https://github.com/H-Huang	2023-05-02 20:09:25 +00:00
Ke Wen	ae0eb2342d	[Experimental] Remove store barrier after PG init (#99937 ) Store based barrier is not scalable. Experimenting to see if removing it breaks any CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937 Approved by: https://github.com/kumpera, https://github.com/H-Huang	2023-04-27 17:23:10 +00:00
Rodrigo Kumpera	ad21890f8f	[c10d] Scalable PG initiation. (#99931 ) Add use_local_synchronization argument to new_group. When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster. This addressess both scalability and composability problems associated with new_group. Fixes #81291. This is relanding #84224 As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following: new_group use_local_synchronization=False: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.12 \| \| 8 \| 0.25 \| \| 16 \| 0.51 \| \| 32 \| 0.87 \| \| 64 \| 1.50 \| \| 128 \| 2.87 \| new_group use_local_synchronization=True: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.05 \| \| 8 \| 0.04 \| \| 16 \| 0.03 \| \| 32 \| 0.03 \| \| 64 \| 0.04 \| \| 128 \| 0.04 \| Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128. Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3. Setup: 1 AWS host, backend gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931 Approved by: https://github.com/xw285cornell	2023-04-27 13:44:02 +00:00

1 2 3 4 5 ...

259 Commits