pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
cyy	265acd4bea	Clean up CMake target linking (#109959 ) This PR cleans up more CMake target linking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109959 Approved by: https://github.com/ezyang	2023-09-25 01:37:14 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
Rodrigo Kumpera	fe2cda64dc	[C10D] Implement new libuv backend for TCPStore. (#108066 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. This is a reland of #105870 with a fix for a bad test. Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2023-08-29 14:55:14 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
PyTorch MergeBot	d3f92ca9e9	Revert "[C10D] Implement new libuv backend for TCPStore. (#105870 )" This reverts commit `3c841163ce`. Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))	2023-08-17 23:41:00 +00:00
Rodrigo Kumpera	3c841163ce	[C10D] Implement new libuv backend for TCPStore. (#105870 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870 Approved by: https://github.com/H-Huang	2023-08-17 20:40:32 +00:00
Rodrigo Kumpera	174b0c22cb	[C10D] Remove watchKey functionality from the Store. (#105014 ) The feature was never fully finished and never got any adoption but TCPStore pays the cost of twice the number of tcp connections anyway. While the cost of all those idle connections is minimal is doesn't come for free: - It increases the likelyhood of a connection refused failure during the initialization stampede. - TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014 Approved by: https://github.com/fduwjj	2023-07-21 21:18:55 +00:00
Howard Huang	9165d46b89	DDP + C10D sparse all_reduce changes (#103916 ) (#104256 ) Summary: reland of https://github.com/pytorch/pytorch/pull/103916 ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D47056695 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256 Approved by: https://github.com/rohan-varma	2023-06-28 00:37:52 +00:00
PyTorch MergeBot	436d035dc7	Revert "DDP + C10D sparse all_reduce changes (#103916 )" This reverts commit `fed5fba6e4`. Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))	2023-06-26 22:37:58 +00:00
Howard Huang	fed5fba6e4	DDP + C10D sparse all_reduce changes (#103916 ) Summary: ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D46724856 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916 Approved by: https://github.com/rohan-varma	2023-06-26 20:42:17 +00:00
Howard Huang	a206e8b027	[small BE] update NcclTest dim size (#101127 ) Previously input dimensions are fixed to 3x3, this is a small change to make that configurable. Will be used in future additions to nccl tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/101127 Approved by: https://github.com/rohan-varma	2023-05-15 23:05:10 +00:00
Ke Wen	f89af60183	Rewrite NCCL watchdog to more reliably throw timeout (#97066 ) Fixes #97191 This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job. ### Previous output in #97191 ``` Rank 0 is the problematic rank Rank 4 completed Rank 5 completed Rank 3 completed Rank 6 completed Rank 2 completed Rank 7 completed Rank 1 completed [E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out. Rank 0 completed [E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down. ``` Although it says that it is taking the process down, it sometimes fails to do so. ### New output after this PR: ``` ... [E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. [E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python Traceback (most recent call last): File "/pytorch-dev-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, *kwargs) File "/pytorch-dev/torch/distributed/run.py", line 794, in main run(args) File "/pytorch-dev/torch/distributed/run.py", line 785, in run elastic_launch( File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ hang.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-20_22:00:42 host : node0 rank : 0 (local_rank: 0) exitcode : -6 (pid: 194470) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194470 ============================================================ ``` The log suggests that TorchX monitor is triggered, and job is torn down. ### Major changes in this PR: 1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined. Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level. 2. Rethrow exception at watchdog thread. 3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`. 4. Turn on ASYNC_ERROR_HANDLING by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066 Approved by: https://github.com/rohan-varma	2023-03-25 04:30:20 +00:00
Pruthvi Madugundu	baf71a8aad	[ROCm] Update clock intrinsic handling for AMD gfx11 family (#97005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97005 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-03-24 18:29:49 +00:00
PyTorch MergeBot	f25cdf8aeb	Revert "Rewrite NCCL watchdog to more reliably throw timeout (#97066 )" This reverts commit `95e8d0c39e`. Reverted https://github.com/pytorch/pytorch/pull/97066 on behalf of https://github.com/clee2000 due to sorry but I think this broke periodic mutigpu tests `416bac5b81` https://github.com/pytorch/pytorch/actions/runs/4505085943/jobs/7930826040	2023-03-24 06:27:00 +00:00
Ke Wen	95e8d0c39e	Rewrite NCCL watchdog to more reliably throw timeout (#97066 ) Fixes #97191 This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job. ### Previous output in #97191 ``` Rank 0 is the problematic rank Rank 4 completed Rank 5 completed Rank 3 completed Rank 6 completed Rank 2 completed Rank 7 completed Rank 1 completed [E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out. Rank 0 completed [E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down. ``` Although it says that it is taking the process down, it sometimes fails to do so. ### New output after this PR: ``` ... [E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. [E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python Traceback (most recent call last): File "/pytorch-dev-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, *kwargs) File "/pytorch-dev/torch/distributed/run.py", line 794, in main run(args) File "/pytorch-dev/torch/distributed/run.py", line 785, in run elastic_launch( File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ hang.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-20_22:00:42 host : node0 rank : 0 (local_rank: 0) exitcode : -6 (pid: 194470) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194470 ============================================================ ``` The log suggests that TorchX monitor is triggered, and job is torn down. ### Major changes in this PR: 1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined. Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level. 2. Rethrow exception at watchdog thread. 3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`. 4. Turn on ASYNC_ERROR_HANDLING by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066 Approved by: https://github.com/rohan-varma	2023-03-23 21:31:21 +00:00
Howard Huang	7a0f29b776	Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj	2022-12-16 23:15:00 +00:00
Kazuaki Ishizaki	088f2fa567	Fix typos in messages under test (#89121 ) This PR fixes typos of messages in `.cpp` and `.py` files under test directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89121 Approved by: https://github.com/mruberry, https://github.com/kit1980	2022-11-17 01:55:03 +00:00
Chengqi Deng	b43ae1c411	Add reference counter in FileStore (#85601 ) Fixes #67566. This diff added a reference counter in the FileStore object. The underlying file would be removed only if the reference counter became 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85601 Approved by: https://github.com/H-Huang	2022-10-07 17:59:29 +00:00
Min Si	1ad0048b64	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera, https://github.com/huydhn	2022-09-30 05:13:50 +00:00
PyTorch MergeBot	a50d8864fc	Revert "Refactor distribuetd to use absolute header path (#85780 )" This reverts commit `668082718a`. Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>	2022-09-30 02:04:29 +00:00
Min Si	668082718a	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera	2022-09-30 00:27:24 +00:00
Howard Huang	74ead61944	[2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680 ) ### Changes - Move ProcessGroup::Work into its own class and update all the references to it / header includes. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work. Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680 Approved by: https://github.com/kwen2501	2022-09-14 13:05:58 +00:00
Xiang Gao	a4a55f5ea6	New TORCH_UCC_BLOCKING_WAIT env variable (#81791 ) Cherry-pick of https://github.com/facebookresearch/torch_ucc/pull/95. I recommend waiting until https://github.com/pytorch/pytorch/pull/81583 is merged first, so the CI is checking if this PR compiles correctly. Marking this as a draft for now, will change to "ready for review" once https://github.com/pytorch/pytorch/pull/81583 merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81791 Approved by: https://github.com/kwen2501	2022-08-25 21:33:17 +00:00
Richard Barnes	67f0940cdd	Check all CUDA API calls for errors in test/ (#74921 ) (#83954 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74921 Test Plan: Sandcastle Reviewed By: ezyang, malfet, ngimel Differential Revision: D35194966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83954 Approved by: https://github.com/ezyang	2022-08-24 20:12:25 +00:00
Howard Huang	9d228fe517	[Small] Remove using c10d::ProcessGroup directive from c10d test (#82681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82681 Approved by: https://github.com/awgu	2022-08-03 17:23:35 +00:00
Sergii Dymchenko	b0aaefb50f	Build example_allreduce only for GLOO (#81062 ) `example/allreduce.cpp` is GLOO-specific and will not compile with USE_GLOO=0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81062 Approved by: https://github.com/malfet	2022-07-08 02:25:54 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Nikita Shulga	80ea6955af	Add cuda-11.3+clang9 build workflow (take 2) To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic) Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu` Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-04-11 17:13:01 +00:00
PyTorch MergeBot	8fe43d76d5	Revert "Add cuda-11.3+clang9 build workflow" This reverts commit `709fcc862e`. Reverted https://github.com/pytorch/pytorch/pull/75293 on behalf of https://github.com/janeyx99	2022-04-11 15:24:59 +00:00
Nikita Shulga	709fcc862e	Add cuda-11.3+clang9 build workflow To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic) Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu` Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-04-11 14:10:57 +00:00
Xiang Gao	6e16c9bb1d	Add support for deleteKey for FileStore (#69953 ) Summary: torch_ucc uses `deleteKey`, and trying to run PyTorch tests with torch_ucc leads to failure about `deleteKey not implemented for FileStore`. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/69953 Reviewed By: ngimel Differential Revision: D33458457 Pulled By: H-Huang fbshipit-source-id: f46afd59f950722ae594d9aafb8843f14019e930	2022-01-07 06:20:59 -08:00
Nikita Shulga	3bb20ae49f	Make c10d tests -Werror clean (#69703 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69703 Test Plan: Imported from OSS Reviewed By: seemethere Differential Revision: D32997001 Pulled By: malfet fbshipit-source-id: 38b5f195c04f2b3b920e6883a96fe9a36345b9d2	2021-12-09 22:10:04 -08:00
Peter Bell	e279963eef	Remove remaining THC code (#69039 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872476 Pulled By: ngimel fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31	2021-12-08 12:18:08 -08:00
Hongyi Jia	146a7f68e2	Enable desync root cause analysis for NCCL (#68310 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68310 Enable desync root cause analysis by recording the last footprint of collective calls. When timeout we parse the store trace and figure out the root cause of the desync issue. This feature is built based on async error handling. Test Plan: Standalone test * Typical desync - P467288969 * Mismatched collectives - P467288916 * Mismatched broadcast size - P467288873 DDP benchmark * DDP benchmark desync - P467433483, P467520195 No perf regression: * w/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs * w/ this diff https://www.internalfb.com/intern/fblearner/details/308534088?tab=Outputs Reviewed By: mingzhe09088 Differential Revision: D32348647 fbshipit-source-id: 43e7e96e3fa2be0ac66c1325bceb639b461a8b3a	2021-11-17 20:29:03 -08:00
Rohan Varma	885da61d7d	[PG NCCL] Disable NCCL health check (#67668 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67668 This adds an env var to enable NCCL health check, which when left unspecified, results in the check not being run. Unit tests that need to test this functionality have the env variable set. Please see internal diff for more details. Test Plan: CI Reviewed By: yuguo68, mrshenli Differential Revision: D32089763 fbshipit-source-id: dff5664a5e607f711515cd1042089ca769914fbb	2021-11-02 16:21:59 -07:00
Richard Barnes	e0643fa3fc	use irange for loops 5 (#66744 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66744 Modified loops in files under fbsource/fbcode/caffe2/ from the format `for(TYPE var=x0;var<x_max;x++)` to the format `for(const auto var: irange(xmax))` This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand. Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D31705358 fbshipit-source-id: d6ea350cbaa8f452fc78f238160e5374be637a48	2021-10-18 21:59:50 -07:00
Xue Li	2f099c7555	Revert D30652629: use irange for loops Test Plan: revert-hammer Differential Revision: D30652629 (`687c2267d4`) Original commit changeset: 0ae6c4bbbb55 fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3	2021-10-15 15:23:10 -07:00
Richard Barnes	687c2267d4	use irange for loops (#66234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234 Modified loops in files under fbsource/fbcode/caffe2/ from the format `for(TYPE var=x0;var<x_max;x++)` to the format `for(const auto var: irange(xmax))` This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand. bypass_size_limit allow-large-files Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D30652629 fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e	2021-10-15 13:50:33 -07:00
Rohan Varma	06fa6c15c0	Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" (#66393 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed https://github.com/pytorch/pytorch/issues/66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140560113 Test Plan: CI Reviewed By: mrshenli Differential Revision: D31534735 fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2	2021-10-14 22:23:22 -07:00
Nikita Shulga	c373387709	Update CMake and use native CUDA language support (#62445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445 PyTorch currently uses the old style of compiling CUDA in CMake which is just a bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as a language just like C++ or C. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31503350 fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55	2021-10-11 09:05:48 -07:00
Jane Xu	0a48f56318	Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor" Test Plan: revert-hammer Differential Revision: D31299350 (`f1f3bd8c36`) Original commit changeset: 9ad5c8fa17f7 fbshipit-source-id: d63d889922f507a4a0e2e042e451b95b9591c317	2021-10-08 17:55:28 -07:00
Rohan Varma	f1f3bd8c36	Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor" (#65883 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65883 Original commit changeset: d8e962b8aab6 ghstack-source-id: 139836954 Test Plan: ci Reviewed By: zhaojuanmao Differential Revision: D31299350 fbshipit-source-id: 9ad5c8fa17f7038ba579cb1eda6d9271ac07a130	2021-10-08 16:04:20 -07:00
Mike Ruberry	91f8755b0e	Revert D31005792: [NCCL] Init dummy NCCL comms in constructor Test Plan: revert-hammer Differential Revision: D31005792 (`2b22a5dde2`) Original commit changeset: c2c582dee25a fbshipit-source-id: d8e962b8aab6fda8a6c013e8577492dff9568c27	2021-09-29 20:46:38 -07:00
Rohan Varma	2b22a5dde2	[NCCL] Init dummy NCCL comms in constructor (#65173 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65173 Initializes dummy NCCL communicators in constructor for a basic health check that communicators can be initialized prior to launching the first collective. After successful init, we immediately use `ncclCommAbort` to destroy these communicators to ensure they don't interfere with regular communicator creation during collectives. Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D31005792 fbshipit-source-id: c2c582dee25a098361ead6ef03f541e7833c606b	2021-09-29 15:36:54 -07:00
Kimish Patel	54f2eb6e7e	[Pytorch Profiler] Add support for adding module hierarchy to (#61792 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61792 KinetoEvent This PR adds module hierarchy information to events. What is module hierarchy information attached to events? During profiling a TorchScript module, when events are added, we ask JIT what is the module hierarchy associated with the node being executed. At the time of execution of that node, there might be multiple frames in the stack of interpreter. For each frame, we find corresponding node and the corresponding module hierarchy is queried. Module hierarchy corresponding to the node is associated with node's InlinedCallStack. InlinedCallStack of node tracks the path via which the node is inlined. Thus during the inlining process we annotate module information corresponding to the CallMethod nodes being inlined. With this PR, chrome trace will contain additional metadata: "Module Hierarchy". This can look like this: TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward It contains module instance, type name and the method name in the callstack. Test Plan: test_profiler Imported from OSS Reviewed By: raziel, ilia-cher Differential Revision: D29745442 fbshipit-source-id: dc8dfaf7c5b8ab256ff0b2ef1e5ec265ca366528	2021-08-13 21:39:10 -07:00
tktrungna	2f5ac9c0ba	update test distributed (#62796 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62796 Fixes #62380 * update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder * add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208). ### Test plan check if all ci workflows pass Test Plan: Imported from OSS Reviewed By: driazati Differential Revision: D30193142 Pulled By: tktrungna fbshipit-source-id: 1247f9eda1c11c763c31c7383c77545b1ead1a60	2021-08-10 16:29:47 -07:00
guyang3532	4ed8858817	Exclude time of waiting in queue from gloo communication prof… (#61342 ) Summary: Background: The gloo communication implementation is as follow: 1. Construct communication workers and push them into a queue. 2. Initialize a thread pool and each thread run a loop to get worker from the queue and execute it. Issue: The recorded profiling time span start from the worker construction and end at finish. So it will include the time of worker waiting in the queue and will result in multiple gloo communication time span overlapping with each other in a same thread in the timeline: ![image](https://user-images.githubusercontent.com/62738430/124867273-5bc95b80-dff0-11eb-8664-6e5d4166fc39.png) This is because when next work is waiting in the queue, the last work is not finished. Solution: This PR delays the profiling start time of gloo communication from worker construction to worker is really executed, so the profiling span will not include the time of waiting in queue. Implementation as follow: 1. Firstly, disable the original record function by specifying 'nullptr' to 'profilingTitle' argument of ProcessGroup::Work 2. Construct a 'recordFunctionBeforeCallback_' and 'recordFunctionEndCallback_' and save it as member of the worker. 3. When the worker is executed, invoke the 'recordFunctionBeforeCallback_'. 4. The 'recordFunctionEndCallback_' will be invoked at finish as before. After this modification, the gloo profiling span in timeline will not overlap with each other: ![image](https://user-images.githubusercontent.com/62738430/124868716-bb286b00-dff2-11eb-9cf0-d0494a356d0c.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/61342 Reviewed By: albanD Differential Revision: D29811656 Pulled By: gdankel fbshipit-source-id: ff07e8906d90f21a072049998400b4a48791e441	2021-07-28 22:24:26 -07:00
Luca Wehrstedt	a016150163	Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543 Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place. ghstack-source-id: 132306292 Test Plan: It builds Reviewed By: cbalioglu Differential Revision: D29062002 fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6	2021-06-24 12:38:51 -07:00

49 Commits