pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	4f4ecc583e	[BE]: Enable RUFF TRY400 rule - log.exception (#153473 ) Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-15 13:36:59 +00:00
Ke Wen	5dd746b4b5	[c10d] Reduce test verbosity (#153116 ) Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153116 Approved by: https://github.com/fduwjj	2025-05-08 22:22:22 +00:00
Xilun Wu	0f9821d0e3	[BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153114 Approved by: https://github.com/cyyever, https://github.com/fegin, https://github.com/H-Huang, https://github.com/Skylion007	2025-05-08 14:01:49 +00:00
wizzniu	2246cb6e14	Fix common_distributed.py to NOT set root logger (#152319 ) Using `logging.basicConfig` to set root logger's level is not a good behavior. Fix common_distributed.py to set level for current logger only, because it affects downstream's 3rd-party testing plugins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152319 Approved by: https://github.com/Skylion007	2025-04-28 17:51:32 +00:00
Tristan Rice	df4e5294a6	Reapply "ProcessGroupGloo: support lazy_init (#150801 )" (#151031 ) This reverts commit `73f3d6d9aa`. Reapplies #150801 Test plan: See #150801 submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031 Approved by: https://github.com/fduwjj	2025-04-11 01:58:35 +00:00
PyTorch MergeBot	73f3d6d9aa	Revert "ProcessGroupGloo: support lazy_init (#150801 )" This reverts commit `f237ee54bf`. Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))	2025-04-10 13:44:31 +00:00
Tristan Rice	f237ee54bf	ProcessGroupGloo: support lazy_init (#150801 ) This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801 Approved by: https://github.com/fduwjj	2025-04-09 19:29:50 +00:00
lzhang2	84b58bd63e	Enable FSDP tests on XPU device (#147518 ) Motivation: Enable FSDP tests on XPU device Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518 Approved by: https://github.com/weifengpy	2025-03-04 23:49:37 +00:00
PyTorch MergeBot	e06ee4aa9f	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit `06f4a5c0e5`. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))	2025-02-14 16:44:46 +00:00
atalman	06f4a5c0e5	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 15:29:59 +00:00
Aaron Orenstein	dea7ad3371	PEP585 update - torch/testing (#145200 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200 Approved by: https://github.com/bobrenjc93	2025-01-20 22:42:42 +00:00
Will Constable	2c4281d7da	Make MultiProcContinuousTest timeout configurable (#145099 ) Allows test classes using MPCT to set their own timeout as a class property, which is good enough since the processgroup is shared across test instances and the timeout is set at processgroup init. Also sets a default timeout of 2 minutes, which is probably (?) long enough for reasonable tests, but can be changed if it causes flakyness. It's preferable to have as short default timeout as possible, since when debugging tests getting a timeout quickly helps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099 Approved by: https://github.com/d4l3k, https://github.com/fduwjj ghstack dependencies: #145010, #145011	2025-01-18 04:37:12 +00:00
bobrenjc93	3b6b306b71	Migrate from Tuple -> tuple in torch/testing (#144256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256 Approved by: https://github.com/aorenste	2025-01-10 06:37:55 +00:00
PyTorch MergeBot	080f992d68	Revert "[CI] Reduce distributed test timeout to 60s (#141168 )" This reverts commit `e8de8f3969`. Reverted https://github.com/pytorch/pytorch/pull/141168 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think we missed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/141168#issuecomment-2494060624))	2024-11-22 15:46:37 +00:00
Ke Wen	e8de8f3969	[CI] Reduce distributed test timeout to 60s (#141168 ) Pulling a PR to test viability. Today's timeout is 300s, which could waste quite some machine time if a hang happens in CI. Differential Revision: [D66275756](https://our.internmc.facebook.com/intern/diff/D66275756) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141168 Approved by: https://github.com/clee2000	2024-11-22 00:59:55 +00:00
Ke Wen	66476617bf	[Dist][CI] Easier override of destroy-upon-exit setting (#141192 ) Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired. (Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192 Approved by: https://github.com/wconstab	2024-11-21 07:32:56 +00:00
Syed Tousif Ahmed	e0482fdf95	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-21 01:40:11 +00:00
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit `25d9be37be`. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Anant Gulati	b379a28a95	Generalization of distributed test cases for non-CUDA devices (#138216 ) # Motivation This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2024-11-18 09:38:00 +00:00
Will Constable	1d5a8ee8fb	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin	2024-11-18 04:26:21 +00:00
PyTorch MergeBot	bf8709b08a	Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820 )" This reverts commit `77d1f076da`. Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))	2024-11-16 16:32:14 +00:00
Will Constable	77d1f076da	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin ghstack dependencies: #140460, #140815	2024-11-16 14:24:52 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Edward Z. Yang	4e647871d6	Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786 Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305 ghstack dependencies: #139716	2024-11-07 01:58:05 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Ke Wen	c88b77af9c	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 21:39:39 +00:00
PyTorch MergeBot	0ff6f7a040	Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 )" This reverts commit `1581a93e87`. Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))	2024-10-18 13:21:17 +00:00
Ke Wen	1581a93e87	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 09:10:01 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `2b7c7a20b9`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `f43c4d28b8`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
PyTorch MergeBot	b55ff476bd	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit `cdd8fa98c7`. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
Ke Wen	cdd8fa98c7	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161	2024-10-10 17:16:34 +00:00
PyTorch MergeBot	cd17b2645c	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit `a93d3873e9`. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))	2024-10-04 13:58:57 +00:00
Ke Wen	a93d3873e9	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy	2024-10-04 00:44:02 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Will Feng	9e06572704	[Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510 ) This PR creates these `GroupedSchedulerNode`s: - One for each all-gather code block (cast + copy-in + all-gather) - One for each all-gather-wait code block (all-gather-wait + copy-out) - One for each reduce-scatter code block (copy-in + reduce-scatter) - One for each reduce-scatter-wait code block (reduce-scatter-wait) This serves two goals: - Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage. - Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms"). The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510 Approved by: https://github.com/yifuwang	2024-07-27 08:39:58 +00:00
Will Feng	fc3d2b26cd	Use fake PG for test_compute_comm_reordering.py unit tests (#131415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415 Approved by: https://github.com/yifuwang	2024-07-23 22:53:23 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit `9f392f8294`. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
Sam Larsen	9f392f8294	Use inductor TestCase for test_replicate_with_compiler.py (#129494 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-18 03:08:32 +00:00
PyTorch MergeBot	8458dc8966	Revert "Use inductor TestCase for distributed tests (#129494 )" This reverts commit `3cd2ae331a`. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))	2024-07-17 00:32:48 +00:00
Sam Larsen	3cd2ae331a	Use inductor TestCase for distributed tests (#129494 ) Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-16 01:24:35 +00:00
Chien-Chin Huang	6e43897912	[BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843 ) Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512. Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843 Approved by: https://github.com/wz337	2024-06-18 07:05:31 +00:00
Xilun Wu	85758fa5ae	[c10d][TCPStore] make TCPStore server use libuv by default (#127957 ) Summary This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. What it changes This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. Fallback/Remain using the old TCPStore server For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. Operating Systems Compatibility From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> TODO 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957 Approved by: https://github.com/kurman ghstack dependencies: #127956	2024-06-07 16:53:01 +00:00
Chien-Chin Huang	bb68b54be0	[BE][ptd_fb_test][1/N] Enable testslide (#127512 ) This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue. Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-06-05 17:45:15 +00:00
willfengg	6454e95824	[FSDP2] enable CI for torch.compile(root Transformer) (#127832 ) This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group` compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832 Approved by: https://github.com/awgu	2024-06-05 17:29:46 +00:00
Ke Wen	04bf7713e8	[c10d] Reduce test time by reusing ProcessGroup (#125648 ) ## Problem this PR resolves Today, most of distributed tests are arranged like this: ``` def test_allreduce(self): pg = self._create_process_group_nccl(store, self.opts()) pg.allreduce(tensor) ... ``` Thus, we are paying PG creation time per test. That's bad. But why were we doing that? Is there a constraint? If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following: ``` def setUp(self): self._spawn_processes() def tearDown(self): for p in self.processes: p.terminate() ``` Since `setUp` and `tearDown` are "test-scope fixtures", meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time. ## How we are fixing it First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called "class-scope fixtures." It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do: ``` @classmethod def setUpClass(self): dist.init_process_group(...) @classmethod def tearDownClass(self): dist.destroy_process_group() ``` In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture. Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either: 1. launch the whole test file with `torchrun --nproc-per-node=...` or 2. use `mp.spawn()` under `if __name__ == "__main__":`. Point is, launch the processes only once. ## Result We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py. Before this PR: ``` $ python test_c10d_nccl.py -k ProcessGroupNCCLTest Ran 24 tests in 174.457s ``` After this PR: ``` $ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py or $ python test_c10d_ops_nccl.py Ran 24 tests in 16.247s ``` 10X speedup. ## Limitation For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout. ## Next step Migrate the tests of distributed that would fit with this test style! Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648 Approved by: https://github.com/wconstab	2024-05-08 22:33:40 +00:00

1 2 3 4

180 Commits