pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Howard Huang	aa3eab2ce6	Fix tcp init when using port 0 (#154156 ) I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2025-05-23 21:41:58 +00:00
Tristan Rice	df4e5294a6	Reapply "ProcessGroupGloo: support lazy_init (#150801 )" (#151031 ) This reverts commit `73f3d6d9aa`. Reapplies #150801 Test plan: See #150801 submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031 Approved by: https://github.com/fduwjj	2025-04-11 01:58:35 +00:00
PyTorch MergeBot	73f3d6d9aa	Revert "ProcessGroupGloo: support lazy_init (#150801 )" This reverts commit `f237ee54bf`. Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))	2025-04-10 13:44:31 +00:00
Tristan Rice	f237ee54bf	ProcessGroupGloo: support lazy_init (#150801 ) This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801 Approved by: https://github.com/fduwjj	2025-04-09 19:29:50 +00:00
Tristan Rice	159e97cbcf	ProcessGroupGloo: support reduce_scatter + update support chart (#149869 ) This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank. If users find these functions to be too slow we can address them as issues arise. Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869 Approved by: https://github.com/fduwjj	2025-03-25 01:16:12 +00:00
Tristan Rice	b248edd7cc	ProcessGroupGloo: support ReduceOp::AVG (#149781 ) This adds AVG support to ProcessGroupGloo to better support FSDP on CPU. I expect there will be more issues but this is easy enough to support in a naive fashion. This applies to both reduce and allreduce. This is a simple SUM + division and may not be the most numerically stable but that's expected. FSDP for low precision data types implements pre/post divide and uses SUM instead. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149781 Approved by: https://github.com/fduwjj	2025-03-24 20:29:30 +00:00
Dmitry Nikolaev	d4871750d9	[ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673 ) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\_gather_dim_\ (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\_scatter_dim_\ (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-09 05:18:57 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Tom Ritchford	d25e6e623f	Fix unused Python variables in test/[a-d]* (#134665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665 Approved by: https://github.com/albanD	2024-12-13 22:13:12 +00:00
Dmitry Rogozhkin	5872a8c6b0	Use task submitter TLS in gloo working threads (#142184 ) Fixes: #86830 CC: @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/142184 Approved by: https://github.com/albanD	2024-12-06 17:03:17 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Howard Huang	0f90ffe94a	Remove ProcessGroupRoundRobin (#132888 ) `_round_robin_process_groups` is deprecated and should be removed. `258f47fc0b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp (L10-L12)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132888 Approved by: https://github.com/Skylion007, https://github.com/wanchaol, https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-08 01:07:40 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00
Wanchao Liang	12434504a2	[c10d] remove non-necessary tests (#131212 ) as titled, comm tensor is not being actively used as we approached the functional collectives as our collective tracing approach Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212 Approved by: https://github.com/XilunWu	2024-07-23 03:48:55 +00:00
PyTorch MergeBot	c74396e890	Revert "[c10d] remove non-necessary tests (#131212 )" This reverts commit `0c074352ab`. Reverted https://github.com/pytorch/pytorch/pull/131212 on behalf of https://github.com/atalman due to sorry need to revert breaks OSS CI, module 'test_c10d_common' has no attribute 'CompilerTest' ([comment](https://github.com/pytorch/pytorch/pull/131212#issuecomment-2243961785))	2024-07-22 23:11:44 +00:00
Wanchao Liang	0c074352ab	[c10d] remove non-necessary tests (#131212 ) as titled, comm tensor is not being actively used as we approached the functional collectives as our collective tracing approach Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212 Approved by: https://github.com/XilunWu	2024-07-22 19:52:44 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
Yuanhao Ji	e3effa5855	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-17 06:46:02 +00:00
PyTorch MergeBot	52be63eb2c	Revert "Enable UFMT on all of `test/distributed` (#123539 )" This reverts commit `89ac37fe91`. Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))	2024-04-16 06:33:21 +00:00
Yuanhao Ji	89ac37fe91	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-16 03:23:56 +00:00
Yifu Wang	372e9550bd	ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. ### This PR We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following: - Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now. - By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`. - The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118910	2024-02-03 02:42:47 +00:00
Yifu Wang	fd000340fd	ProcessGroupGloo::allgather_into_tensor_coalesced (#118910 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage. ### This PR This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`. This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910 Approved by: https://github.com/shuqiangzhang	2024-02-02 17:53:28 +00:00
Ke Wen	58c4bc62bb	[c10d] Deprecate Work.result() (#117565 ) Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather). It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs. Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565 Approved by: https://github.com/wconstab	2024-01-18 01:22:37 +00:00
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
Howard Huang	7a3c3d63bf	fix gloo cuda sparse_allreduce dispatch (#111485 ) Fixes #111422 allreduce_sparse_cuda gets dispatched to allreduce_sparse which doesnt exist for gloo. However, gloo has an existing implementation so this is just fixing the dispatching to that. The reason CI didn't catch this is because we are calling the backend directly. Added a test which calls the public API (dist.XYZ) and goes through the dispatcher Pull Request resolved: https://github.com/pytorch/pytorch/pull/111485 Approved by: https://github.com/fduwjj	2023-10-19 21:15:45 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Rodrigo Kumpera	9e1b07e692	[C10d] Handle bool tensors in gloo. Fixes #103585 . (#105354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105354 Approved by: https://github.com/wanchaol	2023-07-18 20:42:58 +00:00
mantaionut	e3ee5b00be	Enable test sparse allreduce basics Windows (#103317 ) The test was marked as flaky in #59965. However, it is not failing anymore so it can be enabled. This PR enables only one test, but it will only run in local tests because the test suite is disabled in CI. #94495 is a superset of this PR which enables the full test suite. The CI run there shows this test passing. Fixes #59965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103317 Approved by: https://github.com/kit1980	2023-06-14 07:37:50 +00:00
shaoyf42	97180aca5e	Enables barrier to support the specified device (#99589 ) Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Today, there are two limitations of barrier: One is that barrier does not support custom #device: `fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)` The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device. `789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589 Approved by: https://github.com/kwen2501	2023-05-17 05:26:04 +00:00
Xiaodong Wang	c29ab84115	Fix bug in process_group_name when there is duplicate pgs (#100518 ) Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518 Approved by: https://github.com/kumpera	2023-05-04 02:12:28 +00:00
Rodrigo Kumpera	ad21890f8f	[c10d] Scalable PG initiation. (#99931 ) Add use_local_synchronization argument to new_group. When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster. This addressess both scalability and composability problems associated with new_group. Fixes #81291. This is relanding #84224 As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following: new_group use_local_synchronization=False: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.12 \| \| 8 \| 0.25 \| \| 16 \| 0.51 \| \| 32 \| 0.87 \| \| 64 \| 1.50 \| \| 128 \| 2.87 \| new_group use_local_synchronization=True: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.05 \| \| 8 \| 0.04 \| \| 16 \| 0.03 \| \| 32 \| 0.03 \| \| 64 \| 0.04 \| \| 128 \| 0.04 \| Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128. Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3. Setup: 1 AWS host, backend gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931 Approved by: https://github.com/xw285cornell	2023-04-27 13:44:02 +00:00
Huy Do	5bcbb9bca7	Skip testing distributed backend if the backend (UCC, NCCL, Gloo) is not available (#98576 ) After the recent change on https://github.com/pytorch/pytorch/pull/88110 to add a new c10d test for UCC backend, the test starts to fail on ROCm distributed job. I guess ROCm doesn't support that backend yet, so I go ahead and disable the test there. Please let me know if the support on ROCm is coming, I will close this PR accordingly. But it's now failing in ROCm trunk with `AssertionError: Unknown c10d backend type UCC`, for example `4adba70cc6` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98576 Approved by: https://github.com/Fuzzkatt, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/ZainRizvi	2023-04-10 20:04:40 +00:00
Sergii Dymchenko	35bf5bac26	Fix "sandcastle_skip_if decorator name is confusing" (#95649 ) Fixes https://github.com/pytorch/pytorch/issues/89473 See the issue https://github.com/pytorch/pytorch/issues/89473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649 Approved by: https://github.com/atalman, https://github.com/malfet	2023-03-03 09:29:40 +00:00
fduwjj	a88bfc60c7	[2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450 ) No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450 Approved by: https://github.com/wanchaol	2023-02-26 03:03:37 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Howard Huang	f8b348c1fc	Update ProcessGroupRoundRobin (#91172 ) Summary: Temporary fix to unblock jobs in https://fb.workplace.com/groups/300451907202972/permalink/906337097050850/ Real fix would be to remove use of _round_robin_process_group API and update corresponding references (e.g. PyText) Test Plan: sandcastle Differential Revision: D42169592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91172 Approved by: https://github.com/awgu	2022-12-20 19:53:34 +00:00
Howard Huang	7a0f29b776	Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj	2022-12-16 23:15:00 +00:00
Sergii Dymchenko	0ac0af02d5	Reland Fix issue 38095 TODO in test_multiprocessing.py (#90741 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Reland of https://github.com/pytorch/pytorch/pull/90335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90741 Approved by: https://github.com/clee2000	2022-12-15 05:32:27 +00:00
Howard Huang	80150788bc	[21/N] Add alltoall_base custom op with CPU/CUDA implementations (#89813 ) Differential Revision: [D41812670](https://our.internmc.facebook.com/intern/diff/D41812670) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89813 Approved by: https://github.com/kwen2501	2022-12-08 23:39:26 +00:00
Sergii Dymchenko	f99f239531	Fix issue 38095 TODOs in gloo tests (#89985 ) Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89985 Approved by: https://github.com/ZainRizvi	2022-12-08 01:12:37 +00:00
Howard Huang	5797f74924	[19/N] Add monitored_barrier custom op with CPU implementation (#89318 ) Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89318 Approved by: https://github.com/kwen2501	2022-11-22 14:18:40 +00:00
Howard Huang	be22b5d39f	[18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (#89317 ) Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89317 Approved by: https://github.com/kwen2501	2022-11-22 14:14:17 +00:00
Colin Taylor	24b9890f03	[torchrec] [composable] update ShardedEmbeddingBagCollection to be use registered EBCs with shardedTensors as registered modules (#758 ) (#88026 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/758 This PR fixes a bug in FSDP/DDP, where ShardedTensors are not supported even if passed in as params to ignore. this is important for composability because TorchRec named_parameters() will return FQN of shardedTensors (as defined in goals) It defines device of ShardedTensor to be None when local_tensor() does not exist on rank update ShardedEmbeddingBagCollection to be composable according to https://docs.google.com/document/d/1TBJSd5zgEg6cRcXv3Okuj7bBkqQwGS2IPh4TLWNNzFI/edit Differential Revision: D40458625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88026 Approved by: https://github.com/wanchaol, https://github.com/rohan-varma	2022-11-17 04:26:13 +00:00

1 2 3

101 Commits