pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Hari Krishna Sai Kodali	9d184bda2f	add device generalization support for distributed tests (#156796 ) MOTIVATION To generalize Distributed test cases for non-CUDA devices CHANGES - test/distributed/checkpoint/test_fsspec.py - test/distributed/checkpoint/test_state_dict.py - test/distributed/test_multi_threaded_pg.py Replaced hard coded device names with torch.accelerator.current_accelerator - torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py support for hccl backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-07-16 09:37:03 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Yuanhao Ji	e3effa5855	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-17 06:46:02 +00:00
PyTorch MergeBot	52be63eb2c	Revert "Enable UFMT on all of `test/distributed` (#123539 )" This reverts commit `89ac37fe91`. Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))	2024-04-16 06:33:21 +00:00
Yuanhao Ji	89ac37fe91	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-16 03:23:56 +00:00
Tristan Rice	358ace1a1b	functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599 ) This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs. This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions. This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering. To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`. Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py Test plan: ``` pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599 Approved by: https://github.com/yifuwang	2024-04-12 01:48:49 +00:00
Andrew Gu	4cf6d1172b	[FSDP2] Used `ReduceOp.AVG` if fp32 reduce-scatter (#120919 ) This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919 Approved by: https://github.com/yifuwang, https://github.com/wanchaol ghstack dependencies: #120238, #120910	2024-03-02 00:39:16 +00:00
Rodrigo Kumpera	02cd971e95	[C10D] Improve MTPG autograd test. Fixes #105106 (#105356 ) Explicitly asserts that bwd is running from the same thread as fwd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105356 Approved by: https://github.com/rohan-varma, https://github.com/wanchaol, https://github.com/fduwjj	2023-07-20 13:51:21 +00:00
Rodrigo Kumpera	246dc0d9f2	[MTPG] Use TLS propagation to enable MTPG from bwd. (#104735 ) We use PyTorch's built-in tls propagation in ThreadLocalState to forward the world object from the fwd thread to the bwd thread. This further closes the gap on enabling FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104735 Approved by: https://github.com/rohan-varma	2023-07-12 18:47:02 +00:00
Iris	15eed5b73e	[Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568 ) This test(`8340762211/test/distributed/test_multi_threaded_pg.py (L133)` ) is failing on internal sandbox with the following error msg: ``` File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll raise Exception( Exception: world not ready, only 3 PG's registered but world has 4 ranks exiting thread 1 ERROR ``` Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0 We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937). This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this. cc. @kumpera @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568 Approved by: https://github.com/H-Huang	2023-06-18 07:05:28 +00:00
Rodrigo Kumpera	5b4a523583	Add all_reduce_coalesced to functional collectives (#98640 ) This adds all_reduce_coalesced to MTPG to ease testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98640 Approved by: https://github.com/wanchaol	2023-04-26 17:05:54 +00:00
Xilun Wu	89894115ab	[MTPG] add all_to_all collective to MTPG (#98791 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98791 Approved by: https://github.com/kumpera	2023-04-11 21:35:45 +00:00
Rodrigo Kumpera	8177081848	Add gather to MTPG (#97555 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97555 Approved by: https://github.com/H-Huang	2023-03-27 19:37:02 +00:00
Rodrigo Kumpera	1d3c394d5e	[MTPG] Improve all_reduce and handle bwd thread support (#95524 ) This implements all reduce ops in all_reduce and a PG being used from a thread different than the one that created it. We should be this >< close to getting complex training tests working. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95524 Approved by: https://github.com/H-Huang	2023-03-03 18:53:36 +00:00
Wanchao Liang	e16979c9a0	[threaded_pg] full rewrite of MultiThreadedTestCase to enable device_type tests (#91650 ) This PR did a full rewrite of MultiThreadedTestCase, to make it more aligned with the MultiProcessTestCase, also changed how it do spawning and testing, so that we could embed thread local states when running tests. This PR enables device_type tests to work with MultiThreadedTestCase Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650 Approved by: https://github.com/XilunWu	2023-01-17 03:26:36 +00:00
Wanchao Liang	9942ddd5b3	[threaded_pg] enable subpg creation and concurrent collective (#91649 ) This PR refactors the threaded PG logic to enable multiple sub pg creation under the world threaded pg, and allow the case where we can call collectives together on different subpgs Pull Request resolved: https://github.com/pytorch/pytorch/pull/91649 Approved by: https://github.com/XilunWu	2023-01-17 03:26:34 +00:00
Xilun Wu	a6dcebf997	[threaded pg] make exception handling consistent with MultiProcessTestCase (#90712 ) Differential Revision: [D42153661](https://our.internmc.facebook.com/intern/diff/D42153661) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90712 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
Xilun Wu	34da446072	[threaded pg] add assertion util to MultiThreadedTestCase (#90595 ) Differential Revision: [D42153662](https://our.internmc.facebook.com/intern/diff/D42153662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90595 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
Xilun Wu	3759777edc	[threaded PG] fix long hang issue in testing (#90515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90515 Approved by: https://github.com/wanchaol	2022-12-09 05:24:08 +00:00
Xilun Wu	91dcef41ae	Thread PG: add allreduce to threaded pg (#89043 ) Summary: Goal Add `all_reduce` collective to multi-threaded ProcessGroup added in D40236769 (`6663ae5537`). Code Motion Added `allreduce` collective to ProcessLocalGroup (a subclass of c10d ProcessGroup). What's Next Add a DDP test utilizing the new allreduce op. Generalize `allreduce` to allow other `ReduceOp`s besides `SUM`. Test Plan: cd fbcode/caffe2 buck2 test mode/dev //caffe2/test/distributed:multi_threaded Differential Revision: D41046606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89043 Approved by: https://github.com/wanchaol	2022-11-23 19:43:30 +00:00
Wanchao Liang	821ba6b51b	[4/n] Thread PG: add reduce_scatter to threaded pg (#89442 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89442 Approved by: https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:44 +00:00
Wanchao Liang	3e99d4db76	[3/n] Thread PG: add scatter to threaded pg (#89441 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89441 Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:44 +00:00
Wanchao Liang	3876f94c3d	[2/n] Thread PG: add test for broadcast (#89440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89440 Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:42 +00:00
Wanchao Liang	deae450899	[1/n] Thread PG: add test for allgather (#89439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89439 Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:41 +00:00
Charlie Yan	ee05f47bdd	Rebase and re-land thread PG (#88795 ) The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795 Approved by: https://github.com/huydhn, https://github.com/wanchaol	2022-11-15 21:58:58 +00:00
PyTorch MergeBot	c7fc710459	Revert "[3/n] Thread PG: add threaded PG implementation (#88627 )" This reverts commit `6dd081846e`. Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test `6dd081846e` in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged	2022-11-09 22:38:41 +00:00
Charlie Yan	6dd081846e	[3/n] Thread PG: add threaded PG implementation (#88627 ) Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation. Test Plan: TBD Reviewed By: XilunWu Differential Revision: D40992593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-11-09 20:51:11 +00:00
PyTorch MergeBot	f451e824f3	Revert " C10D extension to enable per-thread PG (#86348 )" This reverts commit `97abc21f2b`. Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests `97abc21f2b`	2022-10-14 01:26:46 +00:00
Rodrigo Kumpera	97abc21f2b	C10D extension to enable per-thread PG (#86348 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 This change ensures BC by keeping the global variables around and have the default _World wrap it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348 Approved by: https://github.com/rohan-varma	2022-10-13 22:23:28 +00:00
PyTorch MergeBot	6fae62b35f	Revert "C10D extension to enable per-thread PG (#84153 )" This reverts commit `5cbffbbac9`. Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff	2022-09-29 13:51:05 +00:00
Rodrigo Kumpera	5cbffbbac9	C10D extension to enable per-thread PG (#84153 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153 Approved by: https://github.com/rohan-varma	2022-09-27 21:42:31 +00:00

31 Commits