pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Prachi Gupta	f2b44424a1	[ROCm] Skip _stress_cuda and test_ddp_apply_optim_in_backward (#155724 ) These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-06-12 21:18:04 +00:00
Tsung-Hsien Lee	cae25ef4e5	[c10d] Enhance Error Logging in `new_subgroups()` for Non-Divisible World Sizes (#154124 ) Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness. Test Plan: contbuild & OSS CI Differential Revision: D75226925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124 Approved by: https://github.com/wz337	2025-05-23 17:12:43 +00:00
PyTorch MergeBot	59c5fff2aa	Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 )" This reverts commit `a79e621c1c`. Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))	2025-05-22 22:26:59 +00:00
Yanli Zhao	a79e621c1c	[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 ) Differential Revision: D72437251 Enable to rebuild bucket order when find_unused_parameters=true. It should be always better than not rebuilding bucket order when find_unused_parameters=True: 1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything 2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases: a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2025-05-20 02:45:01 +00:00
Tsung-Hsien Lee	6487ea30b3	[c10d] Fix `new_subgroups(group=)` bug (#153798 ) Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call. Test Plan: contbuild & OSS CI Differential Revision: D74964213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798 Approved by: https://github.com/Skylion007	2025-05-19 17:01:10 +00:00
Aby Mathew C	6a28cc826f	Add TEST_HPU flag to set device type (#153461 ) MOTIVATION This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 CHANGES add TEST_HPU flag to some of the conditions checking the environment use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi. @ankurneog , @EikanWang , @cyyever , @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD	2025-05-14 19:31:40 +00:00
Xilun Wu	cbb03e6971	[BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225 Approved by: https://github.com/kwen2501, https://github.com/fegin	2025-05-09 20:40:54 +00:00
Xilun Wu	0f9821d0e3	[BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153114 Approved by: https://github.com/cyyever, https://github.com/fegin, https://github.com/H-Huang, https://github.com/Skylion007	2025-05-08 14:01:49 +00:00
Wanchao Liang	3380a46b44	Fix DTensorTestBase to barrier with device ids (#150896 ) try to get rid of the below annoying warnings when running the unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/150896 Approved by: https://github.com/fegin	2025-04-22 20:22:55 +00:00
Yanli Zhao	d8bafd23ab	[DDP] add one option to allow skipping all reduce unused parameters (#151503 ) Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model. Test Plan: unit tests, CI Differential Revision: D72282069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503 Approved by: https://github.com/mrshenli	2025-04-17 23:30:19 +00:00
Ke Wen	35c45a4a31	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman	2025-04-01 16:46:07 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `ef6296e7f2`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
lzhang2	84b58bd63e	Enable FSDP tests on XPU device (#147518 ) Motivation: Enable FSDP tests on XPU device Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518 Approved by: https://github.com/weifengpy	2025-03-04 23:49:37 +00:00
cyy	b0dfd242fa	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 05:53:19 +00:00
PyTorch MergeBot	926b7b5027	Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705 )" This reverts commit `40ad5e01df`. Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))	2025-02-28 03:04:38 +00:00
cyyever	40ad5e01df	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 00:15:32 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
mori360	a21a123fd5	Add fqn_modifier at loading_state_dict and unit test (#146557 ) In Fusion model, users might change the state_dict keys by state_dict_hook The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys. The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook. For example: There's a state_dict_hook: ``` def _state_dict_hook(self, destination, prefix, keep_vars): """Remove "embedding" from the original embedding in the state_dict name. This keeps the orginal state dict name for the embedding from before fusing with the FusionEmbedding. [!Note] This update changes the order of the OrderedDict """ key = prefix + "embedding.weight" new_key = prefix + "weight" destination[new_key] = destination[key] del destination[key] ``` In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module ``` def fqn_modifiers(self) -> Dict[str, str]: return { "weight": "embedding", } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557 Approved by: https://github.com/fegin	2025-02-18 22:54:41 +00:00
Aaron Gokaslan	e738f7ba23	[BE]: Enable ruff rule SIM113 (#147290 ) Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290 Approved by: https://github.com/jansel	2025-02-16 22:41:16 +00:00
Haifeng Jin	8177fc4d33	Make regex error catching compatible with Python 3.12+. (#145945 ) In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object". The old regex would no longer catch the error. This PR make it compatible with Python 3.12 and backward compatible as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945 Approved by: https://github.com/H-Huang	2025-02-05 00:57:36 +00:00
Wei Wang	3a4e7a589b	[CI][Distributed] Fix edge case: One rank case (Rank 0) should get [False, False] (#146099 ) To match the expected tensor (i.e. 2nd element in the array). Making rank0 receive [False, False] Fixes one of the issues reported in #146094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146099 Approved by: https://github.com/eqy	2025-01-31 20:31:13 +00:00
Marc Horowitz	efebec5ef5	[dcp] Add ZStandard transformer (#143360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360 Approved by: https://github.com/saumishr, https://github.com/albanD ghstack dependencies: #145528	2025-01-25 00:14:07 +00:00
Aaron Orenstein	dea7ad3371	PEP585 update - torch/testing (#145200 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200 Approved by: https://github.com/bobrenjc93	2025-01-20 22:42:42 +00:00
Marc Horowitz	ba3f1c29ee	[dcp] Add extension mechanism (#143358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143358 Approved by: https://github.com/saumishr	2025-01-17 01:51:37 +00:00
RAHUL SINGH	95b41d2aa4	Tests Generelization for multiple accelerator devices (#139749 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139749 Approved by: https://github.com/kwen2501	2025-01-14 08:52:46 +00:00
bobrenjc93	3b6b306b71	Migrate from Tuple -> tuple in torch/testing (#144256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256 Approved by: https://github.com/aorenste	2025-01-10 06:37:55 +00:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
Tsung-Hsien Lee	84f818f359	[DTensorTestbase] Fix `TestFunc` typing issue (#141513 ) Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`). Test Plan: Contbuild & OSS CI Differential Revision: D66463705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-11-26 19:48:34 +00:00
Ke Wen	66476617bf	[Dist][CI] Easier override of destroy-upon-exit setting (#141192 ) Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired. (Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192 Approved by: https://github.com/wconstab	2024-11-21 07:32:56 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Will Constable	1d5a8ee8fb	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin	2024-11-18 04:26:21 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
Tsung-Hsien Lee	953286b850	[DTensorTestbase] Fix `@with_comms` inactive problem (#139637 ) Summary: `with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages. This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`. Test Plan: Contbuild & OSS CI Differential Revision: D65385700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637 Approved by: https://github.com/wz337	2024-11-13 02:45:02 +00:00
wz337	5b39734a0a	[DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097 ) We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error: ``` RuntimeError: No backend for the parent process group or its backend does not support splitting ``` This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097 Approved by: https://github.com/XilunWu	2024-10-29 00:04:06 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
wz337	ff598f2f4d	[DTensorTestbase] Add an optional `eager_init` flag to `with_comms()` to support eager init nccl communicator for DeviceMesh test case (#138108 ) Add an optional `eager_init` flag to `with_comms`. When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization. Otherwise, `device_id` is still `None` and this goes through the normal lazy call. Default for `eager_init` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108 Approved by: https://github.com/kwen2501	2024-10-19 01:04:55 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `2b7c7a20b9`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `f43c4d28b8`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
Ke Wen	35fc24fbed	[PGNCCL] Fix bugs in non-blocking mode (#137741 ) ### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // https://github.com/NVIDIA/nccl/issues/1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741 Approved by: https://github.com/shuqiangzhang	2024-10-15 20:35:39 +00:00
iupaikov-amd	c09b567a91	Fixed error string assertion in test_invalid_devices (#137772 ) ROCm distribution returns different error string for this operation so the test fails this assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772 Approved by: https://github.com/Skylion007	2024-10-13 18:10:07 +00:00

1 2 3 4 5 ...

939 Commits