pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yifu Wang	6e1ba79b7f	[re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) (#116125 ) This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125 Approved by: https://github.com/yf225	2023-12-20 07:13:50 +00:00
Will Constable	3747aca49a	[C10D] Make all PGNCCL LOG usages use logPrefix() (#116060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116060 Approved by: https://github.com/fduwjj ghstack dependencies: #116059	2023-12-20 04:19:45 +00:00
Will Constable	4f02cc0670	[C10D] Add logPrefix to abortCommsFromMap (#116059 ) Prints additional info such as PG ID/Rank. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116059 Approved by: https://github.com/fduwjj	2023-12-20 02:17:04 +00:00
PyTorch MergeBot	91e184fd74	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit `4edc921857`. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))	2023-12-19 16:01:19 +00:00
Aaron Gokaslan	647f14e70b	[BE]: Enable clang-tidy check for readability-string-compare (#115994 ) Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994 Approved by: https://github.com/albanD	2023-12-18 16:13:00 +00:00
Nikita Shulga	d7caef7996	[CI] Update clang-format (#116002 ) To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002 Approved by: https://github.com/suo	2023-12-18 14:58:46 +00:00
Will Constable	9fcf6fb6fe	[C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876 ) Helps call attention to any cases where the dump actually times out. The timeout is likely to hit if we run into slow stacktrace processing. Log any exceptions encountered in the background thread, but don't raise them- we're already willing to abandon the debug dump, and want to proceed with our normal execution (in the case of dumppipe) or shutdown process (when dumping happens on timeout and shutdown is already initiated). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876 Approved by: https://github.com/zdevito ghstack dependencies: #115807	2023-12-15 22:13:06 +00:00
Will Constable	82e0d00da9	[c10d] Polish NCCL PG monitor thread log message (#115888 ) We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888 Approved by: https://github.com/wconstab	2023-12-15 22:00:29 +00:00
Jun Luo	2d43e31aa9	Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553 ) Reviewed By: kirteshpatil Differential Revision: D51860023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553 Approved by: https://github.com/fduwjj	2023-12-15 11:14:41 +00:00
Yifu Wang	4edc921857	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-15 08:17:35 +00:00
Will Constable	8e1cff96e3	[C10D] Log PG size in init log (#115807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807 Approved by: https://github.com/XilunWu	2023-12-15 02:38:54 +00:00
zdevito	66b04e3cb7	[nccl flight recorder] nullptr profiling name (#115851 ) Sometimes profiling name can be a nullptr, which throws on conversion to std::string. This adds a check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851 Approved by: https://github.com/wconstab	2023-12-14 23:40:54 +00:00
Will Constable	04ef21f5dd	[C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803 ) The mutex was originally added to avoid racing to dump debuginfo, where a race in this case would result in a corrupted dump file. The reason a mutex helps is that it forces all dump requests to be serialized, so that an observer would either see an in-progress file, a complete file, or no file. Without a mutex, a fourth state is possible (a file that has been written to by multiple threads and is invalid). Becuase the mutex was a ProcessGroupNCCL class member, and each PG instance has its own watchdog thread that can launch a dump, it was not doing its job. Making the mutex static shares it between instances of the class and ensures serialization of dumps triggered by any PG. (Note: dumps triggered by different PGs have the same, global contents anyway- there is only one global flight recorder, so it doesn't matter who triggers it.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803 Approved by: https://github.com/kwen2501 ghstack dependencies: #115771, #115798, #115800, #115801	2023-12-14 21:17:44 +00:00
PyTorch MergeBot	7ecddaef23	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit `adfbd2b219`. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))	2023-12-14 20:33:10 +00:00
Will Constable	0fe014bd8a	[C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801 ) Adds a PG {process group uid} prefix component to logs. This is helpful in situations where there are multiple processgroups, and rank information by itself is confusing. (For example rank0 on PG1 may correspond to rank3 on PG0. People may assume 'rank0' references the global (PG0) world, but it may reference a sub-pg. Prefacing the PG helps clarify this. Does NOT change logs from inside WorkNCCL functions, since WorkNCCL doens't know what PG ID it corresponds to. Will address these logs separately. Example: ``` [I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798, #115800	2023-12-14 18:17:16 +00:00
Will Constable	e94267587b	[C10D] Refactor NCCL logs to use common prefix helper (#115800 ) Put the repeated code that string formats [Rank {rank}] in one place. Sets up for the next PR that also adds more info to this prefix. (Does not change exception messages, which could be done as well. Exception messages are not formatted quite the same way. Tries instead to keep from changing log behavior (in this PR) and only refactor code. Did limited testing (some logs were observed OK). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798	2023-12-14 18:13:24 +00:00
Will Constable	eb6e70cf66	[C10D] Only open NCCL dump pipe file once per process (#115798 ) The NCCL flight recorder is per-process (it is shared by all processgroups), but individual process groups used to construct their own pipe for being signaled to dump the flight recorder. This ensures that only one pipe per process is created, by only creating the pipe on the first ProcessGroup (uid_ == 0) which should be the world group. Filenames are still keyed off of rank, but this should now be global rank instead of sub-pg rank, making the filenames unique across the whole trainer process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798 Approved by: https://github.com/zdevito ghstack dependencies: #115771	2023-12-14 17:48:26 +00:00
Will Constable	74d2b9dd15	[C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771 Approved by: https://github.com/fduwjj	2023-12-14 17:42:46 +00:00
Yifu Wang	adfbd2b219	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-14 08:13:08 +00:00
Will Constable	f5458f8f00	[C10D] Make DumpPipe pipe file configurable (#115770 ) Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file location from dump file location. Defaults PIPE_FILE to empty, meaning disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770 Approved by: https://github.com/zdevito	2023-12-14 03:54:43 +00:00
Pavan Balaji	ffc826bf10	[nccl-pg] Store PG global rank information in tracing logs (#115730 ) Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks. Test Plan: OSS CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730 Approved by: https://github.com/fduwjj	2023-12-14 00:59:17 +00:00
Pavan Balaji	afa62d6237	[nccl-pg] Pass group global rank information to NCCL PG (#114736 ) We were only passing a subset of the group creation information to the NCCL PG. We are specifically missing the information on which global ranks belong to a particular PG. This allows the NCCL PG to use this additional information for things like better trace logging. Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736 Approved by: https://github.com/kwen2501	2023-12-13 18:02:51 +00:00
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
Xiaodong Wang	7553c49514	[S382174] Fix distributed debug w/ non-equal split (#115483 ) Summary: In collectives, it's possible to have non-equal split that has a different implementation and the output tensor size will be different, e.g. https://www.internalfb.com/code/fbsource/[460afb1172b5]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=3104. However, TORCH_DISTRIBUTED_DEBUG=DETAIL will assume the output tensor size is the same and does the check and will fail the job if they don't: https://fburl.com/code/mhte9ty8. c10d code should handle this. Ideally we should check the input size across ranks and make sure they're the same. Maybe for next diff. Test Plan: Test torchrec's TWRW w/ non-even split and it's working now. Reviewed By: zhangruiskyline Differential Revision: D52010942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115483 Approved by: https://github.com/kwen2501, https://github.com/fegin, https://github.com/XilunWu	2023-12-12 18:02:05 +00:00
fduwjj	0379c11248	[c10d] Enable PG NCCL monitor thread by default (#115577 ) We added a monitor thread in NCCL PG in https://github.com/pytorch/pytorch/pull/112518. To summarize what we are doing in monitor thread: it listens to the heartbeat from watchdog thread and detect unhealthy nccl watchdog hang (due to several reasons such as nccl/cuda API bugs or unexpected blocking behaviors). This is the last resort to ensure that we don't silently keep the training job run for hours. We didn't open this feature as default, since we want to perform more due diligence and have some customers to try it out. So far, we didn't see any obstacle which blocks turning on this feature and received positive feedback from users. We now decided to turn in on by default in this PR. If this feature turns out not to work as expected and disturb one's training process, one can set `TORCH_NCCL_ENABLE_MONITORING=0` to disable this feature. Please kindly file an issue with us so that we can see if we missed any corner cases during the design. Differential Revision: [D52045911](https://our.internmc.facebook.com/intern/diff/D52045911) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115577 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2023-12-12 00:45:54 +00:00
fduwjj	8c1567d021	[c10d] Change watchdog inner loop function name to make it more accurate (#115404 ) Function `workCleanupLoop` does not affect all things we did in watchdog thread, so proposing a new name here to reflect what we are actually doing in the watchdog thread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115404 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2023-12-11 22:00:06 +00:00
fduwjj	03ff44c958	[c10d] Fix Store check condition in NCCL PG watchdog (#115475 ) In https://github.com/pytorch/pytorch/pull/115449/ somehow after turning on `DUMP_ON_TIMEOUT=1`, some existing tests failed. Upon checking, the failing is because of TCPStore check call within watchdog thread. 1. It's not because of TCPStore creation has not completed, because if I put it sleep for a long time, the test still failed. Rather, it's because we query TCPStore after we shutdown the PG. 2. The reason for that is: The `std::chrono::steady_clock::now()` function in C++ returns a `time_point` object representing the current point in time according to the steady clock. The default unit of this time_point is not directly specified in terms of seconds or nanoseconds; rather, it is dependent on the internal representation of the steady clock, which can vary between implementations. In reality it's actually nanosecs which makes the delta so big that we are checking the store every time when watchdog thread wakes up. To make things even worse, `terminateProcessGroup_` might be turned to be `True` before the next check for the outmost while but before TCPStore check, so watchdog gets stuck because we are checking a TCPStore which is already deleted. And main thread is still waiting for watchdog to join. The solution here is: 1. Add back `std::chrono::duration_cast` to ensure the delta is indeed mil_sec, so that the timeout check logic is working as expected. 2. Check `terminateProcessGroup_` as well so that, we don't do any dump when main thread has already mark the process exited. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115475 Approved by: https://github.com/wconstab	2023-12-11 21:06:05 +00:00
fduwjj	5f41fc7619	[c10d] Change NCCL PG watchdog error msg and test comments (#115403 ) Address the nit comments in https://github.com/pytorch/pytorch/pull/115226/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/115403 Approved by: https://github.com/wconstab ghstack dependencies: #115226	2023-12-11 17:55:28 +00:00
Deepak Seshadri	1c1f2bbe8a	Add a space in the error message (#115465 ) Summary: As title says Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: waitforsandcastle Sandcastle run Reviewed By: eeggl Differential Revision: D52000286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115465 Approved by: https://github.com/kwen2501	2023-12-09 04:35:51 +00:00
Will Constable	317486edb0	[C10D] Decouple flight recorder from enableTiming (#115358 ) RE #115301 Decoupling gives us a path to disable timing without disabling the flight recorder. Flight recorder is still useful for stuckness analysis without 'timing'. Disabling timing makes it miss the 'started' state that comes from using an extra nccl event at the start of each collective. It will also be missing 'duration_ms' of collectives, which hasn't been landed yet, but is useful for timing/perf work more than stuckness analysis. Hopefully we can enable timing by default and leave both on, but it's nice to have the flexiblity for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358 Approved by: https://github.com/fduwjj	2023-12-08 19:44:45 +00:00
fduwjj	4d70802133	[c10d] Use TCPStore to record NCCL timeout and dump debug info (#115226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115226 Approved by: https://github.com/wconstab	2023-12-08 06:19:40 +00:00
Will Constable	784e20e3d7	[C10D] Make dumpPipe use async launcher (#115375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115375 Approved by: https://github.com/fduwjj ghstack dependencies: #115332	2023-12-08 00:16:22 +00:00
Will Constable	7562b45454	Reland "[C10D] Use future for flight recorder dump (#115176 )" (#115332 ) Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec for the future to complete then abort". The difference in this case is the abort happens as soon as the dump finishes up to a maximum, instead of always waiting the maximum. Allows multiple calls to dump, which will be serialized. Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the change to support more than one launch and to always launch rather than only launching on the first call. Adds a test for dumping on timeout. This reverts commit `ac7d14baad`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332 Approved by: https://github.com/fduwjj	2023-12-07 21:20:58 +00:00
Howard Huang	3e66385ddd	Add Work to distributed docs (#115172 ) Summary: Documenting the `Work` object For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented Test Plan: Preview the docs build in OSS Differential Revision: D51854974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172 Approved by: https://github.com/wconstab	2023-12-07 18:12:10 +00:00
Chip Turner	78b945484b	[c10d] Extend NCCL communicator splitting to more use cases (#114916 ) Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world. This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank. This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number). And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916 Approved by: https://github.com/kwen2501	2023-12-07 15:13:01 +00:00
PyTorch MergeBot	ac7d14baad	Revert "[C10D] Use future for flight recorder dump (#115176 )" This reverts commit `0e07e3dbe4`. Reverted https://github.com/pytorch/pytorch/pull/115176 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test_timeout_dumps is failing in trunk `0e07e3dbe4` ([comment](https://github.com/pytorch/pytorch/pull/115176#issuecomment-1844076455))	2023-12-07 02:09:58 +00:00
Will Constable	0e07e3dbe4	[C10D] Use future for flight recorder dump (#115176 ) Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec for the future to complete then abort". The difference in this case is the abort happens as soon as the dump finishes up to a maximum, instead of always waiting the maximum. Allows multiple calls to dump, which will be serialized. Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the change to support more than one launch and to always launch rather than only launching on the first call. Adds a test for dumping on timeout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176 Approved by: https://github.com/zdevito	2023-12-06 23:42:19 +00:00
fduwjj	2bff36bb0e	[c10d] Change set timeout API name to _set_default_timeout (#115197 ) Somehow the feedback does not show up, this PR is to address the comment in https://github.com/pytorch/pytorch/pull/115141. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115197 Approved by: https://github.com/XilunWu, https://github.com/wconstab	2023-12-06 03:38:39 +00:00
zdevito	259a99669d	[NCCL flight recorder] Dump when writing to pipe (#115139 ) If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump file when a timeout happens, you can trigger a dump by writing to local pipe `<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default /tmp/nccl_trace_{rank}_<rank>.pipe). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139 Approved by: https://github.com/wconstab	2023-12-05 20:44:23 +00:00
fduwjj	a8bd593252	[c10d] Add _reset_nccl_collective_timeout so users can change timeout of a NCCL PG (#115141 ) There are some use cases when users want to change the timeout for a NCCL process group in the middle of training. This PR enables it by adding a pybind api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115141 Approved by: https://github.com/wconstab	2023-12-05 19:55:28 +00:00
Ke Wen	c9853ccadc	Relax tensor contiguity requirement for P2P ops (#114982 ) I hit the following error when performing pipeline parallel for T5: ``` return default_pg.send([tensor], dst, tag) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: Tensors must be contiguous ``` In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy". Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114982 Approved by: https://github.com/H-Huang	2023-12-05 18:25:42 +00:00
Pavan Balaji	94faba5224	[nccl-pg] Revert accidental renaming of env variables (#115082 ) Summary: In [`9cc040fef6`], we accidentally changed some of the environment variable names to the non-deprecated form. The intent was to support both the deprecated and the new form of the env variables (with a warning thrown for the deprecated form). Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115082 Approved by: https://github.com/zdevito	2023-12-05 14:52:30 +00:00
PyTorch MergeBot	f101426790	Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901 )" This reverts commit `fb325bbd46`. Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))	2023-12-04 14:55:39 +00:00
Will Constable	8a51845b38	[C10D] Add filename to dump finished log (#114957 ) Just shows you where to look.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114957 Approved by: https://github.com/fduwjj	2023-12-01 20:38:02 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
fduwjj	25b83521be	[c10d] Log NCCL trace buffer size (#114926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114926 Approved by: https://github.com/zdevito ghstack dependencies: #114901	2023-12-01 08:06:10 +00:00
Pavan Balaji	aa390cec21	[profiler] Fix description to use nelems rather than size (#114735 ) We were storing the number of elements in the tensor, rather than the actual bytes. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114735 Approved by: https://github.com/aaronenyeshi, https://github.com/yoyoyocmu, https://github.com/kwen2501, https://github.com/fduwjj	2023-12-01 06:21:47 +00:00
fduwjj	fb325bbd46	Move class definition of DebugInfoWriter to TraceUtil as well (#114901 ) Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901 Approved by: https://github.com/XilunWu	2023-12-01 03:28:16 +00:00
Shengbao Zheng	1d95644740	[Execution Trace] record root rank for broadcast/gather/reduce/scatter (#113828 ) Summary: collective like broadcast/gather/reduce/scatter need root rank info in order to be replayed in PARAM benchmarks. Log root rank instead of local rank in RECORD_PARAM_COMMS_DATA Reference: distributed/c10d/Types.hpp Test Plan: Tested in HPC Differential Revision: D51381196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113828 Approved by: https://github.com/fduwjj	2023-12-01 01:28:49 +00:00
Will Constable	92cd78b1df	[C10D] logging/comment clean ups (#114625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114625 Approved by: https://github.com/fduwjj, https://github.com/XilunWu ghstack dependencies: #114810	2023-11-30 07:46:32 +00:00

1 2 3 4 5 ...

1601 Commits