pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	71cb13869b	[Easy][BE]: Enable clang-tidy check for duplicate includes (#116193 ) Adds a clang-tidy check to flag duplicate include files Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-20 17:56:21 +00:00
PyTorch MergeBot	fe15645619	Revert "Serve multistream graph captures from correct pool (#114647 )" This reverts commit `8a445f7bd5`. Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))	2023-12-20 17:11:42 +00:00
Bin Bao	fabf9433e7	[AOTI][refactor] Organize model runner files (#116022 ) Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022 Approved by: https://github.com/khabinov	2023-12-20 15:35:34 +00:00
Yifu Wang	6e1ba79b7f	[re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) (#116125 ) This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125 Approved by: https://github.com/yf225	2023-12-20 07:13:50 +00:00
Will Constable	3747aca49a	[C10D] Make all PGNCCL LOG usages use logPrefix() (#116060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116060 Approved by: https://github.com/fduwjj ghstack dependencies: #116059	2023-12-20 04:19:45 +00:00
Will Constable	4f02cc0670	[C10D] Add logPrefix to abortCommsFromMap (#116059 ) Prints additional info such as PG ID/Rank. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116059 Approved by: https://github.com/fduwjj	2023-12-20 02:17:04 +00:00
Tugsbayasgalan Manlaibaatar	d85314c95c	Support Predispatch functionalization (#113728 ) In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work: 1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key. 2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack. 3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup. Some missing bits after this PR: 1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way. 2. Turn off CompositeImplicitAutograd decomps 3. Functionalizing HOO Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728 Approved by: https://github.com/bdhirsh	2023-12-19 20:28:35 +00:00
PyTorch MergeBot	91e184fd74	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit `4edc921857`. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))	2023-12-19 16:01:19 +00:00
zdevito	8a445f7bd5	Serve multistream graph captures from correct pool (#114647 ) This fixes #114320 by placing the logic for determining whether to allocate to a pool inside a callback that is controlled by CUDAGraph.cpp or by the python bound api to allocate a stream directly to a pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647 Approved by: https://github.com/ngimel, https://github.com/eellison	2023-12-18 18:24:15 +00:00
Behrang Javaherian	386776c49a	[torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657 ) Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags Test Plan: the FLAGS are all off by default baseline ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb ``` ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb ``` ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb``` Differential Revision: D52081631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657 Approved by: https://github.com/houseroad	2023-12-18 17:56:39 +00:00
Aaron Gokaslan	647f14e70b	[BE]: Enable clang-tidy check for readability-string-compare (#115994 ) Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994 Approved by: https://github.com/albanD	2023-12-18 16:13:00 +00:00
Nikita Shulga	d7caef7996	[CI] Update clang-format (#116002 ) To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002 Approved by: https://github.com/suo	2023-12-18 14:58:46 +00:00
Mu-Chu Lee	c285ca7916	[AOTInductor] Add updaing constant buffer to active buffer. (#116001 ) Summary: Refactor update inactive constant buffer to allow updating with active buffer. Test Plan: Existing test to test inactive buffer updates. UpdateConstantsCuda in cpp test for active buffer updates. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001 Approved by: https://github.com/chenyang78	2023-12-18 11:49:03 +00:00
Pearu Peterson	34fe850d00	SymInt'ify sparse_compressed_tensor (#107903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107903 Approved by: https://github.com/cpuhrsch ghstack dependencies: #115586	2023-12-17 17:36:20 +00:00
youkaichao	034e871710	[Dynamo] Look up variables from old frame, rather than copy variables to new frame; skip some copy to save time. (#115062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115062 Approved by: https://github.com/williamwen42	2023-12-16 00:02:59 +00:00
Will Constable	9fcf6fb6fe	[C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876 ) Helps call attention to any cases where the dump actually times out. The timeout is likely to hit if we run into slow stacktrace processing. Log any exceptions encountered in the background thread, but don't raise them- we're already willing to abandon the debug dump, and want to proceed with our normal execution (in the case of dumppipe) or shutdown process (when dumping happens on timeout and shutdown is already initiated). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876 Approved by: https://github.com/zdevito ghstack dependencies: #115807	2023-12-15 22:13:06 +00:00
Will Constable	82e0d00da9	[c10d] Polish NCCL PG monitor thread log message (#115888 ) We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888 Approved by: https://github.com/wconstab	2023-12-15 22:00:29 +00:00
Jun Luo	2d43e31aa9	Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553 ) Reviewed By: kirteshpatil Differential Revision: D51860023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553 Approved by: https://github.com/fduwjj	2023-12-15 11:14:41 +00:00
Yifu Wang	4edc921857	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-15 08:17:35 +00:00
Will Constable	8e1cff96e3	[C10D] Log PG size in init log (#115807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807 Approved by: https://github.com/XilunWu	2023-12-15 02:38:54 +00:00
Nikita Shulga	5989e1222d	[BE] Set `torch.cuda.has_half` to True (#115884 ) This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147 So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884 Approved by: https://github.com/kit1980	2023-12-15 02:30:55 +00:00
zdevito	66b04e3cb7	[nccl flight recorder] nullptr profiling name (#115851 ) Sometimes profiling name can be a nullptr, which throws on conversion to std::string. This adds a check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851 Approved by: https://github.com/wconstab	2023-12-14 23:40:54 +00:00
Pearu Peterson	194d57dae7	Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586 ) Fixes https://github.com/pytorch/pytorch/issues/107286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-12-14 23:09:13 +00:00
Will Constable	04ef21f5dd	[C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803 ) The mutex was originally added to avoid racing to dump debuginfo, where a race in this case would result in a corrupted dump file. The reason a mutex helps is that it forces all dump requests to be serialized, so that an observer would either see an in-progress file, a complete file, or no file. Without a mutex, a fourth state is possible (a file that has been written to by multiple threads and is invalid). Becuase the mutex was a ProcessGroupNCCL class member, and each PG instance has its own watchdog thread that can launch a dump, it was not doing its job. Making the mutex static shares it between instances of the class and ensures serialization of dumps triggered by any PG. (Note: dumps triggered by different PGs have the same, global contents anyway- there is only one global flight recorder, so it doesn't matter who triggers it.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803 Approved by: https://github.com/kwen2501 ghstack dependencies: #115771, #115798, #115800, #115801	2023-12-14 21:17:44 +00:00
PyTorch MergeBot	7ecddaef23	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit `adfbd2b219`. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))	2023-12-14 20:33:10 +00:00
Will Constable	0fe014bd8a	[C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801 ) Adds a PG {process group uid} prefix component to logs. This is helpful in situations where there are multiple processgroups, and rank information by itself is confusing. (For example rank0 on PG1 may correspond to rank3 on PG0. People may assume 'rank0' references the global (PG0) world, but it may reference a sub-pg. Prefacing the PG helps clarify this. Does NOT change logs from inside WorkNCCL functions, since WorkNCCL doens't know what PG ID it corresponds to. Will address these logs separately. Example: ``` [I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798, #115800	2023-12-14 18:17:16 +00:00
Will Constable	e94267587b	[C10D] Refactor NCCL logs to use common prefix helper (#115800 ) Put the repeated code that string formats [Rank {rank}] in one place. Sets up for the next PR that also adds more info to this prefix. (Does not change exception messages, which could be done as well. Exception messages are not formatted quite the same way. Tries instead to keep from changing log behavior (in this PR) and only refactor code. Did limited testing (some logs were observed OK). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798	2023-12-14 18:13:24 +00:00
Will Constable	eb6e70cf66	[C10D] Only open NCCL dump pipe file once per process (#115798 ) The NCCL flight recorder is per-process (it is shared by all processgroups), but individual process groups used to construct their own pipe for being signaled to dump the flight recorder. This ensures that only one pipe per process is created, by only creating the pipe on the first ProcessGroup (uid_ == 0) which should be the world group. Filenames are still keyed off of rank, but this should now be global rank instead of sub-pg rank, making the filenames unique across the whole trainer process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798 Approved by: https://github.com/zdevito ghstack dependencies: #115771	2023-12-14 17:48:26 +00:00
Will Constable	74d2b9dd15	[C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771 Approved by: https://github.com/fduwjj	2023-12-14 17:42:46 +00:00
Yifu Wang	adfbd2b219	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-14 08:13:08 +00:00
Will Constable	f5458f8f00	[C10D] Make DumpPipe pipe file configurable (#115770 ) Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file location from dump file location. Defaults PIPE_FILE to empty, meaning disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770 Approved by: https://github.com/zdevito	2023-12-14 03:54:43 +00:00
Pavan Balaji	ffc826bf10	[nccl-pg] Store PG global rank information in tracing logs (#115730 ) Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks. Test Plan: OSS CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730 Approved by: https://github.com/fduwjj	2023-12-14 00:59:17 +00:00
Scott Wolchok	81321baf5c	[PyTorch] Remove ArrayRefTensor::dtype (#113578 ) Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: #112800, #113577	2023-12-13 21:32:14 +00:00
Alexander Yermolovich	23bff71de4	[llvm][oncall] Fix build for llvm-18+ (#115652 ) Summary: https://reviews.llvm.org/D137838 moved Host.h and some other files under TargetParser. https://github.com/llvm/llvm-project/pull/74261 Removed it from Support folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115652 Approved by: https://github.com/davidberard98	2023-12-13 20:11:31 +00:00
soulitzer	4d8ad4fb82	Move SingletonSymNodeImpl from c10 to aten (#114895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895 Approved by: https://github.com/jbschlosser	2023-12-13 20:01:18 +00:00
Pavan Balaji	afa62d6237	[nccl-pg] Pass group global rank information to NCCL PG (#114736 ) We were only passing a subset of the group creation information to the NCCL PG. We are specifically missing the information on which global ranks belong to a particular PG. This allows the NCCL PG to use this additional information for things like better trace logging. Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736 Approved by: https://github.com/kwen2501	2023-12-13 18:02:51 +00:00
Scott Wolchok	f9cf6ae889	[PyTorch] AOTI: add minimal arrayref interface (#112800 ) This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800 Approved by: https://github.com/chenyang78	2023-12-13 12:06:35 +00:00
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
Xiaodong Wang	7553c49514	[S382174] Fix distributed debug w/ non-equal split (#115483 ) Summary: In collectives, it's possible to have non-equal split that has a different implementation and the output tensor size will be different, e.g. https://www.internalfb.com/code/fbsource/[460afb1172b5]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=3104. However, TORCH_DISTRIBUTED_DEBUG=DETAIL will assume the output tensor size is the same and does the check and will fail the job if they don't: https://fburl.com/code/mhte9ty8. c10d code should handle this. Ideally we should check the input size across ranks and make sure they're the same. Maybe for next diff. Test Plan: Test torchrec's TWRW w/ non-even split and it's working now. Reviewed By: zhangruiskyline Differential Revision: D52010942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115483 Approved by: https://github.com/kwen2501, https://github.com/fegin, https://github.com/XilunWu	2023-12-12 18:02:05 +00:00
mantaionut	d521857411	Terminate handler (#101332 ) Fixes #50051. This PR is based on #50320 and I address the last feedback. On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable. This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads. If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h> The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-12 17:55:27 +00:00
Scott Wolchok	de4b2e59a7	[PyTorch] AOTI: add more basic aoti_torch getters (#112799 ) Lot of simple information about tensors we couldn't get. In particular, we didn't know the lengths of the arrays returned by sizes and strides. Differential Revision: [D50949929](https://our.internmc.facebook.com/intern/diff/D50949929/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112799 Approved by: https://github.com/desertfire, https://github.com/aakhundov ghstack dependencies: #112116, #112174, #112405, #112798	2023-12-12 15:56:33 +00:00
Bin Bao	7350dcb307	[CI] Fix lint errors on master (#115627 ) Differential Revision: [D52073432](https://our.internmc.facebook.com/intern/diff/D52073432) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115627 Approved by: https://github.com/atalman	2023-12-12 13:53:14 +00:00
PyTorch MergeBot	bc51a0c22f	Revert "[PyTorch] AOTI: add more basic aoti_torch getters (#112799 )" This reverts commit `3de2596abe`. Reverted https://github.com/pytorch/pytorch/pull/112799 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112799#issuecomment-1852076887))	2023-12-12 13:52:34 +00:00
Scott Wolchok	3de2596abe	[PyTorch] AOTI: add more basic aoti_torch getters (#112799 ) Lot of simple information about tensors we couldn't get. In particular, we didn't know the lengths of the arrays returned by sizes and strides. Differential Revision: [D50949929](https://our.internmc.facebook.com/intern/diff/D50949929/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112799 Approved by: https://github.com/desertfire, https://github.com/aakhundov ghstack dependencies: #112116, #112174, #112405, #112798	2023-12-12 06:19:45 +00:00
Scott Wolchok	ca52195112	[PyTorch] AOTI: Avoid aoti_torch_data_ptr calls for constants at inference time (#112405 ) Cache aoti_torch_get_data_ptr at constants update time. Differential Revision: [D50708982](https://our.internmc.facebook.com/intern/diff/D50708982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112405 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov ghstack dependencies: #112116, #112174	2023-12-12 06:19:45 +00:00
Scott Wolchok	ff6f987adc	[PyTorch] Replace cached thread_locals with stack allocation in AOTI (#112116 ) This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning. Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116 Approved by: https://github.com/jansel, https://github.com/desertfire	2023-12-12 06:19:45 +00:00
fduwjj	0379c11248	[c10d] Enable PG NCCL monitor thread by default (#115577 ) We added a monitor thread in NCCL PG in https://github.com/pytorch/pytorch/pull/112518. To summarize what we are doing in monitor thread: it listens to the heartbeat from watchdog thread and detect unhealthy nccl watchdog hang (due to several reasons such as nccl/cuda API bugs or unexpected blocking behaviors). This is the last resort to ensure that we don't silently keep the training job run for hours. We didn't open this feature as default, since we want to perform more due diligence and have some customers to try it out. So far, we didn't see any obstacle which blocks turning on this feature and received positive feedback from users. We now decided to turn in on by default in this PR. If this feature turns out not to work as expected and disturb one's training process, one can set `TORCH_NCCL_ENABLE_MONITORING=0` to disable this feature. Please kindly file an issue with us so that we can see if we missed any corner cases during the design. Differential Revision: [D52045911](https://our.internmc.facebook.com/intern/diff/D52045911) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115577 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2023-12-12 00:45:54 +00:00
fduwjj	8c1567d021	[c10d] Change watchdog inner loop function name to make it more accurate (#115404 ) Function `workCleanupLoop` does not affect all things we did in watchdog thread, so proposing a new name here to reflect what we are actually doing in the watchdog thread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115404 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2023-12-11 22:00:06 +00:00
Nikita Shulga	b706c4116d	[MPS] Add MacOS 14 runtime check (#115512 ) Prerequisite for adding more complex type support and FFT operation Check using `conjugateWithTensor:name:` selector defined as follows ```objc /// Returns the complex conjugate of the input tensor elements. /// /// - Parameters: /// - tensor: The input tensor. /// - name: An optional string which serves as an identifier for the operation.. /// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation. -(MPSGraphTensor ) conjugateWithTensor:(MPSGraphTensor ) tensor name:(NSString * _Nullable) name MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0)) MPS_SWIFT_NAME( conjugate(tensor:name:) ); ``` - Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)` - Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer` - Add `torch.backends.mps.is_macos_or_newer` public API Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512 Approved by: https://github.com/albanD	2023-12-11 21:11:42 +00:00
fduwjj	03ff44c958	[c10d] Fix Store check condition in NCCL PG watchdog (#115475 ) In https://github.com/pytorch/pytorch/pull/115449/ somehow after turning on `DUMP_ON_TIMEOUT=1`, some existing tests failed. Upon checking, the failing is because of TCPStore check call within watchdog thread. 1. It's not because of TCPStore creation has not completed, because if I put it sleep for a long time, the test still failed. Rather, it's because we query TCPStore after we shutdown the PG. 2. The reason for that is: The `std::chrono::steady_clock::now()` function in C++ returns a `time_point` object representing the current point in time according to the steady clock. The default unit of this time_point is not directly specified in terms of seconds or nanoseconds; rather, it is dependent on the internal representation of the steady clock, which can vary between implementations. In reality it's actually nanosecs which makes the delta so big that we are checking the store every time when watchdog thread wakes up. To make things even worse, `terminateProcessGroup_` might be turned to be `True` before the next check for the outmost while but before TCPStore check, so watchdog gets stuck because we are checking a TCPStore which is already deleted. And main thread is still waiting for watchdog to join. The solution here is: 1. Add back `std::chrono::duration_cast` to ensure the delta is indeed mil_sec, so that the timeout check logic is working as expected. 2. Check `terminateProcessGroup_` as well so that, we don't do any dump when main thread has already mark the process exited. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115475 Approved by: https://github.com/wconstab	2023-12-11 21:06:05 +00:00

1 2 3 4 5 ...

13130 Commits