pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
fduwjj	5f41fc7619	[c10d] Change NCCL PG watchdog error msg and test comments (#115403 ) Address the nit comments in https://github.com/pytorch/pytorch/pull/115226/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/115403 Approved by: https://github.com/wconstab ghstack dependencies: #115226	2023-12-11 17:55:28 +00:00
Nikita Shulga	8ddc549c0f	[BE][JIT] Do not wrap shared_ptr with optional (#115473 ) While reviewing https://github.com/pytorch/pytorch/pull/115381 noticed that `torch::jit::GraphFunction::optimized_graph_` is an `std::array<c10::optional<std::shared_ptr<Graph>>, N>`, which feels excessive as `shared_ptr` is already nullable and have `operator bool()`. Looking at https://github.com/pytorch/pytorch/pull/26488 that introduced the change, also does not hint that this indirection is necessary. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115473 Approved by: https://github.com/davidberard98, https://github.com/Skylion007	2023-12-09 20:43:40 +00:00
Deepak Seshadri	1c1f2bbe8a	Add a space in the error message (#115465 ) Summary: As title says Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: waitforsandcastle Sandcastle run Reviewed By: eeggl Differential Revision: D52000286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115465 Approved by: https://github.com/kwen2501	2023-12-09 04:35:51 +00:00
cyy	516bd4a72c	[1/N] Use std::in_place (#115170 ) It is time to gradually replace c10::in_place with std::in_place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115170 Approved by: https://github.com/colesbury	2023-12-09 03:52:39 +00:00
Will Constable	317486edb0	[C10D] Decouple flight recorder from enableTiming (#115358 ) RE #115301 Decoupling gives us a path to disable timing without disabling the flight recorder. Flight recorder is still useful for stuckness analysis without 'timing'. Disabling timing makes it miss the 'started' state that comes from using an extra nccl event at the start of each collective. It will also be missing 'duration_ms' of collectives, which hasn't been landed yet, but is useful for timing/perf work more than stuckness analysis. Hopefully we can enable timing by default and leave both on, but it's nice to have the flexiblity for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358 Approved by: https://github.com/fduwjj	2023-12-08 19:44:45 +00:00
Scott Wolchok	494cb28231	[PyTorch] AOTI: add ArrayRefTensor (#112115 ) This adds a shim for AOTI generated code to pretend a raw array works like an AtenTensorHandle. This allows parts of AOTI that generate uses of tensors to continue to be unaware of how those tensors are allocated. See the following diff/PR for usage. Differential Revision: [D50570252](https://our.internmc.facebook.com/intern/diff/D50570252/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112115 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-12-08 19:31:50 +00:00
albanD	a2b89154bf	New swap function (#111747 ) This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem. The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references. This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up. This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs). Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots. The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses. Draft right now to see what @colesbury thinks about doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747 Approved by: https://github.com/colesbury	2023-12-08 18:49:35 +00:00
Wongboo	68f74dd162	Add python and C++ support for LPPool3d (#114199 ) Add python and C++ support for LPPool3d to Fixes #114114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199 Approved by: https://github.com/mikaylagawarecki	2023-12-08 18:18:44 +00:00
Behrang Javaherian	b3b5bd51ea	[raas][torch][jit] Allow not storing the optimized graph (#115381 ) Summary: GraphFunction internally stores the optimized graph after generating it and then it is passed into the executor which makes a copy of it. So we store the optimized graph effectively twice. This diff allows to set a flag to not store the optimized graph inside the GraphFunction. The code is NoP right now until the flag is enabled. Test Plan: I ran SL with this on raas with good memory saving on raas server. From command line: exmaple model run ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 I1207 11:04:58.657143 3556226 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 255646 Kb ``` then with flag enabled: ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true I1207 11:06:25.245779 3577383 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 165167 Kb ``` So collective with this flag and the flag from D51950418 ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true --torch_jit_enable_profiling_graph_executor=false I1207 11:09:17.502743 3592345 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 114848 Kb ``` Differential Revision: D51931895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115381 Approved by: https://github.com/malfet	2023-12-08 16:29:13 +00:00
fduwjj	4d70802133	[c10d] Use TCPStore to record NCCL timeout and dump debug info (#115226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115226 Approved by: https://github.com/wconstab	2023-12-08 06:19:40 +00:00
Will Constable	784e20e3d7	[C10D] Make dumpPipe use async launcher (#115375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115375 Approved by: https://github.com/fduwjj ghstack dependencies: #115332	2023-12-08 00:16:22 +00:00
Will Constable	7562b45454	Reland "[C10D] Use future for flight recorder dump (#115176 )" (#115332 ) Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec for the future to complete then abort". The difference in this case is the abort happens as soon as the dump finishes up to a maximum, instead of always waiting the maximum. Allows multiple calls to dump, which will be serialized. Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the change to support more than one launch and to always launch rather than only launching on the first call. Adds a test for dumping on timeout. This reverts commit `ac7d14baad`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332 Approved by: https://github.com/fduwjj	2023-12-07 21:20:58 +00:00
youkaichao	16373bbc1f	fix error message in pytorch (#115349 ) Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349 Approved by: https://github.com/Skylion007	2023-12-07 19:27:29 +00:00
Howard Huang	3e66385ddd	Add Work to distributed docs (#115172 ) Summary: Documenting the `Work` object For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented Test Plan: Preview the docs build in OSS Differential Revision: D51854974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172 Approved by: https://github.com/wconstab	2023-12-07 18:12:10 +00:00
Shaltiel Shmidman	ee8b33f7d5	Fixed crash when calling pad_packed_tensor when packed with cuda tensors and ensure_sorted=false due to indexing with tensors on different devices (#115028 ) Fixes #115027 Fix in csrc as done in the python code [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/utils/rnn.py#L338). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115028 Approved by: https://github.com/drisspg	2023-12-07 18:09:18 +00:00
Tobias Ringwald	43f42bf3cb	Updated docs for deprecated `torch.set_default_tensor_type` (#115041 ) Added deprecation note for torch.set_default_tensor_type. Updated docs that referenced this method. Fixes #113646. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115041 Approved by: https://github.com/janeyx99	2023-12-07 16:17:36 +00:00
Chip Turner	78b945484b	[c10d] Extend NCCL communicator splitting to more use cases (#114916 ) Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world. This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank. This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number). And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916 Approved by: https://github.com/kwen2501	2023-12-07 15:13:01 +00:00
FFFrog	e1f159e6b2	Remove rebundant api named is_int_list (#115136 ) Fixes #114933 As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115136 Approved by: https://github.com/zou3519	2023-12-07 04:55:13 +00:00
PyTorch MergeBot	ac7d14baad	Revert "[C10D] Use future for flight recorder dump (#115176 )" This reverts commit `0e07e3dbe4`. Reverted https://github.com/pytorch/pytorch/pull/115176 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test_timeout_dumps is failing in trunk `0e07e3dbe4` ([comment](https://github.com/pytorch/pytorch/pull/115176#issuecomment-1844076455))	2023-12-07 02:09:58 +00:00
Antonio Kim	73c0035160	Add `reset_storage` method to FunctionalTensorWrapper (#115235 ) In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example ```python class Model(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(4, 2, bias=False) def forward(self, x): return x @ self.fc1.weight.transpose(0, 1) with torch.device("lazy"): model = Model() x = torch.ones(4) out = model(x) ``` The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge. To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas. CC: @behzad-a @GlebKazantaev @wconstab @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/115235 Approved by: https://github.com/bdhirsh	2023-12-07 01:32:01 +00:00
Will Constable	0e07e3dbe4	[C10D] Use future for flight recorder dump (#115176 ) Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec for the future to complete then abort". The difference in this case is the abort happens as soon as the dump finishes up to a maximum, instead of always waiting the maximum. Allows multiple calls to dump, which will be serialized. Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the change to support more than one launch and to always launch rather than only launching on the first call. Adds a test for dumping on timeout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176 Approved by: https://github.com/zdevito	2023-12-06 23:42:19 +00:00
y-sq	233ce0d24b	Support GPU annotations for auto-trace jobs similar on-demand support (#114638 ) Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`. Differential Revision: D51597995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638 Approved by: https://github.com/aaronenyeshi	2023-12-06 09:38:13 +00:00
cyy	d250b2158e	[4/N] Fixes clang-tidy warnings in header files (#115163 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115163 Approved by: https://github.com/Skylion007	2023-12-06 05:00:01 +00:00
fduwjj	2bff36bb0e	[c10d] Change set timeout API name to _set_default_timeout (#115197 ) Somehow the feedback does not show up, this PR is to address the comment in https://github.com/pytorch/pytorch/pull/115141. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115197 Approved by: https://github.com/XilunWu, https://github.com/wconstab	2023-12-06 03:38:39 +00:00
Hongtao Yu	01ec71e466	[NFC][Autotune] Use device_prop.regsPerMultiprocessor instead of hardcoded reg number. (#115094 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115094 Approved by: https://github.com/jansel	2023-12-05 23:49:46 +00:00
Mu-Chu Lee	80527c0cf2	[AOTInductor] Double buffering for Weights (#114446 ) Summary: This adds function to model container doing weight swapping with double buffering. There are 2 parts for double buffering a) Write constants into inactive buffer b) Swap active buffer For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read. For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive. Test Plan: test/cpp/aot_inductor/test.cpp Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446 Approved by: https://github.com/chenyang78, https://github.com/eellison	2023-12-05 22:31:56 +00:00
zdevito	259a99669d	[NCCL flight recorder] Dump when writing to pipe (#115139 ) If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump file when a timeout happens, you can trigger a dump by writing to local pipe `<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default /tmp/nccl_trace_{rank}_<rank>.pipe). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139 Approved by: https://github.com/wconstab	2023-12-05 20:44:23 +00:00
fduwjj	a8bd593252	[c10d] Add _reset_nccl_collective_timeout so users can change timeout of a NCCL PG (#115141 ) There are some use cases when users want to change the timeout for a NCCL process group in the middle of training. This PR enables it by adding a pybind api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115141 Approved by: https://github.com/wconstab	2023-12-05 19:55:28 +00:00
Ke Wen	c9853ccadc	Relax tensor contiguity requirement for P2P ops (#114982 ) I hit the following error when performing pipeline parallel for T5: ``` return default_pg.send([tensor], dst, tag) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: Tensors must be contiguous ``` In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy". Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114982 Approved by: https://github.com/H-Huang	2023-12-05 18:25:42 +00:00
Xia, Weiwen	daf89b4101	Update oneDNN submodule to v3.3.2 (#112700 ) Update oneDNN submodule to v3.3.2. Add a macro to check the version of `third_party/ideep`. Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2023-12-05 17:51:55 +00:00
PyTorch MergeBot	ee96399bb4	Revert "[Reland2] Update NVTX to NVTX3 (#109843 )" This reverts commit `dcb486232d`. Reverted https://github.com/pytorch/pytorch/pull/109843 on behalf of https://github.com/atalman due to Diff broke internal builds and tests ([comment](https://github.com/pytorch/pytorch/pull/109843#issuecomment-1841105398))	2023-12-05 16:10:20 +00:00
Pavan Balaji	94faba5224	[nccl-pg] Revert accidental renaming of env variables (#115082 ) Summary: In [`9cc040fef6`], we accidentally changed some of the environment variable names to the non-deprecated form. The intent was to support both the deprecated and the new form of the env variables (with a warning thrown for the deprecated form). Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115082 Approved by: https://github.com/zdevito	2023-12-05 14:52:30 +00:00
cyyever	1224acc018	[3/N] Fixes clang-tidy warnings in header files (#114431 ) This PR series tries to enable clang-tidy for headers in torch/csrc and c10/util. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114431 Approved by: https://github.com/Skylion007	2023-12-05 12:58:27 +00:00
PyTorch MergeBot	62df4f3428	Revert "Update oneDNN submodule to v3.3.2 (#112700 )" This reverts commit `afbaa0c165`. Reverted https://github.com/pytorch/pytorch/pull/112700 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/112700#issuecomment-1839350284))	2023-12-04 19:41:12 +00:00
cyyever	dcb486232d	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10	2023-12-04 19:02:07 +00:00
PyTorch MergeBot	f101426790	Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901 )" This reverts commit `fb325bbd46`. Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))	2023-12-04 14:55:39 +00:00
FFFrog	541591dd79	Add the appropriate check on div_value to the cpp frontend (#114671 ) Fixes #114334 As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114671 Approved by: https://github.com/mikaylagawarecki	2023-12-04 01:28:11 +00:00
Yang Chen	4d8b9964e1	[aotinductor] support at::convolution for AOTInductor (#114961 ) This PR adds support to at::convolution for AOTInductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/114961 Approved by: https://github.com/desertfire	2023-12-03 07:52:28 +00:00
Kwanghoon An	13410d0eda	Moving target/code path to non-pytorch repo (#114095 ) Differential Revision: D51460806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114095 Approved by: https://github.com/digantdesai	2023-12-02 19:27:09 +00:00
Jez Ng	f1fd02503b	Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974 ) This is a backout of #113747 which reverted the above two commits. Now that #113997 has landed, this diff can be landed safely without breaking ABI compatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974 Approved by: https://github.com/chenyang78	2023-12-02 03:25:51 +00:00
Mu-Chu Lee	fb806f487f	[AOTInductor] Add method to get storage size in shim (#114976 ) Summary: Add a method to get storage size. Test Plan: N/A, for FC, test will come after packaged. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114976 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-12-02 01:54:18 +00:00
Will Constable	8a51845b38	[C10D] Add filename to dump finished log (#114957 ) Just shows you where to look.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114957 Approved by: https://github.com/fduwjj	2023-12-01 20:38:02 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
Xia, Weiwen	afbaa0c165	Update oneDNN submodule to v3.3.2 (#112700 ) Update oneDNN submodule to v3.3.2. Add a macro to check the version of `third_party/ideep`. Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2023-12-01 18:40:07 +00:00
Nikita Shulga	76362cc9a0	[BE] Do not use AT_ERROR (#114883 ) As later is just an alias to `TORCH_CHECK(false,)` Proposed as suggestion to https://github.com/pytorch/pytorch/pull/110303 but it wasn't noticed Pull Request resolved: https://github.com/pytorch/pytorch/pull/114883 Approved by: https://github.com/atalman	2023-12-01 13:44:17 +00:00
fduwjj	25b83521be	[c10d] Log NCCL trace buffer size (#114926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114926 Approved by: https://github.com/zdevito ghstack dependencies: #114901	2023-12-01 08:06:10 +00:00
Pavan Balaji	aa390cec21	[profiler] Fix description to use nelems rather than size (#114735 ) We were storing the number of elements in the tensor, rather than the actual bytes. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114735 Approved by: https://github.com/aaronenyeshi, https://github.com/yoyoyocmu, https://github.com/kwen2501, https://github.com/fduwjj	2023-12-01 06:21:47 +00:00
fduwjj	fb325bbd46	Move class definition of DebugInfoWriter to TraceUtil as well (#114901 ) Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901 Approved by: https://github.com/XilunWu	2023-12-01 03:28:16 +00:00
Shengbao Zheng	1d95644740	[Execution Trace] record root rank for broadcast/gather/reduce/scatter (#113828 ) Summary: collective like broadcast/gather/reduce/scatter need root rank info in order to be replayed in PARAM benchmarks. Log root rank instead of local rank in RECORD_PARAM_COMMS_DATA Reference: distributed/c10d/Types.hpp Test Plan: Tested in HPC Differential Revision: D51381196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113828 Approved by: https://github.com/fduwjj	2023-12-01 01:28:49 +00:00
Will Constable	92cd78b1df	[C10D] logging/comment clean ups (#114625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114625 Approved by: https://github.com/fduwjj, https://github.com/XilunWu ghstack dependencies: #114810	2023-11-30 07:46:32 +00:00
Will Constable	4ed9e65038	[C10D] Add time_created_us to flight recorder (#114810 ) time_created_us is the cpu-side epoch_time (in usec) when a flight-recorder event was created. It loosely corresponds to the time the c10d collective API was called and a work object was created. It does NOT correspond to the time the collective started on the GPU. We follow the precedent of us epoch time from this PR adding timestamps to the cuda caching allocator: https://github.com/pytorch/pytorch/pull/112266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114810 Approved by: https://github.com/zdevito	2023-11-30 04:15:56 +00:00
Tae Kyung Heo	1f5726708b	[PyTorch][ET] Collect Execution Traces in Chakra schema (#114753 ) Summary: Collect execution traces in the Chakra schema Created a new diff to change email address: D48030418 Test Plan: ``` $ cd ~/fbcode $ binary_path=$(buck2 build //param_bench/train/compute/python:pytorch_run_benchmark --show-output \| tail -1 \| awk '{print $2}') $ cd ~/fbsource $ $binary_path -c ~/fbcode/param_bench/train/compute/python/examples/pytorch/configs/alex_net.json --et $ cat ~/is_json.py import json import sys def is_json_file(filename): try: with open(filename, 'r') as f: json.load(f) return True except Exception as e: return False if len(sys.argv) != 2: print("Usage: python check_json.py [filename]") sys.exit(1) filename = sys.argv[1] # get filename from command-line argument print(is_json_file(filename)) $ python3 ~/is_json.py ~/fbsource/benchmark_result_2244333_1691065899_et.json True ``` Differential Revision: D51662384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114753 Approved by: https://github.com/aaronenyeshi	2023-11-30 04:07:11 +00:00
zdevito	d5544125a0	[distributed] NCCLflight recorder timeout fix (#114804 ) Because isCompleted() returns true on an exception, a timeout exception will cause the flight recorder to consider the event completed even though it timed out. This changes the logic to explicitly query the completion events on "retirement" when the work item leaves the workMetaList. We mark events as retired so we can distinguish between an event still in the queue but not completed and one that timed out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114804 Approved by: https://github.com/wconstab	2023-11-30 03:46:48 +00:00
Scott Wolchok	165f4f6ccf	[PyTorch] Redirect c10::optional to std::optional (#101995 ) We have C++17 now! I am intentionally dropping the `c10::optional<c10::ArrayRef>` size optimization. It was intended to improve dispatch, but thanks to D34602980 / #70864 we don't use `optional<ArrayRef>` in function arguments anymore anyway. Differential Revision: [D46079028](https://our.internmc.facebook.com/intern/diff/D46079028/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101995 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/ezyang	2023-11-30 02:46:41 +00:00
Wei Lu	34ea0a2bdc	[Pytoch][Vulkan] Create context for layernorm (#114701 ) Summary: `Layernorm` has two arguments weight and bias which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this op to avoid the repeated passing. Specifically, we - created `create_layernorm_context` and `run_layernorm_context` in `Layernorm.h` and `Layernorm.cpp` - registered them in `Register.cpp` - rewrote the graph representation of the op in `vulkan_rewrite.cpp` Test Plan: ## Numerical test ``` [luwei@devbig984.prn1 /data/users/luwei/fbsource (b6ccc956c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="layer_norm" Recommended: For faster builds try buck2: replace 'buck' with 'buck2' NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/ 'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths. If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa Targets matching .buckconfig buck2.supported_projects: {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'} To suppress this warning: touch ~/.config/.dont_hint_buck2 Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated Total time: 0.2 sec BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = layer_norm [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from VulkanAPITest [ RUN ] VulkanAPITest.packed_layer_norm_2d [ OK ] VulkanAPITest.packed_layer_norm_2d (342 ms) [ RUN ] VulkanAPITest.packed_layer_norm_3d [ OK ] VulkanAPITest.packed_layer_norm_3d (284 ms) [ RUN ] VulkanAPITest.packed_layer_norm_4d [ OK ] VulkanAPITest.packed_layer_norm_4d (5 ms) [ RUN ] VulkanAPITest.layer_norm_invalid_inputs [ OK ] VulkanAPITest.layer_norm_invalid_inputs (28 ms) [ RUN ] VulkanAPITest.layer_norm_2d [ OK ] VulkanAPITest.layer_norm_2d (1 ms) [ RUN ] VulkanAPITest.layer_norm_3d [ OK ] VulkanAPITest.layer_norm_3d (2 ms) [ RUN ] VulkanAPITest.layer_norm_4d [ OK ] VulkanAPITest.layer_norm_4d (4 ms) [ RUN ] VulkanAPITest.native_layer_norm_2d [ OK ] VulkanAPITest.native_layer_norm_2d (1 ms) [ RUN ] VulkanAPITest.native_layer_norm_3d [ OK ] VulkanAPITest.native_layer_norm_3d (2 ms) [ RUN ] VulkanAPITest.native_layer_norm_4d [ OK ] VulkanAPITest.native_layer_norm_4d (6 ms) [----------] 10 tests from VulkanAPITest (679 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (679 ms total) [ PASSED ] 10 tests. ``` Full test result in P888496077, summary as below ``` [----------] 419 tests from VulkanAPITest (21652 ms total) [----------] Global test environment tear-down [==========] 419 tests from 1 test suite ran. (21652 ms total) [ PASSED ] 418 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` ## Graph representation comparison We created a model using `layer_norm` and traced it as below ``` class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.layer_norm = torch.nn.LayerNorm(normalized_shape=10) def forward(self, x): return self.layer_norm(x) # Create an instance of the model model = MyModel() # Create a dummy input tensor for tracing input_tensor = torch.randn(1, 10) # Use torch.jit.trace to trace the model and generate a graph traced_model = torch.jit.trace(model, input_tensor) ``` Then we converted the traced model to Vulkan backend using `optimize_for_mobile` ``` from torch.utils import mobile_optimizer vulkan_model = mobile_optimizer.optimize_for_mobile( traced_model, backend="vulkan", preserved_methods=to_preserve ) ``` Then we can print the graph of the `vulkan_model` as `print(vk_model.graph)` - Before this diff ``` %4 : bool = prim::Constant[value=1](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0 %5 : float = prim::Constant[value=1.0000000000000001e-05](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0 %14 : int[] = prim::Constant[value=[10]]() %33 : Tensor = aten::to(%x, %53, %30, %31, %31) %10 : Tensor = aten::layer_norm(%33, %14, %self.layer_norm.weight, %self.layer_norm.bias, %5, %4), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0 ``` - after this diff ``` %14 : int[] = prim::Constant[value=[10]]() %47 : Tensor = aten::to(%x, %78, %44, %45, %45) %16 : Tensor = vulkan_prepack::run_layernorm_context(%47, %14, %17) ``` Reviewed By: SS-JIA Differential Revision: D51530478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114701 Approved by: https://github.com/yipjustin	2023-11-30 01:33:50 +00:00
cyy	4e38178bb8	[Reland] [1/N] Fixes clang-tidy warnings in header files (#114668 ) Reland of #113608 after fixing the problematic parts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114668 Approved by: https://github.com/huydhn	2023-11-29 07:11:51 +00:00
Brian Hirsh	64ccdd4afb	AOTAutograd: keep input mutations in the graph if they are under no_grad, even if they require_grad (#114646 ) Quick recap of events: (1) https://github.com/pytorch/pytorch/pull/111347, which fixed a perf regression in 2.1 compared to 2.0, introduced a correctness problem around input mutations on inputs that require grad that show up in an inference-only graph (the specific case where this can happen is rare and nobody reported the issue, but it was fixed a few weeks later) (2) That fix happened here: https://github.com/pytorch/pytorch/pull/113584, which makes sure to keep input mutations outside of the graph, so the autograd engine can set metadata properly on them (3) That in turn caused a slight regression compared to (1), which is what this PR attempts to fix. In particular, code like the below is safe to keep the mutations in the graph for: ``` @torch.compile def f(x): x.mul_(2) x = torch.ones(2, requires_grad=True).clone() # x requires_grad, so the input mutation will change some autograd metadata, like the version counter # However, the mutation is under no_grad, so we don't have to worry about e.g. aliases of x having their .grad_fn fields changed with torch.no_grad(): f(x) ``` This particular case is pretty important to the shampoo optimizer code, which is run under `torch.compile`, and mutates parameters (which require grad). Pull Request resolved: https://github.com/pytorch/pytorch/pull/114646 Approved by: https://github.com/zou3519	2023-11-29 04:29:32 +00:00
Will Constable	43d0659d74	[C10D] Fix DUMP_ON_TIMEOUT env (#114699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114699 Approved by: https://github.com/kwen2501, https://github.com/XilunWu, https://github.com/fduwjj	2023-11-29 00:15:45 +00:00
Will Constable	44c9e4cbf0	[C10D] Decouple PGNCCL desync from dbg dump (#114614 ) Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently of desync debug feature. Currently default to disabled (so no behavior change by default), but plan to default this to true after validation. Moves 'sleep for 30 sec' that used to be after desync debug to before it. In my view sleeping before desync is equivalent since we always sleep the same duration, and keeps the code simpler this way. Fixes #114433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114614 Approved by: https://github.com/zdevito ghstack dependencies: #114651	2023-11-28 19:46:10 +00:00
voznesenskym	ddf1cb7870	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-28 19:33:35 +00:00
Will Constable	e6a8052051	[C10D] Flight recorder - disable c++ stacktrace by default (#114651 ) CPP Stacktrace processing (symbolizer) takes a long time on some systems using a particular version of addr2line. In slow systems, this makes flight-recorder dumping slow enough to time out on even toy programs. TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection as part of the flight recorder. CPP stacktrace is fast enough for use on certain combinations of OS. We can investigate moving to llvm's symbolizer as a replacement. On devserver with C++ stacktraces disabled/enabled: ``` python test/distributed/test_c10d_nccl.py -k test_short Ran 1 test in 12.175s TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short Ran 1 test in 53.338s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114651 Approved by: https://github.com/zdevito	2023-11-28 16:49:20 +00:00
cyy	8933ff3595	Make torch::jit::module movable (#114041 ) This PR makes torch::jit::module movable to improve performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114041 Approved by: https://github.com/huydhn	2023-11-28 05:03:37 +00:00
Pritam Damania	f505d76462	Bug fixes to DDP _update_process_group API. (#114194 ) https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state. As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194 Approved by: https://github.com/rohan-varma	2023-11-27 23:52:40 +00:00
Ke Wen	800cf5f7cb	Add USE_C10D_NCCL around NCCL trace utils (#114597 ) Fixes #114575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114597 Approved by: https://github.com/malfet	2023-11-27 19:55:31 +00:00
Chip Turner	066e072524	Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385 ) - [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889) - Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit` Fixes cause of revert of original PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385 Approved by: https://github.com/huydhn	2023-11-23 07:00:00 +00:00
Ke Wen	36763d3135	[ProcessGroupNCCL] Move new trace utils (#114367 ) to TraceUtils.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/114367 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2023-11-23 05:07:41 +00:00
PyTorch MergeBot	b927a4e2ca	Revert "Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 )" This reverts commit `64a5372e6c`. Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk `4d07428ede` ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))	2023-11-22 17:43:51 +00:00
Pavan Balaji	00ae299016	[c10d] Remove unused function (#114341 ) Summary: As the title suggests Test Plan: OSS CI Differential Revision: D51386619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114341 Approved by: https://github.com/Skylion007	2023-11-22 17:31:20 +00:00
Ke Wen	f2ca07b680	[ProcessGroupNCCL] Remove jumper to UCC (#114170 ) The "jumper" to UCC lib in ProcessGroupNCCL was a temporary solution a while back. Cleaning it now that UCC has its own "PG" representation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114170 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/Aidyn-A	2023-11-22 15:35:06 +00:00
Bin Bao	33fad1c0d4	[AOTI] Fix a weight loading issue when the weight size can be 0 (#114280 ) Summary: When a weight tensor is 0-size, no device memory should be allocated for it. This PR fixes the weight loading logic for such a case. This problem was found when running the 14K model test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114280 Approved by: https://github.com/chenyang78	2023-11-22 14:03:51 +00:00
PyTorch MergeBot	3e1abde46d	Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 )" This reverts commit `a911b4db9d`. Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))	2023-11-22 10:13:48 +00:00
Edward Z. Yang	6187153753	Consolidate sym/non-sym overloads for _make_wrapper_subclass (#114236 ) I'm not sure why we needed two overloads previously, let's find out! Removing the int overload is load bearing because it now forces specialization on SymInt arguments instead of falling through to the SymInt overload, see new test. I decided NOT to allow storage offset simultaneously with None strides. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114236 Approved by: https://github.com/albanD	2023-11-22 02:03:29 +00:00
Andrew Calvano	4d07428ede	Fix for out of bounds read in mobile interpreter FORMAT opcode handler (#110303 ) Summary: The FORMAT opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption. This change adds an explicit check that the number of inputs passed to the format method called when handling the FORMAT opcode is a valid and within bounds of the stack. Test Plan: contbuild + OSS signals Differential Revision: D49739095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110303 Approved by: https://github.com/malfet	2023-11-22 01:05:42 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
Chip Turner	64a5372e6c	Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 ) Currently `ncclCommInitRankConfig` is always used when creating new communicator groups. This is wasteful as it creates non-shared pairs of endpoint queues as well as costs time to re-establish communication. This change is transparent and opportunistic; when `dist.new_group` is called, it will use the existing, healthy world process group to select the right ranks to include in the process group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889 Approved by: https://github.com/kwen2501	2023-11-21 21:03:52 +00:00
Ying Liu	85b97605ab	Enable set sequence nr (#114120 ) Summary: In some cases (especially those involving collective calls) - we would want to always kick off a collective call first before running going down another path. For example: ``` tbe lookup -> a2a -> overarch dense -------------> ``` if the forward code is written as a2a_out = a2a dense = dense_net out = overarch(a2a_out, dense) out.backward() The current default is running backwards in the opposite order the forward is called. However, there is no data dependency between a2a and dense, so in reality either of them could be run first. We would like the a2a to run first because it provides optimal (on average) overlap. Changing the seq_nr of a2a_out to something large enough would allow autograd engine to kick it off first. Test Plan: Tests incoming Differential Revision: D51445261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114120 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-11-21 19:47:28 +00:00
Pavan Balaji	8f8722e3f1	[nccl-pg] Avoid using NCCL_ prefix for non-NCCL env variables (#114077 ) NCCL_ prefix should only be used for NCCL library's environment variables. We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand. This patch renames such environment variables to use the TORCH_NCCL_ prefix instead. We still maintain the old NCCL_ variables, but throw a warning when they are used. The following env changes have been made: `NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT` `NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING` `NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG` `NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING` `ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK` `NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114077 Approved by: https://github.com/fduwjj	2023-11-21 07:23:42 +00:00
David Berard	99af534e93	[docs][jit] Mention dynamic-shapes settings in jit/OVERVIEW.md (#113964 ) Document torch._C._jit_set_fusion_strategy, which can control how many static-shape compilation attempts are made before falling back to dynamic shapes, before falling back to uncompiled graph execution. Would be good to keep all the graph executor settings documented in one place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113964 Approved by: https://github.com/eellison	2023-11-21 06:21:38 +00:00
Ke Wen	dc65f6c601	[c10d] Remove deprecated multi-gpu-per-thread APIs (#114156 ) As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156 Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 03:50:23 +00:00
voznesenskym	a911b4db9d	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-21 01:52:46 +00:00
Jacob Szwejbka	e8996055a9	[iOS][PTMCoreMLCompiler] update other deprecated function (#114177 ) Summary: old way was deprecated Test Plan: ci Reviewed By: kirklandsign Differential Revision: D51172622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114177 Approved by: https://github.com/kirklandsign	2023-11-21 01:36:00 +00:00
Guilherme Leobas	77f16eb00c	Fix prod double backward when there are 2+ zeros (#113969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113969 Approved by: https://github.com/albanD	2023-11-21 01:32:10 +00:00
Ke Wen	585332fb8d	[ProcessGroupNCCL] Fix avoid-record-stream warning for P2P (#114168 ) I have been seen below warning even though I did not set `TORCH_NCCL_AVOID_RECORD_STREAMS` to 1. ``` Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) ``` Turns out that `TORCH_WARN_ONCE` is unconditional, so the original code below would print out both the value of `avoidRecordStreams_` and the error message: ``` TORCH_WARN_ONCE( avoidRecordStreams_, "TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point " "collectives."); ``` That's also where the "0" in the message came from. Cc: @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/114168 Approved by: https://github.com/eqy, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 01:29:00 +00:00
Jacob Szwejbka	d70857bd9e	[pytorch][lite interpreter] add tracer run under inference guard (#114003 ) Summary: This can change the ops called under the hood. Its not safe to always call because of on device training. Test Plan: ci Differential Revision: D51440119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114003 Approved by: https://github.com/Jack-Khuu	2023-11-21 00:45:52 +00:00
Adnan Akhundov	ae00d9623e	[inductor] Add ABI shim function for torch.scatter (#114027 ) Summary: Scatter fallback calls `at::scatter` in the C++ wrapper codegen. This doesn't work in the ABI compatibility mode, as the latter requires a shim function. One is added in this PR. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_scatter_fallback s... ---------------------------------------------------------------------- Ran 4 tests in 52.713s OK (skipped=1) ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114027 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #114024	2023-11-20 22:51:59 +00:00
Edward Z. Yang	8c4812be80	Replace expect_int with guard_int (#113921 ) The idea is that instead of erroring, we will just specialize at these sites. Fixes https://github.com/pytorch/pytorch/issues/113142 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921 Approved by: https://github.com/zou3519	2023-11-20 21:27:48 +00:00
rzou	d1bb0b0e4d	Mark more built-in ops as pt2_compliant (#114128 ) See title Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/114128 Approved by: https://github.com/ezyang	2023-11-20 20:55:55 +00:00
Andrew Gallagher	95eab508e3	[caffe2] Add non-x86 stub definition for `libraryFor` too (#114023 ) Summary: Fix non-x86 build errors with missing `libraryFor` symbol. Test Plan: ``` $ buck2 build -c fbcode.arch=aarch64 fbcode//admarket/adfinder:adfinder ``` Reviewed By: malfet Differential Revision: D51444766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114023 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet	2023-11-20 17:01:47 +00:00
PyTorch MergeBot	f36d09fcb7	Revert "Add function to materialize COW storages (#113396 )" This reverts commit `e2f090086b`. Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))	2023-11-20 10:26:01 +00:00
PyTorch MergeBot	fe428a284b	Revert "Add `torch._lazy_clone` to create COW tensors (#113397 )" This reverts commit `9916d8a9ea`. Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))	2023-11-20 10:21:09 +00:00
cyy	226384b460	[2/N] Cleanup header inclusions in torch_cpu by iwyu (#109964 ) Further cleaning up of torch_cpu header inclusions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109964 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-11-19 20:56:32 +00:00
cyy	bae61ecb96	[Reland 1] Cleanup header inclusions in torch_cpu by iwyu (#112311 ) Reland https://github.com/pytorch/pytorch/pull/101178 to use IWYU on torch_cpu. The header file changes are excluded to avoid breaking internal jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112311 Approved by: https://github.com/ezyang	2023-11-19 04:06:36 +00:00
Pavan Balaji	958f3b0df6	[nccl-pg] Migrate to getCvar* functions for env variable checking (#113797 ) Summary: The getCvar* functions allow us to provide multiple environment variables for the same value. This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time. Test Plan: OSS CI Reviewed By: fduwjj, XilunWu Differential Revision: D51225487 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797 Approved by: https://github.com/fduwjj	2023-11-19 03:48:58 +00:00
Edward Z. Yang	fdaddec2c3	make_fx can now SymIntify int inputs (#113452 ) This PR also contains a basket of fixes that were turned up by now testing more arguments with SymInt. I fixed as many of the easy ones as I could easily get earlier in this stack and a bunch here, but there are some more annoying ones I xfailed. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113452 Approved by: https://github.com/Chillee ghstack dependencies: #113877, #113911	2023-11-18 06:39:09 +00:00
albanD	855a5cf427	312 test fix in named tensor and TS deprecations (#113981 ) Fix existing bugs / deprecations that become hard errors when running CI with Python 3.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113981 Approved by: https://github.com/malfet	2023-11-18 03:06:04 +00:00
Nikita Shulga	2efa89a388	[torch/csrc/onnx] Use nested namespaces (3/N) (#113993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113993 Approved by: https://github.com/ZainRizvi ghstack dependencies: #113991, #113992	2023-11-18 00:20:19 +00:00
Nikita Shulga	d6744a698c	[torch/csrc/onnx] Use nested namespaces (2/N) (#113992 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113992 Approved by: https://github.com/ZainRizvi ghstack dependencies: #113991	2023-11-18 00:20:19 +00:00
Nikita Shulga	c83a897348	[torch/csrc/onnx] Use nested namespaces (1/N) (#113991 ) Differential Revision: [D51439849](https://our.internmc.facebook.com/intern/diff/D51439849) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113991 Approved by: https://github.com/ZainRizvi	2023-11-18 00:20:10 +00:00
Andrzej Kotlowski	0885c58296	Add Bfloat16 scalar support to gloo backend (#113557 ) There was missing support for bfloat scalars. When I use gloo backend `torch.distributed.init_process_group(backend='gloo')` and run `torch.nn.parallel.DistributedDataParallel(model)` and _model_ has Bfloat16 features I receive following error: `RuntimeError: Invalid scalar type` This change fix this issue. c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113557 Approved by: https://github.com/XilunWu, https://github.com/jgong5	2023-11-17 21:16:54 +00:00
soulitzer	c435b8c10a	Fix autograd engine callback error propagation from device thread (#113702 ) The existing try-catch doesn't work because it doesn't call err.persist(). This is in contrast to the try-catch for evaluate_function which does work because it calls into python_engine's thread_on_exception which calls persist. Calling persist on a python_error stashes the PyErr state from the thread-local PyThreadState onto the python_error object, so that when this error object is stored onto the future and passed back to the calling cpu thread, python_engine's execute try-catch can then err.restore() the error state. Finally, the python_engine's execute would re-raise so that this is re-caught by the HANDLE_TH_ERRORS macro. Fixes https://github.com/pytorch/pytorch/issues/75750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113702 Approved by: https://github.com/albanD	2023-11-17 20:17:02 +00:00

1 2 3 4 5 ...

13130 Commits