Commit Graph

13130 Commits

Author SHA1 Message Date
fduwjj
5f41fc7619 [c10d] Change NCCL PG watchdog error msg and test comments (#115403)
Address the nit comments in https://github.com/pytorch/pytorch/pull/115226/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115403
Approved by: https://github.com/wconstab
ghstack dependencies: #115226
2023-12-11 17:55:28 +00:00
Nikita Shulga
8ddc549c0f [BE][JIT] Do not wrap shared_ptr with optional (#115473)
While reviewing https://github.com/pytorch/pytorch/pull/115381 noticed that `torch::jit::GraphFunction::optimized_graph_` is an `std::array<c10::optional<std::shared_ptr<Graph>>, N>`, which feels excessive as `shared_ptr` is already nullable and have `operator bool()`. Looking at https://github.com/pytorch/pytorch/pull/26488 that introduced the change, also does not hint that this indirection is necessary.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115473
Approved by: https://github.com/davidberard98, https://github.com/Skylion007
2023-12-09 20:43:40 +00:00
Deepak Seshadri
1c1f2bbe8a Add a space in the error message (#115465)
Summary:
As title says

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
waitforsandcastle

Sandcastle run

Reviewed By: eeggl

Differential Revision: D52000286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115465
Approved by: https://github.com/kwen2501
2023-12-09 04:35:51 +00:00
cyy
516bd4a72c [1/N] Use std::in_place (#115170)
It is time to gradually replace c10::in_place with std::in_place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115170
Approved by: https://github.com/colesbury
2023-12-09 03:52:39 +00:00
Will Constable
317486edb0 [C10D] Decouple flight recorder from enableTiming (#115358)
RE #115301

Decoupling gives us a path to disable timing without disabling the
flight recorder.

Flight recorder is still useful for stuckness analysis without 'timing'.

Disabling timing makes it miss the 'started'
state that comes from using an extra nccl event at the start of each
collective.  It will also be missing 'duration_ms' of collectives, which
hasn't been landed yet, but is useful for timing/perf work more than
stuckness analysis.

Hopefully we can enable timing by default and leave both on, but it's
nice to have the flexiblity for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358
Approved by: https://github.com/fduwjj
2023-12-08 19:44:45 +00:00
Scott Wolchok
494cb28231 [PyTorch] AOTI: add ArrayRefTensor (#112115)
This adds a shim for AOTI generated code to pretend a raw array works like an AtenTensorHandle. This allows parts of AOTI that generate uses of tensors to continue to be unaware of how those tensors are allocated. See the following diff/PR for usage.

Differential Revision: [D50570252](https://our.internmc.facebook.com/intern/diff/D50570252/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112115
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-12-08 19:31:50 +00:00
albanD
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
Wongboo
68f74dd162 Add python and C++ support for LPPool3d (#114199)
Add python and C++ support for LPPool3d to Fixes #114114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199
Approved by: https://github.com/mikaylagawarecki
2023-12-08 18:18:44 +00:00
Behrang Javaherian
b3b5bd51ea [raas][torch][jit] Allow not storing the optimized graph (#115381)
Summary:
GraphFunction internally stores the optimized graph after generating it and then it is passed into the executor which makes a copy of it. So we store the optimized graph effectively twice.

This diff allows to set a flag to not store the optimized graph inside the GraphFunction.

The code is NoP right now until the flag is enabled.

Test Plan:
I ran SL with this on raas with good memory saving on raas server. From command line:

exmaple model run
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362

I1207 11:04:58.657143 3556226 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 255646 Kb
```

then with flag enabled:
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true
I1207 11:06:25.245779 3577383 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 165167 Kb
```
So collective with this flag and the flag from D51950418
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true --torch_jit_enable_profiling_graph_executor=false

I1207 11:09:17.502743 3592345 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 114848 Kb
```

Differential Revision: D51931895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115381
Approved by: https://github.com/malfet
2023-12-08 16:29:13 +00:00
fduwjj
4d70802133 [c10d] Use TCPStore to record NCCL timeout and dump debug info (#115226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115226
Approved by: https://github.com/wconstab
2023-12-08 06:19:40 +00:00
Will Constable
784e20e3d7 [C10D] Make dumpPipe use async launcher (#115375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115375
Approved by: https://github.com/fduwjj
ghstack dependencies: #115332
2023-12-08 00:16:22 +00:00
Will Constable
7562b45454 Reland "[C10D] Use future for flight recorder dump (#115176)" (#115332)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

This reverts commit ac7d14baad.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
2023-12-07 21:20:58 +00:00
youkaichao
16373bbc1f fix error message in pytorch (#115349)
Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349
Approved by: https://github.com/Skylion007
2023-12-07 19:27:29 +00:00
Howard Huang
3e66385ddd Add Work to distributed docs (#115172)
Summary:
Documenting the `Work` object

For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented

Test Plan: Preview the docs build in OSS

Differential Revision: D51854974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
2023-12-07 18:12:10 +00:00
Shaltiel Shmidman
ee8b33f7d5 Fixed crash when calling pad_packed_tensor when packed with cuda tensors and ensure_sorted=false due to indexing with tensors on different devices (#115028)
Fixes #115027

Fix in csrc as done in the python code [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/utils/rnn.py#L338).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115028
Approved by: https://github.com/drisspg
2023-12-07 18:09:18 +00:00
Tobias Ringwald
43f42bf3cb Updated docs for deprecated torch.set_default_tensor_type (#115041)
Added deprecation note for torch.set_default_tensor_type. Updated docs that referenced this method.

Fixes #113646.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115041
Approved by: https://github.com/janeyx99
2023-12-07 16:17:36 +00:00
Chip Turner
78b945484b [c10d] Extend NCCL communicator splitting to more use cases (#114916)
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.

This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.

This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number).  And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
2023-12-07 15:13:01 +00:00
FFFrog
e1f159e6b2 Remove rebundant api named is_int_list (#115136)
Fixes #114933

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115136
Approved by: https://github.com/zou3519
2023-12-07 04:55:13 +00:00
PyTorch MergeBot
ac7d14baad Revert "[C10D] Use future for flight recorder dump (#115176)"
This reverts commit 0e07e3dbe4.

Reverted https://github.com/pytorch/pytorch/pull/115176 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test_timeout_dumps is failing in trunk 0e07e3dbe4 ([comment](https://github.com/pytorch/pytorch/pull/115176#issuecomment-1844076455))
2023-12-07 02:09:58 +00:00
Antonio Kim
73c0035160 Add reset_storage method to FunctionalTensorWrapper (#115235)
In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example
```python
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(4, 2, bias=False)

    def forward(self, x):
        return x @ self.fc1.weight.transpose(0, 1)

with torch.device("lazy"):
    model = Model()

    x = torch.ones(4)
    out = model(x)
```
The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge.

To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas.

CC: @behzad-a @GlebKazantaev @wconstab @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115235
Approved by: https://github.com/bdhirsh
2023-12-07 01:32:01 +00:00
Will Constable
0e07e3dbe4 [C10D] Use future for flight recorder dump (#115176)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort".  The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176
Approved by: https://github.com/zdevito
2023-12-06 23:42:19 +00:00
y-sq
233ce0d24b Support GPU annotations for auto-trace jobs similar on-demand support (#114638)
Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`.

Differential Revision: D51597995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638
Approved by: https://github.com/aaronenyeshi
2023-12-06 09:38:13 +00:00
cyy
d250b2158e [4/N] Fixes clang-tidy warnings in header files (#115163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115163
Approved by: https://github.com/Skylion007
2023-12-06 05:00:01 +00:00
fduwjj
2bff36bb0e [c10d] Change set timeout API name to _set_default_timeout (#115197)
Somehow the feedback does not show up, this PR is to address the comment in https://github.com/pytorch/pytorch/pull/115141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115197
Approved by: https://github.com/XilunWu, https://github.com/wconstab
2023-12-06 03:38:39 +00:00
Hongtao Yu
01ec71e466 [NFC][Autotune] Use device_prop.regsPerMultiprocessor instead of hardcoded reg number. (#115094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115094
Approved by: https://github.com/jansel
2023-12-05 23:49:46 +00:00
Mu-Chu Lee
80527c0cf2 [AOTInductor] Double buffering for Weights (#114446)
Summary:
This adds function to model container doing weight swapping with double buffering.

There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer

For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.

Test Plan:
test/cpp/aot_inductor/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
2023-12-05 22:31:56 +00:00
zdevito
259a99669d [NCCL flight recorder] Dump when writing to pipe (#115139)
If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump
file when a timeout happens, you can trigger a dump by writing to local pipe
`<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default
/tmp/nccl_trace_{rank}_<rank>.pipe).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139
Approved by: https://github.com/wconstab
2023-12-05 20:44:23 +00:00
fduwjj
a8bd593252 [c10d] Add _reset_nccl_collective_timeout so users can change timeout of a NCCL PG (#115141)
There are some use cases when users want to change the timeout for a NCCL process group in the middle of training. This PR enables it by adding a pybind api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115141
Approved by: https://github.com/wconstab
2023-12-05 19:55:28 +00:00
Ke Wen
c9853ccadc Relax tensor contiguity requirement for P2P ops (#114982)
I hit the following error when performing pipeline parallel for T5:
```
    return default_pg.send([tensor], dst, tag)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Tensors must be contiguous
```

In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy".

Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114982
Approved by: https://github.com/H-Huang
2023-12-05 18:25:42 +00:00
Xia, Weiwen
daf89b4101 Update oneDNN submodule to v3.3.2 (#112700)
Update oneDNN submodule to v3.3.2.
Add a macro to check the version of `third_party/ideep`.
Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2023-12-05 17:51:55 +00:00
PyTorch MergeBot
ee96399bb4 Revert "[Reland2] Update NVTX to NVTX3 (#109843)"
This reverts commit dcb486232d.

Reverted https://github.com/pytorch/pytorch/pull/109843 on behalf of https://github.com/atalman due to Diff broke internal builds and tests ([comment](https://github.com/pytorch/pytorch/pull/109843#issuecomment-1841105398))
2023-12-05 16:10:20 +00:00
Pavan Balaji
94faba5224 [nccl-pg] Revert accidental renaming of env variables (#115082)
Summary:

In [9cc040fef6], we accidentally changed some of the environment variable names to the non-deprecated form.  The intent was to support both the deprecated and the new form of the env variables (with a warning thrown for the deprecated form).

Test Plan:

OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115082
Approved by: https://github.com/zdevito
2023-12-05 14:52:30 +00:00
cyyever
1224acc018 [3/N] Fixes clang-tidy warnings in header files (#114431)
This PR series tries to enable clang-tidy for headers in torch/csrc and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114431
Approved by: https://github.com/Skylion007
2023-12-05 12:58:27 +00:00
PyTorch MergeBot
62df4f3428 Revert "Update oneDNN submodule to v3.3.2 (#112700)"
This reverts commit afbaa0c165.

Reverted https://github.com/pytorch/pytorch/pull/112700 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/112700#issuecomment-1839350284))
2023-12-04 19:41:12 +00:00
cyyever
dcb486232d [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10
2023-12-04 19:02:07 +00:00
PyTorch MergeBot
f101426790 Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901)"
This reverts commit fb325bbd46.

Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))
2023-12-04 14:55:39 +00:00
FFFrog
541591dd79 Add the appropriate check on div_value to the cpp frontend (#114671)
Fixes #114334

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114671
Approved by: https://github.com/mikaylagawarecki
2023-12-04 01:28:11 +00:00
Yang Chen
4d8b9964e1 [aotinductor] support at::convolution for AOTInductor (#114961)
This PR adds support to at::convolution for AOTInductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114961
Approved by: https://github.com/desertfire
2023-12-03 07:52:28 +00:00
Kwanghoon An
13410d0eda Moving target/code path to non-pytorch repo (#114095)
Differential Revision: D51460806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114095
Approved by: https://github.com/digantdesai
2023-12-02 19:27:09 +00:00
Jez Ng
f1fd02503b Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974)
This is a backout of #113747 which reverted the above two commits. Now that
#113997 has landed, this diff can be landed safely without breaking ABI compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974
Approved by: https://github.com/chenyang78
2023-12-02 03:25:51 +00:00
Mu-Chu Lee
fb806f487f [AOTInductor] Add method to get storage size in shim (#114976)
Summary:
Add a method to get storage size.

Test Plan:
N/A, for FC, test will come after packaged.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114976
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-12-02 01:54:18 +00:00
Will Constable
8a51845b38 [C10D] Add filename to dump finished log (#114957)
Just shows you where to look..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114957
Approved by: https://github.com/fduwjj
2023-12-01 20:38:02 +00:00
Chip Turner
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
Xia, Weiwen
afbaa0c165 Update oneDNN submodule to v3.3.2 (#112700)
Update oneDNN submodule to v3.3.2.
Add a macro to check the version of `third_party/ideep`.
Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2023-12-01 18:40:07 +00:00
Nikita Shulga
76362cc9a0 [BE] Do not use AT_ERROR (#114883)
As later is just an alias to `TORCH_CHECK(false,)`

Proposed as suggestion to https://github.com/pytorch/pytorch/pull/110303 but it wasn't noticed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114883
Approved by: https://github.com/atalman
2023-12-01 13:44:17 +00:00
fduwjj
25b83521be [c10d] Log NCCL trace buffer size (#114926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114926
Approved by: https://github.com/zdevito
ghstack dependencies: #114901
2023-12-01 08:06:10 +00:00
Pavan Balaji
aa390cec21 [profiler] Fix description to use nelems rather than size (#114735)
We were storing the number of elements in the tensor, rather than the actual bytes.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114735
Approved by: https://github.com/aaronenyeshi, https://github.com/yoyoyocmu, https://github.com/kwen2501, https://github.com/fduwjj
2023-12-01 06:21:47 +00:00
fduwjj
fb325bbd46 Move class definition of DebugInfoWriter to TraceUtil as well (#114901)
Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901
Approved by: https://github.com/XilunWu
2023-12-01 03:28:16 +00:00
Shengbao Zheng
1d95644740 [Execution Trace] record root rank for broadcast/gather/reduce/scatter (#113828)
Summary:
collective like broadcast/gather/reduce/scatter need root rank info in order to be replayed in PARAM benchmarks. Log root rank instead of local rank in RECORD_PARAM_COMMS_DATA

Reference: distributed/c10d/Types.hpp

Test Plan: Tested in HPC

Differential Revision: D51381196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113828
Approved by: https://github.com/fduwjj
2023-12-01 01:28:49 +00:00
Will Constable
92cd78b1df [C10D] logging/comment clean ups (#114625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114625
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
ghstack dependencies: #114810
2023-11-30 07:46:32 +00:00
Will Constable
4ed9e65038 [C10D] Add time_created_us to flight recorder (#114810)
time_created_us is the cpu-side epoch_time (in usec) when a flight-recorder
event was created. It loosely corresponds to the time the c10d collective
API was called and a work object was created.  It does NOT correspond to
the time the collective started on the GPU.

We follow the precedent of us epoch time from this PR adding timestamps
to the cuda caching allocator:
https://github.com/pytorch/pytorch/pull/112266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114810
Approved by: https://github.com/zdevito
2023-11-30 04:15:56 +00:00
Tae Kyung Heo
1f5726708b [PyTorch][ET] Collect Execution Traces in Chakra schema (#114753)
Summary:
Collect execution traces in the Chakra schema

Created a new diff to change email address: D48030418

Test Plan:
```
$ cd ~/fbcode
$ binary_path=$(buck2 build //param_bench/train/compute/python:pytorch_run_benchmark --show-output | tail -1 | awk '{print $2}')
$ cd ~/fbsource
$ $binary_path -c ~/fbcode/param_bench/train/compute/python/examples/pytorch/configs/alex_net.json --et

$ cat ~/is_json.py
import json
import sys

def is_json_file(filename):
    try:
        with open(filename, 'r') as f:
            json.load(f)
        return True
    except Exception as e:
        return False

if len(sys.argv) != 2:
    print("Usage: python check_json.py [filename]")
    sys.exit(1)

filename = sys.argv[1] # get filename from command-line argument
print(is_json_file(filename))

$ python3 ~/is_json.py ~/fbsource/benchmark_result_2244333_1691065899_et.json
True
```

Differential Revision: D51662384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114753
Approved by: https://github.com/aaronenyeshi
2023-11-30 04:07:11 +00:00
zdevito
d5544125a0 [distributed] NCCLflight recorder timeout fix (#114804)
Because isCompleted() returns true on an exception, a timeout exception
will cause the flight recorder to consider the event completed even though it timed out.

This changes the logic to explicitly query the completion events on "retirement"
when the work item leaves the workMetaList. We mark events as retired so
we can distinguish between an event still in the queue but not completed and one
that timed out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114804
Approved by: https://github.com/wconstab
2023-11-30 03:46:48 +00:00
Scott Wolchok
165f4f6ccf [PyTorch] Redirect c10::optional to std::optional (#101995)
We have C++17 now!

I am intentionally dropping the `c10::optional<c10::ArrayRef>` size optimization. It was intended to improve dispatch, but thanks to D34602980 / #70864 we don't use `optional<ArrayRef>` in function arguments anymore anyway.

Differential Revision: [D46079028](https://our.internmc.facebook.com/intern/diff/D46079028/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101995
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/ezyang
2023-11-30 02:46:41 +00:00
Wei Lu
34ea0a2bdc [Pytoch][Vulkan] Create context for layernorm (#114701)
Summary:
`Layernorm` has two arguments weight and bias which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this op to avoid the repeated passing. Specifically, we
- created `create_layernorm_context` and `run_layernorm_context` in `Layernorm.h` and `Layernorm.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`

Test Plan:
## Numerical test
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (b6ccc956c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.packed_layer_norm_2d
[       OK ] VulkanAPITest.packed_layer_norm_2d (342 ms)
[ RUN      ] VulkanAPITest.packed_layer_norm_3d
[       OK ] VulkanAPITest.packed_layer_norm_3d (284 ms)
[ RUN      ] VulkanAPITest.packed_layer_norm_4d
[       OK ] VulkanAPITest.packed_layer_norm_4d (5 ms)
[ RUN      ] VulkanAPITest.layer_norm_invalid_inputs
[       OK ] VulkanAPITest.layer_norm_invalid_inputs (28 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d
[       OK ] VulkanAPITest.layer_norm_2d (1 ms)
[ RUN      ] VulkanAPITest.layer_norm_3d
[       OK ] VulkanAPITest.layer_norm_3d (2 ms)
[ RUN      ] VulkanAPITest.layer_norm_4d
[       OK ] VulkanAPITest.layer_norm_4d (4 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_2d
[       OK ] VulkanAPITest.native_layer_norm_2d (1 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_3d
[       OK ] VulkanAPITest.native_layer_norm_3d (2 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_4d
[       OK ] VulkanAPITest.native_layer_norm_4d (6 ms)
[----------] 10 tests from VulkanAPITest (679 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (679 ms total)
[  PASSED  ] 10 tests.
```
Full test result in P888496077, summary as below
```
[----------] 419 tests from VulkanAPITest (21652 ms total)

[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (21652 ms total)
[  PASSED  ] 418 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```

## Graph representation comparison
We created a model using `layer_norm` and traced it as below
```
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layer_norm = torch.nn.LayerNorm(normalized_shape=10)

    def forward(self, x):
        return self.layer_norm(x)

# Create an instance of the model
model = MyModel()

# Create a dummy input tensor for tracing
input_tensor = torch.randn(1, 10)

# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer

vulkan_model = mobile_optimizer.optimize_for_mobile(
    traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Then we can print the graph of the `vulkan_model` as `print(vk_model.graph)`

- Before this diff
```
  %4 : bool = prim::Constant[value=1](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
  %5 : float = prim::Constant[value=1.0000000000000001e-05](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
  %14 : int[] = prim::Constant[value=[10]]()
  %33 : Tensor = aten::to(%x, %53, %30, %31, %31)
  %10 : Tensor = aten::layer_norm(%33, %14, %self.layer_norm.weight, %self.layer_norm.bias, %5, %4), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
```

- after this diff
```
  %14 : int[] = prim::Constant[value=[10]]()
  %47 : Tensor = aten::to(%x, %78, %44, %45, %45)
  %16 : Tensor = vulkan_prepack::run_layernorm_context(%47, %14, %17)
```

Reviewed By: SS-JIA

Differential Revision: D51530478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114701
Approved by: https://github.com/yipjustin
2023-11-30 01:33:50 +00:00
cyy
4e38178bb8 [Reland] [1/N] Fixes clang-tidy warnings in header files (#114668)
Reland of #113608 after fixing the problematic parts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114668
Approved by: https://github.com/huydhn
2023-11-29 07:11:51 +00:00
Brian Hirsh
64ccdd4afb AOTAutograd: keep input mutations in the graph if they are under no_grad, even if they require_grad (#114646)
Quick recap of events:

(1) https://github.com/pytorch/pytorch/pull/111347, which fixed a perf regression in 2.1 compared to 2.0, introduced a correctness problem around input mutations on inputs that require grad that show up in an inference-only graph (the specific case where this can happen is rare and nobody reported the issue, but it was fixed a few weeks later)

(2) That fix happened here: https://github.com/pytorch/pytorch/pull/113584, which makes sure to keep input mutations outside of the graph, so the autograd engine can set metadata properly on them

(3) That in turn caused a slight regression compared to (1), which is what this PR attempts to fix. In particular, code like the below is safe to keep the mutations in the graph for:

```
@torch.compile
def f(x):
    x.mul_(2)

x = torch.ones(2, requires_grad=True).clone()
# x requires_grad, so the input mutation will change some autograd metadata, like the version counter
# However, the mutation is under no_grad, so we don't have to worry about e.g. aliases of x having their .grad_fn fields changed
with torch.no_grad():
    f(x)
```

This particular case is pretty important to the shampoo optimizer code, which is run under `torch.compile`, and mutates parameters (which require grad).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114646
Approved by: https://github.com/zou3519
2023-11-29 04:29:32 +00:00
Will Constable
43d0659d74 [C10D] Fix DUMP_ON_TIMEOUT env (#114699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114699
Approved by: https://github.com/kwen2501, https://github.com/XilunWu, https://github.com/fduwjj
2023-11-29 00:15:45 +00:00
Will Constable
44c9e4cbf0 [C10D] Decouple PGNCCL desync from dbg dump (#114614)
Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently
of desync debug feature.

Currently default to disabled (so no behavior change by default),
but plan to default this to true after validation.

Moves 'sleep for 30 sec' that used to be after desync debug to before
it. In my view sleeping before desync is equivalent since we always
sleep the same duration, and keeps the code simpler this way.

Fixes #114433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114614
Approved by: https://github.com/zdevito
ghstack dependencies: #114651
2023-11-28 19:46:10 +00:00
voznesenskym
ddf1cb7870 AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-28 19:33:35 +00:00
Will Constable
e6a8052051 [C10D] Flight recorder - disable c++ stacktrace by default (#114651)
CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line.  In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.

TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.

CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.

On devserver with C++ stacktraces disabled/enabled:
```
python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 12.175s

TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 53.338s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114651
Approved by: https://github.com/zdevito
2023-11-28 16:49:20 +00:00
cyy
8933ff3595 Make torch::jit::module movable (#114041)
This PR makes torch::jit::module movable to improve performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114041
Approved by: https://github.com/huydhn
2023-11-28 05:03:37 +00:00
Pritam Damania
f505d76462 Bug fixes to DDP _update_process_group API. (#114194)
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.

As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
2023-11-27 23:52:40 +00:00
Ke Wen
800cf5f7cb Add USE_C10D_NCCL around NCCL trace utils (#114597)
Fixes #114575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114597
Approved by: https://github.com/malfet
2023-11-27 19:55:31 +00:00
Chip Turner
066e072524 Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385)
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`

Fixes cause of revert of original PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
2023-11-23 07:00:00 +00:00
Ke Wen
36763d3135 [ProcessGroupNCCL] Move new trace utils (#114367)
to TraceUtils.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114367
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2023-11-23 05:07:41 +00:00
PyTorch MergeBot
b927a4e2ca Revert "Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)"
This reverts commit 64a5372e6c.

Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk 4d07428ede ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))
2023-11-22 17:43:51 +00:00
Pavan Balaji
00ae299016 [c10d] Remove unused function (#114341)
Summary: As the title suggests

Test Plan: OSS CI

Differential Revision: D51386619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114341
Approved by: https://github.com/Skylion007
2023-11-22 17:31:20 +00:00
Ke Wen
f2ca07b680 [ProcessGroupNCCL] Remove jumper to UCC (#114170)
The "jumper" to UCC lib in ProcessGroupNCCL was a temporary solution a while back. Cleaning it now that UCC has its own "PG" representation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114170
Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/Aidyn-A
2023-11-22 15:35:06 +00:00
Bin Bao
33fad1c0d4 [AOTI] Fix a weight loading issue when the weight size can be 0 (#114280)
Summary: When a weight tensor is 0-size, no device memory should be allocated for it. This PR fixes the weight loading logic for such a case. This problem was found when running the 14K model test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114280
Approved by: https://github.com/chenyang78
2023-11-22 14:03:51 +00:00
PyTorch MergeBot
3e1abde46d Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)"
This reverts commit a911b4db9d.

Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))
2023-11-22 10:13:48 +00:00
Edward Z. Yang
6187153753 Consolidate sym/non-sym overloads for _make_wrapper_subclass (#114236)
I'm not sure why we needed two overloads previously, let's find out! Removing the int overload is load bearing because it now forces specialization on SymInt arguments instead of falling through to the SymInt overload, see new test.

I decided NOT to allow storage offset simultaneously with None strides.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114236
Approved by: https://github.com/albanD
2023-11-22 02:03:29 +00:00
Andrew Calvano
4d07428ede Fix for out of bounds read in mobile interpreter FORMAT opcode handler (#110303)
Summary:
The FORMAT opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.

This change adds an explicit check that the number of inputs passed to the format method called when handling the FORMAT opcode is a valid and within bounds of the stack.

Test Plan: contbuild + OSS signals

Differential Revision: D49739095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110303
Approved by: https://github.com/malfet
2023-11-22 01:05:42 +00:00
Antonio Kim
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
Chip Turner
64a5372e6c Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
2023-11-21 21:03:52 +00:00
Ying Liu
85b97605ab Enable set sequence nr (#114120)
Summary:
In some cases (especially those involving collective calls) - we would want to always kick off a collective call first before running going down another path.

For  example:

```
tbe lookup -> a2a ->
                     overarch
dense ------------->
```

if the forward code is written as
a2a_out = a2a
dense = dense_net
out = overarch(a2a_out, dense)
out.backward()

The current default is running backwards in the opposite order the forward is called. However, there is no data dependency between a2a and dense, so in reality either of them could be run first. We would like the a2a to run first because it provides optimal (on average) overlap.

Changing the seq_nr of a2a_out to something large enough would allow autograd engine to kick it off first.

Test Plan: Tests incoming

Differential Revision: D51445261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114120
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-11-21 19:47:28 +00:00
Pavan Balaji
8f8722e3f1 [nccl-pg] Avoid using NCCL_ prefix for non-NCCL env variables (#114077)
NCCL_ prefix should only be used for NCCL library's environment variables.  We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand.

This patch renames such environment variables to use the TORCH_NCCL_ prefix instead.  We still maintain the old NCCL_ variables, but throw a warning when they are used.

The following env changes have been made:

`NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT`
`NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING`
`NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG`
`NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING`
`ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK`
`NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114077
Approved by: https://github.com/fduwjj
2023-11-21 07:23:42 +00:00
David Berard
99af534e93 [docs][jit] Mention dynamic-shapes settings in jit/OVERVIEW.md (#113964)
Document torch._C._jit_set_fusion_strategy, which can control how many static-shape compilation attempts are made before falling back to dynamic shapes, before falling back to uncompiled graph execution.

Would be good to keep all the graph executor settings documented in one place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113964
Approved by: https://github.com/eellison
2023-11-21 06:21:38 +00:00
Ke Wen
dc65f6c601 [c10d] Remove deprecated multi-gpu-per-thread APIs (#114156)
As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document.  The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156
Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 03:50:23 +00:00
voznesenskym
a911b4db9d AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-21 01:52:46 +00:00
Jacob Szwejbka
e8996055a9 [iOS][PTMCoreMLCompiler] update other deprecated function (#114177)
Summary: old way was deprecated

Test Plan: ci

Reviewed By: kirklandsign

Differential Revision: D51172622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114177
Approved by: https://github.com/kirklandsign
2023-11-21 01:36:00 +00:00
Guilherme Leobas
77f16eb00c Fix prod double backward when there are 2+ zeros (#113969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113969
Approved by: https://github.com/albanD
2023-11-21 01:32:10 +00:00
Ke Wen
585332fb8d [ProcessGroupNCCL] Fix avoid-record-stream warning for P2P (#114168)
I have been seen below warning even though I did not set `TORCH_NCCL_AVOID_RECORD_STREAMS` to 1.
```
Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives.  (function operator())
```

Turns out that `TORCH_WARN_ONCE` is unconditional, so the original code below would print out both the value of `avoidRecordStreams_` and the error message:
```
TORCH_WARN_ONCE(
   avoidRecordStreams_,
   "TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point "
   "collectives.");
```
 That's also where the "0" in the message came from.

Cc: @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114168
Approved by: https://github.com/eqy, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 01:29:00 +00:00
Jacob Szwejbka
d70857bd9e [pytorch][lite interpreter] add tracer run under inference guard (#114003)
Summary: This can change the ops called under the hood. Its not safe to always call because of on device training.

Test Plan: ci

Differential Revision: D51440119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114003
Approved by: https://github.com/Jack-Khuu
2023-11-21 00:45:52 +00:00
Adnan Akhundov
ae00d9623e [inductor] Add ABI shim function for torch.scatter (#114027)
Summary: Scatter fallback calls `at::scatter` in the C++ wrapper codegen. This doesn't work in the ABI compatibility mode, as the latter requires a shim function. One is added in this PR.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_scatter_fallback
s...
----------------------------------------------------------------------
Ran 4 tests in 52.713s

OK (skipped=1)
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114027
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #114024
2023-11-20 22:51:59 +00:00
Edward Z. Yang
8c4812be80 Replace expect_int with guard_int (#113921)
The idea is that instead of erroring, we will just specialize at these sites.

Fixes https://github.com/pytorch/pytorch/issues/113142

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921
Approved by: https://github.com/zou3519
2023-11-20 21:27:48 +00:00
rzou
d1bb0b0e4d Mark more built-in ops as pt2_compliant (#114128)
See title

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114128
Approved by: https://github.com/ezyang
2023-11-20 20:55:55 +00:00
Andrew Gallagher
95eab508e3 [caffe2] Add non-x86 stub definition for libraryFor too (#114023)
Summary: Fix non-x86 build errors with missing `libraryFor` symbol.

Test Plan:
```
$ buck2 build -c fbcode.arch=aarch64 fbcode//admarket/adfinder:adfinder
```

Reviewed By: malfet

Differential Revision: D51444766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114023
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
2023-11-20 17:01:47 +00:00
PyTorch MergeBot
f36d09fcb7 Revert "Add function to materialize COW storages (#113396)"
This reverts commit e2f090086b.

Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))
2023-11-20 10:26:01 +00:00
PyTorch MergeBot
fe428a284b Revert "Add torch._lazy_clone to create COW tensors (#113397)"
This reverts commit 9916d8a9ea.

Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))
2023-11-20 10:21:09 +00:00
cyy
226384b460 [2/N] Cleanup header inclusions in torch_cpu by iwyu (#109964)
Further cleaning up of torch_cpu header inclusions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109964
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-11-19 20:56:32 +00:00
cyy
bae61ecb96 [Reland 1] Cleanup header inclusions in torch_cpu by iwyu (#112311)
Reland https://github.com/pytorch/pytorch/pull/101178 to use IWYU on torch_cpu. The header file changes are excluded to avoid breaking internal jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112311
Approved by: https://github.com/ezyang
2023-11-19 04:06:36 +00:00
Pavan Balaji
958f3b0df6 [nccl-pg] Migrate to getCvar* functions for env variable checking (#113797)
Summary:
The getCvar* functions allow us to provide multiple environment variables for the same value.  This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time.

Test Plan: OSS CI

Reviewed By: fduwjj, XilunWu

Differential Revision: D51225487

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797
Approved by: https://github.com/fduwjj
2023-11-19 03:48:58 +00:00
Edward Z. Yang
fdaddec2c3 make_fx can now SymIntify int inputs (#113452)
This PR also contains a basket of fixes that were turned up by now testing more arguments with SymInt. I fixed as many of the easy ones as I could easily get earlier in this stack and a bunch here, but there are some more annoying ones I xfailed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113452
Approved by: https://github.com/Chillee
ghstack dependencies: #113877, #113911
2023-11-18 06:39:09 +00:00
albanD
855a5cf427 312 test fix in named tensor and TS deprecations (#113981)
Fix existing bugs / deprecations that become hard errors when running CI with Python 3.12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113981
Approved by: https://github.com/malfet
2023-11-18 03:06:04 +00:00
Nikita Shulga
2efa89a388 [torch/csrc/onnx] Use nested namespaces (3/N) (#113993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113993
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #113991, #113992
2023-11-18 00:20:19 +00:00
Nikita Shulga
d6744a698c [torch/csrc/onnx] Use nested namespaces (2/N) (#113992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113992
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #113991
2023-11-18 00:20:19 +00:00
Nikita Shulga
c83a897348 [torch/csrc/onnx] Use nested namespaces (1/N) (#113991)
Differential Revision: [D51439849](https://our.internmc.facebook.com/intern/diff/D51439849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113991
Approved by: https://github.com/ZainRizvi
2023-11-18 00:20:10 +00:00
Andrzej Kotlowski
0885c58296 Add Bfloat16 scalar support to gloo backend (#113557)
There was missing support for bfloat scalars. When I use gloo backend
`torch.distributed.init_process_group(backend='gloo')`
and run
`torch.nn.parallel.DistributedDataParallel(model)`
and _model_ has Bfloat16 features I receive following error:
`RuntimeError: Invalid scalar type`

This change fix this issue.
c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113557
Approved by: https://github.com/XilunWu, https://github.com/jgong5
2023-11-17 21:16:54 +00:00
soulitzer
c435b8c10a Fix autograd engine callback error propagation from device thread (#113702)
The existing try-catch doesn't work because it doesn't call err.persist(). This is in contrast to the try-catch for evaluate_function which does work because it calls into python_engine's thread_on_exception which calls persist.

Calling persist on a python_error stashes the PyErr state from the thread-local PyThreadState onto the python_error object, so that when this error object is stored onto the future and passed back to the calling cpu thread, python_engine's execute try-catch can then err.restore() the error state. Finally, the python_engine's execute would re-raise so that this is re-caught by the HANDLE_TH_ERRORS macro.

Fixes https://github.com/pytorch/pytorch/issues/75750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113702
Approved by: https://github.com/albanD
2023-11-17 20:17:02 +00:00