Commit Graph

78699 Commits

Author SHA1 Message Date
Riley Dulin
3be150653c [torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.

Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.

Test Plan: Added a test for this case in `test_numeric_debugger`.

Reviewed By: jerryzh168

Differential Revision: D62898297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
2024-09-24 03:28:12 +00:00
Guilherme Leobas
e09c5b6046 Remove vt argument in raise_observed_exception (#136037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037
Approved by: https://github.com/zou3519
2024-09-24 02:36:57 +00:00
fduwjj
9372692c7b [FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473)
Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473
Approved by: https://github.com/c-p-i-o
2024-09-24 02:34:43 +00:00
Nikita Shulga
4fd16dd8aa Clarify that libtorch API is C++17 compatible (#136471)
As it relies on some common C++17 primitives, such as `std::optional`
Replace all docs references from C++14 to C++17

Fixes https://github.com/pytorch/pytorch/issues/133205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-24 02:03:33 +00:00
Jez Ng
e4d294221b [inductor] Log precompilation time (#136395)
This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395
Approved by: https://github.com/eellison
2024-09-24 01:47:54 +00:00
Edward Z. Yang
802ba79121 Inherit all secrets to inductor workflow (#135354)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354
Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet
2024-09-24 01:30:40 +00:00
Aaron Orenstein
06909803cc Existing mypy issues (#136236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2024-09-24 01:02:07 +00:00
Xuan Zhang
a14f57b126 fix the inductor tests (#136474)
Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474
Approved by: https://github.com/malfet
2024-09-24 00:59:22 +00:00
Nikita Shulga
9d9bc65b5e Make FlashAttentionKernel.cpp compilable for SVE with GCC-11 (#136477)
Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

Hattip to @abhishek-iitmadras  for the investigation

Fixes https://github.com/pytorch/pytorch/issues/136432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477
Approved by: https://github.com/atalman, https://github.com/kit1980
2024-09-24 00:54:26 +00:00
Ke Wen
e0f84f40f7 [Pipelining] Allow non-0 stages to accept kwargs (#136416)
For supporting usage case in torchchat:
all non-0 stages requires `input_pos` and `cache_lane`.
```
kwargs = {"input_pos": input_pos, "cache_lane": lane}

if pp_rank == first_pp_rank:
    output = decorder.step(new_token, **kwargs)
elif pp_rank == last_pp_rank:
    output = decorder.step(**kwargs)
else:  # middle pp ranks
    decorder.step(**kwargs)
```

The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416
Approved by: https://github.com/wconstab
2024-09-23 23:50:59 +00:00
Guilherme Leobas
52c917b0ba Optimize dict reconstruct to not codegen untouched values (#134876)
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)

We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.

Fixes: #133487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
2024-09-23 21:45:44 +00:00
fduwjj
5033a1ca0d [RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
2024-09-23 20:32:24 +00:00
PyTorch MergeBot
fd182b90a7 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d45b0151e5.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))
2024-09-23 19:57:13 +00:00
Nikita Shulga
08dba25775 [BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449)
- `Tensor::type()` -> `Tensor::scalar_type()`
- `Tensor::data<T>()` -> `Tensor::data_ptr<T>()`

Should fix following warnings during the compilation:
```
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   739 |   int* rowOffsets = crow.data<int>();
       |                                    ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   740 |   int* colIndices = col.data<int>();
       |                                   ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                            ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here
   225 |   DeprecatedTypeProperties & type() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                               ^
 /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here
   109 | inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) {
       | ^~~~~~~~~~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                               ^
 /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here
   109 | inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) {
       | ^~~~~~~~~~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449
Approved by: https://github.com/huydhn
2024-09-23 19:20:34 +00:00
Xiaodong Wang
9a1dc41de7 [AMD] Skipping 0 byte send/recv for AMD GPU (#136362)
Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them.

Reviewed By: danzimm

Differential Revision: D63075000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362
Approved by: https://github.com/malfet, https://github.com/houseroad
2024-09-23 19:14:12 +00:00
Edward Z. Yang
3116fbda0f Increase update_hint_regression problem size to 1000 (#136434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434
Approved by: https://github.com/laithsakka
2024-09-23 18:51:44 +00:00
PyTorch MergeBot
274883083d Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318)"
This reverts commit d21841d077.

Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))
2024-09-23 17:47:49 +00:00
Aleksei Nikiforov
d859fcbc61 s390x: build s390x binaries on each pull request (#125399)
Ensure that s390x keeps building for each PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399
Approved by: https://github.com/huydhn
2024-09-23 17:39:48 +00:00
Joel Schlosser
83a3ee0699 Support embedding_bag() with NJT input (#135888)
Fixes #93843

`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
2024-09-23 17:35:19 +00:00
James Wu
4649aeaebf Make AOTAutogradCache support remote FXGraphCache (#136173)
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.

This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.

Test Plan: (Meta only tests)

Reviewed By: aorenste

Differential Revision: D62384944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
2024-09-23 17:24:27 +00:00
Nikita Shulga
c3e678382b Fix addmm silent correctness on aarch64 (#136371)
Do not dispatch to fast gemmv functions when alpha is not equal to 1

Add regression test to address the problem

Fixes https://github.com/pytorch/pytorch/issues/136299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371
Approved by: https://github.com/swolchok
2024-09-23 17:10:34 +00:00
Edward Z. Yang
f0f79dd8f1 Correctly convert Python float to float64 when passing argument as Tensor (#136413)
I can't actually test the Dynamo codegen fix as it is impossible to
directly use the Tensor at the moment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413
Approved by: https://github.com/bobrenjc93
2024-09-23 16:48:08 +00:00
wz337
637d5c4b7e [DSD] Fix loading uneven full tensor into sharded state dict (#136365)
Fix #136228.

This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365
Approved by: https://github.com/fegin
2024-09-23 16:35:58 +00:00
fduwjj
da51fe1c42 [FR] Fix errors in all2all check, improve some log output (#136399)
We found that we show the hashed pg name in our script output, which is not UX friendly.
Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable.

Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399
Approved by: https://github.com/c-p-i-o
2024-09-23 16:31:31 +00:00
PyTorch MergeBot
df6a8fa1eb Revert "[aotd] Fix freezing API for subclasses (#136265)"
This reverts commit cdef760560.

Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))
2024-09-23 16:25:05 +00:00
Andrew Gu
9992084f38 [FSDP2] Fixed test_all_gather_extensions_monkey_patch (#136130)
I messed up the test before. The extensions were not running :/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130
Approved by: https://github.com/weifengpy
ghstack dependencies: #136129
2024-09-23 15:12:44 +00:00
Andrew Gu
b9f53c0dce [FSDP2] Added module, mp policy to fsdp_pre_all_gather (#136129)
- Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR.
- Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example.

The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129
Approved by: https://github.com/weifengpy
2024-09-23 15:12:36 +00:00
Bin Bao
d21841d077 [AOTI] Create another wrapper class to handle ArrayRef (#136318)
Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code.

Test Plan: CI

Differential Revision: D62961885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318
Approved by: https://github.com/frank-wei
2024-09-23 15:10:27 +00:00
PyTorch MergeBot
0e19522122 Revert "Adds support for accelerated sorting with x86-simd-sort (#127936)"
This reverts commit 239a9ad65e.

Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](239a9ad65e) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))
2024-09-23 14:52:23 +00:00
Edward Z. Yang
bae427e4b1 Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107)
By refactoring this way, I can put a non-expiring LRU cache here.
Splitting also will make it easier for me to tell who is using up all
the time.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107
Approved by: https://github.com/aorenste
2024-09-23 14:39:20 +00:00
PyTorch MergeBot
e9bfbf78d5 Revert "Allow fx graph caching higher order operators (opt-in) (#135877)"
This reverts commit 66d5eb64e0.

Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))
2024-09-23 09:04:24 +00:00
cyy
75f141be62 Avoid unnecessary CMake warnings on Windows (#136393)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393
Approved by: https://github.com/ezyang
2024-09-23 06:42:59 +00:00
Yuxin Wu
663e760065 add unittest for OOM message (#129671)
Add unittest for the bug in #123984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671
Approved by: https://github.com/eqy
2024-09-23 04:48:01 +00:00
Yiming Zhou
068fdd602f [export] enable custom tag metadata re-export test (#136048)
Improves and enables a commented out test originally introduced in #131912

In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048
Approved by: https://github.com/zhxchen17
2024-09-23 04:37:58 +00:00
Oguz Ulgen
66d5eb64e0 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-23 04:33:27 +00:00
cyy
a38e4c5e1e Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547
Approved by: https://github.com/ezyang
2024-09-23 03:44:55 +00:00
Isuru Fernando
f276da7f98 Remove prims.slice_in_dim and prims.slice (#136150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150
Approved by: https://github.com/ezyang
2024-09-23 01:27:22 +00:00
Xilun Wu
3406ac24d9 [BE] fix circular import in torch/distributed/utils.py (#136286)
**Summary**
Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286
Approved by: https://github.com/Skylion007
2024-09-22 20:54:12 +00:00
Shangdi Yu
3bc073d728 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-22 04:51:37 +00:00
Zhou, Lingzhi
35532fc477 [Partitioner] Reuse partition to check whether nodes exist (#135317)
The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317
Approved by: https://github.com/ezyang
2024-09-21 23:52:02 +00:00
cyy
e4cdc31227 [14/N] Fix clang-tidy warnings in aten/src/ATen (#133988)
Follows  #133807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988
Approved by: https://github.com/ezyang
2024-09-21 22:41:40 +00:00
Bob Ren
9731ccb9e0 Type _dynamo/variables/lazy.py (#136376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376
Approved by: https://github.com/Skylion007
2024-09-21 22:18:02 +00:00
Jovian Anthony Jaison
09715638ab Add _dynamo.config.suppress_errors logging (#136379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379
Approved by: https://github.com/ezyang
2024-09-21 21:00:26 +00:00
Aaron Orenstein
3176966732 update cache tests (#136215)
Summary:
- Clean up cache test code a bit.
- Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check).

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D62648248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215
Approved by: https://github.com/bobrenjc93
2024-09-21 20:36:22 +00:00
Ramana Sundararaman
be4b7e8131 Param fixes in docstring (#136097)
Fixes wrong param names in docstrings. cc: @kit1980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097
Approved by: https://github.com/ezyang
2024-09-21 18:56:34 +00:00
Aaron Gokaslan
b6ffa381e1 [BE]: Add half CUDA support nextafter (#136373)
Making CUDA support match CPU support for nextafter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373
Approved by: https://github.com/ezyang
2024-09-21 17:13:45 +00:00
PyTorch MergeBot
cc17d58809 Revert "S390x update builder image (#132983)"
This reverts commit 080a249fc2.

Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))
2024-09-21 16:46:54 +00:00
Xuan Zhang
03957efa5d [inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874)
**Motivations**:
A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf).

**Solutions**:
1. implement a peak memory estimator via liveness analysis
2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory

**Results**:
On some models we can reduce the peak memory significantly:
|             model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:-----------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| alexnet                       | 128        |         1.17         |       0.99      | 1.19  |
| vgg16                         | 64         |         4.10         |       3.57      | 1.15  |
| DebertaV2ForQuestionAnswering | 1          |         11.60        |      10.56      | 1.10  |

In the presence of compiler based AC, peak memory can be further reduced:
|              model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:------------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| AlbertForMaskedLM              | 4          |         6.87         |       6.43      | 1.07  |
| AlbertForQuestionAnswering     | 4          |         8.69         |       7.76      | 1.12  |
| MobileBertForQuestionAnswering | 128        |         4.67         |       3.90      | 1.20  |

[Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case.

**Other infos:**
* neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_.
* minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second.
* no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874
Approved by: https://github.com/yf225
2024-09-21 16:28:38 +00:00
DavidGu-Datong
fb4670a1f9 fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174)
Fixes #134848

For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174
Approved by: https://github.com/soulitzer
2024-09-21 14:21:42 +00:00
Will Constable
ea737e4e5d [Pipelining] Make PipelineStage support meta initialization (#136243)
Avoid allocating memory or dry-running the submodule during stage init.

Save user-provided input/output metadata during stage init, to allow
lazily initializing the buffers before the first step call.

Later, we plan to build on top of this to add lazy shape inference
(#130856) so that no input/output shapes are required at stage init.

For now, we require input/output tensors for stage init, but these
should be on meta device and stage should not allocate any real memory.

Note: this needs more thorough testing and review, but it worked on the
torchtitan 3d test.

TODO:
- delete 'device' arg from PipelineStage ctor? (move it to inferred from
  args tensors passed to first step call? separate PR.
- delete 'output_args' from PipelineStage ctor? we don't actually need
  it, but we use it to do shape validation, which is why I didn't remove
  it in this PR.  Proposal: leave it until we add lazy shape inference?

Fixes #136225, #136226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243
Approved by: https://github.com/H-Huang, https://github.com/kwen2501
2024-09-21 09:47:22 +00:00