Commit Graph

77878 Commits

Author SHA1 Message Date
PyTorch MergeBot
10c31e96df Revert "[dynamo][itertools] refactor itertools.islice to use polyfill (#133876)"
This reverts commit 7d12e6dceb.

Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:09 +00:00
Yidi Wu
d261a1751a [HOP] fix export x inline_inbuilt_nn_modules (#133731)
TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))).

We have two special treatments for following cases:

1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`,  which mutates captured states when calling `__getattr__`.

Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy.

2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ | __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer.

Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731
Approved by: https://github.com/anijain2305
ghstack dependencies: #134775
2024-08-30 15:58:20 +00:00
Yidi Wu
932c4ca5a0 make make_fx collective test single threaded (#134775)
make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775
Approved by: https://github.com/yifuwang
2024-08-30 15:58:20 +00:00
eqy
c07e566baf [CUDA][P2P] Check device capability in requires_cuda_p2p_access (#134523)
Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in 79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
2024-08-30 14:08:55 +00:00
Joona Havukainen
92f282ca52 Enable batch matmul for result sizes > 2**32 the tensor can be split along batch axis (#133430)
Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.

Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:

```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```

Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.

Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-30 14:08:43 +00:00
wz337
50efbb9f1e [DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603
Approved by: https://github.com/fduwjj
ghstack dependencies: #133838, #133839, #134048
2024-08-30 08:13:37 +00:00
Animesh Jain
0f8bec4399 [dynamo] mark_static_nn_module (#134713)
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656

With this API, we can mark the offending module as static in detectron2.

Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want.

Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713
Approved by: https://github.com/jansel
ghstack dependencies: #134653
2024-08-30 07:01:06 +00:00
Jason Ansel
a5630239ad [dynamo] Improve minifier error message when fp64 not supported (#134737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737
Approved by: https://github.com/anijain2305
2024-08-30 06:42:32 +00:00
Ankur Neog
1011e0ae98 Generalize devices specific UTs for dynamo (#130714)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices.
We intend to extend similar generalization for the rest of the content in test/dynamo  which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available.

## Changes
 carve out bert related content to another class
 use instantiate_device_type utility to instantiate this class for devices which support the functionality

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714
Approved by: https://github.com/anijain2305
2024-08-30 05:02:47 +00:00
Animesh Jain
7a694f6683 [justknobs] Override __bool__ method (#134799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799
Approved by: https://github.com/ezyang
2024-08-30 04:54:02 +00:00
PyTorch UpdateBot
75b86b1554 [executorch hash update] update the pinned executorch hash (#134736)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736
Approved by: https://github.com/pytorchbot
2024-08-30 04:11:51 +00:00
Jack Taylor
5e8bf29148 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-08-30 03:38:35 +00:00
Xu Han
1f1e2eeb9d [inductor] Install tlparse for test\dynamo\test_structured_trace.py UTs. (#134806)
Install tlparse for test\dynamo\test_structured_trace.py UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806
Approved by: https://github.com/ezyang
2024-08-30 03:16:03 +00:00
Laith Sakka
0d5f978795 add basic nn modules diff time benchmarks (#134658)
benchmarks several shapes of basic nn modules. in both eager and inductor

```
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 48602516013
compile time instruction count for iteration 1 is 20424350269
compile time instruction count for iteration 2 is 20440350455
compile time instruction count for iteration 3 is 20419269999
compile time instruction count for iteration 4 is 20430782200
compile time instruction count for iteration 5 is 20455049622
compile time instruction count for iteration 6 is 20157290712
compile time instruction count for iteration 7 is 20455324001
compile time instruction count for iteration 8 is 20450158317
compile time instruction count for iteration 9 is 20492987748
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 961328334
compile time instruction count for iteration 1 is 958887896
compile time instruction count for iteration 2 is 958792214
compile time instruction count for iteration 3 is 958375977
compile time instruction count for iteration 4 is 958568525
compile time instruction count for iteration 5 is 958152305
compile time instruction count for iteration 6 is 959322800
compile time instruction count for iteration 7 is 958332703
compile time instruction count for iteration 8 is 958092100
compile time instruction count for iteration 9 is 958095277
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor
compile time instruction count for iteration 0 is 3572145793
compile time instruction count for iteration 1 is 3503323973
compile time instruction count for iteration 2 is 3501962432
compile time instruction count for iteration 3 is 3501746084
compile time instruction count for iteration 4 is 3500687361
compile time instruction count for iteration 5 is 3822254676
compile time instruction count for iteration 6 is 3498356846
compile time instruction count for iteration 7 is 3499019157
compile time instruction count for iteration 8 is 3500780314
compile time instruction count for iteration 9 is 3500257458
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager
compile time instruction count for iteration 0 is 1844838754
compile time instruction count for iteration 1 is 1843476862
compile time instruction count for iteration 2 is 1844761450
compile time instruction count for iteration 3 is 1845371742
compile time instruction count for iteration 4 is 1845159665
compile time instruction count for iteration 5 is 1845035802
compile time instruction count for iteration 6 is 1844895007
compile time instruction count for iteration 7 is 1844697922
compile time instruction count for iteration 8 is 1844780885
compile time instruction count for iteration 9 is 1844493990
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor
compile time instruction count for iteration 0 is 1597839479
compile time instruction count for iteration 1 is 1348225351
compile time instruction count for iteration 2 is 1347340818
compile time instruction count for iteration 3 is 1348170800
compile time instruction count for iteration 4 is 1348637747
compile time instruction count for iteration 5 is 1678366444
compile time instruction count for iteration 6 is 1348412420
compile time instruction count for iteration 7 is 1348461578
compile time instruction count for iteration 8 is 1347420149
compile time instruction count for iteration 9 is 1349748195
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager
compile time instruction count for iteration 0 is 137721777
compile time instruction count for iteration 1 is 139065517
compile time instruction count for iteration 2 is 137130552
compile time instruction count for iteration 3 is 137506030
compile time instruction count for iteration 4 is 137089838
compile time instruction count for iteration 5 is 137477395
compile time instruction count for iteration 6 is 138550452
compile time instruction count for iteration 7 is 137568409
compile time instruction count for iteration 8 is 136968468
compile time instruction count for iteration 9 is 137481664
collecting compile time instruction count for basic_modules_ModuleComparison_inductor
compile time instruction count for iteration 0 is 917209684
compile time instruction count for iteration 1 is 899154426
compile time instruction count for iteration 2 is 898145079
compile time instruction count for iteration 3 is 899817018
compile time instruction count for iteration 4 is 899184687
compile time instruction count for iteration 5 is 898172885
compile time instruction count for iteration 6 is 899958951
compile time instruction count for iteration 7 is 899348186
compile time instruction count for iteration 8 is 897745404
compile time instruction count for iteration 9 is 899581123
collecting compile time instruction count for basic_modules_ModuleComparison_eager
compile time instruction count for iteration 0 is 113165302
compile time instruction count for iteration 1 is 112724376
compile time instruction count for iteration 2 is 112774611
compile time instruction count for iteration 3 is 114465211
compile time instruction count for iteration 4 is 112689572
compile time instruction count for iteration 5 is 112726465
compile time instruction count for iteration 6 is 112853691
compile time instruction count for iteration 7 is 112295238
compile time instruction count for iteration 8 is 114022136
compile time instruction count for iteration 9 is 112664932
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649, #134652
2024-08-30 02:13:52 +00:00
Xilun Wu
a645a18d2e [reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294

Fixes #131446
Fixes #126852
Fixes #126868
Fixes #126493

The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.

See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3229710Z Unexpected success
2024-08-24T13:42:01.3230235Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3230407Z Unexpected success
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3231296Z Unexpected success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-08-30 02:13:45 +00:00
Chen Haifeng
27ffa67984 Support __class__ attr for tuple and list variables (#134099)
Fixes #134086

This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-08-30 01:57:49 +00:00
Colin L. Rice
cf11fc0dcb dynamo: Only log if we've disabled eval_frame once. (#134529)
This spams logs pretty badly otherwise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529
Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen
2024-08-30 00:35:25 +00:00
Ivan Zaitsev
8b68912dfc Correctly detect "Rate limit exceeded" error (#134785)
Currently all 403 errors are treated as "Rate limit exceeded":
https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924

[Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim:
> If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header.

After this change:
https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395

Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785
Approved by: https://github.com/clee2000
2024-08-29 23:58:15 +00:00
Yu, Guangye
3402a5d865 fix windows xpu build issue (#133845)
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.

# Solution
Use explicitly `sycl-preview` in linux build only.

# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
2024-08-29 23:53:32 +00:00
leslie-fang-intel
3775fc982d [Inductor][CPP] Fix Index name error (#134645)
**Summary**

Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect:

* `_node` is a FX Node with target in ["index_expr", "load", "store"]
* `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index`
* `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression

It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645
Approved by: https://github.com/jgong5, https://github.com/masnesral
2024-08-29 23:33:15 +00:00
Shuqiang Zhang
d13ce2e2b5 [c10d] release gil lock during eager init (#134779)
Summary:
We found that if we init the pG in a background thread, it would block
the main thread till init is complete. This is because in the pybinding
we never release the GIL lock
Test Plan:
existing CI on eager init

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779
Approved by: https://github.com/c-p-i-o
2024-08-29 23:25:33 +00:00
Lucian Grijincu
71ff168dbb pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572)
Summary:
Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865}

LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505

Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which  could change layout/remove the unused variable/etc.

Differential Revision: D61845799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572
Approved by: https://github.com/atalman
2024-08-29 23:15:13 +00:00
Laith Sakka
496e57283d add add_loop benchmarks (#134652)
This benchmark measure the cost of compiling the following function in eager and inductor
its basically two benchmarks.

```
        @torch.compile(backend=self.backend, fullgraph=True)
        def f(a, b):
            result = a.clone()
            for i in range(1000):
                if i % 3 == 0:
                    result = result + b
                elif i % 3 == 1:
                    result = result + 8 * b
                else:
                    result = result.sin()
            return result
```

 PYTHONPATH=$(pwd) python benchmarks/add_loop.py out
 ```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8286649663
compile time instruction count for iteration 1 is 2838971338
compile time instruction count for iteration 2 is 2834263023
compile time instruction count for iteration 3 is 2829447493
compile time instruction count for iteration 4 is 2830904231
compile time instruction count for iteration 5 is 2830281077
compile time instruction count for iteration 6 is 2831466595
compile time instruction count for iteration 7 is 2830732164
compile time instruction count for iteration 8 is 2831088056
compile time instruction count for iteration 9 is 2831204407

collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 32585687849
compile time instruction count for iteration 1 is 11747553436
compile time instruction count for iteration 2 is 11746959875
compile time instruction count for iteration 3 is 11749479461
compile time instruction count for iteration 4 is 11750053711
compile time instruction count for iteration 5 is 11750793958
compile time instruction count for iteration 6 is 11751673576
compile time instruction count for iteration 7 is 11754552912
compile time instruction count for iteration 8 is 11753723127
compile time instruction count for iteration 9 is 11759059942
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649
2024-08-29 23:04:01 +00:00
fduwjj
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
Zhuoran Zhao
8b4c487581 Fix AOTInductor complication on ROCM (#134522)
Summary:
Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring

So resubmit it to fix

Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548

Differential Revision: D61827208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522
Approved by: https://github.com/frank-wei
2024-08-29 21:59:04 +00:00
Shunting Zhang
1e92d7b688 [inductor] move loop ordering after fusion (#126254)
Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same.

Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand).

Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them.

This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015

Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it  by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254
Approved by: https://github.com/jansel
2024-08-29 21:50:07 +00:00
min-jean-cho
416a7894fe [Windows][XPU] Disable Kineto PTI on Windows only (#134620)
Disable Kineto + XPU PTI on Windows only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-29 20:58:55 +00:00
Xuehai Pan
7d12e6dceb [dynamo][itertools] refactor itertools.islice to use polyfill (#133876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779, #133864, #133894
2024-08-29 20:56:16 +00:00
Xuehai Pan
a2566adfb6 [dynamo] refactor builtins.enumerate to use polyfill (#133894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779, #133864
2024-08-29 20:56:16 +00:00
Xuehai Pan
1b70366957 [dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779
2024-08-29 20:56:16 +00:00
Xuehai Pan
eaa449fbf0 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133769, #133778
2024-08-29 20:56:16 +00:00
Xuehai Pan
b5f1ffa7ab [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133769
2024-08-29 20:56:16 +00:00
Xuehai Pan
e09324e7da [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
2024-08-29 20:56:16 +00:00
drisspg
b977abd5de [Inductor] Fix error checking for scaled_mm lowering (#134765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765
Approved by: https://github.com/Skylion007
2024-08-29 20:18:42 +00:00
atalman
6180574771 Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624)
Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718
Except XPU and ROCM jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi
2024-08-29 19:15:59 +00:00
Jason Ansel
202e5cc87d [inductor] Fix error in debug_str_extra (#134747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747
Approved by: https://github.com/Skylion007, https://github.com/shunting314
2024-08-29 19:09:50 +00:00
Brian Vaughan
43e1df64f8 register all entry_point backends on first attempt (#132546)
fixes: https://github.com/pytorch/pytorch/issues/131360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546
Approved by: https://github.com/jansel
2024-08-29 18:59:29 +00:00
Ke Wen
5470fcd5b9 [5/N] Reconcile barrier and NaN checker (#134707)
By using a zeros() tensor instead of empty() tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134345, #134357, #134701
2024-08-29 18:51:12 +00:00
zdevito
d91b49dbaa expandable_segments <-> other allocator options (#134338)
Previously setting  garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
2024-08-29 18:43:59 +00:00
Rachel Guo
3fc6e47d42 [AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705)
Summary:
Follow up fix for D61018114, D61800622

Increase indentation for `loadKernel` `launchKernel` and `Grid` lines.

Test Plan:
```
TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda
```
```
TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda
```

Differential Revision: D61927248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705
Approved by: https://github.com/ColinPeppler
2024-08-29 18:38:45 +00:00
Aaron Gokaslan
5573c17877 [BE][Ez]: Update ruff to 0.6.3 (#134769)
Mostly bugfix release, updating because it fixes an edgecase with a rule we are using

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769
Approved by: https://github.com/albanD
2024-08-29 18:35:47 +00:00
Xintong Hu
ce96146623 [PT2] Fix node metadata setting in group_batch_fusion_aten (#134543)
Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information

Differential Revision: D61832932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543
Approved by: https://github.com/frank-wei
2024-08-29 18:32:04 +00:00
chilli
348d02a983 Changed masked out rows logsumexp to be -inf and not zero (#134650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg
2024-08-29 17:22:52 +00:00
Pian Pawakapan
36a6516290 [export] use single FQN for param_buffer_mapping (#134500)
Fixes #133252

In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks.

Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500
Approved by: https://github.com/angelayi
2024-08-29 17:06:31 +00:00
Ke Wen
d9d95dc55e [4/N] Test NaN checker against broadcast (#134701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701
Approved by: https://github.com/wconstab
ghstack dependencies: #134345, #134357
2024-08-29 17:00:07 +00:00
PyTorch MergeBot
ab646cd805 Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)"
This reverts commit ba5aec88c6.

Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))
2024-08-29 16:39:19 +00:00
Ke Wen
26aea277f7 [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134345
2024-08-29 16:25:27 +00:00
Xu Han
d503217ea4 [inductor] calibration inductor windows uts (15/N) (#134586)
Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows.

Changes:
1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue.
2. `PermissionError` setup as `delete=False`, let file not be auto deleted.
3. Open log file as "utf-8" to align with Linux.
4. Process wrap difference for Windows.
5. Delete tmp file manually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586
Approved by: https://github.com/jansel
2024-08-29 16:18:40 +00:00
Ke Wen
9953f55f4c [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 16:13:15 +00:00
Bin Bao
387d3fc296 [AOTI] Switch benchmarking to use export non-strict mode (#130977)
Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977
Approved by: https://github.com/angelayi
ghstack dependencies: #134639
2024-08-29 16:08:52 +00:00