TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))).
We have two special treatments for following cases:
1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`.
Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy.
2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ | __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer.
Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731
Approved by: https://github.com/anijain2305
ghstack dependencies: #134775
Fixes#131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.
Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:
```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```
Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.
Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656
With this API, we can mark the offending module as static in detectron2.
Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want.
Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713
Approved by: https://github.com/jansel
ghstack dependencies: #134653
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices.
We intend to extend similar generalization for the rest of the content in test/dynamo which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available.
## Changes
carve out bert related content to another class
use instantiate_device_type utility to instantiate this class for devices which support the functionality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714
Approved by: https://github.com/anijain2305
benchmarks several shapes of basic nn modules. in both eager and inductor
```
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 48602516013
compile time instruction count for iteration 1 is 20424350269
compile time instruction count for iteration 2 is 20440350455
compile time instruction count for iteration 3 is 20419269999
compile time instruction count for iteration 4 is 20430782200
compile time instruction count for iteration 5 is 20455049622
compile time instruction count for iteration 6 is 20157290712
compile time instruction count for iteration 7 is 20455324001
compile time instruction count for iteration 8 is 20450158317
compile time instruction count for iteration 9 is 20492987748
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 961328334
compile time instruction count for iteration 1 is 958887896
compile time instruction count for iteration 2 is 958792214
compile time instruction count for iteration 3 is 958375977
compile time instruction count for iteration 4 is 958568525
compile time instruction count for iteration 5 is 958152305
compile time instruction count for iteration 6 is 959322800
compile time instruction count for iteration 7 is 958332703
compile time instruction count for iteration 8 is 958092100
compile time instruction count for iteration 9 is 958095277
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor
compile time instruction count for iteration 0 is 3572145793
compile time instruction count for iteration 1 is 3503323973
compile time instruction count for iteration 2 is 3501962432
compile time instruction count for iteration 3 is 3501746084
compile time instruction count for iteration 4 is 3500687361
compile time instruction count for iteration 5 is 3822254676
compile time instruction count for iteration 6 is 3498356846
compile time instruction count for iteration 7 is 3499019157
compile time instruction count for iteration 8 is 3500780314
compile time instruction count for iteration 9 is 3500257458
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager
compile time instruction count for iteration 0 is 1844838754
compile time instruction count for iteration 1 is 1843476862
compile time instruction count for iteration 2 is 1844761450
compile time instruction count for iteration 3 is 1845371742
compile time instruction count for iteration 4 is 1845159665
compile time instruction count for iteration 5 is 1845035802
compile time instruction count for iteration 6 is 1844895007
compile time instruction count for iteration 7 is 1844697922
compile time instruction count for iteration 8 is 1844780885
compile time instruction count for iteration 9 is 1844493990
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor
compile time instruction count for iteration 0 is 1597839479
compile time instruction count for iteration 1 is 1348225351
compile time instruction count for iteration 2 is 1347340818
compile time instruction count for iteration 3 is 1348170800
compile time instruction count for iteration 4 is 1348637747
compile time instruction count for iteration 5 is 1678366444
compile time instruction count for iteration 6 is 1348412420
compile time instruction count for iteration 7 is 1348461578
compile time instruction count for iteration 8 is 1347420149
compile time instruction count for iteration 9 is 1349748195
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager
compile time instruction count for iteration 0 is 137721777
compile time instruction count for iteration 1 is 139065517
compile time instruction count for iteration 2 is 137130552
compile time instruction count for iteration 3 is 137506030
compile time instruction count for iteration 4 is 137089838
compile time instruction count for iteration 5 is 137477395
compile time instruction count for iteration 6 is 138550452
compile time instruction count for iteration 7 is 137568409
compile time instruction count for iteration 8 is 136968468
compile time instruction count for iteration 9 is 137481664
collecting compile time instruction count for basic_modules_ModuleComparison_inductor
compile time instruction count for iteration 0 is 917209684
compile time instruction count for iteration 1 is 899154426
compile time instruction count for iteration 2 is 898145079
compile time instruction count for iteration 3 is 899817018
compile time instruction count for iteration 4 is 899184687
compile time instruction count for iteration 5 is 898172885
compile time instruction count for iteration 6 is 899958951
compile time instruction count for iteration 7 is 899348186
compile time instruction count for iteration 8 is 897745404
compile time instruction count for iteration 9 is 899581123
collecting compile time instruction count for basic_modules_ModuleComparison_eager
compile time instruction count for iteration 0 is 113165302
compile time instruction count for iteration 1 is 112724376
compile time instruction count for iteration 2 is 112774611
compile time instruction count for iteration 3 is 114465211
compile time instruction count for iteration 4 is 112689572
compile time instruction count for iteration 5 is 112726465
compile time instruction count for iteration 6 is 112853691
compile time instruction count for iteration 7 is 112295238
compile time instruction count for iteration 8 is 114022136
compile time instruction count for iteration 9 is 112664932
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649, #134652
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294Fixes#131446Fixes#126852Fixes#126868Fixes#126493
The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.
See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m
2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.
# Solution
Use explicitly `sycl-preview` in linux build only.
# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
**Summary**
Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect:
* `_node` is a FX Node with target in ["index_expr", "load", "store"]
* `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index`
* `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression
It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645
Approved by: https://github.com/jgong5, https://github.com/masnesral
Summary:
We found that if we init the pG in a background thread, it would block
the main thread till init is complete. This is because in the pybinding
we never release the GIL lock
Test Plan:
existing CI on eager init
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779
Approved by: https://github.com/c-p-i-o
This benchmark measure the cost of compiling the following function in eager and inductor
its basically two benchmarks.
```
@torch.compile(backend=self.backend, fullgraph=True)
def f(a, b):
result = a.clone()
for i in range(1000):
if i % 3 == 0:
result = result + b
elif i % 3 == 1:
result = result + 8 * b
else:
result = result.sin()
return result
```
PYTHONPATH=$(pwd) python benchmarks/add_loop.py out
```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8286649663
compile time instruction count for iteration 1 is 2838971338
compile time instruction count for iteration 2 is 2834263023
compile time instruction count for iteration 3 is 2829447493
compile time instruction count for iteration 4 is 2830904231
compile time instruction count for iteration 5 is 2830281077
compile time instruction count for iteration 6 is 2831466595
compile time instruction count for iteration 7 is 2830732164
compile time instruction count for iteration 8 is 2831088056
compile time instruction count for iteration 9 is 2831204407
collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 32585687849
compile time instruction count for iteration 1 is 11747553436
compile time instruction count for iteration 2 is 11746959875
compile time instruction count for iteration 3 is 11749479461
compile time instruction count for iteration 4 is 11750053711
compile time instruction count for iteration 5 is 11750793958
compile time instruction count for iteration 6 is 11751673576
compile time instruction count for iteration 7 is 11754552912
compile time instruction count for iteration 8 is 11753723127
compile time instruction count for iteration 9 is 11759059942
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.
Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"
We need to make changes to the test to make it aligned with the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same.
Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand).
Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them.
This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015
Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254
Approved by: https://github.com/jansel
Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
Fixes#133252
In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks.
Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500
Approved by: https://github.com/angelayi
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.
With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.
Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134345
Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows.
Changes:
1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue.
2. `PermissionError` setup as `delete=False`, let file not be auto deleted.
3. Open log file as "utf-8" to align with Linux.
4. Process wrap difference for Windows.
5. Delete tmp file manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586
Approved by: https://github.com/jansel