Commit Graph

86106 Commits

Author SHA1 Message Date
rzou
c41fbb4f78 Change arg_kwarg_vals propagation strategy (#148046)
Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we
special-case the pattern matcher to propagate arg_kwarg_vals when
it sees triton_kernel_wrapper_functional.

The strategy is:
1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata)
2) trace out the replacement graph with vals (which have the accurate Inductor metadata)
3) Propagate the arg_kwarg_vals from the first graph to the second.
4) Use the second graph as the replacement graph.

The strategy is this because we want to extend this to handle
auto_functionalized later up in the stack.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148046
Approved by: https://github.com/eellison
2025-04-02 13:17:52 +00:00
Bin Bao
03138733ba [AOTI] Emit Triton kernels as comment (#150188)
Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging.

Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150188
Approved by: https://github.com/yushangdi
2025-04-02 12:41:54 +00:00
Benjamin Glass
75f38dfd4e cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)
Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350
Approved by: https://github.com/desertfire
2025-04-02 09:54:27 +00:00
Boyuan Feng
3f54b14c75 [CUDAGraph] support meta tensor (#150478)
Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue.

### Example

```python
import torch

def foobar(x, y):
    return x * 2, y * 3

foo_c = torch.compile(mode="reduce-overhead")(foobar)
t = torch.empty((1, 16, 128, 128), device="meta")
y = torch.rand([64], device="cuda")

eager_out = foobar(t, y)

for _ in range(3):
    compiled_out = foo_c(t, y)
```

Prior to this PR, above code leads to
```
skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta')
```

With this PR, we don't skip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150478
Approved by: https://github.com/eellison
2025-04-02 07:21:50 +00:00
Sukchul Cho
0da8127f77 Compare device name of profiler dynamically (#150396)
Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150396
Approved by: https://github.com/sraikund16
2025-04-02 06:06:06 +00:00
Rebecca Chen
c65de03196 Add Any return annotation to __getattr__ methods that return a union of types. (#150204)
Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150204
Approved by: https://github.com/malfet
2025-04-02 05:25:07 +00:00
Nikita Shulga
dee016ceb7 [MPSInductor] Add store_reduce method (#150457)
That restrict the store operation to 0th thread, which should be much better, shouldn't it
(Though I don't observe it in the benchmark)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150457
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150452
2025-04-02 05:12:49 +00:00
William Wen
3ac5a499dd [dynamo] add dynamo disable reasons to codebase (#150440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150440
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #150341
2025-04-02 04:26:48 +00:00
William Wen
25eff6e991 [dynamo] add reason field to torch.compiler.disable (#150341)
Implements https://github.com/pytorch/pytorch/issues/146445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150341
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-04-02 04:26:48 +00:00
Mu-Chu Lee
063ea5d669 [AOTInductor] Modify test for Memory tracking for memory-related (#150269)
operations

Summary:
Fix the test for memory tracking. This PR does:
(1) Add tracking before and after for all memory-related operations.
Make sure the operation do indeed captures memory both in CUDA and
torch's CUDACachAllocator Make sure the operation do indeed captures
consumed memory both in CUDA and torch's CUDACachAllocator.
(2) Keep track of memory being reserved by CUDACacheAllocator in
torch and it's relationship with global CUDA memory consumption.

Test Plan:
This PR is adding tests.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269
Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire
2025-04-02 04:18:18 +00:00
Shivam Raikundalia
5734909f34 [Profiler] Fix Empty C Call Queue (#150370)
Summary:
My commandeer of https://github.com/pytorch/pytorch/pull/150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370
Approved by: https://github.com/aaronenyeshi
2025-04-02 02:44:50 +00:00
eqy
f09513e515 [CUDA]][SymmetricMemory] Interpret empty string as std::nullopt in rendezvous (#149793)
this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-04-02 02:41:07 +00:00
Animesh Jain
61ebe999cc [invoke_subgraph] Do not cache fake tensors for AOTDispatcher first pass (#150450)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150450
Approved by: https://github.com/zou3519
ghstack dependencies: #150082
2025-04-02 02:31:54 +00:00
Animesh Jain
b060fedfa8 [invoke_subgraph] Support None in the fwd output (#150082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150082
Approved by: https://github.com/zou3519
2025-04-02 02:31:54 +00:00
Rithesh Baradi
0ae75ca2de assert on all_reduce_event only if it's not CPU device. (#150316)
Summary: For CPU based runs, `all_reduce_event` would be None since this is the result of the `all_reduce_stream.record_event()`, which does not do much other than returning None when device type is CPU.

Test Plan: CI

Differential Revision: D72176406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150316
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/mori360
2025-04-02 01:54:35 +00:00
cyy
e872c38eb3 Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638
Approved by: https://github.com/zou3519
2025-04-02 01:33:20 +00:00
vasiliy
c974b5322a enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462)
Summary:

Updates the meta registration for `torch._scaled_mm` to work for the
nvfp4 recipe.

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462
Approved by: https://github.com/eellison
2025-04-02 01:08:40 +00:00
Nikita Shulga
ee97299961 [MPS][Testing] Benchmark reduction ops (#150452)
That compares eager vs compile
On my M4Pro mini I'm getting the following now
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      sum (torch.float32)  |      121.0      |       201.5       |       130.3       |        772.3        |       179.4       |        1470.5       |        476.1      |        2980.0
      max (torch.float32)  |      154.1      |       165.9       |       198.7       |        211.6        |       344.2       |         386.9       |       1326.6      |        1345.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452
Approved by: https://github.com/dcci, https://github.com/manuelcandales
2025-04-02 01:06:27 +00:00
tvukovic-amd
db32093192 [ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180)
Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180
Approved by: https://github.com/malfet, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-02 00:35:47 +00:00
Junjie Wang (PyTorch)
d22e3d5efe [fr] Add logger config for flight record in PGNCCL (#150356)
Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg.

Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it.

Differential Revision: D72198171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356
Approved by: https://github.com/fegin
2025-04-01 23:54:07 +00:00
Tristan Rice
6aea4d90fb gloo: use shared Stores (#150230)
Summary:
X-link: https://github.com/facebookincubator/gloo/pull/423

This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around.

To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo.

This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic.

This change will land simultaneously in PyTorch and Gloo repos.

Test Plan:
```
buck2 test //gloo/... //caffe2/caffe2/contrib/gloo:
```

Differential Revision: D72084111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230
Approved by: https://github.com/fduwjj
2025-04-01 23:37:25 +00:00
Nick Riasanovsky
4934a83347 [AMD] [TRITON] [INDUCTOR] Add tl.assume to enable bufferops on AMD (#150373)
Summary: Update the GEMM template to include the necessary `tl.assume` annotations to enable bufferops with AMD.

Test Plan: Tested manually with a simple matmul run with torch.complie(f, mode="max-autotune") the environment variables TRITON_ALWAYS_COMPILE=1 AMDGCN_ENABLE_DUMP=1 AMDGCN_USE_BUFFER_OPS=1.
Inspecting the generated AMDGCN all loads/stores use bufferops.
Note: Since inductor is loading constants for many of the shape values assumes are generally not needed for the stride/shape information, but pid calculations are generally a gap in Triton's inference capability.

Differential Revision: D71922698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150373
Approved by: https://github.com/eellison
2025-04-01 23:29:39 +00:00
angelayi
60fe0922f6 [pytree] Register normal class to register_dataclass (#147752)
Fixes https://github.com/pytorch/pytorch/pull/147532#discussion_r1964365330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147752
Approved by: https://github.com/zou3519
2025-04-01 23:28:20 +00:00
PyTorch MergeBot
203a27e0ce Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 8f7fbe3d7d.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))
2025-04-01 23:07:28 +00:00
Will Feng
80ab233786 [Inductor] Hide reinplace_fsdp_all_gather pass behind skip_fsdp_hooks config (#150436)
The `reinplace_fsdp_all_gather` pass is currently only for Traceable FSDP2 and doesn't work together with SimpleFSDP. We should hide the pass behind `skip_fsdp_hooks` config which makes it only apply to Traceable FSDP2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150436
Approved by: https://github.com/BoyuanFeng
2025-04-01 22:56:06 +00:00
PyTorch MergeBot
9458460211 Revert "if blaslt fails, fall back to blas (#150147)"
This reverts commit 65139eb050.

Reverted https://github.com/pytorch/pytorch/pull/150147 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150147#issuecomment-2770847320))
2025-04-01 22:52:22 +00:00
PyTorch MergeBot
76e1b3ba4c Revert "[ROCm] use correct workspace for hipblaslt, silence warning (#150227)"
This reverts commit c158eac0de.

Reverted https://github.com/pytorch/pytorch/pull/150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150227#issuecomment-2770827563))
2025-04-01 22:31:13 +00:00
henrylhtsang
629c1bd2dd [ez][inductor][tests] Skip triton backend only for CPU tests (#150343)
Motivation: to unblock https://github.com/pytorch/pytorch/pull/148622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150343
Approved by: https://github.com/chenyang78
2025-04-01 22:03:48 +00:00
Avik Chaudhuri
b70d105c77 infer dynamic shapes through additional inputs (#150144)
Summary:
Instead of explicitly specifying dynamic shapes, it is possible to infer them from additional example inputs. Together with the example inputs provided to export, we can basically make any varying dim dynamic and keep any fixed dim static. This should be useful for prod scenarios that have access to tests and/or profiling data, yet are somewhat removed from the model authoring process.

However this alone is not satisfactory: the exported program by design has only one graph, representing one path through the model, and we cannot necessarily guarantee that this graph works for the additional example inputs because different guards might have been created if we had exported with them instead (corresponding to different traced paths). However, checking that the additional example inputs satisfy the guards created by the original export should be sufficient for generalization.

Now, while we don't preserve all guards in the exported program, we do check a subset of them as part of input matching. So we add a verification step at the end of export when such additional example inputs are provided. This should be enough for now.

Test Plan: added test (positive and negative cases)

Differential Revision: D72001771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150144
Approved by: https://github.com/bobrenjc93
2025-04-01 21:13:39 +00:00
Michael Lazos
0d44a8aea1 [Hierarchical Compile] Apply deduplication after output node creation (#150306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150306
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303, #150304, #150305
2025-04-01 20:54:18 +00:00
Michael Lazos
8740ffa760 [Hierarchical Compile] Add cycle detection to graph region expansion (#150305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150305
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303, #150304
2025-04-01 20:54:18 +00:00
Michael Lazos
a2300aff94 [Hierarchical Compile] Add cycle detection function for debug (#150304)
Remove print

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150304
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303
2025-04-01 20:54:10 +00:00
Michael Lazos
99fd96c10b [Hierarchical Compile] Remove spammy debug log (#150303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150303
Approved by: https://github.com/williamwen42
2025-04-01 20:54:03 +00:00
atalman
295162ec3a Smoke Test - disable pypi package validation for binaries that package cuda libs (#150194)
Smoke Test - disable pypi package validation for binaries that package cuda libs. These binaries do not install packages via pypi.
Should Resolve this from `linux-binary-manywheel / manywheel-py3_11-cuda12_6-full-test / test`:
```
Traceback (most recent call last):
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 468, in <module>
    main()
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 462, in main
    smoke_test_cuda(
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 274, in smoke_test_cuda
    compare_pypi_to_torch_versions(
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 220, in compare_pypi_to_torch_versions
    raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}")
RuntimeError: Can't find cudnn in PyPI for Torch: 9.5.1
```
Link: https://github.com/pytorch/pytorch/actions/runs/14101221665/job/39505479587#step:15:982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150194
Approved by: https://github.com/ZainRizvi
2025-04-01 19:18:44 +00:00
Tianyu Liu
d2ad9aa2f2 [dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372)
Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time.

The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs.

I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is
1. less cleaner
2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372
Approved by: https://github.com/wanchaol
2025-04-01 19:15:43 +00:00
Tianyu Liu
5d6ac2dced [dtensor] add op support for select_backward and slice_backward (#150357)
Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071
- fixed an issue for `select_backward` and an issue for `slice_backward`
- removed `_experimental_ops.py` as it becomes empty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357
Approved by: https://github.com/awgu, https://github.com/XilunWu
2025-04-01 19:15:25 +00:00
IvanKobzarev
a37afd23fa [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)
(benchmark for 1 call)

Before:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

After:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555
Approved by: https://github.com/zou3519
2025-04-01 18:45:48 +00:00
Ethan Wee
78300c8205 [ROCm] update test buffer fudge factor for hipblaslt (#150348)
The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed.

Forward-fix for #150227 that broke ROCm distributed tests but wasn't part of initial CI signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150348
Approved by: https://github.com/jeffdaily
2025-04-01 18:31:25 +00:00
Jason Ansel
37ebb0b56a [inductor] Fix inductor windows linker error (#150256)
Fixes #149889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256
Approved by: https://github.com/anijain2305, https://github.com/eellison
2025-04-01 18:30:55 +00:00
eellison
15dbad2115 Update torch.compile issue template (#150192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150192
Approved by: https://github.com/malfet
ghstack dependencies: #149947
2025-04-01 18:16:16 +00:00
PyTorch MergeBot
f04cf13bdd Revert "Merge Triton ScaledMM as epilogue to MM template (#150045)"
This reverts commit 981048854d.

Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))
2025-04-01 17:54:28 +00:00
Will Feng
b0c560ef2a [dynamo][hooks] use wrap_top_frame config for functions (#150209)
When torch.compile is applied to a module via `mod.compile(...)`, it's equivalent to `torch.compile(mod._call_impl)` which takes a different path than `OptimizedModule`. This PR ensures that the `wrap_top_frame` config can also take effect for the `torch.compile(mod._call_impl)` use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150209
Approved by: https://github.com/anijain2305
2025-04-01 17:41:23 +00:00
Nikita Shulga
48af2cdd27 [BE] Move all lint runner to 24.04 (#150427)
As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101
This forces older python version to be 3.8
Delete all linux-20.04 runners from the lintrunner.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427
Approved by: https://github.com/seemethere
2025-04-01 17:33:15 +00:00
Xia, Weiwen
3b0cd9b542 [Quant][PT2E] add a lowering pass for x86 backend (#149708)
**Summary**
This PR adds a lowering pass for x86 backend
- Patterns of `dequantize -> conv/linear (-> quantize)` are fused to corresponding quantized onednn ops.
- Weights are prepacked ahead of time.
- Post ops of conv/linear are fused if supported.
- The pass returns a `GraphModule` with the modifications mentioned above.

**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_lowering_to_x86
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149708
Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel
2025-04-01 17:32:41 +00:00
Catherine Lee
783f045c4f [ez] Remove dead lite interpreter CI code (#150424)
There are no lite-interpreter build environments in CI

I assume every mac build is arm64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150424
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-04-01 17:14:32 +00:00
Catherine Lee
a17ee8181a [CI] Fix log artifact not containing test logs attempt 2 (#150234)
Fixes #ISSUE_NUMBER
Take two of https://github.com/pytorch/pytorch/pull/149577 since it didn't work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150234
Approved by: https://github.com/malfet, https://github.com/seemethere
2025-04-01 17:13:58 +00:00
Nikita Shulga
f94ac263af [MPSInductor] Fix neg for unsigned types (#150412)
By more-or-less copy-n-pasting the fix from https://github.com/pytorch/pytorch/pull/94035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150412
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150382, #150386
2025-04-01 16:52:41 +00:00
Xuehai Pan
ae74ef9d53 Set proper LD_LIBRARY_PATH on Linux in nightly venv in nightly pull tool (#143262)
Before this change:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.12"
$ source venv/bin/activate
$ python3 -c 'import torch'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/PanXuehai/Projects/pytorch/torch/__init__.py", line 379, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
```

This PR adds `site-packages/nvidia/**/lib` to `LD_LIBRARY_PATH` in `venv/bin/activate` script to let NVIDIA PyPI packages can be loaded correctly.

See also:

- #141837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143262
Approved by: https://github.com/malfet
2025-04-01 16:51:02 +00:00
Sriram Kumar
a19b667bca [ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010)
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR https://github.com/ROCm/apex/pull/184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010
Approved by: https://github.com/jeffdaily
2025-04-01 16:49:03 +00:00
Ke Wen
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00