Commit Graph

81462 Commits

Author SHA1 Message Date
Mark Saroufim
f3d16ec76f Add doc preview command (#141590)
Convenience, when we build pytorch docs
1. Docs for build weren't clear that `make html` is the main command intended to be ran
2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command`

Usage

```shell
numpy ❯ make serve PORT=8080 # Add port optionally
Serving HTTP on :: port 8080 (http://[::]:8080/) ...
::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 -
```

![Screenshot 2024-11-26 at 10 05 46 AM](https://github.com/user-attachments/assets/3b275c33-1515-4e21-b540-f5a68c8a8e55)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590
Approved by: https://github.com/svekars, https://github.com/malfet
2024-11-26 21:56:54 +00:00
PyTorch MergeBot
65dbd5cc2d Revert "[Inductor] Inplacing with Donated Buffer (#140113)"
This reverts commit eecc8e362c.

Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))
2024-11-26 21:20:59 +00:00
Joel Schlosser
869d629c0f Forward / backward NJT support for several activation functions (#140736)
Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #141500
2024-11-26 21:19:58 +00:00
Tristan Rice
9f4f061f89 PyProcessGroup: support rank, world size, group name/desc overrides (#141529)
This improves `PyProcessGroup` so you can override rank, world size and group name/desc methods from Python. These will be needed to support resizable process groups in torchft.

This also has some small fixes in test_c10d_pypg.py to use threads instead of processes which speeds up the test execution by ~10x.

Test plan:

```
pytest test/distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141529
Approved by: https://github.com/fegin
2024-11-26 20:56:57 +00:00
Eli Uriegas
5696df439b tools: Add script to do split build in one command (#141359)
Usage:
```bash
python3 tools/packaging/split_wheel.py bdist_wheel
python3 tools/packaging/split_wheel.py install
python3 tools/packaging/split_wheel.py develop
```
Ideally this should make it easier to do the split build locally while
we're doing development.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141359
Approved by: https://github.com/kit1980
2024-11-26 20:51:05 +00:00
Joel Schlosser
8ba555ec8a Fix where() for NJT (#141500)
**Background:** It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, *including layout*. This shows up in various places, notably including derivative formulas ([example](78491d6afc/tools/autograd/derivatives.yaml (L432-L434))). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula).

**This PR:**
* Allows non-contiguous NJT inputs to `where()` + adds tests for this
* Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this
    * Uses limited `broadcast_tensors()` / `broadcast_to()` support
    * Improves `expand()` to work on non-contig NJTs
* Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration)
* Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})`

**Alternative approach:** Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500
Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer
2024-11-26 20:13:27 +00:00
Zhengxu Chen
011650adc5 [sigmoid] Refactor out a helper function to insert const graph into top level graph. (#140854)
Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates.

Test Plan: CI

Reviewed By: trieuat

Differential Revision: D63031982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854
Approved by: https://github.com/SherlockNoMad
2024-11-26 20:07:46 +00:00
William Wen
6fa4356451 handle sympy.oo in bitwise_and/or value_ranges (#141522)
An internal test is failing due to not handling `sympy.oo` properly in bitwise_and/or value_ranges: [T208684142](https://www.internalfb.com/intern/tasks/?t=208684142). I don't know how to repro this - seems like this requires inductor to trigger as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141522
Approved by: https://github.com/ezyang
ghstack dependencies: #138777
2024-11-26 20:01:31 +00:00
Tsung-Hsien Lee
84f818f359 [DTensorTestbase] Fix TestFunc typing issue (#141513)
Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`).

Test Plan: Contbuild & OSS CI

Differential Revision: D66463705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-11-26 19:48:34 +00:00
atalman
893a4390c9 Use cuda 12.6 wheels with Manylinux 2.28. Use Manylinux2014 for CPU, CUDA11.8, CUDA12.4 (#141565)
For release 2.6 we will be using only CUDA 12.6 binaries on Manylinux 2.28.
Issue: https://github.com/pytorch/pytorch/issues/123649
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141565
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet
2024-11-26 19:36:42 +00:00
eqy
816ca98cd2 [cuDNN][SDPA] Update cuDNN grad output layout check (#141147)
Thanks to https://github.com/pytorch/pytorch/pull/137978 from @Skylion007 which bumps to cuDNN 9.5.1 the broken assumption of dO strides == O strides is fixed

Note that there is still the restriction that the innermost stride of the grad output is 1 (this is almost always guaranteed because this condition is required of the input tensors). The main exception would be in test code that does e.g., `.sum().backward()` which yields grad output tensors with strides `[0, 0, 0, 0]`.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141147
Approved by: https://github.com/drisspg
2024-11-26 19:17:01 +00:00
Nichols A. Romero
a99332eb25 [ROCM] Support Multi-GPU offline tuning in TunableOp (#139673)
This PR enhances offline tuning to support multi-GPUs.

High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename

Also adding support for GemmAndBias and ScaledGemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
2024-11-26 19:07:41 +00:00
fduwjj
5b4c864672 [c10d] Enable CudaEventCache by default and add multi device support (#140975)
We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.

Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-11-26 18:42:45 +00:00
Isuru Fernando
44186a0a4e Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-26 18:11:00 +00:00
Shivam Raikundalia
29ca44839e Add skip_first_wait to profiler.schedule (V2) (#141512)
Summary:
Another try for D66198138. Original diff had some weird issue with type checking. Setting everything to int this time to get around it.

Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.

Test Plan:
Got following traces with this schedule:
schedule=torch.profiler.schedule(
          wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
      )

Differential Revision: D66465860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141512
Approved by: https://github.com/aaronenyeshi
2024-11-26 18:10:54 +00:00
atalman
809de05693 Update libgfortran version in aarch64 Docker (#141583)
From `libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb` to `libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb` as former is no longer available:
```
% curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb
HTTP/1.1 404 Not Found
Date: Tue, 26 Nov 2024 16:58:10 GMT
Server: Apache/2.4.29 (Ubuntu)
Content-Type: text/html; charset=iso-8859-1
```
vs
```
% curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb
HTTP/1.1 200 OK
Date: Tue, 26 Nov 2024 16:58:48 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Sun, 31 Mar 2024 10:51:08 GMT
ETag: "713d4-614f2a681d48b"
Accept-Ranges: bytes
Content-Length: 463828
Content-Type: application/x-debian-package
```

Here is the failure: https://github.com/pytorch/pytorch/actions/runs/12032016986/job/33542862322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141583
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet
2024-11-26 17:49:34 +00:00
Yidi Wu
000d4e9d43 [hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181)
Before the PR: P1683356646
after the pr: P1683356585

Relevant changes:
```
@@ -231,7 +421,8 @@
             true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1]
             del true_graph_0_arg0_1
             del true_graph_0_arg1_1
+            (buf5[0],) = true_graph_0(true_graph_0_args)
-             (true_graph_0_buf0,) = true_graph_0(true_graph_0_args)
-             buf5[0] = true_graph_0_buf0
         else:
             # subgraph: false_graph_0
             false_graph_0_arg0_1 = buf4
@@ -239,7 +430,8 @@
             false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1]
             del false_graph_0_arg0_1
             del false_graph_0_arg1_1
+            (buf5[0],) = false_graph_0(false_graph_0_args)
-             (false_graph_0_buf0,) = false_graph_0(false_graph_0_args)
-             buf5[0] = false_graph_0_buf0
         del arg2_1
         del buf4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181
Approved by: https://github.com/anijain2305
ghstack dependencies: #140334, #141172
2024-11-26 17:32:51 +00:00
Yidi Wu
aae581d921 [hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172
Approved by: https://github.com/Chillee
ghstack dependencies: #140334
2024-11-26 17:32:50 +00:00
Yidi Wu
45bc9165fe [hop] add discard_graph_changes to remove the empty calls before hop (#140334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140334
Approved by: https://github.com/zou3519
2024-11-26 17:32:43 +00:00
Boyuan Feng
eecc8e362c [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-26 17:19:50 +00:00
Boyuan Feng
3ef031909f [Donated Buffer] support metadata mutation ops (#141308)
### Background:

`set(x,y)` changes the untyped storage of x to be the same as y.

```python
import torch
from torch._subclasses.fake_tensor import FakeTensorMode

x1 = torch.ones(2,3)
y1 = torch.ones(2,3)
z1 = torch.ops.aten.set_.source_Tensor(x1, y1)

fake_tensor_mode = FakeTensorMode()
x2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
y2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
z2 = torch.ops.aten.set_.source_Tensor(x2, y2)

print(f"x1: {x1.untyped_storage()._cdata}, y1: {y1.untyped_storage()._cdata}, z1: {z1.untyped_storage()._cdata}")
print(f"x2: {x2.untyped_storage()._cdata}, y2: {y2.untyped_storage()._cdata}, z2: {z2.untyped_storage()._cdata}")
# x1: 99973024, y1: 99973024, z1: 99973024
# x2: 112107232, y2: 112107232, z2: 112107232
```

### Error before this diff

Consider this example:
```python
import torch

def fn(x):
    p = torch.nn.Parameter(x + 123)
    return p, p.sin()

opt = torch.compile(fn, fullgraph=True)
x = torch.ones(16, device="cuda", requires_grad=True)

p, r = opt(x)
r.sum().backward()
```

When running with `TORCH_LOGS=aot`, we have `set_` in the graph.
```
def forward(self, primals_1: "f32[16][1]cuda:0", primals_2: "f32[16][1]cuda:0"):
   # File: /home/boyuan/playground/inductor/donated_buffer.py:4 in fn, code: p = torch.nn.Parameter(x + 123)
  add: "f32[16][1]cuda:0" = torch.ops.aten.add.Tensor(primals_1, 123);  primals_1 = None

   # File: /home/boyuan/playground/inductor/donated_buffer.py:5 in fn, code: return p, p.sin()
  sin: "f32[16][1]cuda:0" = torch.ops.aten.sin.default(add)

  # No stacktrace found for following nodes
  set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add);  primals_2 = set_ = None
  return (sin, add)
```

`set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add)` should change the storage of `primals_2` to be the same as `add`. However, this is not true before this diff. We found different untyped_storage() for meta['val'] of `set_`, `add`, and `primals_2`.

This also leads to an error with donated buffer (#130580), which checks alias by untyped_storage. Since `add` and `primals_2` have different untyped_storage (which is wrong), add is wrongly marked as donated buffer.

### Root Cause

During tracing, we have args, kwargs, out, and proxy_args, proxy_kwargs, proxy_out.

We use args and kwargs to compute `out = func(*args, **kwargs)` ([Here](https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L912)). Later, we set out to its proxy, essentially calling `proxy_out.node.meta["val"] = out.detach()`.

Due to the detach, the storage change happens on args but not on proxy_args.node.meta["val"] when func is torch.ops.aten.set_. I repro'ed this behavior of detach in eager code.

```python
import torch

x = torch.ones(2,3)
x_detach = x.detach()
y = torch.ones(2,3)
z = torch.ops.aten.set_.source_Tensor(x_detach, y)

print(f"x: {x.untyped_storage()._cdata}, x_detach: {x_detach.untyped_storage()._cdata}, y: {y.untyped_storage()._cdata}, z: {z.untyped_storage()._cdata}")
# x: 97023632, x_detach: 97026480, y: 97026480, z: 97026480
```

To fix the issue, this PR manually resets node.meta["val"] if the storage has changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141308
Approved by: https://github.com/bdhirsh
2024-11-26 17:06:46 +00:00
Ryan Guo
99a0e2b1a1 [dynamo] Trace through dataclasses by removing it from BUILTIN_SKIPLIST (#141294)
Fixes #141261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141294
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-11-26 17:05:23 +00:00
Stephen Matthews
2bbd984aa2 Fix typo in Reproducibility docs (#141341)
Fixes trivial issue in the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341
Approved by: https://github.com/svekars
2024-11-26 16:53:26 +00:00
Edward Z. Yang
42ab61241e Add README for torch._inductor.runtime (#141492)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141492
Approved by: https://github.com/jansel
ghstack dependencies: #141491
2024-11-26 14:43:02 +00:00
Edward Z. Yang
94ff3985c9 AFAICT, compile workers never actually mocked torch (#141491)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141491
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-11-26 14:43:02 +00:00
leslie-fang-intel
9d4c0527b3 [Inductor][CPP] Modularize the CPP GEMM Template (#141006)
**Summary**
Move the common template code, which may be reused in subsequent group GEMM templates, into the standalone sub-templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141006
Approved by: https://github.com/jgong5
2024-11-26 14:32:40 +00:00
Ting Lu
313c1b33c5 Update CUDA installation script to 12.6.3 (#141365)
related to https://github.com/pytorch/pytorch/issues/138440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141365
Approved by: https://github.com/atalman
2024-11-26 13:49:51 +00:00
xinan.lin
9dd3b85d05 [Inductor XPU] Fix wrong device check before skip concat linear. (#140916)
Fix #140917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140916
Approved by: https://github.com/EikanWang, https://github.com/eellison
2024-11-26 13:30:26 +00:00
xinan.lin
4742080ed9 [AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-11-26 11:51:32 +00:00
ZhiweiYan-96
c418a9ac75 [Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578)
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.

# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).

We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods.  So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.

In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does  not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend.   On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
2024-11-26 09:44:14 +00:00
PyTorch MergeBot
5318bf8baf Revert "[sparse] add extra options to _cslt_spare_mm (#137427)"
This reverts commit f1451163ec.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to This looks like the test is still failing, plz do a rebase ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2499918590))
2024-11-26 08:01:24 +00:00
cyy
6d4cd3e5f2 Remove linking of private cuda targets (#141463)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141463
Approved by: https://github.com/malfet
2024-11-26 03:51:53 +00:00
ZhiweiYan-96
648f5d9dd9 [Intel GPU] qconv at XPU backend (#133080)
# Motivation
This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.

# Details

The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and  `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.

In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU.

# Validation
*  UT testing
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_xpu \
   -k test_qconv2d_silu_xpu \
   -k test_qconv2d_relu6_xpu \
   -k test_qconv2d_hardtanh_xpu \
   -k test_qconv2d_hardswish_xpu
```
* Runtime exemplification
```bash
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945

#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman
2024-11-26 02:24:30 +00:00
Nikita Shulga
f2d388eddd [BE] Use torch.special.expm1 (#141518)
Instead of `torch.exp(x)-1`, as suggested by TorchFix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141518
Approved by: https://github.com/kit1980
2024-11-26 01:47:11 +00:00
Yanbo Liang
dcd16bdc21 [Dynamo][autograd.Function] Use fake tensor prop to infer fwd output (#136184)
Fixes #129963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136184
Approved by: https://github.com/zou3519
2024-11-26 01:10:08 +00:00
cyy
6b60f4bc91 Fix some typos in cuda.cmake (#141462)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141462
Approved by: https://github.com/peterbell10
2024-11-26 01:08:25 +00:00
Yifu Wang
6a22cae436 [IntraNodeComm] fix a recent breakage (#141200)
- Pass `group_name` to `CUDASymmetricMemory::alloc()` instead of `CUDASymmetricMemory::rendezvous()`. We can only move the argument to rendezvous() once all the underlying operators do the same.
- Added `float` to the allowlist for intra-node all-reduces.
- Added a warning when `IntraNodeComm::rendezvous()` is performed with overlapping devices among participants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141200
Approved by: https://github.com/weifengpy, https://github.com/kwen2501
2024-11-26 00:46:38 +00:00
Ryan Guo
583484b726 [dynamo] Fix and simplify hanlding of Set.update method (#141286)
The old implementation of `SetVariable.call_method("update", ...)` was
incorrectly becacuse it wouldn't handle iterable inputs. This patches
removes the input type restriction altogether, and implements the method
as a polyfill (like how most of the other set methods are handled).

Fixes #141283.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141286
Approved by: https://github.com/anijain2305
2024-11-26 00:41:50 +00:00
Avik Chaudhuri
5d7c3701e4 fix non termination in unflatten + state (#141494)
With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders.

Test Plan: added random dag test that failed previously

Differential Revision: D66457661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494
Approved by: https://github.com/angelayi
2024-11-26 00:17:56 +00:00
Jithun Nair
9ccbd84316 Upgrade ROCm wheels to manylinux2_28 - 1 of 2 (docker images) (#140681)
Fixes #140631

Highlights:
* Use `cpu_final` base for ROCm in `.ci/docker/manywheel/Dockerfile_2_28`
* Cleans up install_miopen.sh to remove old ROCm references
* Install `gcc-gfortran` package to build magma for ROCm on almalinux

Needs builder PR https://github.com/pytorch/builder/pull/2043 (merged) so that GCC_ABI expected value is updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140681
Approved by: https://github.com/jeffdaily
2024-11-26 00:10:40 +00:00
Nikita Shulga
8f5ce865a4 [Build] Add COMMIT_SHA to caffe2::GetBuildOptions (#141313)
Using the same `tools/generate_torch_version.py` script

It's already available on Python level, but not on C++ one

Please note, that updating commit hash will force recompilation of less than 10 files according to
```
% touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python
ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist
ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty
```

Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313
Approved by: https://github.com/ezyang
2024-11-26 00:09:36 +00:00
PyTorch MergeBot
ad37afd590 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit ba5253da9b.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))
2024-11-26 00:03:03 +00:00
PyTorch MergeBot
964655bf0c Revert "Remove THC from OSS build (#134969)"
This reverts commit 9c7660be0e.

Reverted https://github.com/pytorch/pytorch/pull/134969 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking the installation of https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/csrc/deformable/deform_conv_cuda_kernel.cu#L76 ([comment](https://github.com/pytorch/pytorch/pull/134969#issuecomment-2499275378))
2024-11-26 00:00:12 +00:00
Jesse Cai
f1451163ec [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-25 23:45:41 +00:00
Shangdi Yu
02990fe36b Populate nn.module.stack in _fuse_conv_bn_qat (#141400)
Summary:
Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph.

In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks).

We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node.

Test Plan:
CI

```
buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack
```

Differential Revision: D66393517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400
Approved by: https://github.com/angelayi
2024-11-25 23:41:28 +00:00
Jithun Nair
851edf208b [ROCm] Remove gfx906 from CI docker build (#141523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141523
Approved by: https://github.com/jeffdaily
2024-11-25 22:23:28 +00:00
Ke Wen
915625307e [PGNCCL] Record device index for GPU guarding during NCCLComm method calls (#141270)
### Motivation
`ncclCommInitRank` needs GPU guard (documented in NCCL).

`ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.

### Solution
This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs.

### Note
This is not a bug fix. Just a safety improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270
Approved by: https://github.com/eqy
ghstack dependencies: #141374
2024-11-25 21:31:21 +00:00
Ke Wen
af4522b81c [c10d][CI] Use new store for PG restart tests (#141374)
A new Store is used to recreate PGs upon restart. Achieve the new Store by adding prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141374
Approved by: https://github.com/fduwjj
2024-11-25 21:31:21 +00:00
Xuehai Pan
b18bbc965c [dynamo] support list.sort sort non-constant iterable with constant keys (#141485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141485
Approved by: https://github.com/jansel
2024-11-25 21:06:11 +00:00
Benjamin Glass
efec302dd0 cpp_wrapper tests: Fix tests assuming non-cpp_wrapper code (#141175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141175
Approved by: https://github.com/desertfire
2024-11-25 19:33:55 +00:00