pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Mark Saroufim	f3d16ec76f	Add doc preview command (#141590 ) Convenience, when we build pytorch docs 1. Docs for build weren't clear that `make html` is the main command intended to be ran 2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command` Usage ```shell numpy ❯ make serve PORT=8080 # Add port optionally Serving HTTP on :: port 8080 (http://[::]:8080/) ... ::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 - ::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 - ::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 - ``` ![Screenshot 2024-11-26 at 10 05 46 AM](https://github.com/user-attachments/assets/3b275c33-1515-4e21-b540-f5a68c8a8e55) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590 Approved by: https://github.com/svekars, https://github.com/malfet	2024-11-26 21:56:54 +00:00
PyTorch MergeBot	65dbd5cc2d	Revert "[Inductor] Inplacing with Donated Buffer (#140113 )" This reverts commit `eecc8e362c`. Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))	2024-11-26 21:20:59 +00:00
Joel Schlosser	869d629c0f	Forward / backward NJT support for several activation functions (#140736 ) Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch ghstack dependencies: #141500	2024-11-26 21:19:58 +00:00
Tristan Rice	9f4f061f89	PyProcessGroup: support rank, world size, group name/desc overrides (#141529 ) This improves `PyProcessGroup` so you can override rank, world size and group name/desc methods from Python. These will be needed to support resizable process groups in torchft. This also has some small fixes in test_c10d_pypg.py to use threads instead of processes which speeds up the test execution by ~10x. Test plan: ``` pytest test/distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141529 Approved by: https://github.com/fegin	2024-11-26 20:56:57 +00:00
Eli Uriegas	5696df439b	tools: Add script to do split build in one command (#141359 ) Usage: ```bash python3 tools/packaging/split_wheel.py bdist_wheel python3 tools/packaging/split_wheel.py install python3 tools/packaging/split_wheel.py develop ``` Ideally this should make it easier to do the split build locally while we're doing development. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141359 Approved by: https://github.com/kit1980	2024-11-26 20:51:05 +00:00
Joel Schlosser	8ba555ec8a	Fix where() for NJT (#141500 ) Background: It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, including layout. This shows up in various places, notably including derivative formulas ([example](`78491d6afc/tools/autograd/derivatives.yaml (L432-L434)`)). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula). This PR: * Allows non-contiguous NJT inputs to `where()` + adds tests for this * Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this * Uses limited `broadcast_tensors()` / `broadcast_to()` support * Improves `expand()` to work on non-contig NJTs * Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration) * Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})` Alternative approach: Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500 Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer	2024-11-26 20:13:27 +00:00
Zhengxu Chen	011650adc5	[sigmoid] Refactor out a helper function to insert const graph into top level graph. (#140854 ) Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates. Test Plan: CI Reviewed By: trieuat Differential Revision: D63031982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854 Approved by: https://github.com/SherlockNoMad	2024-11-26 20:07:46 +00:00
William Wen	6fa4356451	handle sympy.oo in bitwise_and/or value_ranges (#141522 ) An internal test is failing due to not handling `sympy.oo` properly in bitwise_and/or value_ranges: [T208684142](https://www.internalfb.com/intern/tasks/?t=208684142). I don't know how to repro this - seems like this requires inductor to trigger as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141522 Approved by: https://github.com/ezyang ghstack dependencies: #138777	2024-11-26 20:01:31 +00:00
Tsung-Hsien Lee	84f818f359	[DTensorTestbase] Fix `TestFunc` typing issue (#141513 ) Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`). Test Plan: Contbuild & OSS CI Differential Revision: D66463705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-11-26 19:48:34 +00:00
atalman	893a4390c9	Use cuda 12.6 wheels with Manylinux 2.28. Use Manylinux2014 for CPU, CUDA11.8, CUDA12.4 (#141565 ) For release 2.6 we will be using only CUDA 12.6 binaries on Manylinux 2.28. Issue: https://github.com/pytorch/pytorch/issues/123649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141565 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet	2024-11-26 19:36:42 +00:00
eqy	816ca98cd2	[cuDNN][SDPA] Update cuDNN grad output layout check (#141147 ) Thanks to https://github.com/pytorch/pytorch/pull/137978 from @Skylion007 which bumps to cuDNN 9.5.1 the broken assumption of dO strides == O strides is fixed Note that there is still the restriction that the innermost stride of the grad output is 1 (this is almost always guaranteed because this condition is required of the input tensors). The main exception would be in test code that does e.g., `.sum().backward()` which yields grad output tensors with strides `[0, 0, 0, 0]`. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141147 Approved by: https://github.com/drisspg	2024-11-26 19:17:01 +00:00
Nichols A. Romero	a99332eb25	[ROCM] Support Multi-GPU offline tuning in TunableOp (#139673 ) This PR enhances offline tuning to support multi-GPUs. High-level description of algorithm: - Duplicate GEMMs are first eliminated - GEMMs are distributed to multi-GPUs for tuning - Results are gathered into a file with `_full` in the filename Also adding support for GemmAndBias and ScaledGemm Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang	2024-11-26 19:07:41 +00:00
fduwjj	5b4c864672	[c10d] Enable CudaEventCache by default and add multi device support (#140975 ) We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-11-26 18:42:45 +00:00
Isuru Fernando	44186a0a4e	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-26 18:11:00 +00:00
Shivam Raikundalia	29ca44839e	Add skip_first_wait to profiler.schedule (V2) (#141512 ) Summary: Another try for D66198138. Original diff had some weird issue with type checking. Setting everything to int this time to get around it. Addresses https://github.com/pytorch/pytorch/issues/91888 We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait. Adds new flag but sets to false by default so that existing impl is not affected. Test Plan: Got following traces with this schedule: schedule=torch.profiler.schedule( wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1 ) Differential Revision: D66465860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141512 Approved by: https://github.com/aaronenyeshi	2024-11-26 18:10:54 +00:00
atalman	809de05693	Update libgfortran version in aarch64 Docker (#141583 ) From `libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb` to `libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb` as former is no longer available: ``` % curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb HTTP/1.1 404 Not Found Date: Tue, 26 Nov 2024 16:58:10 GMT Server: Apache/2.4.29 (Ubuntu) Content-Type: text/html; charset=iso-8859-1 ``` vs ``` % curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb HTTP/1.1 200 OK Date: Tue, 26 Nov 2024 16:58:48 GMT Server: Apache/2.4.29 (Ubuntu) Last-Modified: Sun, 31 Mar 2024 10:51:08 GMT ETag: "713d4-614f2a681d48b" Accept-Ranges: bytes Content-Length: 463828 Content-Type: application/x-debian-package ``` Here is the failure: https://github.com/pytorch/pytorch/actions/runs/12032016986/job/33542862322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141583 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet	2024-11-26 17:49:34 +00:00
Yidi Wu	000d4e9d43	[hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181 ) Before the PR: P1683356646 after the pr: P1683356585 Relevant changes: ``` @@ -231,7 +421,8 @@ true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1] del true_graph_0_arg0_1 del true_graph_0_arg1_1 + (buf5[0],) = true_graph_0(true_graph_0_args) - (true_graph_0_buf0,) = true_graph_0(true_graph_0_args) - buf5[0] = true_graph_0_buf0 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = buf4 @@ -239,7 +430,8 @@ false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1] del false_graph_0_arg0_1 del false_graph_0_arg1_1 + (buf5[0],) = false_graph_0(false_graph_0_args) - (false_graph_0_buf0,) = false_graph_0(false_graph_0_args) - buf5[0] = false_graph_0_buf0 del arg2_1 del buf4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181 Approved by: https://github.com/anijain2305 ghstack dependencies: #140334, #141172	2024-11-26 17:32:51 +00:00
Yidi Wu	aae581d921	[hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172 Approved by: https://github.com/Chillee ghstack dependencies: #140334	2024-11-26 17:32:50 +00:00
Yidi Wu	45bc9165fe	[hop] add discard_graph_changes to remove the empty calls before hop (#140334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140334 Approved by: https://github.com/zou3519	2024-11-26 17:32:43 +00:00
Boyuan Feng	eecc8e362c	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-26 17:19:50 +00:00
Boyuan Feng	3ef031909f	[Donated Buffer] support metadata mutation ops (#141308 ) ### Background: `set(x,y)` changes the untyped storage of x to be the same as y. ```python import torch from torch._subclasses.fake_tensor import FakeTensorMode x1 = torch.ones(2,3) y1 = torch.ones(2,3) z1 = torch.ops.aten.set_.source_Tensor(x1, y1) fake_tensor_mode = FakeTensorMode() x2 = fake_tensor_mode.from_tensor(torch.ones(2,3)) y2 = fake_tensor_mode.from_tensor(torch.ones(2,3)) z2 = torch.ops.aten.set_.source_Tensor(x2, y2) print(f"x1: {x1.untyped_storage()._cdata}, y1: {y1.untyped_storage()._cdata}, z1: {z1.untyped_storage()._cdata}") print(f"x2: {x2.untyped_storage()._cdata}, y2: {y2.untyped_storage()._cdata}, z2: {z2.untyped_storage()._cdata}") # x1: 99973024, y1: 99973024, z1: 99973024 # x2: 112107232, y2: 112107232, z2: 112107232 ``` ### Error before this diff Consider this example: ```python import torch def fn(x): p = torch.nn.Parameter(x + 123) return p, p.sin() opt = torch.compile(fn, fullgraph=True) x = torch.ones(16, device="cuda", requires_grad=True) p, r = opt(x) r.sum().backward() ``` When running with `TORCH_LOGS=aot`, we have `set_` in the graph. ``` def forward(self, primals_1: "f32[16][1]cuda:0", primals_2: "f32[16][1]cuda:0"): # File: /home/boyuan/playground/inductor/donated_buffer.py:4 in fn, code: p = torch.nn.Parameter(x + 123) add: "f32[16][1]cuda:0" = torch.ops.aten.add.Tensor(primals_1, 123); primals_1 = None # File: /home/boyuan/playground/inductor/donated_buffer.py:5 in fn, code: return p, p.sin() sin: "f32[16][1]cuda:0" = torch.ops.aten.sin.default(add) # No stacktrace found for following nodes set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add); primals_2 = set_ = None return (sin, add) ``` `set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add)` should change the storage of `primals_2` to be the same as `add`. However, this is not true before this diff. We found different untyped_storage() for meta['val'] of `set_`, `add`, and `primals_2`. This also leads to an error with donated buffer (#130580), which checks alias by untyped_storage. Since `add` and `primals_2` have different untyped_storage (which is wrong), add is wrongly marked as donated buffer. ### Root Cause During tracing, we have args, kwargs, out, and proxy_args, proxy_kwargs, proxy_out. We use args and kwargs to compute `out = func(args, *kwargs)` ([Here](https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L912)). Later, we set out to its proxy, essentially calling `proxy_out.node.meta["val"] = out.detach()`. Due to the detach, the storage change happens on args but not on proxy_args.node.meta["val"] when func is torch.ops.aten.set_. I repro'ed this behavior of detach in eager code. ```python import torch x = torch.ones(2,3) x_detach = x.detach() y = torch.ones(2,3) z = torch.ops.aten.set_.source_Tensor(x_detach, y) print(f"x: {x.untyped_storage()._cdata}, x_detach: {x_detach.untyped_storage()._cdata}, y: {y.untyped_storage()._cdata}, z: {z.untyped_storage()._cdata}") # x: 97023632, x_detach: 97026480, y: 97026480, z: 97026480 ``` To fix the issue, this PR manually resets node.meta["val"] if the storage has changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141308 Approved by: https://github.com/bdhirsh	2024-11-26 17:06:46 +00:00
Ryan Guo	99a0e2b1a1	[dynamo] Trace through `dataclasses` by removing it from `BUILTIN_SKIPLIST` (#141294 ) Fixes #141261. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141294 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-11-26 17:05:23 +00:00
Stephen Matthews	2bbd984aa2	Fix typo in Reproducibility docs (#141341 ) Fixes trivial issue in the docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341 Approved by: https://github.com/svekars	2024-11-26 16:53:26 +00:00
Edward Z. Yang	42ab61241e	Add README for torch._inductor.runtime (#141492 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141492 Approved by: https://github.com/jansel ghstack dependencies: #141491	2024-11-26 14:43:02 +00:00
Edward Z. Yang	94ff3985c9	AFAICT, compile workers never actually mocked torch (#141491 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141491 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-11-26 14:43:02 +00:00
leslie-fang-intel	9d4c0527b3	[Inductor][CPP] Modularize the CPP GEMM Template (#141006 ) Summary Move the common template code, which may be reused in subsequent group GEMM templates, into the standalone sub-templates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141006 Approved by: https://github.com/jgong5	2024-11-26 14:32:40 +00:00
Ting Lu	313c1b33c5	Update CUDA installation script to 12.6.3 (#141365 ) related to https://github.com/pytorch/pytorch/issues/138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141365 Approved by: https://github.com/atalman	2024-11-26 13:49:51 +00:00
xinan.lin	9dd3b85d05	[Inductor XPU] Fix wrong device check before skip concat linear. (#140916 ) Fix #140917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140916 Approved by: https://github.com/EikanWang, https://github.com/eellison	2024-11-26 13:30:26 +00:00
xinan.lin	4742080ed9	[AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-11-26 11:51:32 +00:00
ZhiweiYan-96	c418a9ac75	[Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578 ) # Motivation This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend. # Detailed The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion). We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class. In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168 ghstack dependencies: #133080	2024-11-26 09:44:14 +00:00
PyTorch MergeBot	5318bf8baf	Revert "[sparse] add extra options to _cslt_spare_mm (#137427 )" This reverts commit `f1451163ec`. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to This looks like the test is still failing, plz do a rebase ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2499918590))	2024-11-26 08:01:24 +00:00
cyy	6d4cd3e5f2	Remove linking of private cuda targets (#141463 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141463 Approved by: https://github.com/malfet	2024-11-26 03:51:53 +00:00
ZhiweiYan-96	648f5d9dd9	[Intel GPU] qconv at XPU backend (#133080 ) # Motivation This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library. # Details The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library. In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU. # Validation * UT testing ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_xpu \ -k test_qconv2d_silu_xpu \ -k test_qconv2d_relu6_xpu \ -k test_qconv2d_hardtanh_xpu \ -k test_qconv2d_hardswish_xpu ``` * Runtime exemplification ```bash #qconv2d onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945 #qconv2d_silu onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman	2024-11-26 02:24:30 +00:00
Nikita Shulga	f2d388eddd	[BE] Use `torch.special.expm1` (#141518 ) Instead of `torch.exp(x)-1`, as suggested by TorchFix Pull Request resolved: https://github.com/pytorch/pytorch/pull/141518 Approved by: https://github.com/kit1980	2024-11-26 01:47:11 +00:00
Yanbo Liang	dcd16bdc21	[Dynamo][autograd.Function] Use fake tensor prop to infer fwd output (#136184 ) Fixes #129963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136184 Approved by: https://github.com/zou3519	2024-11-26 01:10:08 +00:00
cyy	6b60f4bc91	Fix some typos in cuda.cmake (#141462 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141462 Approved by: https://github.com/peterbell10	2024-11-26 01:08:25 +00:00
Yifu Wang	6a22cae436	[IntraNodeComm] fix a recent breakage (#141200 ) - Pass `group_name` to `CUDASymmetricMemory::alloc()` instead of `CUDASymmetricMemory::rendezvous()`. We can only move the argument to rendezvous() once all the underlying operators do the same. - Added `float` to the allowlist for intra-node all-reduces. - Added a warning when `IntraNodeComm::rendezvous()` is performed with overlapping devices among participants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141200 Approved by: https://github.com/weifengpy, https://github.com/kwen2501	2024-11-26 00:46:38 +00:00
Ryan Guo	583484b726	[dynamo] Fix and simplify hanlding of `Set.update` method (#141286 ) The old implementation of `SetVariable.call_method("update", ...)` was incorrectly becacuse it wouldn't handle iterable inputs. This patches removes the input type restriction altogether, and implements the method as a polyfill (like how most of the other set methods are handled). Fixes #141283. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141286 Approved by: https://github.com/anijain2305	2024-11-26 00:41:50 +00:00
Avik Chaudhuri	5d7c3701e4	fix non termination in unflatten + state (#141494 ) With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders. Test Plan: added random dag test that failed previously Differential Revision: D66457661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494 Approved by: https://github.com/angelayi	2024-11-26 00:17:56 +00:00
Jithun Nair	9ccbd84316	Upgrade ROCm wheels to manylinux2_28 - 1 of 2 (docker images) (#140681 ) Fixes #140631 Highlights: * Use `cpu_final` base for ROCm in `.ci/docker/manywheel/Dockerfile_2_28` * Cleans up install_miopen.sh to remove old ROCm references * Install `gcc-gfortran` package to build magma for ROCm on almalinux Needs builder PR https://github.com/pytorch/builder/pull/2043 (merged) so that GCC_ABI expected value is updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140681 Approved by: https://github.com/jeffdaily	2024-11-26 00:10:40 +00:00
Nikita Shulga	8f5ce865a4	[Build] Add `COMMIT_SHA` to `caffe2::GetBuildOptions` (#141313 ) Using the same `tools/generate_torch_version.py` script It's already available on Python level, but not on C++ one Please note, that updating commit hash will force recompilation of less than 10 files according to ``` % touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty ``` Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313 Approved by: https://github.com/ezyang	2024-11-26 00:09:36 +00:00
PyTorch MergeBot	ad37afd590	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit `ba5253da9b`. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))	2024-11-26 00:03:03 +00:00
PyTorch MergeBot	964655bf0c	Revert "Remove THC from OSS build (#134969 )" This reverts commit `9c7660be0e`. Reverted https://github.com/pytorch/pytorch/pull/134969 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking the installation of https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/csrc/deformable/deform_conv_cuda_kernel.cu#L76 ([comment](https://github.com/pytorch/pytorch/pull/134969#issuecomment-2499275378))	2024-11-26 00:00:12 +00:00
Jesse Cai	f1451163ec	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-25 23:45:41 +00:00
Shangdi Yu	02990fe36b	Populate nn.module.stack in _fuse_conv_bn_qat (#141400 ) Summary: Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph. In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks). We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node. Test Plan: CI ``` buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack ``` Differential Revision: D66393517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400 Approved by: https://github.com/angelayi	2024-11-25 23:41:28 +00:00
Jithun Nair	851edf208b	[ROCm] Remove gfx906 from CI docker build (#141523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141523 Approved by: https://github.com/jeffdaily	2024-11-25 22:23:28 +00:00
Ke Wen	915625307e	[PGNCCL] Record device index for GPU guarding during NCCLComm method calls (#141270 ) ### Motivation `ncclCommInitRank` needs GPU guard (documented in NCCL). `ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests. ### Solution This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs. ### Note This is not a bug fix. Just a safety improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270 Approved by: https://github.com/eqy ghstack dependencies: #141374	2024-11-25 21:31:21 +00:00
Ke Wen	af4522b81c	[c10d][CI] Use new store for PG restart tests (#141374 ) A new Store is used to recreate PGs upon restart. Achieve the new Store by adding prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141374 Approved by: https://github.com/fduwjj	2024-11-25 21:31:21 +00:00
Xuehai Pan	b18bbc965c	[dynamo] support `list.sort` sort non-constant iterable with constant keys (#141485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141485 Approved by: https://github.com/jansel	2024-11-25 21:06:11 +00:00
Benjamin Glass	efec302dd0	cpp_wrapper tests: Fix tests assuming non-cpp_wrapper code (#141175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141175 Approved by: https://github.com/desertfire	2024-11-25 19:33:55 +00:00

1 2 3 4 5 ...

81462 Commits