Convenience, when we build pytorch docs
1. Docs for build weren't clear that `make html` is the main command intended to be ran
2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command`
Usage
```shell
numpy ❯ make serve PORT=8080 # Add port optionally
Serving HTTP on :: port 8080 (http://[::]:8080/) ...
::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 -
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590
Approved by: https://github.com/svekars, https://github.com/malfet
This improves `PyProcessGroup` so you can override rank, world size and group name/desc methods from Python. These will be needed to support resizable process groups in torchft.
This also has some small fixes in test_c10d_pypg.py to use threads instead of processes which speeds up the test execution by ~10x.
Test plan:
```
pytest test/distributed/test_c10d_pypg.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141529
Approved by: https://github.com/fegin
Usage:
```bash
python3 tools/packaging/split_wheel.py bdist_wheel
python3 tools/packaging/split_wheel.py install
python3 tools/packaging/split_wheel.py develop
```
Ideally this should make it easier to do the split build locally while
we're doing development.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141359
Approved by: https://github.com/kit1980
**Background:** It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, *including layout*. This shows up in various places, notably including derivative formulas ([example](78491d6afc/tools/autograd/derivatives.yaml (L432-L434))). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula).
**This PR:**
* Allows non-contiguous NJT inputs to `where()` + adds tests for this
* Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this
* Uses limited `broadcast_tensors()` / `broadcast_to()` support
* Improves `expand()` to work on non-contig NJTs
* Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration)
* Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})`
**Alternative approach:** Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500
Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer
Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates.
Test Plan: CI
Reviewed By: trieuat
Differential Revision: D63031982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854
Approved by: https://github.com/SherlockNoMad
Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`).
Test Plan: Contbuild & OSS CI
Differential Revision: D66463705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513
Approved by: https://github.com/wz337, https://github.com/Skylion007
Thanks to https://github.com/pytorch/pytorch/pull/137978 from @Skylion007 which bumps to cuDNN 9.5.1 the broken assumption of dO strides == O strides is fixed
Note that there is still the restriction that the innermost stride of the grad output is 1 (this is almost always guaranteed because this condition is required of the input tensors). The main exception would be in test code that does e.g., `.sum().backward()` which yields grad output tensors with strides `[0, 0, 0, 0]`.
CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141147
Approved by: https://github.com/drisspg
This PR enhances offline tuning to support multi-GPUs.
High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename
Also adding support for GemmAndBias and ScaledGemm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.
Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975
Approved by: https://github.com/eqy, https://github.com/kwen2501
Summary:
Another try for D66198138. Original diff had some weird issue with type checking. Setting everything to int this time to get around it.
Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.
Test Plan:
Got following traces with this schedule:
schedule=torch.profiler.schedule(
wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
)
Differential Revision: D66465860
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141512
Approved by: https://github.com/aaronenyeshi
### Background:
`set(x,y)` changes the untyped storage of x to be the same as y.
```python
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
x1 = torch.ones(2,3)
y1 = torch.ones(2,3)
z1 = torch.ops.aten.set_.source_Tensor(x1, y1)
fake_tensor_mode = FakeTensorMode()
x2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
y2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
z2 = torch.ops.aten.set_.source_Tensor(x2, y2)
print(f"x1: {x1.untyped_storage()._cdata}, y1: {y1.untyped_storage()._cdata}, z1: {z1.untyped_storage()._cdata}")
print(f"x2: {x2.untyped_storage()._cdata}, y2: {y2.untyped_storage()._cdata}, z2: {z2.untyped_storage()._cdata}")
# x1: 99973024, y1: 99973024, z1: 99973024
# x2: 112107232, y2: 112107232, z2: 112107232
```
### Error before this diff
Consider this example:
```python
import torch
def fn(x):
p = torch.nn.Parameter(x + 123)
return p, p.sin()
opt = torch.compile(fn, fullgraph=True)
x = torch.ones(16, device="cuda", requires_grad=True)
p, r = opt(x)
r.sum().backward()
```
When running with `TORCH_LOGS=aot`, we have `set_` in the graph.
```
def forward(self, primals_1: "f32[16][1]cuda:0", primals_2: "f32[16][1]cuda:0"):
# File: /home/boyuan/playground/inductor/donated_buffer.py:4 in fn, code: p = torch.nn.Parameter(x + 123)
add: "f32[16][1]cuda:0" = torch.ops.aten.add.Tensor(primals_1, 123); primals_1 = None
# File: /home/boyuan/playground/inductor/donated_buffer.py:5 in fn, code: return p, p.sin()
sin: "f32[16][1]cuda:0" = torch.ops.aten.sin.default(add)
# No stacktrace found for following nodes
set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add); primals_2 = set_ = None
return (sin, add)
```
`set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add)` should change the storage of `primals_2` to be the same as `add`. However, this is not true before this diff. We found different untyped_storage() for meta['val'] of `set_`, `add`, and `primals_2`.
This also leads to an error with donated buffer (#130580), which checks alias by untyped_storage. Since `add` and `primals_2` have different untyped_storage (which is wrong), add is wrongly marked as donated buffer.
### Root Cause
During tracing, we have args, kwargs, out, and proxy_args, proxy_kwargs, proxy_out.
We use args and kwargs to compute `out = func(*args, **kwargs)` ([Here](https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L912)). Later, we set out to its proxy, essentially calling `proxy_out.node.meta["val"] = out.detach()`.
Due to the detach, the storage change happens on args but not on proxy_args.node.meta["val"] when func is torch.ops.aten.set_. I repro'ed this behavior of detach in eager code.
```python
import torch
x = torch.ones(2,3)
x_detach = x.detach()
y = torch.ones(2,3)
z = torch.ops.aten.set_.source_Tensor(x_detach, y)
print(f"x: {x.untyped_storage()._cdata}, x_detach: {x_detach.untyped_storage()._cdata}, y: {y.untyped_storage()._cdata}, z: {z.untyped_storage()._cdata}")
# x: 97023632, x_detach: 97026480, y: 97026480, z: 97026480
```
To fix the issue, this PR manually resets node.meta["val"] if the storage has changed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141308
Approved by: https://github.com/bdhirsh
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.
# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).
We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.
In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
# Motivation
This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.
# Details
The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.
In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU.
# Validation
* UT testing
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
-k test_qconv2d_xpu \
-k test_qconv2d_silu_xpu \
-k test_qconv2d_relu6_xpu \
-k test_qconv2d_hardtanh_xpu \
-k test_qconv2d_hardswish_xpu
```
* Runtime exemplification
```bash
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945
#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman
- Pass `group_name` to `CUDASymmetricMemory::alloc()` instead of `CUDASymmetricMemory::rendezvous()`. We can only move the argument to rendezvous() once all the underlying operators do the same.
- Added `float` to the allowlist for intra-node all-reduces.
- Added a warning when `IntraNodeComm::rendezvous()` is performed with overlapping devices among participants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141200
Approved by: https://github.com/weifengpy, https://github.com/kwen2501
The old implementation of `SetVariable.call_method("update", ...)` was
incorrectly becacuse it wouldn't handle iterable inputs. This patches
removes the input type restriction altogether, and implements the method
as a polyfill (like how most of the other set methods are handled).
Fixes#141283.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141286
Approved by: https://github.com/anijain2305
With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders.
Test Plan: added random dag test that failed previously
Differential Revision: D66457661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494
Approved by: https://github.com/angelayi
Using the same `tools/generate_torch_version.py` script
It's already available on Python level, but not on C++ one
Please note, that updating commit hash will force recompilation of less than 10 files according to
```
% touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python
ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist
ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty
```
Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313
Approved by: https://github.com/ezyang
Summary:
Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.
This PR adds in the additional cuSPARSELt bindings into pytorch.
* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
so a warning has been added
* Added a header file for cuSPARSELtOps.cpp
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
Summary:
Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph.
In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks).
We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node.
Test Plan:
CI
```
buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack
```
Differential Revision: D66393517
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400
Approved by: https://github.com/angelayi
### Motivation
`ncclCommInitRank` needs GPU guard (documented in NCCL).
`ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.
### Solution
This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs.
### Note
This is not a bug fix. Just a safety improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270
Approved by: https://github.com/eqy
ghstack dependencies: #141374