Commit Graph

43731 Commits

Author SHA1 Message Date
cyy
d558c1a047 Enable cppcoreguidelines-special-member-functions (#139132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132
Approved by: https://github.com/sraikund16
2024-11-06 13:42:20 +00:00
Nikita Shulga
c0c6bf4ef2 Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353
2024-11-06 13:34:45 +00:00
PyTorch MergeBot
44e4949bcf Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)"
This reverts commit 22e89ea2aa.

Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/malfet due to It broke number of tests, see 22e89ea2aa ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2459754355))
2024-11-06 13:31:26 +00:00
PyTorch MergeBot
10d7729333 Revert "Enable cppcoreguidelines-special-member-functions (#139132)"
This reverts commit a9b4989c72.

Reverted https://github.com/pytorch/pytorch/pull/139132 on behalf of https://github.com/ZainRizvi due to Sorry but this fails on trunk. See inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_smooth_quant_with_int_mm [GH job link](https://github.com/pytorch/pytorch/actions/runs/11699366379/job/32591132460) [HUD commit link](22e89ea2aa) ([comment](https://github.com/pytorch/pytorch/pull/139132#issuecomment-2459743145))
2024-11-06 13:27:42 +00:00
PyTorch MergeBot
53299b8a38 Revert "Don't use deprecated type properties in UpsampleKernel (#139399)"
This reverts commit 0058f71002.

Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))
2024-11-06 13:23:43 +00:00
Jack Taylor
5f266b5a02 [ROCm] re-enable flex attention UTs (#139632)
https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632
Approved by: https://github.com/drisspg
2024-11-06 12:49:44 +00:00
Michael Lazos
d622b490d6 [Dynamo] Support tensor mro without source (#139838)
Fixes https://github.com/pytorch/pytorch/issues/137743

The issue here is that if `type` was called on a tensor without a source, we wouldn't have a source even for `torch.Tensor`, and the `__mro__` retrieval would fail. Since `torch.Tensor` is an internal torch type, I add handling for it in `call_type` in builtins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139838
Approved by: https://github.com/williamwen42
2024-11-06 08:52:53 +00:00
cyy
a9b4989c72 Enable cppcoreguidelines-special-member-functions (#139132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132
Approved by: https://github.com/sraikund16
2024-11-06 07:59:09 +00:00
Xia, Weiwen
22e89ea2aa [Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-11-06 07:54:47 +00:00
Huy Do
c19c384690 Fix torch.load (torch.utils.benchmark) after #137602 (#139810)
After #137602, the default `weights_only` has been set to True.  This test is failing in trunk slow jobs atm

benchmark_utils/test_benchmark_utils.py::TestBenchmarkUtils::test_collect_callgrind [GH job link](https://github.com/pytorch/pytorch/actions/runs/11672436111/job/32502454946) [HUD commit link](1aa71be56c)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139810
Approved by: https://github.com/kit1980
2024-11-06 03:08:29 +00:00
Colin Peppler
63b01f328e [inductor] support masked_scatter w/ unbacked sized source (#138083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138083
Approved by: https://github.com/jansel
2024-11-06 02:16:25 +00:00
cyy
028c5d3426 [2/N] Replace c10::sv with std::sv (#139456)
Follows  #139453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456
Approved by: https://github.com/ezyang
2024-11-06 01:50:38 +00:00
Andrew Gu
39ede99a33 Add current FSDP2 path to old composable FSDP1 warning (#139759)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139759
Approved by: https://github.com/weifengpy, https://github.com/wz337
ghstack dependencies: #139650
2024-11-06 01:43:04 +00:00
David Berard
aec179e2be Fix docs for logcumsumexp formula (#139768)
The previous formula was wrong and reused some indexing variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139768
Approved by: https://github.com/janeyx99
2024-11-06 01:19:09 +00:00
Laith Sakka
a787320d0f Do not try to optimize new implications in get_implications (#139738)
Summary:
save around 8%  on the torchrec model.
In most case the new implications are not optimizaiton anyway in some case though they are,
but optimizing them is useless.

ex:
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding True
adding False
```

VS
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding 0 <= Mod(s0, 3)
adding Mod(s0, 3) < 0
```
the main difference is that  0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change
this wont happen. but True:True and False: False are useless anyway lol. so its ok i think
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```

<img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738
Approved by: https://github.com/ezyang
ghstack dependencies: #139703
2024-11-06 00:23:40 +00:00
Will Feng
6a30c14a0a [Traceable FSDP2] Run any unexecuted post_backward at beginning of pre_backward hook (#139671)
Assuming the forward pass user code looks like:
```
for _ in range(2):
    x = layer(x)
```
and we have `fully_shard(layer)`, then:
- the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile)
- the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile

The behavior in the backward pass is different between eager and compile, which is not ideal.

 I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things:
1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects.
2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice.
3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671
Approved by: https://github.com/awgu
2024-11-06 00:19:06 +00:00
Xiaodong Wang
e7cf7d00be Support torch.bool in torch.sort + CUDA (#139409)
Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok.

Test Plan: CI

Differential Revision: D65282650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409
Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy
2024-11-06 00:02:54 +00:00
Aaron Orenstein
06f619d999 typing ir.py - part 2 (#131846)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846
Approved by: https://github.com/eellison
ghstack dependencies: #139238
2024-11-06 00:01:15 +00:00
Aaron Orenstein
c2109ec479 typing ir.py - Disallow untyped defs for ir.py (#139238)
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-06 00:01:15 +00:00
leslie-fang-intel
82e4de4994 [Inductor][CPU] Enable the oneDNN Linear fusion for special case (#139172)
**Summary**
In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-11-05 23:49:53 +00:00
Thomas Bohnstingl
d1c26b0781 Improvements for associative_scan - slicing of xs (#138858)
In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of https://github.com/pytorch/pytorch/pull/136966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138858
Approved by: https://github.com/ydwu4
2024-11-05 23:38:21 +00:00
Mikayla Gawarecki
86d7d39bff Forward fix D65441551 for T206731737 (#139767)
Test Plan: -

Differential Revision: D65482429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139767
Approved by: https://github.com/awgu
2024-11-05 23:19:08 +00:00
Shuqiang Zhang
c0d642a295 [pgnccl][simple] log started work numel (#139773)
Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-11-05 23:11:19 +00:00
PyTorch MergeBot
1d28b8b6d5 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit e84d1121ad.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))
2024-11-05 23:10:38 +00:00
drisspg
16da289402 [Workspace Inductor] Fix dynamic shapes (#139777)
# Summary
Arg ordering was wrong for when dynamic shapes is enabled and we pass in the additional size args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139777
Approved by: https://github.com/eellison
ghstack dependencies: #139157
2024-11-05 22:34:09 +00:00
Animesh Jain
b09eb6ed6a [dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476

We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.

If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.

So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
2024-11-05 21:48:07 +00:00
Yidi Wu
6734cb7bf2 [hop free symbols] refactor tensor.to_list implementation to call wrap_fx_proxy. (#139663)
Refactoring only. Previously, we manually cal SymNodeVariable.create, now we handle it with wrap_fx_proxy. This unifies the handling of operations that produce symints in wrap_fx_proxy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139663
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737, #138559
2024-11-05 20:19:09 +00:00
rzou
b9f0563aaf Add repro instructions to fx_graph_runnable.py (#139481)
This PR adds some instructions for how to add a TARGETS file to run the
fx_graph_runnable script. I'm planning to add some followups that will
add additional imports for custom ops and use autodeps to get the
dependencies, but I figure this PR is an easy first step.

Test Plan:
- pytest test/dynamo/test_structured_trace.py
- Does anyone have suggestions for how to test this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481
Approved by: https://github.com/eellison
2024-11-05 19:24:16 +00:00
Ryan Guo
01bcf37123 [dynamo][NFC] Remove some dead code paths (#139674)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139674
Approved by: https://github.com/Skylion007, https://github.com/anijain2305, https://github.com/mlazos
2024-11-05 19:12:17 +00:00
Ryan Guo
2b3a227b35 [dynamo] Add is_mutable() and is_immutable() methods to VariableTracker (#139341)
This patch adds 2 simple methods `VariableTracker.is_mutable()` and
`VariableTracker.is_immutable()`, which helps clarify intention. For
instance, rather than writing
```python
if var.mutation_type:
    ...
```
After this patch one can write
```python
if var.is_mutable():
    ...
```

This patch also simplifies `mutation_type` propagation in some
`ListVariable` methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #139339, #139340
2024-11-05 19:11:41 +00:00
Ryan Guo
0ba3962b80 [dynamo][NFC] Move MutationType classes into variables/base.py (#139340)
As title, this addresses
https://github.com/pytorch/pytorch/pull/137905/files#r1806800222.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139340
Approved by: https://github.com/anijain2305
ghstack dependencies: #139339
2024-11-05 19:11:41 +00:00
Ryan Guo
693a0a1bd4 [dynamo][NFC] Rename mutable_local and add documentation (#139339)
This patch addresses the renaming part of #133027, specifically, it
renames the following and adds documentation for relevant classes.
1. `VariableTracker.mutable_local` to `mutation_type`
2. `MatableLocal `to `ValueMutationNew`
3. `MutableSideEffects `to `ValueMutationExisting`
4. `MutableLocalSource` to `SourceType`
5. `MutableLocalSource.Local` to `New`

Note that (2), (3) and (5) are mainly to bring consistency between them
and `AttributeMutationNew`, `AttributeMutationExisting`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2024-11-05 19:11:41 +00:00
Ke Wen
5f2ed505eb [PGNCCL] Watchdog prints call-time traceback when reporting timeout (#139659)
### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-11-05 19:07:17 +00:00
Yifu Wang
ee42a99745 [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-05 18:47:24 +00:00
Boyuan Feng
87059d4547 [AOTAutograd] Handle edge cases for donated buffer & enable in oss (#139669)
This PR enables donated buffer in OSS and handles two edge cases:

1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer.
2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`,
`functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and
`inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669
Approved by: https://github.com/bdhirsh
2024-11-05 18:38:20 +00:00
rzou
27ec3921bc Optimize mutable torch.library.custom_op overhead (#139513)
We don't need to do a loop over all the args, kwargs in the
AdInplaceOrView key; we just need to bump the version on the args,
kwargs that are mutable.

On the benchmark mentioned in
https://github.com/pytorch/pytorch/issues/139494
this made the time go from
```
mutate2 = 61.72943878173828
no_mutate2 = 36.89440155029297
mutate = 236.3092498779297
no_mutate = 59.31964874267578

```
to
```
mutate2 = 47.976478576660156
no_mutate2 = 38.37468719482422
mutate = 71.21315002441406
no_mutate = 59.7432975769043
```

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139509
2024-11-05 18:30:53 +00:00
Tomasz Bohutyn
9dc5851f5d handle more devices in method_type method of TensorVariable (#138078)
Fixes #138077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138078
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-11-05 18:19:52 +00:00
Angela Yi
de509abe1c [export] Dedup data-dependent errors based on stacktrace (#139540)
Summary:
Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation).

This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors.

Test Plan: CI

Differential Revision: D65374254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540
Approved by: https://github.com/pianpwk, https://github.com/zou3519
2024-11-05 18:16:05 +00:00
Sam Ginzburg
cc25b6d7ba [inductor] Error on unsupported autotuner configs (#139658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139658
Approved by: https://github.com/aakhundov
2024-11-05 18:09:02 +00:00
Junjie Wang (PyTorch)
41e4d88584 [logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757)
Summary: As discussed, we want to measure the time spent during pickling and unpickle.

Test Plan: CI

Reviewed By: wz337

Differential Revision: D65462767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139757
Approved by: https://github.com/awgu, https://github.com/Skylion007, https://github.com/fegin, https://github.com/c-p-i-o
2024-11-05 17:40:27 +00:00
Oguz Ulgen
c0d21b6581 End TritonBundle on non-cache write codepaths (#139698)
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.

This diff also turns on the cache for internal so that we can do a code rollout.

Test Plan: updated tests

Differential Revision: D65457224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
2024-11-05 17:00:40 +00:00
PyTorch MergeBot
4d5cc1b4ef Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)"
This reverts commit e6ff07f00e.

Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking internal tests. Please see D65430317 for more details ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2457620720))
2024-11-05 16:22:30 +00:00
cyy
a2bc2e38f9 Use clang-tidy 17 (#139678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678
Approved by: https://github.com/Skylion007
2024-11-05 16:00:25 +00:00
Junjie Wang (PyTorch)
13eb3b3f6f [Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702)
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use  `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.

Test Plan: CI.

Differential Revision: D65453312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
2024-11-05 15:28:03 +00:00
Edward Z. Yang
349cd49406 Fix compiler collective TORCH_TRACE and improve code state printing (#139716)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139716
Approved by: https://github.com/yf225
2024-11-05 14:32:52 +00:00
cyy
546318e559 [7/N] Don't skip ASAN on some tests (#139675)
Follows #139565
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139675
Approved by: https://github.com/ezyang
2024-11-05 14:01:01 +00:00
Xuehai Pan
e84d1121ad Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-05 10:44:56 +00:00
zeshengzong
ffb7a08921 Fix torch.histc not checking min > max on cuda for int8 tensors (#139372)
Fixes #139360

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)

Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)

![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487)

Change type of `minvalue` and `maxvalue` to fix it, similar like in line:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)

**Test Result**
```bash
$ pytest test/test_reductions.py -vv
```
![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372
Approved by: https://github.com/eqy
2024-11-05 08:42:38 +00:00
Laith Sakka
6ad52db8c8 use torch.sym_sum instead of incremental sum in _cat_meta (#139653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139653
Approved by: https://github.com/ezyang
2024-11-05 07:24:24 +00:00
Aaron Orenstein
51a3d6dbc3 Fix existing lint issues in ir.py (#139237)
- Remove stale mypy "type: ignores"
- Made ir.py pass the rest of the lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139237
Approved by: https://github.com/Skylion007
2024-11-05 06:06:12 +00:00
Eli Simhayev
b2f5a5311b RMSNorms docs - remove biases initialization (#139620)
RMSNorm doesn't use a bias in `elementwise_affine`, so I've removed it from the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139620
Approved by: https://github.com/mikaylagawarecki
2024-11-05 05:59:41 +00:00
Chen, Zejun
9aaf3a04fa [profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316)
This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316
Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-11-05 05:46:13 +00:00
CaoE
9e14d86573 [Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255)
`kernel_micro_gemm` generated using BRGEMM:
```
template <bool accum>
inline void kernel_micro_gemm(
    const half* __restrict__ A,
    const half* __restrict__ B,
    float* __restrict__ C,
    int64_t M,
    int64_t N,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc
) {
    at::native::cpublas::brgemm(
      M, N, K,
      lda, ldb, ldc,
      1.f, accum ? 1.f : 0.f,
      A,
      B,
      C);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 05:33:29 +00:00
Meet Vadakkanchery
c8a55eea88 [DCP] Fix process_group logging for DCP methods (#139428)
Summary:
Currently, we incorrectly log process_group for DCP based events.

We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available).

In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`.

Test Plan:
Before:

Always defaults to NCCL even though GLOO is passed by caller.

{F1950847585}

After:

GLOO backend shows up.

{F1950848375}

Differential Revision: D65255871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428
Approved by: https://github.com/teja-rao, https://github.com/mhorowitz
2024-11-05 05:24:38 +00:00
Animesh Jain
fe4fa1df9f [dynamo][eval_frame] Set the callback to None earlier for guard eval (#139655)
xref - https://fb.workplace.com/groups/1075192433118967/permalink/1536570810314458/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139655
Approved by: https://github.com/jansel, https://github.com/williamwen42
2024-11-05 05:18:46 +00:00
Gabriel Ferns
a766d84a3c Allow inplacing buffer when other users are inconsequential (#138383)
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.

Implements:
https://github.com/pytorch/pytorch/issues/132826

Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
2024-11-05 03:44:09 +00:00
Andrew Gu
9039fbb47e [FSDP2] Make module-to-state mapping use weakrefs (#139650)
Without this, `del model` does not free memory of a module with FSDP2 applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139650
Approved by: https://github.com/yf225
2024-11-05 02:16:52 +00:00
cyy
5008d15ae9 [2/N] Remove usage of C array (#139589)
Follows  #139567
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139589
Approved by: https://github.com/ezyang
2024-11-05 01:58:12 +00:00
CaoE
3672c688e3 Fix layout for SetSourceTensorKernel (#137973)
Fixes #136837.
`aten.set_.source_Tensor` will make the size and stride of the first input and output follow that of the second input: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L440. If the layouts of the two inputs are different, the following `assert_size_stride` will fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137973
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 00:55:17 +00:00
Edward Yang
639162f39a Add cache size to pt2_compile_events (#139627)
Summary:
I realized I wanted to check "are my cache entries/IO unreasonably large"
and there's no easy way to do it.  This lets me do it.

Test Plan: servicelab

Differential Revision: D65390363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627
Approved by: https://github.com/c00w
2024-11-05 00:30:10 +00:00
Nikita Shulga
0058f71002 Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353, #139358
2024-11-05 00:29:58 +00:00
PyTorch MergeBot
4a3ee96427 Revert "Don't use deprecated type properties in UpsampleKernel (#139399)"
This reverts commit 9d096e4d9f.

Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))
2024-11-05 00:13:48 +00:00
cyy
64d9ee88d7 [11/N] Fix extra warnings brought by clang-tidy-17 (#139599)
Follows #139385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139599
Approved by: https://github.com/sraikund16
2024-11-04 23:57:41 +00:00
Laith Sakka
3f248a5735 Classify miss-inplaced tensors in logs. (#139240)
Summary:
use signpost logs,
a followup is to remove the field possibly_missed_reinplacing_opportunities form dynamo compile table.

Differential Revision: D65180194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139240
Approved by: https://github.com/zou3519
2024-11-04 23:56:14 +00:00
Mikayla Gawarecki
e947649e8f [BE] Change _marked_safe_globals_list to set (#139303)
Prevent same global from being added multiple times

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139303
Approved by: https://github.com/janeyx99
ghstack dependencies: #138936, #139221, #139433, #139541, #137602
2024-11-04 23:50:55 +00:00
Pian Pawakapan
a678eaf1ad check fake/real mismatches during real tensor prop (#137747)
Summary:
While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output).

Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](929797dedb/torch/_subclasses/fake_utils.py (L78))

Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer?

Test Plan: test_export, test_fake_tensor

Differential Revision: D64210055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747
Approved by: https://github.com/zou3519
2024-11-04 23:39:48 +00:00
Bob Ren
9919932783 Specialize symfloats that flow through is_integer (#139572)
Fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_is_integer_num_type6_dynamic_shapes` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139572
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568
2024-11-04 23:35:35 +00:00
Henry Tsang
350bc2a166 [export] Add support for symbool to make it usable for torch.cond (#138765)
# Why?

I want the following code to work.

minimal repro:
```
class M(torch.nn.Module):
    def forward(self, dilate_flag):
        return dilate_flag.item()

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()

ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```

error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program

second error will be handled by https://github.com/pytorch/pytorch/pull/138760

# Motivation

I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.

```
class M(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.dilate_flag = 0

    def forward(self, dilate_flag):
        self.dilate_flag = dilate_flag.item()

        def true_fn(dilate_flag):
            return dilate_flag.clone()

        def false_fn(dilate_flag):
            return dilate_flag.clone()

        torch.cond(
            self.dilate_flag,
            true_fn,
            false_fn,
            (dilate_flag,),
        )
        return self.dilate_flag

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()

for input in inputs:
    expected_output = model(*input)

    ep = torch.export.export(model, input, strict=False)
    path = torch._inductor.aot_compile(ep.module(), input)
    aot_model = torch._export.aot_load(path, device="cuda")
    actual_output = aot_model(*input)

    assert (
        expected_output == actual_output
    ), f"henry they are not equal {expected_output} != {actual_output}"
```

Differential Revision: D64867504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
2024-11-04 23:31:49 +00:00
PyTorch MergeBot
6add86a29f Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit bf5cd8d011.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking lint on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11673543178/job/32504499599) [HUD commit link](bf5cd8d011) ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2455908056))
2024-11-04 23:30:15 +00:00
Jane Xu
23169a6bcc Disable foreach tests for complex128 internally (#139649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649
Approved by: https://github.com/ngimel
2024-11-04 23:24:47 +00:00
Tugsbayasgalan Manlaibaatar
87a379b61b Move pippy to training IR (#139233)
Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233
Approved by: https://github.com/kwen2501
ghstack dependencies: #138658, #139209
2024-11-04 23:07:14 +00:00
Yidi Wu
397938b453 [hop free symbols][refactor] lift freevar to parent graph before lifting to subgraph (#138559)
This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor  x with shape (s0,) in subgraph, the ordering of lifting changes from
```
lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent
```
to
```
lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child
```
This produces a determinstic ordering of handling the symints in lifted tensors.

This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737
2024-11-04 22:48:14 +00:00
Yidi Wu
c5b79699e1 [hop free symbols] replace ctx.save_for_backward to support symints/ints (#138737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138737
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Chillee
ghstack dependencies: #138345, #138428, #138558
2024-11-04 22:48:14 +00:00
Yidi Wu
ac20d0f893 [hop free symbols][refactor] make map's save_for_backward to handle int (#138558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138558
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428
2024-11-04 22:48:07 +00:00
Yidi Wu
dc3a6a9d08 [hop free symbols][refactor] make create_graph_input always take example_value (#138428)
Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that **all graph inputs must set their example values when creating the inputs**. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #138345
2024-11-04 22:47:49 +00:00
Yidi Wu
54c69a785b [hop free symbols][refactor] make bound_symbols a dictionary (#138345)
Code refactoring only. Change all self.tx.output.bound_symbols to self.tx.output.root_tracer.bound_symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138345
Approved by: https://github.com/zou3519
2024-11-04 22:47:41 +00:00
Felix Zimmermann
bf5cd8d011 Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-04 22:10:04 +00:00
Shunting Zhang
888110841c [inductor] don't fuse two nodes if likely increase peak memory (#138756)
Partially fixing https://github.com/pytorch/pytorch/issues/138685

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756
Approved by: https://github.com/jansel
ghstack dependencies: #139136
2024-11-04 20:49:29 +00:00
Ze Sheng
1aa71be56c [PT2] Decouple decompose_triton_kernel_wrapper_functional from decompose_auto_functionalized (#139526)
As title. We may not always want to remove the `triton_kernel_wrapper_functional` for example the references of [`unsafe_remove_auto_functionalized_pass`](c8ab9b06a2/torch/export/_remove_auto_functionalized_pass.py (L48)).

Test Plan: CI & [D62592946](https://www.internalfb.com/diff/D62592946)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139526
Approved by: https://github.com/zou3519
2024-11-04 20:16:18 +00:00
Will Constable
71dc5df93c [pipelining] Fix 'last backward' counting for dI / dW (#139415)
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
2024-11-04 20:14:10 +00:00
Ryan Guo
30a83ca991 [dynamo] Improve codegen for DataPtrVariable and fix tensor reference issue (#139487)
This addresses
https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which
had to set `allow_cache=False` for codegen on `DataPtrVariable.base`,
which is a `TensorVariable`, otherwise we observe failure of
`test_no_grad_copy` when testing with Dynamo.

I've seen `test_no_grad_copy` failing a few times, and every single time
it's related to cyclic reference, my best guess is the cyclic reference
holds some tensor object longer in memory than necessary, preventing the
optimization introduced in #11165.

This patch makes `OutputGraph.cleanup()` more aggressive by clearing out
all fields that might reference a `VariableTracker`. As a result, we can
remove the aforementioned `allow_cache=False`, which helps generate
better code (e.g., in the case of `test_no_grad_copy`, it skipped generating
a redundant graph whose only op is returning the input tensor; instead we just
generate a single `LOAD_FAST`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487
Approved by: https://github.com/jansel, https://github.com/aakhundov
2024-11-04 19:14:06 +00:00
Bin Bao
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
Oguz Ulgen
e76ce20177 Log to pt2 compile events (#139601)
Summary: This option was added after I wrote the original diff, lets publish to pt2_compile_events

Test Plan: CI

Differential Revision: D65404910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139601
Approved by: https://github.com/jamesjwu
2024-11-04 18:39:06 +00:00
Shunting Zhang
4930c4b716 [inductor] patterns to remove pointless view/permute pairs (#139136)
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.

Consider this snippet:
```
        class LinearAndCEL(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(C, V)
                self.ce = nn.CrossEntropyLoss()

            def forward(self, x, y):
                return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```

`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.

The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-11-04 18:39:02 +00:00
Mikayla Gawarecki
ca43ecd599 Flip default on weights_only (#137602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #138936, #139221, #139433, #139541
2024-11-04 18:30:29 +00:00
Mikayla Gawarecki
f55dfbcf87 Remove hasattr(__slots__) for BUILD logic in weights_only unpickler (#139541)
This is tested in PR stacked above in

```python
python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load
```

We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code

```python

>>> from abc import ABC
>>> hasattr(ABC, "__slots__")
True
```

So

```python
import torch
from abc import ABC
from dataclasses import dataclass

class Foo(ABC):
    pass

class FooWrapper(Foo):
    def __init__(self, x, y):
        self.x = x
        self.y = y

f = FooWrapper(1, 2)
torch.save(f, "temp.pt")
with torch.serialization.safe_globals([FooWrapper]):
    torch.load("temp.pt")
```

Would fail on the previous code with
```
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load
    result = unpickler.load()
  File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load
    for k, v in slotstate.items():
```

As there is actually no slotstate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221, #139433
2024-11-04 18:30:29 +00:00
Tugsbayasgalan Manlaibaatar
ae0e7042f6 Fix custom obj being input (#139209)
Differential Revision: [D65158939](https://our.internmc.facebook.com/intern/diff/D65158939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139209
Approved by: https://github.com/ydwu4
ghstack dependencies: #138658
2024-11-04 18:24:29 +00:00
rzou
85c3c4132d no-op torch.library.custom_op APIs on torch.deploy (#139509)
We forgot this case in the previous PR. Fixes
https://github.com/pytorch/pytorch/issues/137536

Test Plan:
- better tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139509
Approved by: https://github.com/williamwen42
2024-11-04 18:01:08 +00:00
PyTorch MergeBot
6dada2136a Revert "Refactor FxGraphDrawer to use HTML-like labels (#137726)"
This reverts commit 1e73842029.

Reverted https://github.com/pytorch/pytorch/pull/137726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like some internal components are failing after this change and need to be updated ([comment](https://github.com/pytorch/pytorch/pull/137726#issuecomment-2455332612))
2024-11-04 17:44:44 +00:00
Tugsbayasgalan Manlaibaatar
e080c89bdc Make test_torchbind.py training IR compatible (#138658)
In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now.  I also fixed two bugs:
1. ep.module() doesn't register all aliased constants in the module.
2. When we retrace, we need to fakify the original Torchbind object.
3. We don't run any DCE on training IR so we need to add some more torch ops to verifier.

Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2024-11-04 17:43:11 +00:00
Bob Ren
68c515b292 don't run z3 analysis on backed symfloat nodes (#139568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139568
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457
2024-11-04 17:04:29 +00:00
PyTorch MergeBot
3ca794783f Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755)"
This reverts commit 924e726c3a.

Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally.  Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))
2024-11-04 16:34:34 +00:00
Bob Ren
87404b6ca6 support symfloats in translation validation (#139457)
fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesHigherOrderOpTests.test_cond_pytree_operands_with_non_tensor_leaves_dynamic_shapes` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139457
Approved by: https://github.com/ezyang
ghstack dependencies: #139569
2024-11-04 15:40:08 +00:00
Richard Barnes
6b8e3022f2 Remove c10::optional usages in PyTorch (#139525)
Test Plan: Sandcastle

Reviewed By: swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139525
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-11-04 15:35:23 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
Bob Ren
12d225d91c add opaque unary sin and cos to SYMPY_INTERP (#139569)
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nn.py TestNNDeviceTypeCPU.test_affine_3d_rotateRandom_cpu` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139569
Approved by: https://github.com/ezyang
2024-11-04 07:37:11 +00:00
Sun, Jiayi
3337439dc0 [inductor] modify the heuristic for disabling vectorization (#136422)
Summary
Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-04 07:33:32 +00:00
James Wu
f4ee5a243d Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402)
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.

In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.

In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.

The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.

Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
2024-11-04 06:37:18 +00:00
cyy
3179eb15ae [1/N] Remove usage of C array (#139567)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139567
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-04 04:52:46 +00:00
Yuxin Wu
cadc50e7e9 LOG(INFO) -> VLOG(2) in ProcessGroupNCCL (#130696)
In the same spirit as https://github.com/pytorch/pytorch/pull/105695

Initialization and error handling logs are mostly kept. Routine logs are changed to VLOG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130696
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@fb.com>
2024-11-04 04:43:42 +00:00
Jason Ansel
ed30fa74ab [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-04 04:28:40 +00:00
Jason Ansel
b6fb135c2c [inductor] Simplify remove_kernel_local_buffers (#139452)
I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365, #139370
2024-11-04 04:28:40 +00:00
Jason Ansel
3d633f12ba [inductor] Move remove_kernel_local_buffers to Kernel (#139370)
This method mutates the kernel, so it fits better in that class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365
2024-11-04 04:28:33 +00:00
Jason Ansel
66d5e2405d [inductor] Remove Node.last_usage mutation (#139365)
I can't figure out why this is needed.  Let's see if tests fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365
Approved by: https://github.com/shunting314
ghstack dependencies: #139364
2024-11-04 04:28:25 +00:00
Jason Ansel
d189f92eb1 [inductor] Remove SIMDKernel.last_usage (#139364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-04 04:28:18 +00:00
Animesh Jain
e6ff07f00e [dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476

We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.

If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.

So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
2024-11-04 00:54:20 +00:00
cyy
7f387fa612 [10/N] Fix extra warnings brought by clang-tidy-17 (#139385)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385
Approved by: https://github.com/Skylion007
2024-11-04 00:47:19 +00:00
briancoutinho
3242049daa [profiler] Annotate triton kernels with kernel hash (#139531)
As above, annotates triton kernel hash in the profile attributes.

Added a new unit test in profiler to triton/dynamo events.

Testplan:

Running new unit test in CI

Internal:
  buck2 run @mode/dev-nosan caffe2/test:profiler -- -r test_pt2_triton_attributes

Running on an example, this is how the kernel hash file looks
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 1670242, "tid": 1670242,
    "ts": 2413669097354.058, "dur": 95.812,
    "args": {
      "External id": 3,"kernel_hash": "cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu",
"grid": "grid(100,)", "Record function id": 0, "stream": 0, "Concrete Inputs": ["", "", "", "100"], "kernel_file": "/tmp/torchinductor_bcoutinho/qa/cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu.py", "kernel_backend": "triton", "Input type": ["float", "float", "float", "Scalar"], "Input Strides": [[10, 1], [10, 1], [10, 1], []], "Input Dims": [[10, 10], [10, 10], [10, 10], []], "Ev Idx": 2

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139531
Approved by: https://github.com/davidberard98
2024-11-03 23:19:35 +00:00
Yifu Wang
924e726c3a [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-03 21:37:31 +00:00
Bob Ren
5d07651c72 only use hint_size in _smart_symbol_sort for size type symbols (#139571)
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_exponential_kstest_cpu_bfloat16` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139571
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482, #139484, #139486
2024-11-03 21:15:08 +00:00
leslie-fang-intel
d84a344410 [Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537
Approved by: https://github.com/jansel
2024-11-03 10:10:29 +00:00
Edward Z. Yang
585dbfa583 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-03 06:29:57 +00:00
Bob Ren
a1370259ba always specialize float on export path (#139486)
This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout.

Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482, #139484
2024-11-03 04:47:12 +00:00
Bob Ren
25f243ff5d Update tensorify pass to specialize symfloats we didn't tensorify away (#139564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139564
Approved by: https://github.com/huydhn
2024-11-03 04:27:43 +00:00
PyTorch MergeBot
067d2a089d Revert "Expose Storage _use_count API in Python (#139426)"
This reverts commit e31136d07b.

Reverted https://github.com/pytorch/pytorch/pull/139426 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing some inductor job in trunk ([comment](https://github.com/pytorch/pytorch/pull/139426#issuecomment-2453269063))
2024-11-03 02:40:45 +00:00
Bob Ren
b8b60e0bc5 add is_integer to support example_value function whitelist (#139484)
Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_is_integer_dynamic_shapes when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139484
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482
2024-11-03 02:01:38 +00:00
Ke Wen
f121eab018 [c10d] Remove dead Dynamo marker (#139545)
Per discussion with @anijain2305, `dynamo_unsupported_distributed_c10d_ops` is not referenced anywhere.
Removing this dead code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139545
Approved by: https://github.com/Skylion007
2024-11-03 00:40:26 +00:00
Yukio Siraichi
a3cb8ee38b AOTAutograd: Make general SymInt hashable when merging view inputs. (#139553)
Fix: #139111

This PR wraps `SymInt` input arguments with `SymIntEqByExpr`, making them hashable when
merging view inputs (`merge_view_inputs` function).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139553
Approved by: https://github.com/ezyang
2024-11-02 23:57:11 +00:00
Yuanhao Ji
b46e1fc141 [Dynamo] Fix graph break when tensor.split() is called within a device context manager (#139270)
Fixes: #139183

Note: this case can also be reproduced on cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139270
Approved by: https://github.com/ezyang

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>
2024-11-02 23:55:51 +00:00
Jane Xu
e31136d07b Expose Storage _use_count API in Python (#139426)
Would be nice to replace the torch._C._storage_Use_Count call in https://github.com/pytorch/torchtune/pull/1936, at least without needing to know about _cdata in OSS code.

Initially keeping it private as Tensor._use_count is also private.

In favor over https://github.com/pytorch/pytorch/pull/139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139426
Approved by: https://github.com/soulitzer
2024-11-02 23:36:31 +00:00
Bob Ren
232af152b5 Fix graph breaks related to specialized float inputs (#139482)
Fixes issue with timm models where

example_value = 0.09999
proxy.node.target = <built-in function sub>

would fall through to

```
        unimplemented(
            "torch.* op returned non-Tensor "
            + f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}",
            case_name="unsupported_operator",
        )
```

and graph break

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482
Approved by: https://github.com/ezyang
ghstack dependencies: #139451
2024-11-02 21:58:46 +00:00
PyTorch MergeBot
854be65fa0 Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013)"
This reverts commit 55038aa661.

Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2453085932))
2024-11-02 18:29:10 +00:00
PyTorch MergeBot
92d7f29e59 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit f6be44c74e.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))
2024-11-02 13:11:04 +00:00
Edward Z. Yang
f6be44c74e Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-02 11:50:11 +00:00
Ke Wen
55038aa661 [PGNCCL] Make sure we do not use split for P2P comm creation (#139013)
Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013
Approved by: https://github.com/shuqiangzhang
2024-11-02 07:47:55 +00:00
PyTorch MergeBot
2a3fe06ce0 Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598)"
This reverts commit 39ec5a20ea.

Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails an executorch test https://github.com/pytorch/executorch/blob/main/exir/backend/test/test_graph_partition.py#L114-L175 ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2452903705))
2024-11-02 07:19:22 +00:00
PyTorch MergeBot
f3238106fd Revert "Allow inplacing buffer when other users are inconsequential (#138383)"
This reverts commit 030f70b40b.

Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))
2024-11-02 06:53:48 +00:00
PyTorch MergeBot
0863d6a08e Revert "[inductor] Remove SIMDKernel.last_usage (#139364)"
This reverts commit 286d3ce266.

Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:11 +00:00
PyTorch MergeBot
9331640e26 Revert "[inductor] Remove Node.last_usage mutation (#139365)"
This reverts commit 1e934b473c.

Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
PyTorch MergeBot
dc4b459737 Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370)"
This reverts commit b57b4b7f9b.

Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
PyTorch MergeBot
66a401c9e1 Revert "[inductor] Simplify remove_kernel_local_buffers (#139452)"
This reverts commit 73c0762a34.

Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
PyTorch MergeBot
98e11b0021 Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)"
This reverts commit c53beab377.

Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
Bob Ren
fdd298dcb7 add hex method on SymFloat (#139451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139451
Approved by: https://github.com/ezyang
2024-11-02 05:33:19 +00:00
PyTorch MergeBot
8d1eaa3da6 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit a6630bcf87.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))
2024-11-02 03:38:15 +00:00
drisspg
540f3ef9b1 Fix flex_decode to build offsets off of strides (#139516)
Fixes PR: https://github.com/pytorch/pytorch/issues/139462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139516
Approved by: https://github.com/Chillee
2024-11-02 03:17:46 +00:00
Bin Bao
a46a79fe92 [AOTI] Ignore .o files in package_aoti (#139153)
Summary: There is no point to package .o files since a .so file is included in that package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153
Approved by: https://github.com/angelayi
2024-11-02 03:10:05 +00:00
Jason Ansel
c53beab377 [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-02 03:04:22 +00:00
Justin Chu
387b120549 [ONNX] Remove type promotion rule for pow (#139527)
ONNX supports different input types in Pow, so type promotion is not needed.

The resulting graph is the following:

```py
ONNXProgram(
    model=
        <
            ir_version=9,
            opset_imports={'': 18, 'pkg.onnxscript.torch_lib.common': 1},
            producer_name='pytorch',
            producer_version='2.6.0a0+git59a1af5',
            domain=None,
            model_version=None,
        >
        graph(
            name=main_graph,
            inputs=(
                %"x"<FLOAT16,[3]>
            ),
            outputs=(
                %"pow_1"<FLOAT16,[3]>
            ),
        ) {
            0 |  # node_Constant_0
                 %"val_0"<?,?> ⬅️ ::Constant() {value=Tensor<FLOAT,[]>(array(2., dtype=float32), name=None)}
            1 |  # node_Pow_1
                 %"pow_1"<FLOAT16,[3]> ⬅️ ::Pow(%"x", %"val_0")
            return %"pow_1"<FLOAT16,[3]>
        }
...
    ,
    exported_program=
        ExportedProgram:
            class GraphModule(torch.nn.Module):
                def forward(self, x: "f16[3]"):
                     # File: /workspace/pytorch/test/onnx/exporter/test_small_models_e2e.py:53 in forward, code: return x**2.0
                    pow_1: "f16[3]" = torch.ops.aten.pow.Tensor_Scalar(x, 2.0);  x = None
                    return (pow_1,)

        Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='pow_1'), target=None)])
        Range constraints: {}

)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139527
Approved by: https://github.com/titaiwangms
2024-11-02 02:19:50 +00:00
Chen, Zejun
edd3f5a94d [profiler] fix a building warning by adding USE_KINETO namespace for setTraceID (#139461)
Fix: https://github.com/pytorch/pytorch/issues/139460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139461
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/sraikund16
2024-11-02 01:02:29 +00:00
Angela Yi
092fe2f422 Handle nan case when checking mutations (#139483)
Test Plan: PT2 readiness models

Differential Revision: D65340986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139483
Approved by: https://github.com/zou3519
2024-11-02 00:49:05 +00:00
William Wen
b71e813bce [dynamo, 3.13] fix bytecode nop tests (#139323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323
Approved by: https://github.com/jansel
2024-11-02 00:39:36 +00:00
Bin Bao
8c17830dea [AOTI] Unify how weights are stored as data section (#139471)
Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well.

Test Plan: CI

Differential Revision: D65302273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471
Approved by: https://github.com/chenyang78
2024-11-02 00:23:24 +00:00
eellison
ee2f8a50d3 Class rename (#139490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490
Approved by: https://github.com/exclamaforte, https://github.com/zou3519
ghstack dependencies: #139295
2024-11-02 00:10:17 +00:00
PyTorch MergeBot
b617d4813c Revert "fix dynamo tracking numpy 2 ops (#138686)"
This reverts commit 124eac255e.

Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))
2024-11-01 23:34:06 +00:00
eellison
2382b3b6d8 [Easy] Add joint graph passes, fallback_random to bisector (#139295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295
Approved by: https://github.com/zou3519, https://github.com/exclamaforte
2024-11-01 23:21:53 +00:00
Gabriel Ferns
1e73842029 Refactor FxGraphDrawer to use HTML-like labels (#137726)
Fixes https://github.com/pytorch/pytorch/issues/137499
Testing: Added a new unit test to make sure that the regression case succeeds.
I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read?
![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932)
Vs.
![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726
Approved by: https://github.com/eellison, https://github.com/malfet
2024-11-01 23:19:50 +00:00
David Berard
60542eeb33 [inductor] set sanitize_overflow=False for triton kernels (#139502)
In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502
Approved by: https://github.com/bertmaher
2024-11-01 23:10:21 +00:00
Mikayla Gawarecki
a979318ef7 Add section to serialization note re weights_only (#139433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221
2024-11-01 21:51:50 +00:00
Edward Z. Yang
a6630bcf87 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-01 21:43:25 +00:00
Xuan Zhang
9c2ffce71a add condition for freeable input buffer (#139480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480
Approved by: https://github.com/yf225
ghstack dependencies: #139396
2024-11-01 21:15:40 +00:00
Sam Larsen
c412a42ae2 [pt2 logging] move remote cache get/put logging up one level (#139423)
Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see.

Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x

Differential Revision: D65290089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423
Approved by: https://github.com/oulgen
2024-11-01 21:06:59 +00:00
Animesh Jain
0e57f2b589 [invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326
Approved by: https://github.com/zou3519
ghstack dependencies: #139216, #139130
2024-11-01 21:02:32 +00:00
Animesh Jain
6a268c3fbb [invoke_subgraph] Generate fake_inputs correctly (#139130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130
Approved by: https://github.com/zou3519
ghstack dependencies: #139216
2024-11-01 21:02:32 +00:00
Animesh Jain
4c756cacfd [invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216
Approved by: https://github.com/zou3519
2024-11-01 21:02:32 +00:00
Justin Chu
5d67efb809 [ONNX] New registration API (#135403)
The ONNX custom ops registration API.

## Design

1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] | Callable" parameter for specifying extra functions
2. Use a callable as the key to support all possible call_function targets in the fx graph
3. Allow a callable or a Sequence of callables as values.
		- When there is a single callable, it is the translation function for the op
		- When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes.
		- The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function.
		- Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions.

```py
torch.onnx.export(dynamo=True,
	custom_translation_table = {
	torch.ops.aten.add: [overload1, overload2],
	torch.sym_not: sym_not_onnx,
})
```
Support for functions that handles complex inputs will be in separate PRs.

fixes https://github.com/pytorch/pytorch/issues/138391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403
Approved by: https://github.com/titaiwangms
2024-11-01 20:58:54 +00:00
Jason Ansel
73c0762a34 [inductor] Simplify remove_kernel_local_buffers (#139452)
I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365, #139370
2024-11-01 20:36:39 +00:00
Yifu Wang
0dbc284a72 [SymmetricMemory] expose signal_pads as tensors in Python (#138754)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR

```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)

# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)

# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
2024-11-01 20:17:15 +00:00
Haifeng Jin
124eac255e fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/lezcano, https://github.com/williamwen42
2024-11-01 19:51:40 +00:00
Mikayla Gawarecki
ea0e09b3f3 Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221)
Fixes https://github.com/pytorch/pytorch/issues/129698

https://github.com/pytorch/pytorch/pull/139106 without pickletools

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221
Approved by: https://github.com/malfet
ghstack dependencies: #138936
2024-11-01 19:31:39 +00:00
rzou
f3b485eb2a [reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064)
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.

IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been
more than two weeks since this landed. You can flip the config locally
as a workaround.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064
Approved by: https://github.com/albanD, https://github.com/eellison
2024-11-01 19:21:16 +00:00
Colin L. Rice
abc5d59dcb config: create Config objects with JK support (#138766)
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.

This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.

Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.

Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
2024-11-01 19:20:37 +00:00
Sam Larsen
d8b606ecb5 [fx graph cache] Support freezing with FX graph caching (#136505)
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.

Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
2024-11-01 18:29:29 +00:00
vladimirrotariu
7d644f025f make equation behind torch.isclose element-wise (#138459)
The current formula behind torch.isclose, according to the docs, is
![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd)

However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation:
![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459
Approved by: https://github.com/soulitzer
2024-11-01 18:18:33 +00:00
sanchitintel
3cbf0c0bbf [Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688)
# Summary

The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights.

This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension.

The figure below is from the document linked above -

<img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94">

## Performance data

Collected on 48 physical cores of one socket of Intel Xeon  Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded.

|M | N | K | Latency with ATen _weight_int8pack_mm | Latency with codegened templated GEMM (current main branch) | Latency with codegened templated GEMM (this PR) |
|-----|-----|-----|------|----------|----|
|4096|4096|4096| 45.844 ms | 9.322 ms| 5.2181 ms |
|4096|11008|4096| 127.618 ms |24.6258 ms | 13.6046 ms|
|4096|4096|11008| 121.953 ms | 25.4692 ms | 10.2669 ms |
|4096|32000|4096| 478.450 ms| 75.3942 ms | 48.21 ms |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688
Approved by: https://github.com/jgong5
2024-11-01 16:32:22 +00:00
Jason Ansel
b57b4b7f9b [inductor] Move remove_kernel_local_buffers to Kernel (#139370)
This method mutates the kernel, so it fits better in that class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365
2024-11-01 16:28:15 +00:00
Jason Ansel
1e934b473c [inductor] Remove Node.last_usage mutation (#139365)
I can't figure out why this is needed.  Let's see if tests fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365
Approved by: https://github.com/shunting314
ghstack dependencies: #139364
2024-11-01 16:28:15 +00:00
Jason Ansel
286d3ce266 [inductor] Remove SIMDKernel.last_usage (#139364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-01 16:28:15 +00:00
Shuqiang Zhang
df0c1eceb9 [pgnccl][simple] clean up unused members of PGNCCL (#139436)
Summary:
Found those unused members when prototying something else.
Better remove unused members
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436
Approved by: https://github.com/Skylion007
2024-11-01 16:25:04 +00:00
Bin Bao
33dce10ece [AOTI][reland] Update zero size computation in clone_preserve_strides (#139458)
Summary: Reland https://github.com/pytorch/pytorch/pull/139224. clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly.

Differential Revision: [D65317451](https://our.internmc.facebook.com/intern/diff/D65317451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139458
Approved by: https://github.com/hl475
2024-11-01 13:51:02 +00:00
Yifu Wang
e6e140c3d7 [Inductor] fix a compilation time regression caused by user-visible output handling (#139420)
Some checks failed
docker-builds / docker-build (pytorch-linux-focal-py3-clang10-onnx, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-focal-py3-clang9-android-ndk-r21e, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-focal-py3.11-clang10, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-focal-py3.12-clang10, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-focal-py3.9-clang10, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-focal-rocm-n-1-py3, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-focal-rocm-n-py3, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-aarch64-py3.10-gcc11, linux.arm64.2xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks, linux.arm64.m7g.4xlarge, 600) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-py3-clang12-executorch, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-py3-clang15-asan, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-py3-clang18-asan, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-py3.12-halide, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-py3.9-gcc11, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks, linux.12xlarge) (push) Has been cancelled
docker-builds / docker-build (pytorch-linux-jammy-xpu-2024.0-py3, linux.12xlarge) (push) Has been cancelled
ossf-scorecard / Scorecards analysis (push) Has been cancelled
Nightly Upload to rockset / upload-stats-to-rockset (push) Has been cancelled
inductor-cu124-unittest / get-default-label-prefix (push) Has been cancelled
inductor-cu124-unittest / cuda12.4-py3.12-gcc9-sm86 (push) Has been cancelled
inductor-cu124-unittest / cuda12.4-py3.10-gcc9-sm86 (push) Has been cancelled
inductor-rocm / get-label-type (push) Has been cancelled
inductor-cu124 / inductor-unittest (push) Has been cancelled
inductor-cu124 / get-default-label-prefix (push) Has been cancelled
inductor-cu124 / get-a100-test-label-type (push) Has been cancelled
inductor-rocm / rocm6.2-py3.10-inductor (push) Has been cancelled
inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 (push) Has been cancelled
inductor-cu124 / cuda12.4-py3.10-gcc9-sm80 (push) Has been cancelled
This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732.

The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference.

```
Before the offending PR

compilation_latency mean=176.022 seconds
compilation_latency mean=176.564 seconds

On the offending PR

compilation_latency mean=180.096 seconds
compilation_latency mean=179.101 seconds

On the fix

compilation_latency mean=173.153 seconds
compilation_latency mean=174.182 seconds
```

(I think the fix being faster than the baseline is due to noise)

The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18).

The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420
Approved by: https://github.com/ezyang
2024-11-01 08:27:40 +00:00
Shunting Zhang
5e4c8b671c [inductor] loaf-fix (#139376)
Fix https://github.com/pytorch/pytorch/issues/128063 .

Now for this snippet
```
        def f(x):
            y = torch.sum(torch.sum(x, dim=-1))

            z = x / 10.0
            z_t = z.t().contiguous().t()
            return y, z, z_t
```
Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused).

The PR needs fix 2 subtile bugs regarding LOAF .
1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison.
2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376
Approved by: https://github.com/jansel, https://github.com/eellison
2024-11-01 07:54:32 +00:00
lingzhi98
39ec5a20ea [Partitioner] Enumerate partitions by iterating partition ids (#136598)
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/tarun292

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-01 07:42:36 +00:00
andras_matyassy
61df90e3f6 Add TORCHDYNAMO_EXTENDED_ADVICE (#137159) (#137196)
Fixes #137159

Happy to contribute to this project for the first time. If I missed any contribution guidelines, please let me know, I'm happy to adjust.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137196
Approved by: https://github.com/ezyang
2024-11-01 06:43:26 +00:00
angelayi
86db2cd194 [export] Initial draft export (#139383)
Differential Revision: [D65288590](https://our.internmc.facebook.com/intern/diff/D65288590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139383
Approved by: https://github.com/zou3519
2024-11-01 06:25:44 +00:00
FFFrog
300ca6368f Remove depracated alias macro(2/3) (#137559)
**Detailed Descriptions:**
- Remove AT_ASSERTM Macro
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559
Approved by: https://github.com/ezyang
2024-11-01 06:17:57 +00:00
William Wen
0c47657b05 [dynamo] ignore False/None callback in fail_on_recompile/force_backend stances (#139215)
Fix https://github.com/pytorch/pytorch/issues/139202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139215
Approved by: https://github.com/jansel
2024-11-01 06:15:28 +00:00
cyy
4a2da52137 [1/N] Replace c10::sv with std::sv (#139453)
Picks some safe replacements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139453
Approved by: https://github.com/Skylion007
2024-11-01 05:39:37 +00:00
Will Constable
84416618a6 [Pipelining] Update schedules to use I, B actions. (#138886)
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.

Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.

Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR.  The logic for 'use_full_backward' is removed
from the runtime.

_validate_pipeline_order is replaced  with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR.  'validate' operates on compute-only IR, while simulate operates on
compute + comm IR.  To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.

'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
2024-11-01 03:54:06 +00:00
Bob Ren
094d288f40 Update tensorify pass to specialize symfloats we didn't tensorify away (#138868)
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:

1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization

This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:

1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
2024-11-01 03:18:02 +00:00
James Wu
c8a648d4df Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309)
This diff considerably changes the column format of PT2 Compile Events:

- Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention.

- Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba.

Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309
Approved by: https://github.com/oulgen
2024-11-01 02:40:25 +00:00
Gabriel Ferns
030f70b40b Allow inplacing buffer when other users are inconsequential (#138383)
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.

Implements:
https://github.com/pytorch/pytorch/issues/132826

Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
2024-11-01 01:24:40 +00:00
cyyever
8ace3e8023 Add sv starts/ends_with (#139261)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-01 01:17:42 +00:00
Mikayla Gawarecki
2a309c0997 Fix weights_only for BUILD instructions for user allowlisted objects with __slots__ (#138936)
Previously `BUILD` instruction missed handling for `__slots__`. **This only applies for things allowlisted via `add_safe_globals`/`safe_globals` that use slots.**

### Background
When does pickle serialize a `BUILD` instruction? When `state` is not `None` and `state_setter` is `None` [[link](c5b99f5c2c/Lib/pickle.py (L765))]. In this case, the docs tell us that either `__setstate__` or a `__dict__` update will be performed [[link](https://github.com/python/cpython/blob/3.13/Lib/pickletools.py#L1984)]

`__reduce__`/`__reduce_ex__` are expected to return tuples of length 2 to 6 where `state` is the 3rd argument. When user doesn't patch `__reduce__` but patches `__setstate__`/`__getstate__`, state will be what is yielded by `__getstate__`

Note the return type for [`__getstate__` ](https://docs.python.org/3/library/pickle.html#object.__getstate__)

- For a class that has no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is None.
- For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is `self.__dict__`.
- For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is a tuple consisting of two dictionaries: `self.__dict__`, and a dictionary mapping slot names to slot values. Only slots that have a value are included in the latter.
- For a class that has [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__) and no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__), the default state is a tuple whose first item is None and whose second item is a dictionary mapping slot names to slot values described in the previous bullet.

see handling in pickle code c5b99f5c2c/Lib/pickle.py (L1846-L1867)

Before this PR, we didn't account for the fact that when `__setstate__` is not defined, `state` might be a tuple so this would fail

```python
from dataclasses import dataclass

# Define the dataclass
@dataclass
class MyDataClass:
    __slots__ = ["x", "y"]
    x: int
    y: str
# Create an instance of the dataclass
my_data = MyDataClass(x=2, y=3)
# Save the dataclass to a file
torch.save(my_data, "my_data.pt")
with torch.serialization.safe_globals([MyDataClass]):
    loaded_my_data = torch.load("my_data.pt", weights_only=True)
# AttributeError: 'MyDataClass' object has no attribute '__dict__'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138936
Approved by: https://github.com/malfet
2024-11-01 00:59:29 +00:00
Jason Ansel
f9ef880c0b [inductor] Refactor kernel args into SIMDKernelFeatures (#139327)
This is a refactor PR to move stuff around.  I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-01 00:30:14 +00:00
PyTorch MergeBot
b6b9596607 Revert "[dynamo] Fix constant propagation in builtins and UserClasses (#131354)"
This reverts commit 44257c063e.

Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break some internal tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2451050605))
2024-11-01 00:13:20 +00:00
IvanKobzarev
d33849908d [aotd] Fuse tangents subclasses runtime traversals (#139068)
Reason:
Currently we have multiple traversals for tangents in runtime:
 - To check that types and structure are identical to what we guessed during tracing time
 - Coerce metadata
 - Coerce memory_format
 - Unwrap_tensor_subclass
All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses.

Change:
To do everything in one traversal at runtime (including flattening)

Implementation details:

Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.

Preparing memory_format is optional (controlled by with_memory_format=True).

2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068
Approved by: https://github.com/bdhirsh
2024-11-01 00:03:02 +00:00
Xuan Zhang
86602a66d7 [orm] fix live_memory computation in lpmf algorithm (#139396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139396
Approved by: https://github.com/yf225
2024-10-31 23:45:30 +00:00
PyTorch MergeBot
3d3551506d Revert "[dynamo, 3.13] fix bytecode nop tests (#139323)"
This reverts commit c2d754441f.

Reverted https://github.com/pytorch/pytorch/pull/139323 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in instruction count metric ([comment](https://github.com/pytorch/pytorch/pull/139323#issuecomment-2451017609))
2024-10-31 23:34:00 +00:00
Will Constable
8e8040a5c2 [Pipelining] Optimize ready_to_schedule logic (#138924)
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled.  For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.

The old implementation iteratively compared the candidate op to the
previous ops.  The new implementation uses set lookups to reduce
complexity.  It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.

I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s.  Most schedules take less than
10ms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928, #131762
2024-10-31 22:49:45 +00:00
Will Constable
c82e0d117a [Pipelining] Support separate dI / dW and V-schedules (#131762)
### Separate dI / dW:

PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.

Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles.  In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.

### V-schedules:

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process.  We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward.  Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928
2024-10-31 22:49:45 +00:00
Zhengxu Chen
45da80b970 reland D65167805 "[export] Update min_val and max_val to Optional[int] in serialization." (#139394)
Summary:
had a land racing with another diff D65166035 to fix the schema.

According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/rec/ir/tests:ir_export_deserialize_test

Differential Revision: D65273170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139394
Approved by: https://github.com/yiming0416
2024-10-31 22:28:32 +00:00
Donald Tolley
c1e7d85ce6 Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049)
#### Summary
This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others.

#### Changes
- **`weighted_huber_loss`**: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter.
- **`wmse_loss`** (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks.
- **`wmae_loss`** (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers.

#### Code Details
- **Input Validation**: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors.
- **Reduction Options**: Supports `none`, `mean`, and `sum` reductions to suit various computational needs.
- **Backward Compatibility**: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument.

#### Usage Example
```python
import torch
input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32)
target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32)
weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32)

loss = weighted_huber_loss(input, target, weights, delta=1.0)
print(loss)
```
---

Feedback on these implementations is welcome; please let me know if further modifications are required.

Resolves #132465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2024-10-31 21:59:43 +00:00
Simon Fan
82e74ad40e [aot autograd] refactor CompiledFunction.backward: control flow (3/N) (#139347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139347
Approved by: https://github.com/zou3519
ghstack dependencies: #139331, #139343
2024-10-31 21:53:03 +00:00
Simon Fan
8134456a27 [aot autograd] refactor CompiledFunction.backward: epilogue (2/N) (#139343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139343
Approved by: https://github.com/zou3519
ghstack dependencies: #139331
2024-10-31 21:53:03 +00:00
Simon Fan
04ce9ec087 [aot autograd] refactor CompiledFunction.backward: prologue (1/N) (#139331)
So for functional autograd + CA, most nodes are inlined in aot autograd. But user-defined callables aren't safe to make_fx unless dynamo traces through them. The AOT backward must be inlined by dynamo time. We plan to directly insert calls to the backward in the graph:
- call prologue
- call bwd graph
- call epilogue

Restructuring our AOT bwd implementation will make this implementation easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139331
Approved by: https://github.com/zou3519
2024-10-31 21:53:03 +00:00
angelayi
8c22e09e39 [aoti] Add masked_select to cshim (#139071)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071
Approved by: https://github.com/desertfire
2024-10-31 21:52:53 +00:00
PyTorch MergeBot
b9acbde4fd Revert "Update tensorify pass to specialize symfloats we didn't tensorify away (#138868)"
This reverts commit a494572799.

Reverted https://github.com/pytorch/pytorch/pull/138868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the new tests are failing on fbcode ([comment](https://github.com/pytorch/pytorch/pull/138868#issuecomment-2450863895))
2024-10-31 21:46:06 +00:00
Laith Sakka
6a1c451479 Don't uselessly recompute axiom dict every static eval call (#138967)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967
Approved by: https://github.com/ezyang
2024-10-31 21:16:55 +00:00
PyTorch MergeBot
c4d9428b17 Revert "[AOTI] Update zero size computation in clone_preserve_strides (#139224)"
This reverts commit 206a8dde68.

Reverted https://github.com/pytorch/pytorch/pull/139224 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139224#issuecomment-2450811914))
2024-10-31 21:05:07 +00:00
Joel Schlosser
ddb291a881 Fix and test several NJT reductions (#139317)
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.

It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`

The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.

Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar

Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
2024-10-31 20:55:38 +00:00