Commit Graph

43731 Commits

Author SHA1 Message Date
PyTorch MergeBot
7e02386303 Revert "[2/N] Replace c10::sv with std::sv (#139456)"
This reverts commit 028c5d3426.

Reverted https://github.com/pytorch/pytorch/pull/139456 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. @ezyang can you please help get this landed? See D65546398 for more details ([comment](https://github.com/pytorch/pytorch/pull/139456#issuecomment-2462768891))
2024-11-07 17:00:59 +00:00
IvanKobzarev
781c68c865 [aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095)
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731

Introducing ability for subclass implement type convertion to expected_type.
```
    def __coerce_same_metadata_as_tangent__(
        self, expected_metadata: Any, expected_type: Optional[Type] = None
    ):
```
Here if `expected_type=None` means `SubclassClass` is expected.

E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime

Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
2024-11-07 16:24:48 +00:00
Bob Ren
8d3d47e439 Trigger symfloat specialization in argument binding code (#139454)
Fixes the test `python test/inductor/test_torchinductor.py CpuTests.test_upsample_cat_conv_cpu` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139454
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846
2024-11-07 16:10:23 +00:00
James Wu
c35a01173b Remove compile event logging for automatic dynamic (#139891)
Summary: These events are a pretty large portion of the table, but not really currently used. Only log to tlparse for now.

Test Plan: Unit tests

Differential Revision: D65539986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139891
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-07 14:52:10 +00:00
Annop Wongwathanarat
81ecf98d23 Pass all arguments when quantizing embedding bag from float (#137697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137697
Approved by: https://github.com/snadampal, https://github.com/jerryzh168
2024-11-07 09:53:49 +00:00
sanchitintel
314aa268ce In AMX GEMM micro-kernel, use same dtype for A & B only if B is dequantized (#139906)
@frost-intel discovered that some Inductor auto-tuning UTs for CPU are currently broken on machines supporting AMX ISA. That's because in #136688, I had reverted a change in the AMX GEMM micro-kernel that was introduced in #131887, but it looks like some other implementations introduced after the aforementioned change rely upon it, so it should not have been reverted.

Added a fix.

Ideally, a CI machine that supports AMX should cover these UTs (test/inductor/test_cpu_select_algorithm.py). We do have at least one CI machines that support AMX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139906
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-11-07 09:18:59 +00:00
Bob Ren
a4e7b8001c refuse to generate a symbolic variable if a float input is inf (#139846)
Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCPU.test_cauchy_cpu_float64` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139846
Approved by: https://github.com/ruidazeng, https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572
2024-11-07 09:16:55 +00:00
xinan.lin
c4a323ed05 [Inductor] Generalize device-bias code newly introduced in scheduler.py (#139872)
[Inductor] Generalize device-bias code newly introduced in scheduler.py to align the Inductor behavior for xpu with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139872
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
ghstack dependencies: #139705
2024-11-07 07:10:28 +00:00
xinan.lin
320374b011 [Inductor] Refine triton_bundler.py to support correctly on Intel GPU and fix CI failures. (#139705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139705
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
2024-11-07 07:10:28 +00:00
FFFrog
c03324de2d Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-11-07 06:28:47 +00:00
Max Podkorytov
ca30704f0b [Inductor][ROCm][CK] Add standalone runner (#139441)
Generate standalone executable to debug and profile CK gemm instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139441
Approved by: https://github.com/ColinPeppler
2024-11-07 06:21:27 +00:00
Yidi Wu
ab42967238 [hop free symbols] lift free symbols in example_value when create_graph_input (#138363)
There are 4 parts (they are hard to further break into smaller ones cause they're highly coupled) in this PR:
1. **Whenever we call create_graph_input, we try to bind the symbols in the graph input.**
We've enforced the invariant that all create_graph_inputs calls must provide an example value, we could intercept at the create_graph_input calls (This PR only handles free symbols in tensors).
2. **We cache the bound_symbols** to avoid lift the same symbol repeated.
3. For lifted symbols, we re-used  **lifted_freevars** i.e. the mapping between symbol proxy in parent graph to the lifted phs in current subgraph, which we handle lifted tensors. In this way, all hops that supports lifted tensors should be able to handle lifted_symints automatically (at least in dynamo part).
4. For **unbacked symbols** created during tracing, we need to also bound these symbols to its proxy. This is to support the tests cases where we want to lift unbacked symbols as input. We need the proxy of the unbacked symbol in parent graph in order to properly create the args to the hop.
5. We change all the tests after free symbols are lifted in subgraphs. And also supports the lifted symbols in existing higher order ops.

**The interaction of nested tracers:**
The previous design for lifting tensor closures is that: suppose we're in nested tracers, whenever we see a new proxy that's not created by create tracer, we recursively look for the proxy in parent tracer until we find the tracer that creates this proxy (either a placeholder or some intermediate results). More detail is in Note [Nested SubgraphTracer and free_variable handling].

Given the above design, the plan for lifting the free symbols is: whenever we lift a free tensor to be the inputs of current subgraph, we'll look at the symbols in it and bind the symbols at the same time.

For example, suppose we have the following function:
```python
def f(x: [s1, s2]):
  def true_f():
    def true_f_inner():
      return x.sin()
```
what will happen in time order:

1. we create a subtracer 1 and start to speculate the outer cond's true_f
2. we create a another subtracer 2 and start to speculate the inner cond's true_f_inner.
3. dynamo realize the tensor input x by calling wrap_tensor in top-level to create graph input x (tracer 0), we bind the symbol s1, s2 after ph for x is created. So the graph now looks like:
```python
def gm(s1, s2, x):
```
4. when seeing TensorVariable.call_method of x,  tracer2 wants to create a call_function(sin, proxy_of_x), but it finds that proxy_of_x is not created by current tracer. So it recursively look up its parent tracer1 and find parent tracer1 also doesn't track this proxy_of_x then it finds the root tracer0, who is the creator of it and tracks it as a ph. Then tracer 1 create_graph_input  to lift the closure to its input ph1 and add (proxy_of_x: ph1) k-v in **lifted_freevars**  of tracer 1.
Now the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(x):
```
5. Since there are free symbols inside this new tensor input, tracer 1 also binds the symbols (maybe_bind_symbol), which calls create_graph_input for s1 and s2. Now the graph looks like
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
```
6. then it goes back to tracer 2, and call create_graph_input for x and get ph2, tracer 2's **lifted_freevars** records (ph1, ph2). and tracer 2 also binds the symbols in this new tensor input. Now the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner(s1, s2, x):
```
7. Finally the sin call_function node is created by tracer 2.

**This PR also handles the following cases:**
- What if we lift two tensors share the same symbol? e.g. x1 [s1, s2], x2 [s2, s3]? Each subtracer maintains bound_symbols as a cache that maps a symbol.expr to its proxy in current tracer. So when we see x1, we'll track s1 and s2 as inputs and bound s1 to ph1, s2 to ph2. So when we try to bind symbols of x2, s2 will already be tracked so no graph input is created.
- what if a subgraph close over a symint? e.g.
```python
def f(x):
  def true_f():
    c = x.size(0)
   def true_fn_inner():
     return c
```
When we speculate true_fn_inner, we find proxy_of_c is not tracked by tracer 2, so it recursively looks up its parent. At this point, x and its symbols have been lifted as input of true_f (as a result of lifting x during tracing true_f in tracer 1. Specifically the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner():
```
So tracer 2 is able to find that s1 have been tracked as ph in tracer 1 so it returns back to gm and call create_graph_input on s1. The graph now looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner(s1):
     return s1
```

-  What if subgraph close over an unbacked symint? e.g.
```python
def f(x):
  def true_f():
    c =  x.item()
    def true_f_inner():
      return c
```
When x.item() is called, proxy_of_c and its symnode variable is created for tracer 1, and we also call track_unbacked_symbols to record this relationship. So when tracer 2 finds proxy_of_c is not created by current tracer, it recursivelly looks up its parent tracer and finds that that expression u0 has been tracked as a result of track_unbacked_symbol in tracer 1. So it will stop the recursion and create_graph_input u0 in tracer 2. Graph looks like:
```python
def f(x):
  def true_f(s1, s2, x):
    c = x.item()
    def true_gm_inner(u0):
      return u0
    cond(pred, true_gm_inner, false_gm_inner, (c,))
```

- what if subgraph close over a tensor with unbacked symint shape?
```python
def f(x):
  def true_f():
    c = x.item()
    r = torch.randn((c,))
    def true_f_inner():
      return r + 1
```
This is the same as the case of closing over tensors with backed shapes. where we first lift r, then bind u0 in it, which recursively bind_symint of u0 in its parent and found u0 is tracked in parent tracer as a result of .item() call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138363
Approved by: https://github.com/zou3519
2024-11-07 04:44:32 +00:00
Justin Chu
3368f3ad41 [ONNX] Update TorchTensor implementation to handle fake mode (#139534)
Update TorchTensor implementation to handle fake mode better. Specifically, we disable fake mode before calling detach() etc. when getting the weights if it is already a real tensor so we do not lose it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139534
Approved by: https://github.com/fatcat-z, https://github.com/titaiwangms
2024-11-07 04:36:24 +00:00
Gabriel Ferns
2037ea3e15 Add type annotations to Configs (#139833)
Summary:
Adds types to Configs, and fixes a bug in options that was caused by the lack of types.

fixes: https://github.com/pytorch/pytorch/issues/139822

Configs are used by many modules so not sure which label to put.

Types also allow https://github.com/pytorch/pytorch/pull/139736 to fuzz configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139833
Approved by: https://github.com/c00w
2024-11-07 03:49:09 +00:00
Yifu Wang
5203138483 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-07 03:43:12 +00:00
Sun, Jiayi
a59132b9c8 fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661)
Fix https://github.com/pytorch/pytorch/issues/132634.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2024-11-07 03:21:36 +00:00
PyTorch MergeBot
604e353cae Revert "Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787)"
This reverts commit 060bee7f22.

Reverted https://github.com/pytorch/pytorch/pull/139787 on behalf of https://github.com/huydhn due to Sorry for reverting this, but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/139787#issuecomment-2461234683))
2024-11-07 03:17:16 +00:00
Ryan Guo
f459c3095f [dynamo] Document codegen and clean up some code paths (#139670)
This patch
1. Adds documentation to `PyCodegen.__call__`, `PyCodegen.tempvars` and
   the `allow_cache` flag.
2. Merges a few existing code paths in `PyCodegen.__call__`.
3. removes the `elif var in cg.tempvars` code path in
   `codegen_save_tempvars`, because it's no longer needed after #113725,
   as we have up-to-date `VariableTracker.source` now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139670
Approved by: https://github.com/jansel
ghstack dependencies: #139538
2024-11-07 03:14:16 +00:00
Ryan Guo
183b386cb2 [dynamo] Simplify Codegen for variables with MutableSideEffects (#139538)
This effectively undoes #115095, which is not longer be needed after #113725.

Why did we need #115095? I went back in history and found that [this line](https://github.com/pytorch/pytorch/pull/113725/files#diff-0bb1756725c4426408938314b0c9d3988ae5bf49994892d7038ad7746e209e9fR86)
actually fixed what #115095 fixed. Specifically, without the
`allow_cache` check for the "dup_top" optimization, we could incorrectly
codegen based on source, despite `codegen_update_mutated` requested to
codegen from value, for updates to pre-existing lists, etc. Since #113725 added
the `allow_cache` check, we no longer need the `mutable_side_effects_from_source`
code path from #115095.

However, #115442 introduced a `value_from_source` flag which didn't
account for the `mutable_side_effects_from_source` branch. So this patch
adds an extra check to keep existing behavior for export, and leaves a
TODO for investigating what exactly export wants from codegen, when it
comes to side effects and sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139538
Approved by: https://github.com/jansel
2024-11-07 03:14:16 +00:00
Valentine233
cf0bb6c435 [cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827)
Reopen https://github.com/pytorch/pytorch/pull/121782, as more optimizations have landed.

Fixes https://github.com/pytorch/pytorch/issues/115261, https://github.com/pytorch/pytorch/issues/113017.
For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues.

### Validation on 3 benchmark suites

#### FP32
![image](https://github.com/user-attachments/assets/ec920928-fa36-467f-ba07-d2c05c51b92e)

Outlier models (speedup<0.8, single socket): None.

#### BF16
![image](https://github.com/user-attachments/assets/4a301e5e-147d-4b74-beb1-40290969ed80)

Outlier models (speedup<0.8, single socket multi threads):

- functorch_dp_cifar10 0.58
- opacus_cifar10 0.57

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136827
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-11-07 02:49:52 +00:00
Gregory Comer
617b4538f1 Support symbolic builtin round in export (#139549)
Differential Revision: D65380866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139549
Approved by: https://github.com/digantdesai, https://github.com/angelayi
2024-11-07 02:49:44 +00:00
Edward Z. Yang
4e647871d6 Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786
Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305
ghstack dependencies: #139716
2024-11-07 01:58:05 +00:00
Bin Bao
d0ffd6d142 [AOTI] Add data_ptr to RAIIAtenTensorHandle (#139895)
Summary: To increase the readbility of the generated code. This is not BC-breaking, because RAIIAtenTensorHandle is implemented as header-only.

Differential Revision: [D65547216](https://our.internmc.facebook.com/intern/diff/D65547216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139895
Approved by: https://github.com/chenyang78
2024-11-07 01:36:28 +00:00
sgui/a3213105
4ddf015e7d [ONNX export] exporting model to onnx error when tensor.index_fill ops met dim=0 #139594 (#139596)
When fill_index op's param dim==0, there is no need to unsqueeze the index tensor's dimension. So we return index tensor directly if ths size of axes_i == 0

Fixes #139594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139596
Approved by: https://github.com/justinchuby
2024-11-07 01:32:34 +00:00
Bin Bao
bd5a2c2c71 [AOTI] Simplify the return code (#139889)
Summary:
```
    if constexpr (std::is_same_v<std::decay_t<decltype(buf3)>,RAIIAtenTensorHandle> || std::is_same_v<std::decay_t<decltype(buf3)>,AtenTensorHandle> || std::is_same_v<std::decay_t<decltype(buf3)>,ConstantHandle>) {
        output_handles[0] = buf3.release();
    } else {
        thread_local ThreadLocalCachedOutputTensor<std::decay_t<decltype(buf3)>> cached_output_0(buf3);
        cached_output_0.copy_data_from(buf3);
        AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&output_handles[0]));
        AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_assign_tensors(cached_output_0.tensor(), output_handles[0]));
    }
```
->
```
 output_handles[0] = buf3.release();
```

Test Plan: CI

Differential Revision: D65460719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139889
Approved by: https://github.com/chenyang78
2024-11-07 01:28:43 +00:00
Valentine233
6fcef86cfa [inductor] fix the unligned variable ranges issue in fuse node (#138568)
Fixes #138550.

### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.

### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-11-07 01:17:58 +00:00
Songhao Jia
59ec011855 [numerical debugger] bumped up the starting handler id (#139666)
Differential Revision: D65445250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139666
Approved by: https://github.com/tarun292, https://github.com/dulinriley
2024-11-07 01:00:43 +00:00
Colin L. Rice
e675c6702d justknobs: Remove JustKnobsConfig and justknobs_feature (#138767)
This never ended up getting used, and instead we're doing this
resolution within the configuration system.

Removing these unused internal features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138767
Approved by: https://github.com/ezyang
ghstack dependencies: #138766, #138956
2024-11-07 00:21:46 +00:00
Sam Larsen
52446d7f30 Revert D65290089 (#139893)
Summary:
This diff reverts D65290089
This change is introducing more logging than I realized and could present problems for tlparsen

Test Plan: NA

Reviewed By: jamesjwu

Differential Revision: D65541060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139893
Approved by: https://github.com/jamesjwu
2024-11-07 00:10:09 +00:00
Animesh Jain
ac5fa26e07 [dynamo][weakref] Support weakref.ref call (#139914)
Should fix - https://github.com/pytorch/pytorch/pull/135001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139914
Approved by: https://github.com/jansel
ghstack dependencies: #139856
2024-11-06 23:16:41 +00:00
Animesh Jain
738bfff5f9 [dynamo][user-defined] Fix bugs with method descriptors (#139856)
Should fix some problems in https://github.com/pytorch/pytorch/pull/138080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139856
Approved by: https://github.com/jansel
2024-11-06 23:16:40 +00:00
eellison
060bee7f22 Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787)
Previously we were checking for a last dim with stride == 1. When the size is <= 1 that also is sufficient because the stride is insignificant. Fix for https://github.com/pytorch/pytorch/issues/138317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139787
Approved by: https://github.com/drisspg
2024-11-06 22:53:01 +00:00
Sam Larsen
b8cf324e50 [pt2 logging] move remote cache get/put logging up one level (#139423)
Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see.

Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x

Differential Temp Revision: D65290089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423
Approved by: https://github.com/oulgen
2024-11-06 22:44:53 +00:00
Max Podkorytov
8f077b811b [ROCm][Inductor]Fixing missing ck package warning when the backend is disabled (#139790)
```

test_addmm_multiple_dynamic_cuda (__main__.AOTInductorTestABICompatibleCuda) ... W1101 10:26:20.492000 1361741 torch/_inductor/utils.py:1207] Please pip install Composable Kernel package
AUTOTUNE addmm(16x6, 16x16, 16x6)
  triton_mm_0 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=0, num_stages=2, num_warps=1
  triton_mm_1 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=2, num_warps=1
SingleProcess AUTOTUNE benchmarking takes 0.2182 seconds and 0.2979 seconds precompiling for 2 choices
```
This PR disables the warning message when the CK backend is disabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139790
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-11-06 22:04:32 +00:00
eellison
aafb3deaf1 Remove multinomial from cudagraph skip list' (#139897)
Since https://github.com/pytorch/pytorch/pull/134818/files we can run multinomial in cudagraph without error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139897
Approved by: https://github.com/BoyuanFeng
2024-11-06 21:28:42 +00:00
Justin Chu
86475dfc9f [ONNX] Prioritize strict=False export strategy (#139905)
Prioritize the `strict=False` export strategy in ONNX export because it is preferred according to @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139905
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-11-06 21:27:29 +00:00
Shunting Zhang
779c0b80cd [inductor] collect memory snapshort in the wrapper (#138429)
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g.
```
python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot
```
gives me:

<img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138429
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #139136, #138756
2024-11-06 21:22:18 +00:00
Colin L. Rice
2a857e940d config: Add env_name_default and env_name_force to Config (#138956)
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.

The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
2024-11-06 21:20:42 +00:00
Oguz Ulgen
1270c78268 Add logging for num_triton_bundles (#139807)
Summary: Adding logs for number of inductor cache triton bundles

Test Plan:
Ran adhoc code and looked at dynamo_compile/sandbox

https://fburl.com/scuba/dynamo_compile/sandbox/nhktfy19

Differential Revision: D65490826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139807
Approved by: https://github.com/masnesral
2024-11-06 21:11:04 +00:00
PyTorch MergeBot
9018326bb8 Revert "[pt2 logging] move remote cache get/put logging up one level (#139423)"
This reverts commit c412a42ae2.

Reverted https://github.com/pytorch/pytorch/pull/139423 on behalf of https://github.com/ZainRizvi due to Reverted internally. See D65541060 for more details ([comment](https://github.com/pytorch/pytorch/pull/139423#issuecomment-2460765579))
2024-11-06 20:59:54 +00:00
zeshengzong
ff616c26fb Optimize isclose description (#139724)
Fixes #139563

Make description user friendly.

After Change:

![image](https://github.com/user-attachments/assets/88a805c0-0105-4441-812b-582c09abc72b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139724
Approved by: https://github.com/janeyx99
2024-11-06 19:30:44 +00:00
Joel Schlosser
3abbde976d Allow any single non-batch dim to be ragged for NJT (#137125)
Fixes #137512

Relaxes the restriction that the ragged dim is immediately next to the batch dim e.g. `(B, *, D_0, ..., D_N)`. This allows for constructing NJTs of shape e.g. `(B, D, j0)` directly. It's possible before this PR to get an NJT of e.g. shape `(B, D, j0)` by constructing an NJT of shape `(B, j0, D)` and transposing it. This PR allows a user to go straight there without the transpose. The standard `torch.nested.nested_tensor(list)` constructor has been updated to support this.

At the very least, this is useful for testing on transposed NJTs. I'm willing to make this functionality private if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137125
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-11-06 18:50:08 +00:00
Bin Bao
d1e2e81ede [AOTI] Fix two test failures from #139471 (#139885)
Summary: https://github.com/pytorch/pytorch/pull/139471 caused two internal test failures due to different compiler path settings.

Differential Revision: D65519537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139885
Approved by: https://github.com/hl475
2024-11-06 18:41:28 +00:00
Frank Li
6ed237e5b5 [pytorch] Make global module hook to pass kwargs similar to how module hook works (#137403)
Differential Revision: D63576353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137403
Approved by: https://github.com/mikaylagawarecki
2024-11-06 18:20:57 +00:00
Bin Bao
6bdbc86550 [AOTI] Fix a cubin file path issue (#139848)
Summary: When we use aoti_compile_and_package to package the AOTI compiled artifacts, cubin files will be included, and at the deploy time, we should setup the cubin file directory to the right path that contains unziped cubin files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139848
Approved by: https://github.com/aakhundov
2024-11-06 16:45:30 +00:00
James Wu
dd6a5de00d Allow OpOverloadPackets as safe torch functions, sanitize dynamo gm before running aotdispatch with cache (#139785)
Summary:
This diff implements two things to improve cache hit rates after testing AOTAutogradCache with internal cogwheel jobs:
- We should allow torch functions that are OpOverloadPackets
- When running with cache, there are some fields that dynamo puts into the input graph module to aotdispatch that are not stable between runs. We use a context manager to null these out so that they can't be used to affect the output of AOTAutograd, and then we put the fields back onto the gm before returning from AOTAutogradCache.load().

Test Plan:
New unit tests + running nanogpt with AOTAutogradCache.

Meta:

Run on a long running job
Cache miss:
 {F1953831996}

Cache hit:
 {F1953830872}

Servicelabs here:
https://www.internalfb.com/servicelab/experiment/4301352991/

Cache hit:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660597709-TrainingApplication/attempt_0/version_0/rank_0/index.html

Cache miss:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660569960-TrainingApplication/attempt_0/version_0/rank_0/index.html

We can see that with these changes, autograd cache hits and saves compile time:
https://fburl.com/scuba/pt2_compile_events/ycddxstd

Differential Revision: D65436373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139785
Approved by: https://github.com/bdhirsh
2024-11-06 16:34:02 +00:00
Edward Z. Yang
e05a096c49 Ignore polyfill when reporting user backtraces in summarized form (#139850)
Fixes https://github.com/pytorch/pytorch/issues/139316

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139850
Approved by: https://github.com/bobrenjc93
2024-11-06 16:33:34 +00:00
Nikita Shulga
68ef445c33 [MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors (#139791)
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.

Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time

vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
                                    subfolder='vae',
                                    torch_dtype=torch.bfloat16,
                                    force_upcast=False).to('mps')

pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
                                                 torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)

start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
             num_inference_steps=10,
             guidance_scale=8,
             generator=torch.Generator("mps").manual_seed(42),
             ).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```

Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:
![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1)

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
2024-11-06 16:25:39 +00:00
Alan Du
59cf4bc5ae Fix the use of fsspec transactions (#135541)
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file.

I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
2024-11-06 15:16:12 +00:00
Sun, Jiayi
44df6522ee add Half/BFloat16 support for grid_sample on CPU (#134812)
Fix https://github.com/pytorch/pytorch/issues/127224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812
Approved by: https://github.com/Skylion007, https://github.com/mingfeima
2024-11-06 14:02:08 +00:00
cyy
d558c1a047 Enable cppcoreguidelines-special-member-functions (#139132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132
Approved by: https://github.com/sraikund16
2024-11-06 13:42:20 +00:00
Nikita Shulga
c0c6bf4ef2 Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353
2024-11-06 13:34:45 +00:00
PyTorch MergeBot
44e4949bcf Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)"
This reverts commit 22e89ea2aa.

Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/malfet due to It broke number of tests, see 22e89ea2aa ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2459754355))
2024-11-06 13:31:26 +00:00
PyTorch MergeBot
10d7729333 Revert "Enable cppcoreguidelines-special-member-functions (#139132)"
This reverts commit a9b4989c72.

Reverted https://github.com/pytorch/pytorch/pull/139132 on behalf of https://github.com/ZainRizvi due to Sorry but this fails on trunk. See inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_smooth_quant_with_int_mm [GH job link](https://github.com/pytorch/pytorch/actions/runs/11699366379/job/32591132460) [HUD commit link](22e89ea2aa) ([comment](https://github.com/pytorch/pytorch/pull/139132#issuecomment-2459743145))
2024-11-06 13:27:42 +00:00
PyTorch MergeBot
53299b8a38 Revert "Don't use deprecated type properties in UpsampleKernel (#139399)"
This reverts commit 0058f71002.

Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))
2024-11-06 13:23:43 +00:00
Jack Taylor
5f266b5a02 [ROCm] re-enable flex attention UTs (#139632)
https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632
Approved by: https://github.com/drisspg
2024-11-06 12:49:44 +00:00
Michael Lazos
d622b490d6 [Dynamo] Support tensor mro without source (#139838)
Fixes https://github.com/pytorch/pytorch/issues/137743

The issue here is that if `type` was called on a tensor without a source, we wouldn't have a source even for `torch.Tensor`, and the `__mro__` retrieval would fail. Since `torch.Tensor` is an internal torch type, I add handling for it in `call_type` in builtins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139838
Approved by: https://github.com/williamwen42
2024-11-06 08:52:53 +00:00
cyy
a9b4989c72 Enable cppcoreguidelines-special-member-functions (#139132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132
Approved by: https://github.com/sraikund16
2024-11-06 07:59:09 +00:00
Xia, Weiwen
22e89ea2aa [Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-11-06 07:54:47 +00:00
Huy Do
c19c384690 Fix torch.load (torch.utils.benchmark) after #137602 (#139810)
After #137602, the default `weights_only` has been set to True.  This test is failing in trunk slow jobs atm

benchmark_utils/test_benchmark_utils.py::TestBenchmarkUtils::test_collect_callgrind [GH job link](https://github.com/pytorch/pytorch/actions/runs/11672436111/job/32502454946) [HUD commit link](1aa71be56c)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139810
Approved by: https://github.com/kit1980
2024-11-06 03:08:29 +00:00
Colin Peppler
63b01f328e [inductor] support masked_scatter w/ unbacked sized source (#138083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138083
Approved by: https://github.com/jansel
2024-11-06 02:16:25 +00:00
cyy
028c5d3426 [2/N] Replace c10::sv with std::sv (#139456)
Follows  #139453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456
Approved by: https://github.com/ezyang
2024-11-06 01:50:38 +00:00
Andrew Gu
39ede99a33 Add current FSDP2 path to old composable FSDP1 warning (#139759)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139759
Approved by: https://github.com/weifengpy, https://github.com/wz337
ghstack dependencies: #139650
2024-11-06 01:43:04 +00:00
David Berard
aec179e2be Fix docs for logcumsumexp formula (#139768)
The previous formula was wrong and reused some indexing variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139768
Approved by: https://github.com/janeyx99
2024-11-06 01:19:09 +00:00
Laith Sakka
a787320d0f Do not try to optimize new implications in get_implications (#139738)
Summary:
save around 8%  on the torchrec model.
In most case the new implications are not optimizaiton anyway in some case though they are,
but optimizing them is useless.

ex:
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding True
adding False
```

VS
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding 0 <= Mod(s0, 3)
adding Mod(s0, 3) < 0
```
the main difference is that  0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change
this wont happen. but True:True and False: False are useless anyway lol. so its ok i think
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```

<img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738
Approved by: https://github.com/ezyang
ghstack dependencies: #139703
2024-11-06 00:23:40 +00:00
Will Feng
6a30c14a0a [Traceable FSDP2] Run any unexecuted post_backward at beginning of pre_backward hook (#139671)
Assuming the forward pass user code looks like:
```
for _ in range(2):
    x = layer(x)
```
and we have `fully_shard(layer)`, then:
- the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile)
- the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile

The behavior in the backward pass is different between eager and compile, which is not ideal.

 I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things:
1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects.
2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice.
3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671
Approved by: https://github.com/awgu
2024-11-06 00:19:06 +00:00
Xiaodong Wang
e7cf7d00be Support torch.bool in torch.sort + CUDA (#139409)
Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok.

Test Plan: CI

Differential Revision: D65282650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409
Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy
2024-11-06 00:02:54 +00:00
Aaron Orenstein
06f619d999 typing ir.py - part 2 (#131846)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846
Approved by: https://github.com/eellison
ghstack dependencies: #139238
2024-11-06 00:01:15 +00:00
Aaron Orenstein
c2109ec479 typing ir.py - Disallow untyped defs for ir.py (#139238)
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-06 00:01:15 +00:00
leslie-fang-intel
82e4de4994 [Inductor][CPU] Enable the oneDNN Linear fusion for special case (#139172)
**Summary**
In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-11-05 23:49:53 +00:00
Thomas Bohnstingl
d1c26b0781 Improvements for associative_scan - slicing of xs (#138858)
In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of https://github.com/pytorch/pytorch/pull/136966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138858
Approved by: https://github.com/ydwu4
2024-11-05 23:38:21 +00:00
Mikayla Gawarecki
86d7d39bff Forward fix D65441551 for T206731737 (#139767)
Test Plan: -

Differential Revision: D65482429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139767
Approved by: https://github.com/awgu
2024-11-05 23:19:08 +00:00
Shuqiang Zhang
c0d642a295 [pgnccl][simple] log started work numel (#139773)
Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-11-05 23:11:19 +00:00
PyTorch MergeBot
1d28b8b6d5 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit e84d1121ad.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))
2024-11-05 23:10:38 +00:00
drisspg
16da289402 [Workspace Inductor] Fix dynamic shapes (#139777)
# Summary
Arg ordering was wrong for when dynamic shapes is enabled and we pass in the additional size args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139777
Approved by: https://github.com/eellison
ghstack dependencies: #139157
2024-11-05 22:34:09 +00:00
Animesh Jain
b09eb6ed6a [dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476

We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.

If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.

So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
2024-11-05 21:48:07 +00:00
Yidi Wu
6734cb7bf2 [hop free symbols] refactor tensor.to_list implementation to call wrap_fx_proxy. (#139663)
Refactoring only. Previously, we manually cal SymNodeVariable.create, now we handle it with wrap_fx_proxy. This unifies the handling of operations that produce symints in wrap_fx_proxy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139663
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737, #138559
2024-11-05 20:19:09 +00:00
rzou
b9f0563aaf Add repro instructions to fx_graph_runnable.py (#139481)
This PR adds some instructions for how to add a TARGETS file to run the
fx_graph_runnable script. I'm planning to add some followups that will
add additional imports for custom ops and use autodeps to get the
dependencies, but I figure this PR is an easy first step.

Test Plan:
- pytest test/dynamo/test_structured_trace.py
- Does anyone have suggestions for how to test this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481
Approved by: https://github.com/eellison
2024-11-05 19:24:16 +00:00
Ryan Guo
01bcf37123 [dynamo][NFC] Remove some dead code paths (#139674)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139674
Approved by: https://github.com/Skylion007, https://github.com/anijain2305, https://github.com/mlazos
2024-11-05 19:12:17 +00:00
Ryan Guo
2b3a227b35 [dynamo] Add is_mutable() and is_immutable() methods to VariableTracker (#139341)
This patch adds 2 simple methods `VariableTracker.is_mutable()` and
`VariableTracker.is_immutable()`, which helps clarify intention. For
instance, rather than writing
```python
if var.mutation_type:
    ...
```
After this patch one can write
```python
if var.is_mutable():
    ...
```

This patch also simplifies `mutation_type` propagation in some
`ListVariable` methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #139339, #139340
2024-11-05 19:11:41 +00:00
Ryan Guo
0ba3962b80 [dynamo][NFC] Move MutationType classes into variables/base.py (#139340)
As title, this addresses
https://github.com/pytorch/pytorch/pull/137905/files#r1806800222.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139340
Approved by: https://github.com/anijain2305
ghstack dependencies: #139339
2024-11-05 19:11:41 +00:00
Ryan Guo
693a0a1bd4 [dynamo][NFC] Rename mutable_local and add documentation (#139339)
This patch addresses the renaming part of #133027, specifically, it
renames the following and adds documentation for relevant classes.
1. `VariableTracker.mutable_local` to `mutation_type`
2. `MatableLocal `to `ValueMutationNew`
3. `MutableSideEffects `to `ValueMutationExisting`
4. `MutableLocalSource` to `SourceType`
5. `MutableLocalSource.Local` to `New`

Note that (2), (3) and (5) are mainly to bring consistency between them
and `AttributeMutationNew`, `AttributeMutationExisting`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2024-11-05 19:11:41 +00:00
Ke Wen
5f2ed505eb [PGNCCL] Watchdog prints call-time traceback when reporting timeout (#139659)
### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-11-05 19:07:17 +00:00
Yifu Wang
ee42a99745 [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-05 18:47:24 +00:00
Boyuan Feng
87059d4547 [AOTAutograd] Handle edge cases for donated buffer & enable in oss (#139669)
This PR enables donated buffer in OSS and handles two edge cases:

1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer.
2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`,
`functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and
`inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669
Approved by: https://github.com/bdhirsh
2024-11-05 18:38:20 +00:00
rzou
27ec3921bc Optimize mutable torch.library.custom_op overhead (#139513)
We don't need to do a loop over all the args, kwargs in the
AdInplaceOrView key; we just need to bump the version on the args,
kwargs that are mutable.

On the benchmark mentioned in
https://github.com/pytorch/pytorch/issues/139494
this made the time go from
```
mutate2 = 61.72943878173828
no_mutate2 = 36.89440155029297
mutate = 236.3092498779297
no_mutate = 59.31964874267578

```
to
```
mutate2 = 47.976478576660156
no_mutate2 = 38.37468719482422
mutate = 71.21315002441406
no_mutate = 59.7432975769043
```

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139509
2024-11-05 18:30:53 +00:00
Tomasz Bohutyn
9dc5851f5d handle more devices in method_type method of TensorVariable (#138078)
Fixes #138077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138078
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-11-05 18:19:52 +00:00
Angela Yi
de509abe1c [export] Dedup data-dependent errors based on stacktrace (#139540)
Summary:
Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation).

This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors.

Test Plan: CI

Differential Revision: D65374254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540
Approved by: https://github.com/pianpwk, https://github.com/zou3519
2024-11-05 18:16:05 +00:00
Sam Ginzburg
cc25b6d7ba [inductor] Error on unsupported autotuner configs (#139658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139658
Approved by: https://github.com/aakhundov
2024-11-05 18:09:02 +00:00
Junjie Wang (PyTorch)
41e4d88584 [logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757)
Summary: As discussed, we want to measure the time spent during pickling and unpickle.

Test Plan: CI

Reviewed By: wz337

Differential Revision: D65462767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139757
Approved by: https://github.com/awgu, https://github.com/Skylion007, https://github.com/fegin, https://github.com/c-p-i-o
2024-11-05 17:40:27 +00:00
Oguz Ulgen
c0d21b6581 End TritonBundle on non-cache write codepaths (#139698)
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.

This diff also turns on the cache for internal so that we can do a code rollout.

Test Plan: updated tests

Differential Revision: D65457224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
2024-11-05 17:00:40 +00:00
PyTorch MergeBot
4d5cc1b4ef Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)"
This reverts commit e6ff07f00e.

Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking internal tests. Please see D65430317 for more details ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2457620720))
2024-11-05 16:22:30 +00:00
cyy
a2bc2e38f9 Use clang-tidy 17 (#139678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678
Approved by: https://github.com/Skylion007
2024-11-05 16:00:25 +00:00
Junjie Wang (PyTorch)
13eb3b3f6f [Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702)
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use  `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.

Test Plan: CI.

Differential Revision: D65453312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
2024-11-05 15:28:03 +00:00
Edward Z. Yang
349cd49406 Fix compiler collective TORCH_TRACE and improve code state printing (#139716)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139716
Approved by: https://github.com/yf225
2024-11-05 14:32:52 +00:00
cyy
546318e559 [7/N] Don't skip ASAN on some tests (#139675)
Follows #139565
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139675
Approved by: https://github.com/ezyang
2024-11-05 14:01:01 +00:00
Xuehai Pan
e84d1121ad Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-05 10:44:56 +00:00
zeshengzong
ffb7a08921 Fix torch.histc not checking min > max on cuda for int8 tensors (#139372)
Fixes #139360

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)

Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)

![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487)

Change type of `minvalue` and `maxvalue` to fix it, similar like in line:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)

**Test Result**
```bash
$ pytest test/test_reductions.py -vv
```
![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372
Approved by: https://github.com/eqy
2024-11-05 08:42:38 +00:00
Laith Sakka
6ad52db8c8 use torch.sym_sum instead of incremental sum in _cat_meta (#139653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139653
Approved by: https://github.com/ezyang
2024-11-05 07:24:24 +00:00
Aaron Orenstein
51a3d6dbc3 Fix existing lint issues in ir.py (#139237)
- Remove stale mypy "type: ignores"
- Made ir.py pass the rest of the lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139237
Approved by: https://github.com/Skylion007
2024-11-05 06:06:12 +00:00
Eli Simhayev
b2f5a5311b RMSNorms docs - remove biases initialization (#139620)
RMSNorm doesn't use a bias in `elementwise_affine`, so I've removed it from the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139620
Approved by: https://github.com/mikaylagawarecki
2024-11-05 05:59:41 +00:00
Chen, Zejun
9aaf3a04fa [profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316)
This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316
Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-11-05 05:46:13 +00:00
CaoE
9e14d86573 [Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255)
`kernel_micro_gemm` generated using BRGEMM:
```
template <bool accum>
inline void kernel_micro_gemm(
    const half* __restrict__ A,
    const half* __restrict__ B,
    float* __restrict__ C,
    int64_t M,
    int64_t N,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc
) {
    at::native::cpublas::brgemm(
      M, N, K,
      lda, ldb, ldc,
      1.f, accum ? 1.f : 0.f,
      A,
      B,
      C);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 05:33:29 +00:00
Meet Vadakkanchery
c8a55eea88 [DCP] Fix process_group logging for DCP methods (#139428)
Summary:
Currently, we incorrectly log process_group for DCP based events.

We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available).

In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`.

Test Plan:
Before:

Always defaults to NCCL even though GLOO is passed by caller.

{F1950847585}

After:

GLOO backend shows up.

{F1950848375}

Differential Revision: D65255871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428
Approved by: https://github.com/teja-rao, https://github.com/mhorowitz
2024-11-05 05:24:38 +00:00
Animesh Jain
fe4fa1df9f [dynamo][eval_frame] Set the callback to None earlier for guard eval (#139655)
xref - https://fb.workplace.com/groups/1075192433118967/permalink/1536570810314458/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139655
Approved by: https://github.com/jansel, https://github.com/williamwen42
2024-11-05 05:18:46 +00:00
Gabriel Ferns
a766d84a3c Allow inplacing buffer when other users are inconsequential (#138383)
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.

Implements:
https://github.com/pytorch/pytorch/issues/132826

Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
2024-11-05 03:44:09 +00:00
Andrew Gu
9039fbb47e [FSDP2] Make module-to-state mapping use weakrefs (#139650)
Without this, `del model` does not free memory of a module with FSDP2 applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139650
Approved by: https://github.com/yf225
2024-11-05 02:16:52 +00:00
cyy
5008d15ae9 [2/N] Remove usage of C array (#139589)
Follows  #139567
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139589
Approved by: https://github.com/ezyang
2024-11-05 01:58:12 +00:00
CaoE
3672c688e3 Fix layout for SetSourceTensorKernel (#137973)
Fixes #136837.
`aten.set_.source_Tensor` will make the size and stride of the first input and output follow that of the second input: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L440. If the layouts of the two inputs are different, the following `assert_size_stride` will fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137973
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 00:55:17 +00:00
Edward Yang
639162f39a Add cache size to pt2_compile_events (#139627)
Summary:
I realized I wanted to check "are my cache entries/IO unreasonably large"
and there's no easy way to do it.  This lets me do it.

Test Plan: servicelab

Differential Revision: D65390363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627
Approved by: https://github.com/c00w
2024-11-05 00:30:10 +00:00
Nikita Shulga
0058f71002 Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353, #139358
2024-11-05 00:29:58 +00:00
PyTorch MergeBot
4a3ee96427 Revert "Don't use deprecated type properties in UpsampleKernel (#139399)"
This reverts commit 9d096e4d9f.

Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))
2024-11-05 00:13:48 +00:00
cyy
64d9ee88d7 [11/N] Fix extra warnings brought by clang-tidy-17 (#139599)
Follows #139385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139599
Approved by: https://github.com/sraikund16
2024-11-04 23:57:41 +00:00
Laith Sakka
3f248a5735 Classify miss-inplaced tensors in logs. (#139240)
Summary:
use signpost logs,
a followup is to remove the field possibly_missed_reinplacing_opportunities form dynamo compile table.

Differential Revision: D65180194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139240
Approved by: https://github.com/zou3519
2024-11-04 23:56:14 +00:00
Mikayla Gawarecki
e947649e8f [BE] Change _marked_safe_globals_list to set (#139303)
Prevent same global from being added multiple times

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139303
Approved by: https://github.com/janeyx99
ghstack dependencies: #138936, #139221, #139433, #139541, #137602
2024-11-04 23:50:55 +00:00
Pian Pawakapan
a678eaf1ad check fake/real mismatches during real tensor prop (#137747)
Summary:
While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output).

Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](929797dedb/torch/_subclasses/fake_utils.py (L78))

Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer?

Test Plan: test_export, test_fake_tensor

Differential Revision: D64210055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747
Approved by: https://github.com/zou3519
2024-11-04 23:39:48 +00:00
Bob Ren
9919932783 Specialize symfloats that flow through is_integer (#139572)
Fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_is_integer_num_type6_dynamic_shapes` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139572
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568
2024-11-04 23:35:35 +00:00
Henry Tsang
350bc2a166 [export] Add support for symbool to make it usable for torch.cond (#138765)
# Why?

I want the following code to work.

minimal repro:
```
class M(torch.nn.Module):
    def forward(self, dilate_flag):
        return dilate_flag.item()

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()

ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```

error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program

second error will be handled by https://github.com/pytorch/pytorch/pull/138760

# Motivation

I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.

```
class M(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.dilate_flag = 0

    def forward(self, dilate_flag):
        self.dilate_flag = dilate_flag.item()

        def true_fn(dilate_flag):
            return dilate_flag.clone()

        def false_fn(dilate_flag):
            return dilate_flag.clone()

        torch.cond(
            self.dilate_flag,
            true_fn,
            false_fn,
            (dilate_flag,),
        )
        return self.dilate_flag

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()

for input in inputs:
    expected_output = model(*input)

    ep = torch.export.export(model, input, strict=False)
    path = torch._inductor.aot_compile(ep.module(), input)
    aot_model = torch._export.aot_load(path, device="cuda")
    actual_output = aot_model(*input)

    assert (
        expected_output == actual_output
    ), f"henry they are not equal {expected_output} != {actual_output}"
```

Differential Revision: D64867504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
2024-11-04 23:31:49 +00:00
PyTorch MergeBot
6add86a29f Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit bf5cd8d011.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking lint on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11673543178/job/32504499599) [HUD commit link](bf5cd8d011) ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2455908056))
2024-11-04 23:30:15 +00:00
Jane Xu
23169a6bcc Disable foreach tests for complex128 internally (#139649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649
Approved by: https://github.com/ngimel
2024-11-04 23:24:47 +00:00
Tugsbayasgalan Manlaibaatar
87a379b61b Move pippy to training IR (#139233)
Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233
Approved by: https://github.com/kwen2501
ghstack dependencies: #138658, #139209
2024-11-04 23:07:14 +00:00
Yidi Wu
397938b453 [hop free symbols][refactor] lift freevar to parent graph before lifting to subgraph (#138559)
This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor  x with shape (s0,) in subgraph, the ordering of lifting changes from
```
lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent
```
to
```
lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child
```
This produces a determinstic ordering of handling the symints in lifted tensors.

This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737
2024-11-04 22:48:14 +00:00
Yidi Wu
c5b79699e1 [hop free symbols] replace ctx.save_for_backward to support symints/ints (#138737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138737
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Chillee
ghstack dependencies: #138345, #138428, #138558
2024-11-04 22:48:14 +00:00
Yidi Wu
ac20d0f893 [hop free symbols][refactor] make map's save_for_backward to handle int (#138558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138558
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428
2024-11-04 22:48:07 +00:00
Yidi Wu
dc3a6a9d08 [hop free symbols][refactor] make create_graph_input always take example_value (#138428)
Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that **all graph inputs must set their example values when creating the inputs**. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #138345
2024-11-04 22:47:49 +00:00
Yidi Wu
54c69a785b [hop free symbols][refactor] make bound_symbols a dictionary (#138345)
Code refactoring only. Change all self.tx.output.bound_symbols to self.tx.output.root_tracer.bound_symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138345
Approved by: https://github.com/zou3519
2024-11-04 22:47:41 +00:00
Felix Zimmermann
bf5cd8d011 Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-04 22:10:04 +00:00
Shunting Zhang
888110841c [inductor] don't fuse two nodes if likely increase peak memory (#138756)
Partially fixing https://github.com/pytorch/pytorch/issues/138685

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756
Approved by: https://github.com/jansel
ghstack dependencies: #139136
2024-11-04 20:49:29 +00:00
Ze Sheng
1aa71be56c [PT2] Decouple decompose_triton_kernel_wrapper_functional from decompose_auto_functionalized (#139526)
As title. We may not always want to remove the `triton_kernel_wrapper_functional` for example the references of [`unsafe_remove_auto_functionalized_pass`](c8ab9b06a2/torch/export/_remove_auto_functionalized_pass.py (L48)).

Test Plan: CI & [D62592946](https://www.internalfb.com/diff/D62592946)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139526
Approved by: https://github.com/zou3519
2024-11-04 20:16:18 +00:00
Will Constable
71dc5df93c [pipelining] Fix 'last backward' counting for dI / dW (#139415)
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
2024-11-04 20:14:10 +00:00
Ryan Guo
30a83ca991 [dynamo] Improve codegen for DataPtrVariable and fix tensor reference issue (#139487)
This addresses
https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which
had to set `allow_cache=False` for codegen on `DataPtrVariable.base`,
which is a `TensorVariable`, otherwise we observe failure of
`test_no_grad_copy` when testing with Dynamo.

I've seen `test_no_grad_copy` failing a few times, and every single time
it's related to cyclic reference, my best guess is the cyclic reference
holds some tensor object longer in memory than necessary, preventing the
optimization introduced in #11165.

This patch makes `OutputGraph.cleanup()` more aggressive by clearing out
all fields that might reference a `VariableTracker`. As a result, we can
remove the aforementioned `allow_cache=False`, which helps generate
better code (e.g., in the case of `test_no_grad_copy`, it skipped generating
a redundant graph whose only op is returning the input tensor; instead we just
generate a single `LOAD_FAST`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487
Approved by: https://github.com/jansel, https://github.com/aakhundov
2024-11-04 19:14:06 +00:00
Bin Bao
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
Oguz Ulgen
e76ce20177 Log to pt2 compile events (#139601)
Summary: This option was added after I wrote the original diff, lets publish to pt2_compile_events

Test Plan: CI

Differential Revision: D65404910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139601
Approved by: https://github.com/jamesjwu
2024-11-04 18:39:06 +00:00
Shunting Zhang
4930c4b716 [inductor] patterns to remove pointless view/permute pairs (#139136)
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.

Consider this snippet:
```
        class LinearAndCEL(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(C, V)
                self.ce = nn.CrossEntropyLoss()

            def forward(self, x, y):
                return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```

`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.

The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-11-04 18:39:02 +00:00
Mikayla Gawarecki
ca43ecd599 Flip default on weights_only (#137602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #138936, #139221, #139433, #139541
2024-11-04 18:30:29 +00:00
Mikayla Gawarecki
f55dfbcf87 Remove hasattr(__slots__) for BUILD logic in weights_only unpickler (#139541)
This is tested in PR stacked above in

```python
python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load
```

We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code

```python

>>> from abc import ABC
>>> hasattr(ABC, "__slots__")
True
```

So

```python
import torch
from abc import ABC
from dataclasses import dataclass

class Foo(ABC):
    pass

class FooWrapper(Foo):
    def __init__(self, x, y):
        self.x = x
        self.y = y

f = FooWrapper(1, 2)
torch.save(f, "temp.pt")
with torch.serialization.safe_globals([FooWrapper]):
    torch.load("temp.pt")
```

Would fail on the previous code with
```
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load
    result = unpickler.load()
  File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load
    for k, v in slotstate.items():
```

As there is actually no slotstate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221, #139433
2024-11-04 18:30:29 +00:00
Tugsbayasgalan Manlaibaatar
ae0e7042f6 Fix custom obj being input (#139209)
Differential Revision: [D65158939](https://our.internmc.facebook.com/intern/diff/D65158939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139209
Approved by: https://github.com/ydwu4
ghstack dependencies: #138658
2024-11-04 18:24:29 +00:00
rzou
85c3c4132d no-op torch.library.custom_op APIs on torch.deploy (#139509)
We forgot this case in the previous PR. Fixes
https://github.com/pytorch/pytorch/issues/137536

Test Plan:
- better tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139509
Approved by: https://github.com/williamwen42
2024-11-04 18:01:08 +00:00
PyTorch MergeBot
6dada2136a Revert "Refactor FxGraphDrawer to use HTML-like labels (#137726)"
This reverts commit 1e73842029.

Reverted https://github.com/pytorch/pytorch/pull/137726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like some internal components are failing after this change and need to be updated ([comment](https://github.com/pytorch/pytorch/pull/137726#issuecomment-2455332612))
2024-11-04 17:44:44 +00:00
Tugsbayasgalan Manlaibaatar
e080c89bdc Make test_torchbind.py training IR compatible (#138658)
In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now.  I also fixed two bugs:
1. ep.module() doesn't register all aliased constants in the module.
2. When we retrace, we need to fakify the original Torchbind object.
3. We don't run any DCE on training IR so we need to add some more torch ops to verifier.

Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2024-11-04 17:43:11 +00:00
Bob Ren
68c515b292 don't run z3 analysis on backed symfloat nodes (#139568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139568
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457
2024-11-04 17:04:29 +00:00
PyTorch MergeBot
3ca794783f Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755)"
This reverts commit 924e726c3a.

Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally.  Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))
2024-11-04 16:34:34 +00:00
Bob Ren
87404b6ca6 support symfloats in translation validation (#139457)
fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesHigherOrderOpTests.test_cond_pytree_operands_with_non_tensor_leaves_dynamic_shapes` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139457
Approved by: https://github.com/ezyang
ghstack dependencies: #139569
2024-11-04 15:40:08 +00:00
Richard Barnes
6b8e3022f2 Remove c10::optional usages in PyTorch (#139525)
Test Plan: Sandcastle

Reviewed By: swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139525
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-11-04 15:35:23 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
Bob Ren
12d225d91c add opaque unary sin and cos to SYMPY_INTERP (#139569)
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nn.py TestNNDeviceTypeCPU.test_affine_3d_rotateRandom_cpu` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139569
Approved by: https://github.com/ezyang
2024-11-04 07:37:11 +00:00
Sun, Jiayi
3337439dc0 [inductor] modify the heuristic for disabling vectorization (#136422)
Summary
Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-04 07:33:32 +00:00
James Wu
f4ee5a243d Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402)
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.

In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.

In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.

The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.

Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
2024-11-04 06:37:18 +00:00
cyy
3179eb15ae [1/N] Remove usage of C array (#139567)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139567
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-04 04:52:46 +00:00
Yuxin Wu
cadc50e7e9 LOG(INFO) -> VLOG(2) in ProcessGroupNCCL (#130696)
In the same spirit as https://github.com/pytorch/pytorch/pull/105695

Initialization and error handling logs are mostly kept. Routine logs are changed to VLOG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130696
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@fb.com>
2024-11-04 04:43:42 +00:00