Commit Graph

80818 Commits

Author SHA1 Message Date
Bob Ren
85204d0081 Don't wrap inf values as symfloat (#139896)
Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139896
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454
2024-11-07 20:03:54 +00:00
cyy
9d09af981b Wrap torch_python with torch_compile_options (#136743)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136743
Approved by: https://github.com/ezyang
2024-11-07 19:36:40 +00:00
Menglu Yu
d0da40a8b9 [PT2][Optimus] fix the default alpha and beta values (#139857)
Summary:
We noticed that the default coefficient values for beta and alpha should be int 1, instead of float 1.0, which will cause error when the inputs for the add are int types.

More contex:

https://fb.workplace.com/groups/1075192433118967/permalink/1539142760057263/

Test Plan:
# local reproduce
```
buck2 run mode/opt scripts/shuaiyang:test -- --optimus --flow_id 660724017 2>&1 | tee ~/local_run_shuai_660724017.txt
```

trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-11-05-21-18-17/trace.json.gz&bucket=gpu_traces

# E2E

before fix:
f660724017

after fix:

Differential Revision: D65521638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139857
Approved by: https://github.com/jackiexu1992
2024-11-07 19:12:23 +00:00
cyy
72d3f5b26d Turn static inline into static function (#139843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843
Approved by: https://github.com/ezyang
2024-11-07 19:08:41 +00:00
William Wen
f5147e989c [dynamo] prefix some eval_frame.c functions with dynamo_ (#139921)
Fix https://github.com/pytorch/pytorch/issues/137994. I didn't prefix every function, but the ones that are on the hotpath.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139921
Approved by: https://github.com/ezyang
2024-11-07 19:07:23 +00:00
Sherlock Huang
071d48c56e Add output_node util function to fx.Graph (#139770)
Summary: A util function for access output node for FX graph

Test Plan: OSS CI

Differential Revision: D65486457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139770
Approved by: https://github.com/ezyang, https://github.com/Chillee
2024-11-07 18:54:59 +00:00
Max Podkorytov
ee54dfb64d [Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643)
Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last.

This way, the channels-last CK instances will be added to benchmark choices in max autotune

# Testing
```
pytest test/inductor/test_ck_backend.py -k conv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643
Approved by: https://github.com/chenyang78
2024-11-07 18:37:39 +00:00
Howard Huang
edbf57b336 [pipelining] remove extra variables (#139817)
Cleaning up counters / extra variables not needed after https://github.com/pytorch/pytorch/pull/139415 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139817
Approved by: https://github.com/wconstab
2024-11-07 18:32:20 +00:00
Nikita Shulga
8f4b29810b Fix aarch64 wheel builds (#140020)
Shell script still referencing builder checkout rather than PyTorch, which results in
```
python /builder/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
python: can't open file '/builder/aarch64_linux/aarch64_wheel_ci_build.py': [Errno 2] No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140020
Approved by: https://github.com/atalman
2024-11-07 18:24:34 +00:00
David Berard
eabef5000f [user triton] reset kernel_side_table before test_tma_capture_and_functionalize (#139907)
The test was failing when I ran the whole test suite. I'm guessing that the exact indices would previously depend on the order that tests would run; by resetting the kernel_side_table we should hopefully get results that are reproducible independent of the test execution order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139907
Approved by: https://github.com/oulgen, https://github.com/aakhundov
2024-11-07 17:56:53 +00:00
cyy
ca7fdfe4d2 [Reland] Use static_assert to detect get_type_index used in device code (#139966)
#139173 was reverted due to an internal build break of using get_type_index in device code. This PR is created for ease of importing into META to further investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139966
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-07 17:36:47 +00:00
Ke Wen
e474f0de82 [PGNCCL] Slimming watchdog loop (#139834)
- Refactored traceback code into `work.printTraceback()`.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang
- Refactored desync debug code into `class DesyncDebugger`.
- Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang
- Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834
Approved by: https://github.com/shuqiangzhang
2024-11-07 17:22:44 +00:00
PyTorch MergeBot
a60bc051e3 Revert "Fix the use of fsspec transactions (#135541)"
This reverts commit 59cf4bc5ae.

Reverted https://github.com/pytorch/pytorch/pull/135541 on behalf of https://github.com/ZainRizvi due to Breaking internally. See D65551490 ([comment](https://github.com/pytorch/pytorch/pull/135541#issuecomment-2462774239))
2024-11-07 17:03:37 +00:00
PyTorch MergeBot
7e02386303 Revert "[2/N] Replace c10::sv with std::sv (#139456)"
This reverts commit 028c5d3426.

Reverted https://github.com/pytorch/pytorch/pull/139456 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. @ezyang can you please help get this landed? See D65546398 for more details ([comment](https://github.com/pytorch/pytorch/pull/139456#issuecomment-2462768891))
2024-11-07 17:00:59 +00:00
IvanKobzarev
781c68c865 [aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095)
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731

Introducing ability for subclass implement type convertion to expected_type.
```
    def __coerce_same_metadata_as_tangent__(
        self, expected_metadata: Any, expected_type: Optional[Type] = None
    ):
```
Here if `expected_type=None` means `SubclassClass` is expected.

E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime

Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
2024-11-07 16:24:48 +00:00
Bob Ren
8d3d47e439 Trigger symfloat specialization in argument binding code (#139454)
Fixes the test `python test/inductor/test_torchinductor.py CpuTests.test_upsample_cat_conv_cpu` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139454
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846
2024-11-07 16:10:23 +00:00
James Wu
c35a01173b Remove compile event logging for automatic dynamic (#139891)
Summary: These events are a pretty large portion of the table, but not really currently used. Only log to tlparse for now.

Test Plan: Unit tests

Differential Revision: D65539986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139891
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-07 14:52:10 +00:00
Annop Wongwathanarat
81ecf98d23 Pass all arguments when quantizing embedding bag from float (#137697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137697
Approved by: https://github.com/snadampal, https://github.com/jerryzh168
2024-11-07 09:53:49 +00:00
sanchitintel
314aa268ce In AMX GEMM micro-kernel, use same dtype for A & B only if B is dequantized (#139906)
@frost-intel discovered that some Inductor auto-tuning UTs for CPU are currently broken on machines supporting AMX ISA. That's because in #136688, I had reverted a change in the AMX GEMM micro-kernel that was introduced in #131887, but it looks like some other implementations introduced after the aforementioned change rely upon it, so it should not have been reverted.

Added a fix.

Ideally, a CI machine that supports AMX should cover these UTs (test/inductor/test_cpu_select_algorithm.py). We do have at least one CI machines that support AMX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139906
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-11-07 09:18:59 +00:00
Bob Ren
a4e7b8001c refuse to generate a symbolic variable if a float input is inf (#139846)
Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCPU.test_cauchy_cpu_float64` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139846
Approved by: https://github.com/ruidazeng, https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572
2024-11-07 09:16:55 +00:00
xinan.lin
c4a323ed05 [Inductor] Generalize device-bias code newly introduced in scheduler.py (#139872)
[Inductor] Generalize device-bias code newly introduced in scheduler.py to align the Inductor behavior for xpu with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139872
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
ghstack dependencies: #139705
2024-11-07 07:10:28 +00:00
xinan.lin
320374b011 [Inductor] Refine triton_bundler.py to support correctly on Intel GPU and fix CI failures. (#139705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139705
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
2024-11-07 07:10:28 +00:00
Mengwei Liu
3caf56d97a Mark full_like as core ATen (#139937)
Fixes #139617

As titled. For ExecuTorch `full_like` is implemented so this should be fine: https://github.com/pytorch/executorch/blob/main/kernels/portable/cpu/op_full.cpp

Also there are decompositions for ops such as `fill.Scalar` that gives `full_like`: https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L164

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139937
Approved by: https://github.com/tugsbayasgalan
2024-11-07 07:08:18 +00:00
FFFrog
c03324de2d Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-11-07 06:28:47 +00:00
Max Podkorytov
ca30704f0b [Inductor][ROCm][CK] Add standalone runner (#139441)
Generate standalone executable to debug and profile CK gemm instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139441
Approved by: https://github.com/ColinPeppler
2024-11-07 06:21:27 +00:00
Zhenbin Lin
d36fdaf157 Openreg: Support stream (#136991)
Support stream. When the driver communicates with the executor, it will send the stream id corresponding to the execution command; when the executor receives the command with the stream id, it will ignore the stream id because cpu backend doesn't support asynchronous execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136991
Approved by: https://github.com/ezyang
2024-11-07 06:09:07 +00:00
Yidi Wu
ab42967238 [hop free symbols] lift free symbols in example_value when create_graph_input (#138363)
There are 4 parts (they are hard to further break into smaller ones cause they're highly coupled) in this PR:
1. **Whenever we call create_graph_input, we try to bind the symbols in the graph input.**
We've enforced the invariant that all create_graph_inputs calls must provide an example value, we could intercept at the create_graph_input calls (This PR only handles free symbols in tensors).
2. **We cache the bound_symbols** to avoid lift the same symbol repeated.
3. For lifted symbols, we re-used  **lifted_freevars** i.e. the mapping between symbol proxy in parent graph to the lifted phs in current subgraph, which we handle lifted tensors. In this way, all hops that supports lifted tensors should be able to handle lifted_symints automatically (at least in dynamo part).
4. For **unbacked symbols** created during tracing, we need to also bound these symbols to its proxy. This is to support the tests cases where we want to lift unbacked symbols as input. We need the proxy of the unbacked symbol in parent graph in order to properly create the args to the hop.
5. We change all the tests after free symbols are lifted in subgraphs. And also supports the lifted symbols in existing higher order ops.

**The interaction of nested tracers:**
The previous design for lifting tensor closures is that: suppose we're in nested tracers, whenever we see a new proxy that's not created by create tracer, we recursively look for the proxy in parent tracer until we find the tracer that creates this proxy (either a placeholder or some intermediate results). More detail is in Note [Nested SubgraphTracer and free_variable handling].

Given the above design, the plan for lifting the free symbols is: whenever we lift a free tensor to be the inputs of current subgraph, we'll look at the symbols in it and bind the symbols at the same time.

For example, suppose we have the following function:
```python
def f(x: [s1, s2]):
  def true_f():
    def true_f_inner():
      return x.sin()
```
what will happen in time order:

1. we create a subtracer 1 and start to speculate the outer cond's true_f
2. we create a another subtracer 2 and start to speculate the inner cond's true_f_inner.
3. dynamo realize the tensor input x by calling wrap_tensor in top-level to create graph input x (tracer 0), we bind the symbol s1, s2 after ph for x is created. So the graph now looks like:
```python
def gm(s1, s2, x):
```
4. when seeing TensorVariable.call_method of x,  tracer2 wants to create a call_function(sin, proxy_of_x), but it finds that proxy_of_x is not created by current tracer. So it recursively look up its parent tracer1 and find parent tracer1 also doesn't track this proxy_of_x then it finds the root tracer0, who is the creator of it and tracks it as a ph. Then tracer 1 create_graph_input  to lift the closure to its input ph1 and add (proxy_of_x: ph1) k-v in **lifted_freevars**  of tracer 1.
Now the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(x):
```
5. Since there are free symbols inside this new tensor input, tracer 1 also binds the symbols (maybe_bind_symbol), which calls create_graph_input for s1 and s2. Now the graph looks like
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
```
6. then it goes back to tracer 2, and call create_graph_input for x and get ph2, tracer 2's **lifted_freevars** records (ph1, ph2). and tracer 2 also binds the symbols in this new tensor input. Now the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner(s1, s2, x):
```
7. Finally the sin call_function node is created by tracer 2.

**This PR also handles the following cases:**
- What if we lift two tensors share the same symbol? e.g. x1 [s1, s2], x2 [s2, s3]? Each subtracer maintains bound_symbols as a cache that maps a symbol.expr to its proxy in current tracer. So when we see x1, we'll track s1 and s2 as inputs and bound s1 to ph1, s2 to ph2. So when we try to bind symbols of x2, s2 will already be tracked so no graph input is created.
- what if a subgraph close over a symint? e.g.
```python
def f(x):
  def true_f():
    c = x.size(0)
   def true_fn_inner():
     return c
```
When we speculate true_fn_inner, we find proxy_of_c is not tracked by tracer 2, so it recursively looks up its parent. At this point, x and its symbols have been lifted as input of true_f (as a result of lifting x during tracing true_f in tracer 1. Specifically the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner():
```
So tracer 2 is able to find that s1 have been tracked as ph in tracer 1 so it returns back to gm and call create_graph_input on s1. The graph now looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner(s1):
     return s1
```

-  What if subgraph close over an unbacked symint? e.g.
```python
def f(x):
  def true_f():
    c =  x.item()
    def true_f_inner():
      return c
```
When x.item() is called, proxy_of_c and its symnode variable is created for tracer 1, and we also call track_unbacked_symbols to record this relationship. So when tracer 2 finds proxy_of_c is not created by current tracer, it recursivelly looks up its parent tracer and finds that that expression u0 has been tracked as a result of track_unbacked_symbol in tracer 1. So it will stop the recursion and create_graph_input u0 in tracer 2. Graph looks like:
```python
def f(x):
  def true_f(s1, s2, x):
    c = x.item()
    def true_gm_inner(u0):
      return u0
    cond(pred, true_gm_inner, false_gm_inner, (c,))
```

- what if subgraph close over a tensor with unbacked symint shape?
```python
def f(x):
  def true_f():
    c = x.item()
    r = torch.randn((c,))
    def true_f_inner():
      return r + 1
```
This is the same as the case of closing over tensors with backed shapes. where we first lift r, then bind u0 in it, which recursively bind_symint of u0 in its parent and found u0 is tracked in parent tracer as a result of .item() call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138363
Approved by: https://github.com/zou3519
2024-11-07 04:44:32 +00:00
Justin Chu
3368f3ad41 [ONNX] Update TorchTensor implementation to handle fake mode (#139534)
Update TorchTensor implementation to handle fake mode better. Specifically, we disable fake mode before calling detach() etc. when getting the weights if it is already a real tensor so we do not lose it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139534
Approved by: https://github.com/fatcat-z, https://github.com/titaiwangms
2024-11-07 04:36:24 +00:00
Gabriel Ferns
2037ea3e15 Add type annotations to Configs (#139833)
Summary:
Adds types to Configs, and fixes a bug in options that was caused by the lack of types.

fixes: https://github.com/pytorch/pytorch/issues/139822

Configs are used by many modules so not sure which label to put.

Types also allow https://github.com/pytorch/pytorch/pull/139736 to fuzz configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139833
Approved by: https://github.com/c00w
2024-11-07 03:49:09 +00:00
Yifu Wang
5203138483 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-07 03:43:12 +00:00
Sun, Jiayi
a59132b9c8 fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661)
Fix https://github.com/pytorch/pytorch/issues/132634.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2024-11-07 03:21:36 +00:00
PyTorch MergeBot
604e353cae Revert "Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787)"
This reverts commit 060bee7f22.

Reverted https://github.com/pytorch/pytorch/pull/139787 on behalf of https://github.com/huydhn due to Sorry for reverting this, but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/139787#issuecomment-2461234683))
2024-11-07 03:17:16 +00:00
Ryan Guo
f459c3095f [dynamo] Document codegen and clean up some code paths (#139670)
This patch
1. Adds documentation to `PyCodegen.__call__`, `PyCodegen.tempvars` and
   the `allow_cache` flag.
2. Merges a few existing code paths in `PyCodegen.__call__`.
3. removes the `elif var in cg.tempvars` code path in
   `codegen_save_tempvars`, because it's no longer needed after #113725,
   as we have up-to-date `VariableTracker.source` now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139670
Approved by: https://github.com/jansel
ghstack dependencies: #139538
2024-11-07 03:14:16 +00:00
Ryan Guo
183b386cb2 [dynamo] Simplify Codegen for variables with MutableSideEffects (#139538)
This effectively undoes #115095, which is not longer be needed after #113725.

Why did we need #115095? I went back in history and found that [this line](https://github.com/pytorch/pytorch/pull/113725/files#diff-0bb1756725c4426408938314b0c9d3988ae5bf49994892d7038ad7746e209e9fR86)
actually fixed what #115095 fixed. Specifically, without the
`allow_cache` check for the "dup_top" optimization, we could incorrectly
codegen based on source, despite `codegen_update_mutated` requested to
codegen from value, for updates to pre-existing lists, etc. Since #113725 added
the `allow_cache` check, we no longer need the `mutable_side_effects_from_source`
code path from #115095.

However, #115442 introduced a `value_from_source` flag which didn't
account for the `mutable_side_effects_from_source` branch. So this patch
adds an extra check to keep existing behavior for export, and leaves a
TODO for investigating what exactly export wants from codegen, when it
comes to side effects and sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139538
Approved by: https://github.com/jansel
2024-11-07 03:14:16 +00:00
Valentine233
cf0bb6c435 [cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827)
Reopen https://github.com/pytorch/pytorch/pull/121782, as more optimizations have landed.

Fixes https://github.com/pytorch/pytorch/issues/115261, https://github.com/pytorch/pytorch/issues/113017.
For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues.

### Validation on 3 benchmark suites

#### FP32
![image](https://github.com/user-attachments/assets/ec920928-fa36-467f-ba07-d2c05c51b92e)

Outlier models (speedup<0.8, single socket): None.

#### BF16
![image](https://github.com/user-attachments/assets/4a301e5e-147d-4b74-beb1-40290969ed80)

Outlier models (speedup<0.8, single socket multi threads):

- functorch_dp_cifar10 0.58
- opacus_cifar10 0.57

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136827
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-11-07 02:49:52 +00:00
Gregory Comer
617b4538f1 Support symbolic builtin round in export (#139549)
Differential Revision: D65380866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139549
Approved by: https://github.com/digantdesai, https://github.com/angelayi
2024-11-07 02:49:44 +00:00
FEI
54e680151b Optimize peak memory for flash _scaled_dot_product_attention_math (#139612) (#139613)
Fixes #139612

@drisspg @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139613
Approved by: https://github.com/drisspg
2024-11-07 02:25:39 +00:00
Will Constable
2b400236c2 [DCP] Cross-link DCP doc to tutorials (#139776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139776
Approved by: https://github.com/mhorowitz, https://github.com/LucasLLC, https://github.com/fduwjj
ghstack dependencies: #139938
2024-11-07 02:19:49 +00:00
Will Constable
b51b7e28ee Add DCP doc to DCP merge-rules (#139938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139938
Approved by: https://github.com/LucasLLC, https://github.com/c-p-i-o, https://github.com/fduwjj
2024-11-07 02:19:49 +00:00
Edward Z. Yang
4e647871d6 Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786
Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305
ghstack dependencies: #139716
2024-11-07 01:58:05 +00:00
Chirag Pandya
47446cb5f3 [fr][c10d] move logger out from utils.py (#139806)
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.

Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B  Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder

./fr_trace.par  -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

Differential Revision: D65503768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
2024-11-07 01:44:12 +00:00
Bin Bao
d0ffd6d142 [AOTI] Add data_ptr to RAIIAtenTensorHandle (#139895)
Summary: To increase the readbility of the generated code. This is not BC-breaking, because RAIIAtenTensorHandle is implemented as header-only.

Differential Revision: [D65547216](https://our.internmc.facebook.com/intern/diff/D65547216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139895
Approved by: https://github.com/chenyang78
2024-11-07 01:36:28 +00:00
sgui/a3213105
4ddf015e7d [ONNX export] exporting model to onnx error when tensor.index_fill ops met dim=0 #139594 (#139596)
When fill_index op's param dim==0, there is no need to unsqueeze the index tensor's dimension. So we return index tensor directly if ths size of axes_i == 0

Fixes #139594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139596
Approved by: https://github.com/justinchuby
2024-11-07 01:32:34 +00:00
Bin Bao
bd5a2c2c71 [AOTI] Simplify the return code (#139889)
Summary:
```
    if constexpr (std::is_same_v<std::decay_t<decltype(buf3)>,RAIIAtenTensorHandle> || std::is_same_v<std::decay_t<decltype(buf3)>,AtenTensorHandle> || std::is_same_v<std::decay_t<decltype(buf3)>,ConstantHandle>) {
        output_handles[0] = buf3.release();
    } else {
        thread_local ThreadLocalCachedOutputTensor<std::decay_t<decltype(buf3)>> cached_output_0(buf3);
        cached_output_0.copy_data_from(buf3);
        AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&output_handles[0]));
        AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_assign_tensors(cached_output_0.tensor(), output_handles[0]));
    }
```
->
```
 output_handles[0] = buf3.release();
```

Test Plan: CI

Differential Revision: D65460719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139889
Approved by: https://github.com/chenyang78
2024-11-07 01:28:43 +00:00
Valentine233
6fcef86cfa [inductor] fix the unligned variable ranges issue in fuse node (#138568)
Fixes #138550.

### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.

### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-11-07 01:17:58 +00:00
Danial Javady
ed0e63e938 Add NHWC support for group normalization (#126635)
Fixes #111824

Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This  conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-11-07 01:12:08 +00:00
Songhao Jia
59ec011855 [numerical debugger] bumped up the starting handler id (#139666)
Differential Revision: D65445250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139666
Approved by: https://github.com/tarun292, https://github.com/dulinriley
2024-11-07 01:00:43 +00:00
Colin L. Rice
e675c6702d justknobs: Remove JustKnobsConfig and justknobs_feature (#138767)
This never ended up getting used, and instead we're doing this
resolution within the configuration system.

Removing these unused internal features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138767
Approved by: https://github.com/ezyang
ghstack dependencies: #138766, #138956
2024-11-07 00:21:46 +00:00
Sam Larsen
52446d7f30 Revert D65290089 (#139893)
Summary:
This diff reverts D65290089
This change is introducing more logging than I realized and could present problems for tlparsen

Test Plan: NA

Reviewed By: jamesjwu

Differential Revision: D65541060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139893
Approved by: https://github.com/jamesjwu
2024-11-07 00:10:09 +00:00
Animesh Jain
ac5fa26e07 [dynamo][weakref] Support weakref.ref call (#139914)
Should fix - https://github.com/pytorch/pytorch/pull/135001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139914
Approved by: https://github.com/jansel
ghstack dependencies: #139856
2024-11-06 23:16:41 +00:00