Commit Graph

82517 Commits

Author SHA1 Message Date
Mikayla Gawarecki
372b023eb1 Fix test_serialization_zipfile_actually_jit when weights_only is not default (#143668)
Fails in fbcode where weights_only isn't default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143668
Approved by: https://github.com/awgu
ghstack dependencies: #143326, #143403
2024-12-20 21:25:10 +00:00
Darshan Sanghani
33dd4f187d [pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430)
The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET.

This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc.

We also added a few ways to enable ET and ET Resources through the OS environment variables.

Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch.

Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels.

Differential Revision: [D58707846](https://our.internmc.facebook.com/intern/diff/D58707846/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143430
Approved by: https://github.com/shengfukevin, https://github.com/sraikund16
2024-12-20 21:20:32 +00:00
zeshengzong
cee06e74ee Apply clang-format for ATen/core/dispatch headers (#143620)
Code change via add path config in `.lintrunner.toml` file and running

```bash
 $ lintrunner -a --take CLANGFORMAT --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143620
Approved by: https://github.com/malfet
2024-12-20 21:16:23 +00:00
Mikayla Gawarecki
8e483654cb Add config.save.use_pinned_memory_for_d2h to serialization config (#143342)
This was benchmarked with two separate scripts on my A100
(A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save``
(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save`
Timings are an average of 5 runs and benchmark scripts + results are attached

Under both scenarios, we see **~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``)** compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False``

(A)  Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 28.54s | 20.76s |
| `compute_crc_32 = False` | 22.57s |  **14.51s** |

(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 8.38s | 5.53s |
| `compute_crc_32 = False` | 6.94s |  **3.99s** |

Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False`
<img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" />

Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True`
<img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342
Approved by: https://github.com/albanD
ghstack dependencies: #143324
2024-12-20 21:01:18 +00:00
Mikayla Gawarecki
3f63b742e6 Refactor serialization getter/setters into torch.utils.serialization.config (#143324)
Consolidate
- get/set_default_load_endianness
- get/set_default_mmap_options
- get/set_crc32_options

into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC)

In #143459 I add the local (argument style) config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324
Approved by: https://github.com/albanD
2024-12-20 21:01:17 +00:00
Scott Wolchok
629de988df Fix old-compiler-unfriendly zero init of bfloat16_t array (#143504)
clang versions before 17 don't like to assign 0 to a bfloat16_t. gcc versions before 13 also won't assign 0.0 to a bfloat16_t. (Citation: https://godbolt.org/z/Gzs5ebdej)

Differential Revision: [D67396740](https://our.internmc.facebook.com/intern/diff/D67396740/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143504
Approved by: https://github.com/malfet
2024-12-20 20:49:51 +00:00
Chirag Pandya
485497e727 [c10d][fr] flight recorder improvements (#143446)
Summary:
1. Flight recorder dumps are now automatically dumped by default upon
   timeout or exception. Users don't need to opt-in.
2. Change default dump location to running user's home directory
   `.cache` folder.

Test Plan:
1. Tested locally by running the crash program from flight recorder
   tutorial page.
   https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example
2. Noted that flight recorder files were correctly created.
❯ pwd
/home/cpio/.cache/fr_trace
❯ ls
nccl_trace_rank_0  nccl_trace_rank_1

Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446
Approved by: https://github.com/d4l3k
2024-12-20 20:41:30 +00:00
Colin L. Rice
a94f259a69 pgo: Log feature use (#142819)
This will cause dynamo_compile to popualte the feature column if we have
a hit for PGO.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142819
Approved by: https://github.com/ezyang
2024-12-20 20:22:20 +00:00
Aaron Orenstein
8ce0bc282a dynamo tracing perf: bytecode_transform improvements: 34.86 -> 33.9 (#143068)
See #143056 for overall docs.

This PR: Use slots on InstructionExnTabEntry and Instruction.  Stop doing python
version checks in the middle of `convert_instruction()` and
`inst_has_op_bits()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143068
Approved by: https://github.com/jansel
ghstack dependencies: #143065, #143067
2024-12-20 20:06:42 +00:00
Aaron Orenstein
5feb2d7b41 dynamo tracing perf: don't call expensive _set_guard_export_info if it's a duplicate guard: 37.66 -> 34.86 (#143067)
See #143056 for overall docs.

This PR: Move the call to `_set_guard_export_info()` after the duplicate guard
check in `GuardBuilder.DUPLICATE_INPUT()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143067
Approved by: https://github.com/jansel
ghstack dependencies: #143065
2024-12-20 20:06:42 +00:00
Aaron Orenstein
7d4e7fbfc1 dynamo tracing perf: no import on hot path: 47.62 -> 47.26 (#143065)
See #143056 for overall docs.

This PR: Removed another `import` in the body of the hot path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143065
Approved by: https://github.com/jansel
2024-12-20 20:06:42 +00:00
Yanbo Liang
792e6184c5 [GPT-fast] Support run spcific model or micro-benchmark (#143607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607
Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn
2024-12-20 19:58:07 +00:00
Nikhil Gupta
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
Tom Ritchford
b5475d334e [inductor] Fix an unused variable in cpu_vec_isa.py (#138473)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138473
Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/xuhancn
2024-12-20 18:50:19 +00:00
Nikita Shulga
5a69c2a649 [BE][Sparse] Get rid of gcc-5 workaround (#143653)
Discovered those comments while looking at https://github.com/pytorch/pytorch/pull/143620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143653
Approved by: https://github.com/albanD
2024-12-20 18:40:45 +00:00
Joy Dong
a5ed499f6a FlexAttention Benchmark (#139665)
1. Add alibi, sliding window, tahn softcap, prefixLM, and document_mask from attn_gym to benchmark.

2. Add comparison to different SDPA backends & FAv2, FAv3, FAKV.

Dependent on https://github.com/pytorch/pytorch/pull/139639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139665
Approved by: https://github.com/drisspg
2024-12-20 17:52:24 +00:00
Hyunho Yeo
c7d9f29807 (MTIA) Move "empty_cache" API (#143402)
Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api
```

https://www.internalfb.com/intern/testinfra/testrun/13510798943184259

Reviewed By: nautsimon

Differential Revision: D67148738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402
Approved by: https://github.com/nautsimon
2024-12-20 17:39:06 +00:00
Colin L. Rice
d79fbf6b6d test/dynamo/test_utils: logging - Stop testing for impossible things. (#143535)
We don't support assigning to objects or numeric constants at the top level in
config modules, no need to test for them.

(This specifically breaks later sorting refactoring, since it requires <
to be implemented).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535
Approved by: https://github.com/ppanchalia
2024-12-20 17:21:49 +00:00
Huamin Li
f5af87c23c Make Inductor cpp backend enable_floating_point_contract_flag to take string (#143450)
Differential Revision: D66269001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143450
Approved by: https://github.com/desertfire
2024-12-20 16:28:54 +00:00
William Wen
7ab880bc5e fix typo in autocast header (#143625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625
Approved by: https://github.com/mlazos
ghstack dependencies: #143592
2024-12-20 16:17:15 +00:00
bobrenjc93
4f8b7c4272 Revert "refactor tensorify restart logic to use sources (#141517)" (#143623)
This reverts commit 30d8b30db7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143623
Approved by: https://github.com/mlazos
2024-12-20 15:38:34 +00:00
leslie-fang-intel
607884c9af [Inductor][CPP] Fix bitwise shift with corner inputs (#143635)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: 29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501) at these corner inputs.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635
Approved by: https://github.com/jgong5
2024-12-20 13:47:40 +00:00
Guilherme Leobas
7bf3b7cdc5 Rewrite _reparametrize_module to use contextmanager (#138203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203
Approved by: https://github.com/zou3519
ghstack dependencies: #136033, #140604
2024-12-20 12:02:27 +00:00
Guilherme Leobas
1c817fe671 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2024-12-20 12:02:27 +00:00
Guilherme Leobas
673cc88fd6 Add support for contextmanager in Dynamo (#136033)
Fixes #130559

* Intro

This PR adds support for `@contextmanager` in Dynamo. We chose to limit the
scope of this work to only `@contextmanager` and plan to handle generators fully
in #141055 (still in draft).

* Motivation

Dynamo lacks support for generator functions. When it encounters one, it traces
it as if it were a regular function. This is problematic because it can lead to
incorrect behavior. To illustrate, consider the test case below:

```python
import torch
import contextlib

@contextlib.contextmanager
def set_default_dtype(dtype):
    old_dtype = torch.get_default_dtype()
    try:
        torch.set_default_dtype(dtype)
        yield
    finally:
        torch.set_default_dtype(old_dtype)

@torch.compile(backend="eager", fullgraph=True)
def fn():
    with set_default_dtype(torch.float64):
        x = torch.tensor([3.0, 3.0 + 5.0j])
    return x
```

Before this work, Dynamo would not stop at the `yield`, and the graph produced
would contain both calls to `set_default_dtype` executed one after the other.
This is incorrect because the context manager should execute code before and
after the `yield`.

* List of changes

`YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control
flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE`
behaves differently in a generator function. Unlike regular functions, where
`RETURN_VALUE` indicates the final result, in generators it signifies that the
generator is exhausted and implicitly raises `StopIteration`.

A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable`
was introduced to handle `@contextmanager`. This variable tracker acts not just
as a wrapper for the original function but also maintains an internal `tx`
(InstructionTranslator) object to suspend and return control flow to the parent
tracer when a `yield` is encountered.

* Corner cases

Returning a context manager from a compiled function is not supported. This
would require PyTorch to synchronize the generator state between Dynamo and the
interpreter. Any attempt to return it will result in an `IncorrectUsage`
exception.

Graph breaks require special handling as well. In the event of a graph break,
the frame associated with the context manager is skipped, and the context
manager runs in eager mode.

* This PR is breaking my code

There is a configuration flag (`enable_trace_contextlib`) that can be set to
`False` to disable tracing context managers. If this still causes crashes,
please revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033
Approved by: https://github.com/zou3519
2024-12-20 12:02:20 +00:00
Jason Ansel
04b26ee1e8 Fix false positive from f-strings in set_linter (#143628)
This linter was going crazy in python 3.12, example:
```py
$ python3 tools/linter/adapters/set_linter.py torch/_inductor/runtime/triton_heuristics.py
torch/_inductor/runtime/triton_heuristics.py:192:25: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:27: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                  ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:29: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                    ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:31: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                      ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:195:17: Builtin `set` is deprecated
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
                        ^
  196 |         f.write(f"{kernel_name} | {args_str}\n")
  197 |

torch/_inductor/runtime/triton_heuristics.py:195:26: Builtin `set` is deprecated
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
                                 ^
  196 |         f.write(f"{kernel_name} | {args_str}\n")
  197 |

torch/_inductor/runtime/triton_heuristics.py:196:19: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                          ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:31: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                      ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:35: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                          ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:44: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                                   ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:729:26: Builtin `set` is deprecated
  727 |         exec(
  728 |             f"""
  729 |             def launcher({', '.join(def_args)}, grid, stream):
                                 ^
  730 |                 if callable(grid):
  731 |                     grid_0, grid_1, grid_2 = grid(grid_meta)

torch/_inductor/runtime/triton_heuristics.py:729:46: Builtin `set` is deprecated
  727 |         exec(
  728 |             f"""
  729 |             def launcher({', '.join(def_args)}, grid, stream):
                                                     ^
  730 |                 if callable(grid):
  731 |                     grid_0, grid_1, grid_2 = grid(grid_meta)

torch/_inductor/runtime/triton_heuristics.py:735:24: Builtin `set` is deprecated
  733 |                     grid_0, grid_1, grid_2 = grid
  734 |
  735 |                 args = {', '.join(call_args)},
                               ^
  736 |                 launch_args = get_launch_args(
  737 |                     grid, grid_0, grid_1, grid_2, stream, function,

torch/_inductor/runtime/triton_heuristics.py:735:45: Builtin `set` is deprecated
  733 |                     grid_0, grid_1, grid_2 = grid
  734 |
  735 |                 args = {', '.join(call_args)},
                                                    ^
  736 |                 launch_args = get_launch_args(
  737 |                     grid, grid_0, grid_1, grid_2, stream, function,

torch/_inductor/runtime/triton_heuristics.py:1144:20: Builtin `set` is deprecated
 1142 |     cur_file = inspect.stack()[1].filename
 1143 |     summary_str = (
 1144 |         f"SUMMARY ({cur_file})\n"
                           ^
 1145 |         f"{overall_time:.2f}ms   \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s"
 1146 |     )

torch/_inductor/runtime/triton_heuristics.py:1144:29: Builtin `set` is deprecated
 1142 |     cur_file = inspect.stack()[1].filename
 1143 |     summary_str = (
 1144 |         f"SUMMARY ({cur_file})\n"
                                    ^
 1145 |         f"{overall_time:.2f}ms   \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s"
 1146 |     )

torch/_inductor/runtime/triton_heuristics.py:1162:61: Builtin `set` is deprecated
 1160 |                 )
 1161 |                 file.write("====================\n")
 1162 |                 file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n")
                                                                    ^
 1163 |                 for ms, num_gb, gb_per_s, kernel_name in sorted_calls:
 1164 |                     # also display the runtime percentage for each kernel

torch/_inductor/runtime/triton_heuristics.py:1162:70: Builtin `set` is deprecated
 1160 |                 )
 1161 |                 file.write("====================\n")
 1162 |                 file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n")
                                                                             ^
 1163 |                 for ms, num_gb, gb_per_s, kernel_name in sorted_calls:
 1164 |                     # also display the runtime percentage for each kernel

torch/_inductor/runtime/triton_heuristics.py:1166:36: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                           ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:47: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                      ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:52: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                           ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:64: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                                       ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1175:30: Builtin `set` is deprecated
 1173 |                     )
 1174 |                     file.write(bw_info_str + "\n")
 1175 |                 file.write(f"{summary_str}\n\n")
                                     ^
 1176 |         except Exception as e:
 1177 |             log.warning(

torch/_inductor/runtime/triton_heuristics.py:1175:42: Builtin `set` is deprecated
 1173 |                     )
 1174 |                     file.write(bw_info_str + "\n")
 1175 |                 file.write(f"{summary_str}\n\n")
                                                 ^
 1176 |         except Exception as e:
 1177 |             log.warning(

torch/_inductor/runtime/triton_heuristics.py:1205:29: Builtin `set` is deprecated
 1203 |         else:
 1204 |             possible_names = _find_names(self)
 1205 |             kernel_name = f"{max(possible_names, key=len)}"
                                    ^
 1206 |             if not re.match(self.regex_filter, kernel_name):
 1207 |                 return

torch/_inductor/runtime/triton_heuristics.py:1205:58: Builtin `set` is deprecated
 1203 |         else:
 1204 |             possible_names = _find_names(self)
 1205 |             kernel_name = f"{max(possible_names, key=len)}"
                                                                 ^
 1206 |             if not re.match(self.regex_filter, kernel_name):
 1207 |                 return

torch/_inductor/runtime/triton_heuristics.py:1241:60: Builtin `set` is deprecated
 1239 |                     "%s",
 1240 |                     create_bandwidth_info_str(
 1241 |                         ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}"
                                                                   ^
 1242 |                     ),
 1243 |                 )

torch/_inductor/runtime/triton_heuristics.py:1241:72: Builtin `set` is deprecated
 1239 |                     "%s",
 1240 |                     create_bandwidth_info_str(
 1241 |                         ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}"
                                                                               ^
 1242 |                     ),
 1243 |                 )

torch/_inductor/runtime/triton_heuristics.py:1256:15: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                      ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:42: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                 ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:44: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                   ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:58: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                 ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:60: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                   ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:75: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                                  ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1377:23: Builtin `set` is deprecated
 1375 |         if numel is None:
 1376 |             continue
 1377 |         block = cfg[f"{label}BLOCK"]
                              ^
 1378 |         if numel == 1:
 1379 |             assert block == 1, (

torch/_inductor/runtime/triton_heuristics.py:1377:29: Builtin `set` is deprecated
 1375 |         if numel is None:
 1376 |             continue
 1377 |         block = cfg[f"{label}BLOCK"]
                                    ^
 1378 |         if numel == 1:
 1379 |             assert block == 1, (

torch/_inductor/runtime/triton_heuristics.py:1381:24: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                               ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:38: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                             ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:46: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                     ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:52: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                           ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:58: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                 ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:64: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                       ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:71: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                              ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:77: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                    ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:84: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                           ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:88: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                               ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1384:52: Builtin `set` is deprecated
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
                                                           ^
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"

torch/_inductor/runtime/triton_heuristics.py:1384:58: Builtin `set` is deprecated
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
                                                                 ^
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"

torch/_inductor/runtime/triton_heuristics.py:1386:45: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                    ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:51: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                          ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:66: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                                         ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:80: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                                                       ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1387:20: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                           ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:26: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                 ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:33: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                        ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:39: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                              ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:45: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                    ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:59: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                  ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:61: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                    ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:71: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                              ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:78: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                                     ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:82: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                                         ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1402:19: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                          ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:23: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                              ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:46: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                     ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:56: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                               ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:67: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                                          ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:71: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                                              ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1551:21: Builtin `set` is deprecated
 1549 |     rnumels = {}
 1550 |     for idx in range(num_reduction_dims - 1, -1, -1):
 1551 |         prefix = f"r{idx}_"
                            ^
 1552 |         max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()])
 1553 |         dim = min(max_size, remaining)

torch/_inductor/runtime/triton_heuristics.py:1551:25: Builtin `set` is deprecated
 1549 |     rnumels = {}
 1550 |     for idx in range(num_reduction_dims - 1, -1, -1):
 1551 |         prefix = f"r{idx}_"
                                ^
 1552 |         max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()])
 1553 |         dim = min(max_size, remaining)

torch/_inductor/runtime/triton_heuristics.py:1556:34: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                         ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:38: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                             ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:67: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                                                          ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:77: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                                                                    ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1564:38: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                             ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:46: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                     ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:57: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                                ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:59: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                                  ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:37: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                            ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:45: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                    ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:49: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                        ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:60: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                                   ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1746:49: Builtin `set` is deprecated
 1744 |
 1745 |     if not configs:
 1746 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                        ^
 1747 |     return cached_autotune(
 1748 |         size_hints,

torch/_inductor/runtime/triton_heuristics.py:1746:60: Builtin `set` is deprecated
 1744 |
 1745 |     if not configs:
 1746 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                                   ^
 1747 |     return cached_autotune(
 1748 |         size_hints,

torch/_inductor/runtime/triton_heuristics.py:1928:32: Builtin `set` is deprecated
 1926 |         for prefix in size_hints:
 1927 |             if prefix_is_reduction(prefix):
 1928 |                 c.kwargs.pop(f"{prefix.upper()}BLOCK")
                                       ^
 1929 |
 1930 |     if disable_pointwise_autotuning(inductor_meta):

torch/_inductor/runtime/triton_heuristics.py:1928:47: Builtin `set` is deprecated
 1926 |         for prefix in size_hints:
 1927 |             if prefix_is_reduction(prefix):
 1928 |                 c.kwargs.pop(f"{prefix.upper()}BLOCK")
                                                      ^
 1929 |
 1930 |     if disable_pointwise_autotuning(inductor_meta):

torch/_inductor/runtime/triton_heuristics.py:1975:49: Builtin `set` is deprecated
 1973 |     assert triton_meta is not None
 1974 |     if len(size_hints) != 2:
 1975 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                        ^
 1976 |
 1977 |     configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta)

torch/_inductor/runtime/triton_heuristics.py:1975:60: Builtin `set` is deprecated
 1973 |     assert triton_meta is not None
 1974 |     if len(size_hints) != 2:
 1975 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                                   ^
 1976 |
 1977 |     configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta)

torch/_inductor/runtime/triton_heuristics.py:2082:56: Builtin `set` is deprecated
 2080 |         xnumel, ynumel, znumel = numels[2], numels[1], numels[0]
 2081 |     else:
 2082 |         raise AssertionError(f"invalid size for numels {len(numels)}")
                                                               ^
 2083 |
 2084 |     def get_grid_dim(numel, block):

torch/_inductor/runtime/triton_heuristics.py:2082:68: Builtin `set` is deprecated
 2080 |         xnumel, ynumel, znumel = numels[2], numels[1], numels[0]
 2081 |     else:
 2082 |         raise AssertionError(f"invalid size for numels {len(numels)}")
                                                                           ^
 2083 |
 2084 |     def get_grid_dim(numel, block):

torch/_inductor/runtime/triton_heuristics.py:2104:57: Builtin `set` is deprecated
 2102 |             torch._check(
 2103 |                 y_grid <= max_y_grid,
 2104 |                 lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue",
                                                                ^
 2105 |             )
 2106 |

torch/_inductor/runtime/triton_heuristics.py:2104:64: Builtin `set` is deprecated
 2102 |             torch._check(
 2103 |                 y_grid <= max_y_grid,
 2104 |                 lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue",
                                                                       ^
 2105 |             )
 2106 |

torch/_inductor/runtime/triton_heuristics.py:2113:43: Builtin `set` is deprecated
 2111 |         )
 2112 |
 2113 |     setattr(grid_fn, "grid_fn_str", f"grid{numels}")  # noqa: B010
                                                  ^
 2114 |
 2115 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2113:50: Builtin `set` is deprecated
 2111 |         )
 2112 |
 2113 |     setattr(grid_fn, "grid_fn_str", f"grid{numels}")  # noqa: B010
                                                         ^
 2114 |
 2115 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2122:48: Builtin `set` is deprecated
 2120 |         return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1)
 2121 |
 2122 |     grid_fn_str = f"cooperative_reduction_grid({xnumel})"
                                                       ^
 2123 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2124 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2122:55: Builtin `set` is deprecated
 2120 |         return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1)
 2121 |
 2122 |     grid_fn_str = f"cooperative_reduction_grid({xnumel})"
                                                              ^
 2123 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2124 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2135:54: Builtin `set` is deprecated
 2133 |     coop_grid = cooperative_reduction_grid(xnumel)
 2134 |     normal_grid = grid(xnumel)
 2135 |     grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})"
                                                             ^
 2136 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2137 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2135:61: Builtin `set` is deprecated
 2133 |     coop_grid = cooperative_reduction_grid(xnumel)
 2134 |     normal_grid = grid(xnumel)
 2135 |     grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})"
                                                                    ^
 2136 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2137 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2145:37: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                            ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:44: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                   ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:47: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                      ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:54: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                             ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2173:42: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                 ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:53: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                            ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:66: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                                         ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:77: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                                                    ^
 2174 |     else:
 2175 |         # sequential dispatch
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143628
Approved by: https://github.com/yanboliang, https://github.com/rec
2024-12-20 11:45:26 +00:00
Xu Han
6733045a4a export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-20 11:42:09 +00:00
Michael Lazos
b539c61631 [Hierarchical Compile] Update NoneAsConstantBuffer to support graph d… (#143531)
Fixes issues I hit while running graph deduplication with torch tune.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143531
Approved by: https://github.com/eellison
2024-12-20 09:23:12 +00:00
Pian Pawakapan
f9f82ca48f [ts converter] use Dim.AUTO for ts -> export converter (#138273)
Switches TS converter to use `Dim.AUTO` by default, exporting models with max dynamism. Adds runtime input tests to `test_converter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138273
Approved by: https://github.com/avikchaudhuri
2024-12-20 07:48:24 +00:00
Michael Lazos
270ad513c8 [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-20 07:46:49 +00:00
Avik Chaudhuri
29b586bbad fix formatting in programming model doc (#143587)
Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken.

Differential Revision: D67458972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587
Approved by: https://github.com/yushangdi
2024-12-20 07:09:19 +00:00
Huy Do
fe0f20615c [DynamoBench] Handle accuracy results in benchmark records (#143611)
I discovered this issue when trying to search for the accuracy results on the database and couldn't find any.  It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers.

ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves.  So, the remaining option is to store this in the `extra_info` field.  This field is a dictionary, so it can goes there.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12421747715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611
Approved by: https://github.com/kit1980
2024-12-20 06:43:38 +00:00
Sam Ginzburg
132fcf4e0d [user triton] Raise an exception when encountering nested @triton.autotune decorators or @triton.heuristics (#143519)
We support running a single Autotuner for each Triton kernel. Currently,
if there are multiple autotuning decorators, the subsequent ones will be
silently ignored.

Instead, we should raise an error here to avoid silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143519
Approved by: https://github.com/aakhundov
2024-12-20 06:38:45 +00:00
PyTorch MergeBot
71479a9b9c Revert "[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352)"
This reverts commit 429f4cd140.

Reverted https://github.com/pytorch/pytorch/pull/143352 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/143352#issuecomment-2556365140))
2024-12-20 06:21:31 +00:00
Jane Xu
4e29e4aa63 [BE] Add a test to ensure grads are never inplaced into accidentally (#143612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143612
Approved by: https://github.com/soulitzer
2024-12-20 06:15:08 +00:00
Xu Han
2daa666591 update kineto to XPU Windows fixed PR. [submodule kineto] (#143445)
Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445
Approved by: https://github.com/sraikund16
2024-12-20 05:57:30 +00:00
zeshengzong
217a4ddb04 Add range check embedding_bag on input index >= 0 of cuda device (#140791)
Fixes #89362

**Test Result**

**Before**

```
>>> import torch
>>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda()
>>> weight = torch.rand([2, 3], dtype=torch.float32).cuda()
>>> print(torch.nn.functional.embedding_bag(input, weight))
tensor([[0., 0., 0.]], device='cuda:0')
```

**After**

```python
>>> import torch
>>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda()
>>> weight = torch.rand([2, 3], dtype=torch.float32).cuda()
>>> print(torch.nn.functional.embedding_bag(input, weight))
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [1,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [2,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 357, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

```

```bash
$ pytest test/nn/test_embedding.py
```
![image](https://github.com/user-attachments/assets/6a5ec759-a3dc-4d51-9e5e-ec79c0aac526)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/2ce4ac24-74fb-4181-9510-18b96a2c2acb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140791
Approved by: https://github.com/eqy
2024-12-20 05:47:26 +00:00
bobrenjc93
9713a6eeca remove allow-untyped-defs from torch/fx/experimental/refinement_types.py (#143602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143602
Approved by: https://github.com/aorenste
2024-12-20 05:40:52 +00:00
bobrenjc93
78d294379a remove allow-untyped-defs from torch/_lazy/config.py (#143603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143603
Approved by: https://github.com/aorenste
2024-12-20 05:34:19 +00:00
bobrenjc93
cb4e9888df remove allow-untyped-defs from torch/ao/quantization/experimental/APoT_tensor.py (#143601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143601
Approved by: https://github.com/aorenste
2024-12-20 05:26:09 +00:00
bobrenjc93
dd346dbeab remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605
Approved by: https://github.com/aorenste
2024-12-20 05:25:01 +00:00
Michael Lazos
fd23cf5848 [Dynamo] check node class first for graph dedup (#143609)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143609
Approved by: https://github.com/williamwen42
2024-12-20 04:09:46 +00:00
William Wen
1c2593f035 [dynamo] guard global autocast state (#143592)
Fixes https://github.com/pytorch/pytorch/issues/112260.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592
Approved by: https://github.com/jansel
2024-12-20 03:30:54 +00:00
drisspg
d339f1506b Add cutlass version guard in prep for upgrade (#143551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143551
Approved by: https://github.com/eqy
2024-12-20 02:40:02 +00:00
Mayank Mishra
75661f2036 try root fix for FP8 tensor (#143248)
Fixes #143194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143248
Approved by: https://github.com/fegin
2024-12-20 01:57:17 +00:00
PyTorch MergeBot
4462cc6375 Revert "[Inductor] inplace padding (#140249)"
This reverts commit 297ce77636.

Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))
2024-12-20 01:30:27 +00:00
bobrenjc93
e1b4635504 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606
Approved by: https://github.com/aorenste
2024-12-20 01:26:51 +00:00
Jane Xu
a0cff096bc Improve cond error messaging (#143595)
Discovered by @drisspg and I trying out a simple toy example and being way too confused :')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143595
Approved by: https://github.com/zou3519, https://github.com/ydwu4
2024-12-20 01:19:20 +00:00
Yanan Cao (PyTorch)
d547fae5b0 [Codemod][AddExplicitStrictExportArg] caffe2/torch/onnx/_internal/exporter (#143542)
Reviewed By: avikchaudhuri

Differential Revision: D67381244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143542
Approved by: https://github.com/ydwu4, https://github.com/titaiwangms
2024-12-20 00:54:52 +00:00
Sun, Jiayi
544de4008e [Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)
Fix https://github.com/pytorch/pytorch/issues/141671.

Summary:
The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-20 00:35:58 +00:00