Commit Graph

270 Commits

Author SHA1 Message Date
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Shunting Zhang
fb6f6270d6 [inductor] comprehensive padding (#120758)
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel
2024-04-15 19:05:51 +00:00
Jason Ansel
6022600cc6 [inductor] Handle meta tensor ops in graph (#123786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123786
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705
2024-04-12 19:03:13 +00:00
angelayi
493478db4a [effects] Add inductor support for tokens (#122347)
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
    def forward(self, L_x_ : torch.Tensor):
        l_x_ = L_x_
        _print = torch.ops.aten._print('moo')
        res = l_x_ + l_x_;  l_x_ = None
        _print_1 = torch.ops.aten._print('moo')
        return (res,)
```

AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
        with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo');  arg0_1 = None
        getitem: "f32[0]" = with_effects[0];  with_effects = None
        add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
        with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
        getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
        return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
   def forward(self, arg1_1: "f32[2, 3]"):
       _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
       with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo');  _make_dep_token_default = None
       getitem: "f32[0]" = with_effects[0];  with_effects = None
       add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
       with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
       getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
       _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,));  getitem_2 = None
       return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    # Source Nodes: [_print], Original ATen: []
    buf2 = aten._print.default('moo')
    # Source Nodes: [_print_1], Original ATen: []
    buf3 = aten._print.default('moo')
    buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, buf4)
    del arg1_1
    return (buf4, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
2024-04-09 03:22:32 +00:00
Yang Chen
e4e5449dfc [aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
2024-04-08 16:51:43 +00:00
Adnan Akhundov
63c221b7fa Clone mutated inputs in first pass of CPP wrapper compilation (#123316)
Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs.

In this PR, we clone mutated inputs in the first pass to avoid input mutation.

Fixes https://github.com/pytorch/pytorch/issues/117364.

Test Plan:

```
$ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda
...
.
----------------------------------------------------------------------
Ran 1 test in 6.368s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316
Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire
2024-04-05 21:47:19 +00:00
xinan.lin
9743e3a19c [Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895)
As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by:
- Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides
- Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation.

For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing.
We'll reuse all the existing Inductor test case in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-04-05 09:05:11 +00:00
Andrew M. James
bde1a93bc4 Add lowering for resize, decomp for resize_as. (#122317)
This has been split off from #121354 as the inplace version of these
methods prove to be rather tricky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-04-03 17:47:29 +00:00
Bin Bao
0ff6155eee [AOTI] Support module buffer mutation (#123164)
Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164
Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov
2024-04-02 20:25:26 +00:00
PyTorch MergeBot
1f503dffb3 Revert "[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)"
This reverts commit 7eadb157bd.

Reverted https://github.com/pytorch/pytorch/pull/123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/123136#issuecomment-2032163699))
2024-04-02 14:17:03 +00:00
Yang Chen
7eadb157bd [aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
2024-04-02 09:00:05 +00:00
PyTorch MergeBot
a236fa9f06 Revert "[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)"
This reverts commit 384de46395.

Reverted https://github.com/pytorch/pytorch/pull/122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/122882#issuecomment-2027544640))
2024-03-29 17:52:39 +00:00
Yang Chen
384de46395 [aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882
Approved by: https://github.com/desertfire
2024-03-28 19:06:29 +00:00
Wang, Eikan
f8eeae7aaa Enable CPP wrapper codegen registration (#121296)
Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-26 06:51:03 +00:00
Adnan Akhundov
9223b2cb31 Pop codegened parent graph from wrapper in GraphLowering (#122469)
Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done.

Fixes https://github.com/pytorch/pytorch/issues/121887.

Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469
Approved by: https://github.com/eellison
2024-03-25 20:27:59 +00:00
Honglin Zhu
adeedc060f [Inductor] Fix unbacked symbol in stride when using item() (#122298)
Fixes #122296

Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298
Approved by: https://github.com/ezyang
2024-03-24 06:27:15 +00:00
chilli
d34514f8db Renamed mutationlayout/aliasedlayout (#122474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474
Approved by: https://github.com/jansel
ghstack dependencies: #121624
2024-03-22 08:32:14 +00:00
Adnan Akhundov
e419011471 [inductor] Add torch.while_loop support to JIT Inductor (#122069)
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).

AOT Inductor support will be added in a follow-up PR.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-22 02:45:27 +00:00
eellison
18c164ef7c [Inductor] Match insignficiant strides on outputs (#122239)
Fix for https://github.com/pytorch/pytorch/issues/116433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239
Approved by: https://github.com/Chillee
2024-03-21 05:35:59 +00:00
Mu-Chu Lee
7676433012 [AOTInductor] Reuse generated kernels between constant graph and main graph (#121564)
Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated.

Test Plan: Included in commit

Differential Revision: D54706767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-03-11 22:44:38 +00:00
Adnan Akhundov
3d089de851 Add torch.cond support to AOT Inductor (#121120)
Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).

Notable additions:

- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).

- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.

- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).

- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.

Codegen examples from the new unit tests:

- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-07 22:39:57 +00:00
Xia, Weiwen
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
Jason Ansel
01ec8df6d8 [Compiled Autograd] Introduce BackwardState capture (#120382)
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)

We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
2024-02-28 20:36:47 +00:00
Sherlock Huang
3e8b56d362 [Inductor] Track constant's original_fqn mapping (#120524)
When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing.

```
        constants_info_[0].name = "L__self___w_pre";
	constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32);
	constants_info_[0].offset = 0;
	constants_info_[0].data_size = 64;
	constants_info_[0].from_folded = false;
	constants_info_[0].shape = {4, 4};
	constants_info_[0].stride = {4, 1};
	// constants_info_[0].original_fqn = "w_pre";   // this line is missing
```

Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling
an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing.

As a fix, I maintain AOTI's own mapping for constant tensor's fqn.

Differential Revision: D54097073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524
Approved by: https://github.com/chenyang78
2024-02-28 17:36:29 +00:00
Edward Z. Yang
1a1fc1047d Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
2024-02-28 01:01:41 +00:00
PyTorch MergeBot
f3dd2a544c Revert "Add structured trace logs (#120289)"
This reverts commit 9dfaef962c.

Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))
2024-02-27 19:49:05 +00:00
Edward Z. Yang
9dfaef962c Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
2024-02-27 00:04:23 +00:00
Yang Chen
b96ea097ee [aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391)
make WrapperCodeGen subclass names consistent with the
file names:

CppWrapperCodeGen -> CppWrapperCpu
CudaWrapperCodeGen -> CppWrapperCuda

Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391
Approved by: https://github.com/aakhundov
2024-02-23 10:41:50 +00:00
Adnan Akhundov
badf84bd6b [inductor] Add torch.cond support to JIT Inductor (#119759)
Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them.

In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs.

Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 24 tests in 86.790s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759
Approved by: https://github.com/jansel, https://github.com/eellison
2024-02-17 07:25:27 +00:00
Yang Chen
bc7f3efb09 [aot_inductor] move CppWrapperCodeGen into a separate file (#119871)
This reverts commit d8e319a961.

Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871
Approved by: https://github.com/albanD, https://github.com/khabinov
ghstack dependencies: #119870
2024-02-16 08:14:20 +00:00
Yang Chen
78c9b2948a [aot_inductor] move CudaWrapperCodeGen into a separate file (#119870)
This reverts commit 3ab08946d5.

Differential Revision: [D53817852](https://our.internmc.facebook.com/intern/diff/D53817852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119870
Approved by: https://github.com/khabinov
2024-02-16 08:10:51 +00:00
Yanbo Liang
7f5b87c953 [torch.compile] Log more compilation time breakdown (#119865)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119865
Approved by: https://github.com/ezyang
2024-02-15 02:20:07 +00:00
Taras Tsugrii
fb492f7ca1 [inductor] Reorder if check to avoid more expensive check. (#119817)
If `mkldnn` is not enabled or not available there is no point in performing a relatively expensive `all` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119817
Approved by: https://github.com/Skylion007
2024-02-14 16:04:31 +00:00
Taras Tsugrii
184605ae7d [inductor] Replace generators with map. (#119818)
It's more concise and efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119818
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze
2024-02-14 16:02:52 +00:00
PyTorch MergeBot
3ab08946d5 Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448)"
This reverts commit 0597dab523.

Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))
2024-02-10 23:04:36 +00:00
PyTorch MergeBot
d8e319a961 Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491)"
This reverts commit 760056bbdc.

Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))
2024-02-10 23:02:05 +00:00
Yang Chen
760056bbdc [aot_inductor] move CppWrapperCodeGen into a separate file (#119491)
This PR moved CppWrapperCodeGen class into a seperate file,
cpp_wrapper.py, to simplify wrapper.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491
Approved by: https://github.com/desertfire, https://github.com/albanD
2024-02-10 02:15:56 +00:00
Yang Chen
0597dab523 [aot_inductor] move CudaWrapperCodeGen into a separate file (#119448)
wrapper.py is getting more complex. Let's first split it
into smaller pieces. Will have another PR to move CppWrapperCodeGen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448
Approved by: https://github.com/desertfire
2024-02-09 20:18:04 +00:00
Shunting Zhang
fd0bf96c2b [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-05 23:35:41 +00:00
PyTorch MergeBot
b964a1222c Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit c24ffc3f66.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))
2024-02-05 19:25:39 +00:00
Shunting Zhang
c24ffc3f66 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-03 00:06:21 +00:00
PyTorch MergeBot
796278b57e Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit 20484a1936.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))
2024-02-02 01:19:19 +00:00
Shunting Zhang
20484a1936 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-01 21:29:02 +00:00
Mu-Chu Lee
2b48891e62 [AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765)
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.

The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.

Test Plan: Unit tests included in commit.

Differential Revision: D53274382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
2024-02-01 04:54:25 +00:00
hodavand
8026534a2f Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929)
Fixes #117370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-01-31 19:34:58 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
suo
d0627cc2af [export] do not rewrite state dict when unlifting (#118611)
This is Very Bad; changing state dict keys violates one of the key contracts we have, which is "do not mess with the state dict".

Change unlift to use a similar `_assign_attr` approach that fx.GraphModule and unflatten do.

Also took the opportunity to improve the interface of `_assign_attr` to be more general.

Differential Revision: [D53139277](https://our.internmc.facebook.com/intern/diff/D53139277/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118611
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608, #118609, #118610
2024-01-30 19:14:19 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00