eellison
fd35be2fd3
TritonTemplate dtype fixes ( #141991 )
...
- Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype
- Update dtype propagation to use `type_to_dtype` instead of instantiating tensor
- Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991
Approved by: https://github.com/drisspg
ghstack dependencies: #139945 , #140057 , #141495 , #141882
2024-12-04 17:24:23 +00:00
iupaikov-amd
5c2584a14c
[ROCm] Enable inductor GEMM lowering for gfx11 ( #141687 )
...
This check doesn't make sense for some of the AMD gpus since they have the right amount of CUs but multi_processor_count returns WGPs on RDNA while still performing adequately. A lot of tests fail on modern archs due to this check defaulting them to not using the GEMMs backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141687
Approved by: https://github.com/pruthvistony , https://github.com/jeffdaily , https://github.com/malfet
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-02 22:13:34 +00:00
eellison
f83361b274
inductor dtype propagation fixes ( #141495 )
...
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.
Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.
Follow ups:
- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister , https://github.com/arui-meta , https://github.com/ezyang
ghstack dependencies: #139945 , #140057
2024-11-28 11:39:38 +00:00
Edward Z. Yang
dbbebee9d7
Code motion CompiledFxGraph to a dedicated file ( #141654 )
...
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141654
Approved by: https://github.com/aorenste , https://github.com/jansel
ghstack dependencies: #141491 , #141492 , #141574
2024-11-27 20:42:21 +00:00
James Wu
a7ca6a9113
Enable autograd cache on inductor tests ( #140890 )
...
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.
I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.
I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
2024-11-27 20:41:43 +00:00
Boyuan Feng
17fd53d8e5
[Inductor] Inplacing with Donated Buffer ( #140113 )
...
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.
Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.
[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee )


Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-27 18:51:52 +00:00
eellison
fd553b9817
Add remaining method and tests for dtype propagation ( #140057 )
...
Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule.
We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057
Approved by: https://github.com/arui-meta , https://github.com/blaine-rister , https://github.com/ezyang
ghstack dependencies: #139945
2024-11-27 17:06:44 +00:00
eellison
566ceb3e7e
Refactor dtype propagation ( #139945 )
...
A couple changes.
- Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later.
- Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache.
Tests get added later in the stack when everything is implemented.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945
Approved by: https://github.com/blaine-rister , https://github.com/arui-meta , https://github.com/ezyang
2024-11-27 16:57:02 +00:00
PyTorch MergeBot
65dbd5cc2d
Revert "[Inductor] Inplacing with Donated Buffer ( #140113 )"
...
This reverts commit eecc8e362c .
Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300 ))
2024-11-26 21:20:59 +00:00
Boyuan Feng
eecc8e362c
[Inductor] Inplacing with Donated Buffer ( #140113 )
...
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.
Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.
[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee )


Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-26 17:19:50 +00:00
Colin Peppler
8f5edcb75c
[CUTLASS] Lift shape & stride information as kernel args ( #138611 )
...
Differential Revision: [D64773324](https://our.internmc.facebook.com/intern/diff/D64773324 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138611
Approved by: https://github.com/chenyang78
2024-11-25 17:52:33 +00:00
Jason Ansel
808f0f656d
[inductor] Refactor MutableBox to make IRNode typing easier ( #140895 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang , https://github.com/Skylion007
2024-11-20 19:50:46 +00:00
PyTorch MergeBot
d472a5f680
Revert "[inductor] Refactor MutableBox to make IRNode typing easier ( #140895 )"
...
This reverts commit c79e78b503 .
Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319 ))
2024-11-19 04:25:41 +00:00
Jason Ansel
c79e78b503
[inductor] Refactor MutableBox to make IRNode typing easier ( #140895 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang , https://github.com/Skylion007
2024-11-19 00:24:35 +00:00
Jack Taylor
a33fa37b4e
[ROCm] Support new AMD triton stream pipeliner ( #139881 )
...
Fixes #139182
In Triton 3.2 num_stages=0 will be deprecated with Triton's AMD backend. Let's query default num_stages from the relevant triton backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139881
Approved by: https://github.com/bertmaher
2024-11-08 14:51:05 +00:00
xinan.lin
320374b011
[Inductor] Refine triton_bundler.py to support correctly on Intel GPU and fix CI failures. ( #139705 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139705
Approved by: https://github.com/EikanWang , https://github.com/jansel , https://github.com/guangyey
2024-11-07 07:10:28 +00:00
Max Podkorytov
8f077b811b
[ROCm][Inductor]Fixing missing ck package warning when the backend is disabled ( #139790 )
...
```
test_addmm_multiple_dynamic_cuda (__main__.AOTInductorTestABICompatibleCuda) ... W1101 10:26:20.492000 1361741 torch/_inductor/utils.py:1207] Please pip install Composable Kernel package
AUTOTUNE addmm(16x6, 16x16, 16x6)
triton_mm_0 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=0, num_stages=2, num_warps=1
triton_mm_1 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=2, num_warps=1
SingleProcess AUTOTUNE benchmarking takes 0.2182 seconds and 0.2979 seconds precompiling for 2 choices
```
This PR disables the warning message when the CK backend is disabled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139790
Approved by: https://github.com/ColinPeppler , https://github.com/chenyang78
2024-11-06 22:04:32 +00:00
eellison
aafb3deaf1
Remove multinomial from cudagraph skip list' ( #139897 )
...
Since https://github.com/pytorch/pytorch/pull/134818/files we can run multinomial in cudagraph without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139897
Approved by: https://github.com/BoyuanFeng
2024-11-06 21:28:42 +00:00
Jason Ansel
ed30fa74ab
[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) ( #139523 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364 , #139365 , #139370 , #139452
2024-11-04 04:28:40 +00:00
Jason Ansel
d189f92eb1
[inductor] Remove SIMDKernel.last_usage ( #139364 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison , https://github.com/shunting314
2024-11-04 04:28:18 +00:00
PyTorch MergeBot
0863d6a08e
Revert "[inductor] Remove SIMDKernel.last_usage ( #139364 )"
...
This reverts commit 286d3ce266 .
Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337 ))
2024-11-02 06:49:11 +00:00
PyTorch MergeBot
98e11b0021
Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) ( #139523 )"
...
This reverts commit c53beab377 .
Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337 ))
2024-11-02 06:49:10 +00:00
Jason Ansel
c53beab377
[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) ( #139523 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364 , #139365 , #139370 , #139452
2024-11-02 03:04:22 +00:00
Jason Ansel
286d3ce266
[inductor] Remove SIMDKernel.last_usage ( #139364 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison , https://github.com/shunting314
2024-11-01 16:28:15 +00:00
Bin Bao
33dce10ece
[AOTI][reland] Update zero size computation in clone_preserve_strides ( #139458 )
...
Summary: Reland https://github.com/pytorch/pytorch/pull/139224 . clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly.
Differential Revision: [D65317451](https://our.internmc.facebook.com/intern/diff/D65317451 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139458
Approved by: https://github.com/hl475
2024-11-01 13:51:02 +00:00
PyTorch MergeBot
c4d9428b17
Revert "[AOTI] Update zero size computation in clone_preserve_strides ( #139224 )"
...
This reverts commit 206a8dde68 .
Reverted https://github.com/pytorch/pytorch/pull/139224 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139224#issuecomment-2450811914 ))
2024-10-31 21:05:07 +00:00
Bin Bao
206a8dde68
[AOTI] Update zero size computation in clone_preserve_strides ( #139224 )
...
Summary: clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Fix that.
Differential Revision: [D65250405](https://our.internmc.facebook.com/intern/diff/D65250405 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139224
Approved by: https://github.com/angelayi
2024-10-31 17:07:18 +00:00
rzou
ccaa2a206a
[inductor] make requires_stride_order more unbacked-symint-aware ( #137063 )
...
Previously, we tried to sort SymInt strides to determine the stride
order. This PR makes the sorting more unbacked symint aware: given a Tensor
with sizes (u0, u1, u2), it has strides (u1 * u2, u1, 1), which is
sortable under the guard_size_oblivious assumptions.
Test Plan:
- test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137063
Approved by: https://github.com/eellison
2024-10-31 13:11:02 +00:00
eellison
4db6b740bc
[Easy] GraphTransformObserver Refactoring ( #139292 )
...
Uses `torch._inductor.config.trace.log_url_for_graph_xform` by default as the log url. It was only ever instantiated with this as the log_url argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139292
Approved by: https://github.com/shengfukevin , https://github.com/shunting314
2024-10-31 00:33:28 +00:00
Colin L. Rice
0b9320b7c5
fx_graph_cache: Remove custom amd JK ( #137501 )
...
This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501
Approved by: https://github.com/oulgen
2024-10-24 01:30:39 +00:00
Max Podkorytov
2dab4ccb65
[Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen ( #137947 )
...
Plug into lowering and end to end test in a later PR
Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947
Approved by: https://github.com/ColinPeppler , https://github.com/chenyang78
2024-10-22 18:25:23 +00:00
Aaron Orenstein
07cc4bd3e2
typing compile_fx.py ( #138033 )
...
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
2024-10-21 18:14:59 +00:00
Jason Ansel
85a6a782e5
[inductor] Generalize WorkspaceArg for graph-level semaphores ( #138170 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170
Approved by: https://github.com/Chillee
2024-10-18 23:05:54 +00:00
chilli
6752e7dc3e
Moved some of Inductor IR nodes to be frozen ( #137859 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859
Approved by: https://github.com/ezyang
2024-10-17 18:04:45 +00:00
Li, Xingyuan
0dbbcfa7ae
[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) ( #136947 )
...
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_pattern_matcher.py`
reuse `test/inductor/test_snode_runtime.py`
reuse `test/inductor/test_unbacked_symints.py`
fix `test/inductor/test_triton_kernels.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947
Approved by: https://github.com/etaf , https://github.com/EikanWang , https://github.com/jansel
2024-10-12 13:21:20 +00:00
leslie-fang-intel
71010bf097
[Inductor][CPP] generalize the wgt tensor delete ( #135101 )
...
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5 , https://github.com/jansel
2024-10-10 06:01:09 +00:00
Jez Ng
4061910ba2
Have Triton CPU backend respect max_autotune setting ( #137276 )
...
We would previously do it regardless of the setting's value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276
Approved by: https://github.com/jansel , https://github.com/desertfire
2024-10-06 03:03:33 +00:00
David Berard
54094c0c26
[inductor][user triton] Check size hints to determine indexing dtype ( #137234 )
...
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.
This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.
Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
2024-10-03 22:07:26 +00:00
Colin Peppler
d117ec1d6e
[3/3][Inductor] Make CK work in FBCode ( #136234 )
...
Summary:
# Context
Goal: Enable CK for Inductor in FBCode
We split this stack into three diffs to help with review & in case we need to revert anything.
# This Diff
* Gets us to have CK kernels as an option for GEMM autotuning in Inductor.
Reviewed By: zjing14
Differential Revision: D62662705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234
Approved by: https://github.com/tenpercent , https://github.com/chenyang78
2024-10-02 12:17:38 +00:00
Jason Ansel
e9a55b43a1
[inductor] Support lists of tensors in operatorbench ( #136911 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911
Approved by: https://github.com/eellison
2024-10-02 06:41:06 +00:00
eellison
0788d016d6
Update incompatible cudagraph ops skip message ( #137015 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015
Approved by: https://github.com/BoyuanFeng
2024-10-01 21:23:36 +00:00
niklasz
3f457ee1f6
Fix AOT Graph capture not propagating non_blocking copy parameter to … ( #136513 )
...
…inductor codegen.
Fixes #136260
**Note**: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow.
Regarding the bug, re-running the issue's reproduction code:
```
import torch
def fn(x):
return x.to(device="cuda", non_blocking=True)
inp = torch.randn(3, 4)
torch.compile(fn)(inp)
```
We now have the non_blocking being passed on to codegen properly:
```
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] ===== pre insert_deferred_runtime_asserts __compiled_fn_1 =====
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] <eval_with_key>.0 class GraphModule(torch.nn.Module):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4]"):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] l_x_ = L_x_
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] return (to,)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] ===== __compiled_fn_1 =====
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4][4, 1]cpu"):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] l_x_ = L_x_
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] return (to,)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] ===== Forward graph 0 =====
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True); arg0_1 = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32); device_put = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] return (convert_element_type,)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code:
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference']
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals())
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1, = args
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] args.clear()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride(arg0_1, (3, 4), (4, 1))
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] with torch.cuda._DeviceGuard(0):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] torch.cuda.set_device(0)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0.copy_(arg0_1, True)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del arg0_1
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return (buf0, )
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._dynamo.testing import rand_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import print_performance
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] fn = lambda: call([arg0_1])
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__":
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
```
See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513
Approved by: https://github.com/eellison
2024-10-01 00:32:47 +00:00
Jez Ng
71aac59e93
Add Triton CPU as an Inductor backend ( #133408 )
...
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.
Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel , https://github.com/blaine-rister , https://github.com/malfet
2024-09-30 20:24:52 +00:00
Quinn Zhu
d33638588e
[aoti][inplace] Support skipping model buffers ( #136770 )
...
Summary: Some AOTI tensor constants may be model buffers that never needs to be updated.
Differential Revision: D62777502
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770
Approved by: https://github.com/muchulee8
2024-09-30 18:28:42 +00:00
PyTorch MergeBot
36428f91e9
Revert "Add Triton CPU as an Inductor backend ( #133408 )"
...
This reverts commit 31c0467594 .
Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517 ))
2024-09-27 16:54:27 +00:00
Jez Ng
31c0467594
Add Triton CPU as an Inductor backend ( #133408 )
...
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.
Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel , https://github.com/blaine-rister , https://github.com/malfet
2024-09-26 15:35:26 +00:00
Bin Bao
95c0f7493f
[Inductor] Rename WrapperCodeGen to PythonWrapperCodegen ( #136062 )
...
Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit.
Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062
Approved by: https://github.com/angelayi , https://github.com/chenyang78
2024-09-24 21:02:51 +00:00
James Wu
4649aeaebf
Make AOTAutogradCache support remote FXGraphCache ( #136173 )
...
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.
This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.
Test Plan: (Meta only tests)
Reviewed By: aorenste
Differential Revision: D62384944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
2024-09-23 17:24:27 +00:00
PyTorch MergeBot
d0cebedb31
Revert "Add Triton CPU as an Inductor backend ( #133408 )"
...
This reverts commit e498b02b47 .
Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816 ))
2024-09-16 18:33:33 +00:00
Jez Ng
e498b02b47
Add Triton CPU as an Inductor backend ( #133408 )
...
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00