pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	cad79bd0bb	Remove follow_imports = skip from sympy (#118469 ) dmypy silently ignores follow_imports = skip, so to get parity between dmypy and mypy we have to suck it up and type: ignore all of the sympy typing problems. The suppressions were added automatically with the following script generated by GPT-4: ``` import re # Read the error file with open("error_file.txt", "r") as f: errors = f.readlines() # Parse the lines with errors and error types error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type # Insert ignore comments in the source files for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467, #118468	2024-01-28 13:38:38 +00:00
laith sakka	708e6241ed	Fix sympy_subs to preserve integer and non-negative properties. (#118150 ) This diff introduce the following changes: 1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string why is this needed? I was compiling an expression: xabs(y) where y =-2 what happens is that this expression is passed as ``s1abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs. but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true) resulting in ``xabs(ks0) = xks0`` which is wrong 2. rename sympy_symbol to sympy_index_symbol to make it explicit. 3. add assertion that replaced expression is not passed as string but always a sympy expression. Fixes https://github.com/pytorch/pytorch/issues/117757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150 Approved by: https://github.com/ezyang	2024-01-25 20:54:55 +00:00
Yang Chen	1565d58ad9	[inductor] correctly generate grid info for benchmark_kernel (#118202 ) Previously, we generated the grid argument with tree.numel for a benchmark TritonKernel. This was not correct, because it didn't match the launch config used for profiling and running. This PR fixed the issue by emitting the grid value computed by the kernel's grid_fn, which is used by the profiler and the kernel's runner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-01-25 20:37:44 +00:00
Edward Z. Yang	903e1913ff	Rename unbacked SymInt prefix to u (#117859 ) Currently, it conflicts with Inductor's naming convention for index variables Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117859 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/avikchaudhuri	2024-01-22 20:53:47 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
Edward Z. Yang	df4e3d9d08	Document OpsHandler protocol (#117790 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117790 Approved by: https://github.com/jansel	2024-01-21 07:20:53 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit `74e1362499`. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
Shunting Zhang	e432b2e607	[inductor] multi-kernel support (#103469 ) For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time - persistent reduction - regular reduction A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime. Here I talk more about implementation details: - Inductor maintains states for generating kernels. E.g. the wrapper code. After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart. *There is one thing I need some comments from others*: There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel. But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list. Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex. I'm not sure if there is some easy and clean way to resolve this. Testing command: ``` TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469 Approved by: https://github.com/jansel	2024-01-18 23:16:31 +00:00
vfdev-5	7005a4bcb6	[dynamo] Added dyn shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866 ) Description: - Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ... ```python import math import torch def func(x, a, b): c = 0 c = c + math.sqrt(a) c = c + math.cos(a) c = c + math.cosh(a) c = c + math.sin(a) c = c + math.sinh(a) c = c + math.tan(a) c = c + math.tanh(a) c = c + math.asin(b) c = c + math.acos(b) c = c + math.atan(a) y = x + c return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cpu" # or "cuda" x = torch.tensor([0, 1, 2, 3], dtype=torch.float32, device=device) a = 12 b = 1 out = cfunc(x, a, b) expected = func(x, a, b) torch.testing.assert_close(out, expected) ``` and the graph `TORCH_LOGS=+graph_code python check_math_ops.py`: <details> <summary> graph code </summary> ``` [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ===== __compiled_fn_0 ===== [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_a_ = L_a_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:57, code: c = c + math.sqrt(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sqrt = torch.sym_sqrt(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add = 0 + sym_sqrt; sym_sqrt = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:58, code: c = c + math.cos(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cos = torch.sym_cos(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_1 = add + sym_cos; add = sym_cos = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:59, code: c = c + math.cosh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cosh = torch.sym_cosh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_2 = add_1 + sym_cosh; add_1 = sym_cosh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:60, code: c = c + math.sin(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sin = torch.sym_sin(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_3 = add_2 + sym_sin; add_2 = sym_sin = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:61, code: c = c + math.sinh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sinh = torch.sym_sinh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_4 = add_3 + sym_sinh; add_3 = sym_sinh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:62, code: c = c + math.tan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tan = torch.sym_tan(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_5 = add_4 + sym_tan; add_4 = sym_tan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:63, code: c = c + math.tanh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tanh = torch.sym_tanh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_6 = add_5 + sym_tanh; add_5 = sym_tanh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:64, code: c = c + math.asin(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_7 = add_6 + 1.5707963267948966; add_6 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:65, code: c = c + math.acos(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_8 = add_7 + 0.0; add_7 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:66, code: c = c + math.atan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_atan = torch.sym_atan(l_a_); l_a_ = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_9 = add_8 + sym_atan; add_8 = sym_atan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:67, code: y = x + c [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] y = l_x_ + add_9; l_x_ = add_9 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (y,) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ``` </details> Generated code with `TORCH_LOGS=+output_code python check_math_ops.py`: <details> <summary> C++ code </summary> ``` [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] cpp_fused_add_0 = async_compile.cpp(''' [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #include "/tmp/torchinductor_root/2l/c2ljzlm4sosod7u6lyrroqdba6hmfcyijrric6p4t3fhbcmw6osp.h" [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] extern "C" void kernel(const float* in_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] float* out_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks1) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #pragma GCC ivdep [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L)) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp0 = in_ptr0[static_cast<long>(x0)]; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp1 = c10::convert<float>(1.57079632679490 + (std::sqrt(ks1)) + (std::atan(ks1)) + (std::cos(ks1)) + (std::cosh(ks1)) + (std::sin(ks1)) + (std::sinh(ks1)) + (std::tan(ks1)) + (std::tanh(ks1))); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] out_ptr0[static_cast<long>(x0)] = tmp2; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> <details> <summary> Triton code </summary> ``` [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @pointwise( [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] size_hints=[4], [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] filename=__file__, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1), equal_to_1=(), i ds_of_folded_args=(), divisible_by_8=())]}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_0', 'mutated_arg_names': []}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] min_elem_per_thread=0 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @triton.jit [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xoffset = tl.program_id(0) * XBLOCK [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xindex = xoffset + tl.arange(0, XBLOCK)[:] [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xmask = xindex < xnumel [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] x0 = xindex [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp0 = tl.load(in_ptr0 + (x0), xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp1 = 1.57079632679490 + (tl.math.sqrt(ks0.to(tl.float32))) + (tl.math.atan((ks0).to(tl.float32))) + (tl.math.cos((ks0).to(tl.float32))) + (tl.math.cosh((ks0).to(tl.float32))) + (tl.math.sin((ks0) .to(tl.float32))) + (tl.math.sinh((ks0).to(tl.float32))) + (tl.math.tan((ks0).to(tl.float32))) + (tl.math.tanh((ks0).to(tl.float32))) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp2 = tmp1.to(tl.float32) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp3 = tmp0 + tmp2 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tl.store(out_ptr0 + (x0), tmp3, xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114866 Approved by: https://github.com/peterbell10	2024-01-11 11:52:28 +00:00
Jason Ansel	6f8fc42dba	[inductor] Add support for tl.make_block_ptr (#116079 ) On A100 this is a small regression: ![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171) So I will leave it disabled by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079 Approved by: https://github.com/shunting314	2024-01-10 20:02:49 +00:00
PyTorch MergeBot	39ae4d8cd7	Revert "[inductor] Add support for tl.make_block_ptr (#116079 )" This reverts commit `d527df707a`. Reverted https://github.com/pytorch/pytorch/pull/116079 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/116079#issuecomment-1883890254))	2024-01-09 22:19:57 +00:00
Jason Ansel	d527df707a	[inductor] Add support for tl.make_block_ptr (#116079 ) On A100 this is a small regression: ![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171) So I will leave it disabled by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079 Approved by: https://github.com/shunting314 ghstack dependencies: #116078	2024-01-09 19:06:51 +00:00
Jason Ansel	94363cee41	[inductor] Indexing refactors (#116078 ) Perf differences seems to be noise: ![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078 Approved by: https://github.com/aakhundov	2024-01-09 19:06:51 +00:00
Aaron Gokaslan	bbe3261dd3	[BE]: Use `iterable.chain.from_iterable` where possible (#116376 ) This is more readable and more efficient when dealing with lots of sequences to chain together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376 Approved by: https://github.com/albanD	2023-12-27 19:20:07 +00:00
etaf	7a6cb9fdfb	[Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020 ) As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend. ### Design Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation scattered in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code. For example: `2a44034895/torch/_inductor/codegen/wrapper.py (L487)` `2a44034895/torch/_inductor/codegen/triton.py (L1996)` So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility. Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-12-22 08:42:51 +00:00
Shunting Zhang	99f7e721fe	[inductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-22 00:09:29 +00:00
PyTorch MergeBot	db35ccf463	Revert "[innductor] make inductor work with new triton compile interface (#115878 )" This reverts commit `bbded928b3`. Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))	2023-12-21 02:00:17 +00:00
Shunting Zhang	bbded928b3	[innductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-21 00:03:38 +00:00
PyTorch MergeBot	c215e59bf2	Revert "[inductor] Avoid bool being upcast to int (#109913 )" This reverts commit `92998693a9`. Reverted https://github.com/pytorch/pytorch/pull/109913 on behalf of https://github.com/jeanschmidt due to causing performance regression in relevant metrics, @malfet I believe you are the correct person to help identify and fix the issues. More details check internal OPS count for ads metricsnin the internal related diff ([comment](https://github.com/pytorch/pytorch/pull/109913#issuecomment-1864397407))	2023-12-20 12:33:50 +00:00
Philip Meier	505a9e4854	add support for dynamic shapes in round (#115259 ) Fixes #114310 and supersedes #114748. There are two reasons why we have quite a few special cases for `round`: 1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float. 2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-19 15:45:50 +00:00
Peter Bell	92998693a9	[inductor] Avoid bool being upcast to int (#109913 ) Currently the inductor code for `x.any(-1)` does a this strange dance: ```python tmp0 = tl.load(in_ptr0 + (r1 + (128x0)), rmask & xmask) tmp1 = tmp0.to(tl.int64) tmp2 = (tmp1 != 0) ``` This happens because `register_lowering` is doing type promotion with the dimension argument, and so promotes to `int64` which we then cast back to bool. A better fix would be to fix `register_lowering` but for now I just remove the unnecessary type promotion from `aten.any`. In the current code we also see: ```python tmp5 = tl.where(rmask & xmask, tmp3, 0) ``` which promotes the boolean value to int since `0` is an int32 in triton. This fixes it to generate a boolean constant instead. Finally there is also a triton bug where the `tl.load` itself upcasts to `tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final kernel code looks like: ```python tmp0 = tl.load(in_ptr0 + (r1 + (128x0)), rmask & xmask).to(tl.int1) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp3 = tl.full([1, 1], 0, tl.int1) tmp4 = tl.where(rmask & xmask, tmp1, tmp3) tmp5 = triton_helpers.any(tmp4, 1)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913 Approved by: https://github.com/lezcano	2023-12-19 14:16:10 +00:00
vfdev-5	c7ae2c170f	[inductor] Added non-integer expr support for floordiv in triton codegen (#115751 ) Description: - Added non-integer expr support for floordiv in triton codegen - Added a test - cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it This PR is fixing compilation error with the following code: ```python import torch def func(x, a): n = (a * 1.234) // 8.234 y = x + n return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cuda" x = torch.tensor(0, dtype=torch.float32, device=device) a = 33 out = cfunc(x, a) expected = func(x, a) torch.testing.assert_close(out, expected) ``` Error message on Nightly: ``` File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = ((1.23400000000000*ks0) // 8.23400000000000) ^ AssertionError() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751 Approved by: https://github.com/peterbell10	2023-12-13 23:17:42 +00:00
Yang Chen	1392843e7b	[inductor] make sure bitcast input and target type have the same bitwidth (#115619 ) This PR fixed #104791 bitcast requires the source and target have the bitwidth. Because the input tensor's dtype could be promoted, e.g. from float16 to float, we have to cast the tensor to its original source dtype before invoking bitcast in such cases. After that, we also need to convert the bit-casted tensor back to float to make sure we keep using higher precision values for the rest of the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619 Approved by: https://github.com/jansel, https://github.com/eellison	2023-12-13 00:53:04 +00:00
Peter Bell	40dc0580a6	[inductor] De-duplicate triton helper functions (#115546 ) Previously if two calls to cumsum were generated in the same triton kernel we would generate identical helper functions with different names. Now this recognizes identical functions and only defines it once. To do this I defer choosing the name until after codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115546 Approved by: https://github.com/lezcano ghstack dependencies: #109132	2023-12-12 16:30:50 +00:00
Peter Bell	02196c21ac	[inductor] Parameterize ir.Scan on combine_fn (#109132 ) This replaces `tl.cumsum` and `tl.cumprod` with calls to `tl.associative_scan` where the combine function is generated from inductor IR. So before we had: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr): xnumel = 20 rnumel = 30 RBLOCK: tl.constexpr = 32 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[None, :] rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (30x0)), rmask & xmask, other=0).to(tl.float32) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp2 = tl.where(rmask & xmask, tmp1, 0) tmp3 = tl.cumsum(tmp2, 1) tl.store(out_ptr0 + (r1 + (30x0)), tmp3, rmask & xmask) ``` Now we have: ```python @triton.jit def _triton_helper_fn0(arg0, arg1): tmp0 = tmp0 + tmp1 return tmp0 @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr): xnumel = 20 rnumel = 30 RBLOCK: tl.constexpr = 32 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[None, :] rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (30x0)), rmask & xmask, other=0).to(tl.float32) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp2 = tl.where(rmask & xmask, tmp1, 0) tmp3 = tl.associative_scan(tmp2, 1, _triton_helper_fn0) tl.store(out_ptr0 + (r1 + (30x0)), tmp3, rmask & xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109132 Approved by: https://github.com/lezcano	2023-12-12 16:30:50 +00:00
Peter Bell	7aac689b19	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation. Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel. Fixes https://github.com/pytorch/pytorch/issues/93631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano, https://github.com/atalman	2023-12-05 23:31:49 +00:00
Jez Ng	c808a84680	Better logging for "cannot fuse" reasons (#115003 ) This was invaluable when I was debugging #114917. Without the node names in the log message, it was difficult to make sense of them. However, I did not want to bloat the number of LOC with this change. Thus, instead of calling `debug()` directly with the node arguments, I made a new callable class WhyNoFuse to partially apply the node arguments at the top of each fusion-checking method. WhyNoFuse generates the logging string only when its `__str__` method gets called, so there is minimal overhead when logging is disabled. I also removed the various logging 'tags' like "vert:1" / "triton:1" -- the log messages themselves are unique enough that the user can identify them without the tag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115003 Approved by: https://github.com/Skylion007	2023-12-03 04:48:43 +00:00
Shunting Zhang	68a8d74f3f	[inductur] benchmark epilogue fused matmul template (#114809 ) Want to be a able to benchmark epilogue fused triton matmul kernel for a couple of reasons 1. @eellison found that certain TB models (resnet50, resnet152, moco) fails sometimes in maxautotune mode on the dashboard. The issue is quite hard to repro due to flakiness. The issue only get triggered when certain triton config for certain epilogue fused kernel get picked. (disable epilogue fusion bypass the issue) It would be nice if we can have a runnable script that directly run that kernel to ease further debugging 2. this is a necessary piece to do benchmark fusion for triton matmul kernels. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler for this Example runnable kernel script: https://gist.github.com/shunting314/00bdbc1b6b46bfa73d1389d8f40cd669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114809 Approved by: https://github.com/eellison	2023-12-01 21:05:01 +00:00
Jez Ng	47e6cc4d22	Remove yet more type-ignores in dynamo/inductor (#114684 ) Probably the last big batch for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/114684 Approved by: https://github.com/Skylion007	2023-11-28 22:09:38 +00:00
chundian	74e10f0f60	[inductor] Fix torch.split bug on unbacked symint (#113406 ) torch.split(x, l) fails when l's shape is the unbacked symint. E.g. l = y.tolist() makes l the unbacked shape, because l depends on the data access of y. The downdtream call `SliceView.create()` evaluates the shape even if the input shape is unbacked symint, which brings up the bug. Test Plan: python test/inductor/test_unbacked_symints.py -k test_split_with_sizes Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2023-11-28 20:45:13 +00:00
Jez Ng	71b742b42c	[inductor] Remove more type: ignore comments (#114162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114162 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-11-28 06:45:55 +00:00
PyTorch MergeBot	ccb1de3595	Revert "[inductor] Fix torch.split bug on unbacked symint (#113406 )" This reverts commit `cd7d6938c1`. Reverted https://github.com/pytorch/pytorch/pull/113406 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113406#issuecomment-1827727411))	2023-11-27 12:20:52 +00:00
chundian	cd7d6938c1	[inductor] Fix torch.split bug on unbacked symint (#113406 ) torch.split(x, l) fails when l's shape is the unbacked symint. E.g. l = y.tolist() makes l the unbacked shape, because l depends on the data access of y. The downdtream call `SliceView.create()` evaluates the shape even if the input shape is unbacked symint, which brings up the bug. Test Plan: python test/inductor/test_unbacked_symints.py -k test_split_with_sizes Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2023-11-24 07:21:00 +00:00
Xu Han	0f887a6d1a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-22 18:05:33 +00:00
Jez Ng	87925789ae	Make V.graph properly typed (#114025 ) Previously it lacked a type hint and so was treated as an Any type. This resulted in a lot of untyped code downstream as V.graph is referenced in many places in inductor code. I've typed it properly now as GraphLowering, and fixed the numerous type errors this surfaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025 Approved by: https://github.com/eellison ghstack dependencies: #114013	2023-11-21 02:14:29 +00:00
PyTorch MergeBot	ff7c06a01b	Revert "limit fused kernel num args. (#113131 )" This reverts commit `7b442c2b0a`. Reverted https://github.com/pytorch/pytorch/pull/113131 on behalf of https://github.com/albanD due to Breaks lint on trunk ([comment](https://github.com/pytorch/pytorch/pull/113131#issuecomment-1817548349))	2023-11-18 16:14:08 +00:00
Han, Xu	7b442c2b0a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-18 03:55:52 +00:00
Jez Ng	4667e20b3f	Delete a bunch of type-ignores (#113990 ) * Replaced `ignore[import]` by mypy config file entries * Removed a bunch of ignores around previously-fixed attr-defined / call-arg issues * Fixed some invalid / undefined types; added a few more type-ignores to squelch the downstream errors this exposed Pull Request resolved: https://github.com/pytorch/pytorch/pull/113990 Approved by: https://github.com/eellison, https://github.com/Skylion007 ghstack dependencies: #113979	2023-11-18 02:48:38 +00:00
Jez Ng	204ec11e6d	[inductor][easy] Fix fusion logging (#113308 ) We should use %s instead of %d as the numel may be sympy Exprs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113308 Approved by: https://github.com/lezcano	2023-11-09 03:19:39 +00:00
Jez Ng	dc63248b76	Make dynamo configs more amenable to static type checking (#112130 ) `install_config_module` makes a regular module into a ConfigModule with extra methods defined on it. mypy thinks those extra methods (or module functions) are undefined since it cannot analyze something so dynamic. As a workaround, I've created a fake module that defines these extra functions, which I import into the config modules during type checking. As part of this change, I've also added more types to config_utils.py and enabled typechecking for torch/_dynamo/config.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130 Approved by: https://github.com/jansel	2023-11-08 21:17:45 +00:00
drisspg	74c24d2367	Fixes a bug in inductor.triton.load (#113047 ) Lettin CI/CD tell me if there is anything wrong with this Original bug: ``` Shell r1 = rindex tmp37 = tl.load(out_ptr2 + (r1 + (8192*x0)), rmask, eviction_policy='evict_first', other=0) ^ AssertionError('cannot cast int32[constexpr[1],constexpr[2048]] to <[1, 2048], fp8e4nv>') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113047 Approved by: https://github.com/Skylion007, https://github.com/ipiszy	2023-11-07 04:06:54 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Shunting Zhang	493ae78201	[inductor] nan-checker (#112091 ) This PR is spilt out of https://github.com/pytorch/pytorch/pull/108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112091 Approved by: https://github.com/eellison, https://github.com/jansel	2023-11-02 02:32:04 +00:00
David Berard	8191fb3e06	[Reland2] [inductor][BE] split triton_meta and inductor_meta (#112351 ) triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future. This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton. Tests - wait for CI Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351 Approved by: https://github.com/eellison	2023-11-02 00:40:12 +00:00
Jiong Gong	e061144aaf	[inductor] replace ops.div with ops.truediv (#112243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243 Approved by: https://github.com/lezcano ghstack dependencies: #112234	2023-11-01 05:50:51 +00:00
Shunting Zhang	a1e222ef02	metric table (#109245 ) In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem. Q: why not log to stdout/err direclty A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful. Q: what's the difference with speedup_indcutor.csv A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful. Example use cases: - As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface: https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 . - To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful. - gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-11-01 02:33:42 +00:00
Shunting Zhang	fbafff3668	[reland][inductor] benchmark fusion (#112450 ) reland https://github.com/pytorch/pytorch/pull/108193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450 Approved by: https://github.com/jansel	2023-10-31 18:17:06 +00:00
PyTorch MergeBot	64fd027f2e	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `73cc5d1cdd`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))	2023-10-27 01:40:06 +00:00
Shunting Zhang	73cc5d1cdd	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 22:18:37 +00:00

1 2 3 4 5 ...

285 Commits