pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
leslie-fang-intel	af831415a8	fix cpp backend relu codegen with inf input (#117622 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/117544. For CPP backend, current `ReLU` will code gen to `f"{x} * ({x}>0)"` in `CppOverrides`. The result mismatches with eager when input has `inf`, since `inf * 0` will result to `nan` based on [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754). Change the code gen to `f"std::max({x}, decltype({x})(0))"` to align with eager implementation as in `1deb75b584/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L392)` TestPlan ``` python -u -m pytest test_cpu_repro.py -k test_relu_with_inf_value ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117622 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-20 13:28:03 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit `74e1362499`. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
Jason Ansel	a669319450	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-18 16:20:12 +00:00
Nikita Shulga	a1afd1b195	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" It should have never been landed, but was landed again, thanks to ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910 This reverts commit `e457b6fb18`.	2024-01-17 17:06:32 -08:00
titaiwangms	e457b6fb18	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 23:03:15 +00:00
PyTorch MergeBot	da6abaeeac	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit `bb0fd1bd3c`. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
titaiwangms	bb0fd1bd3c	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 19:12:24 +00:00
PyTorch MergeBot	9da01affd3	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit `3a52147cc5`. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
Jason Ansel	3a52147cc5	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-16 22:30:04 +00:00
Jiong Gong	2b56d80460	[inductor][cpp] apply simplify_index_in_vec_range to vector store and vector transpose (#117263 ) As the title, this PR extends the `simplify_index_in_vec_range` to store and transpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117263 Approved by: https://github.com/jansel ghstack dependencies: #117221, #117260	2024-01-15 08:41:28 +00:00
Jiong Gong	3b00dd5843	[inductor][cpp] apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260 ) For the one of the kernels in the UT `test_vec_contiguous_ModularIndexing`: Before: ```c++ for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (256Lx1_inner) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0_vec.mean.store(out_ptr0 + static_cast<long>(x1 + (28Lx0))); tmp_acc0_vec.m2.store(out_ptr1 + static_cast<long>(x1 + (28Lx0))); } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L)) { { #pragma omp declare reduction( welford:Welford<float>: omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) Welford<float> tmp_acc0 = Welford<float>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))]; tmp_acc0 = welford_combine(tmp_acc0, tmp0); } out_ptr0[static_cast<long>(x1 + (28Lx0))] = tmp_acc0.mean; out_ptr1[static_cast<long>(x1 + (28Lx0))] = tmp_acc0.m2; } } ``` After: ```c++ for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (28Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.m2); } } } ``` This PR also further speeds up the model `swin_base_patch4_window7_224` from 1.25x to 1.28x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117260 Approved by: https://github.com/jansel ghstack dependencies: #117221	2024-01-15 06:57:25 +00:00
Edward Z. Yang	7a7535283f	Some basic support for uint{16,32,64} codegen in CPU inductor (#116810 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116810 Approved by: https://github.com/chenyang78, https://github.com/eellison, https://github.com/desertfire	2024-01-12 23:13:28 +00:00
Jiong Gong	172dd13ecf	[inductor][cpp] improve vector contiguous checks for FloorDiv and ModularIndexing (#117221 ) Fix https://github.com/pytorch/pytorch/issues/114488 The PR tries to enable contiguous vector loads for cases where we can reduce `FloorDiv` and `ModularIndexing` in the vectorized loop. Take the index expression in test case `test_vec_contiguous_ModularIndexing` for example. `14336x0 + 256x1 + 128((x2//256)) + ModularIndexing(x2, 1, 128) + 7168ModularIndexing(x2, 128, 2)` can be reduced to `14336x0 + 256x1 + x2 + 128x2_div_c0 + 7168x2_mod_c0 + x2_mod_c1` where `x2` is a vectorized loop variable and the vector length is 16. This means we can do vectorized load for this index. Check the code comment for more details: https://github.com/pytorch/pytorch/pull/117221/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R317-R329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117221 Approved by: https://github.com/jansel	2024-01-12 15:20:36 +00:00
vfdev-5	7005a4bcb6	[dynamo] Added dyn shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866 ) Description: - Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ... ```python import math import torch def func(x, a, b): c = 0 c = c + math.sqrt(a) c = c + math.cos(a) c = c + math.cosh(a) c = c + math.sin(a) c = c + math.sinh(a) c = c + math.tan(a) c = c + math.tanh(a) c = c + math.asin(b) c = c + math.acos(b) c = c + math.atan(a) y = x + c return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cpu" # or "cuda" x = torch.tensor([0, 1, 2, 3], dtype=torch.float32, device=device) a = 12 b = 1 out = cfunc(x, a, b) expected = func(x, a, b) torch.testing.assert_close(out, expected) ``` and the graph `TORCH_LOGS=+graph_code python check_math_ops.py`: <details> <summary> graph code </summary> ``` [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ===== __compiled_fn_0 ===== [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_a_ = L_a_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:57, code: c = c + math.sqrt(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sqrt = torch.sym_sqrt(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add = 0 + sym_sqrt; sym_sqrt = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:58, code: c = c + math.cos(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cos = torch.sym_cos(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_1 = add + sym_cos; add = sym_cos = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:59, code: c = c + math.cosh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cosh = torch.sym_cosh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_2 = add_1 + sym_cosh; add_1 = sym_cosh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:60, code: c = c + math.sin(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sin = torch.sym_sin(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_3 = add_2 + sym_sin; add_2 = sym_sin = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:61, code: c = c + math.sinh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sinh = torch.sym_sinh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_4 = add_3 + sym_sinh; add_3 = sym_sinh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:62, code: c = c + math.tan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tan = torch.sym_tan(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_5 = add_4 + sym_tan; add_4 = sym_tan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:63, code: c = c + math.tanh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tanh = torch.sym_tanh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_6 = add_5 + sym_tanh; add_5 = sym_tanh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:64, code: c = c + math.asin(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_7 = add_6 + 1.5707963267948966; add_6 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:65, code: c = c + math.acos(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_8 = add_7 + 0.0; add_7 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:66, code: c = c + math.atan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_atan = torch.sym_atan(l_a_); l_a_ = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_9 = add_8 + sym_atan; add_8 = sym_atan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:67, code: y = x + c [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] y = l_x_ + add_9; l_x_ = add_9 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (y,) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ``` </details> Generated code with `TORCH_LOGS=+output_code python check_math_ops.py`: <details> <summary> C++ code </summary> ``` [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] cpp_fused_add_0 = async_compile.cpp(''' [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #include "/tmp/torchinductor_root/2l/c2ljzlm4sosod7u6lyrroqdba6hmfcyijrric6p4t3fhbcmw6osp.h" [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] extern "C" void kernel(const float* in_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] float* out_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks1) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #pragma GCC ivdep [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L)) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp0 = in_ptr0[static_cast<long>(x0)]; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp1 = c10::convert<float>(1.57079632679490 + (std::sqrt(ks1)) + (std::atan(ks1)) + (std::cos(ks1)) + (std::cosh(ks1)) + (std::sin(ks1)) + (std::sinh(ks1)) + (std::tan(ks1)) + (std::tanh(ks1))); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] out_ptr0[static_cast<long>(x0)] = tmp2; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> <details> <summary> Triton code </summary> ``` [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @pointwise( [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] size_hints=[4], [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] filename=__file__, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1), equal_to_1=(), i ds_of_folded_args=(), divisible_by_8=())]}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_0', 'mutated_arg_names': []}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] min_elem_per_thread=0 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @triton.jit [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xoffset = tl.program_id(0) * XBLOCK [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xindex = xoffset + tl.arange(0, XBLOCK)[:] [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xmask = xindex < xnumel [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] x0 = xindex [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp0 = tl.load(in_ptr0 + (x0), xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp1 = 1.57079632679490 + (tl.math.sqrt(ks0.to(tl.float32))) + (tl.math.atan((ks0).to(tl.float32))) + (tl.math.cos((ks0).to(tl.float32))) + (tl.math.cosh((ks0).to(tl.float32))) + (tl.math.sin((ks0) .to(tl.float32))) + (tl.math.sinh((ks0).to(tl.float32))) + (tl.math.tan((ks0).to(tl.float32))) + (tl.math.tanh((ks0).to(tl.float32))) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp2 = tmp1.to(tl.float32) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp3 = tmp0 + tmp2 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tl.store(out_ptr0 + (x0), tmp3, xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114866 Approved by: https://github.com/peterbell10	2024-01-11 11:52:28 +00:00
Jiong Gong	4c6e842496	[inductor][cpp] load as scalar for the index invariant in the vector range (#116387 ) For the test `test_expr_vec_non_contiguous`. The index_expr `31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` is invariant under the vector range of `x2`. Before change ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<int, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L))); } return at::vec::Vectorized<int>::loadu(tmpbuf.data()); } () ; auto tmp1 = static_cast<int>(2048); auto tmp2 = at::vec::Vectorized<int>(tmp1); auto tmp3 = to_float_mask(tmp0 < tmp2); auto tmp4 = [&] { auto tmp5 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { if (vector_lane_mask_check(tmp3, x1_inner)) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; } } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; return tmp5; } ; auto tmp6 = [&] { if (all_zero(to_float_mask(tmp3))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3)); } } () ; tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024Lx0))); } } } } ``` After change ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = c10::convert<int>(31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))); auto tmp1 = static_cast<int>(2048); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { if (tmp2 != 0) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; } } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; return tmp4; } ; auto tmp5 = [&] { if (all_zero(to_float_mask(tmp2))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp3())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp3(), to_float_mask(tmp2)); } } () ; tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp5); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0))); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116387 Approved by: https://github.com/EikanWang, https://github.com/lezcano ghstack dependencies: #114545	2023-12-26 08:45:04 +00:00
Jiong Gong	ffe6f9ac91	[inductor cpp] support vectorization for index_expr that depends on tiling itervar or with indirect indexing (#114545 ) As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here: 1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case. 2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop. Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid. Before: ```c++ #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = static_cast<long>(35); auto tmp4 = tmp0 < tmp3; auto tmp5 = [&] { auto tmp6 = in_ptr0[static_cast<long>(x1 + (35Lx0))]; return tmp6; } ; auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0); auto tmp8 = tmp0 >= tmp3; auto tmp9 = static_cast<long>(155); auto tmp10 = tmp0 < tmp9; auto tmp11 = [&] { auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120Lx0))]; return tmp12; } ; ... ``` After: ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L)) { auto tmp0 = c10::convert<int>(x1); auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1); auto tmp2 = static_cast<int>(0); auto tmp3 = at::vec::Vectorized<int>(tmp2); auto tmp4 = to_float_mask(tmp1 >= tmp3); auto tmp5 = static_cast<int>(35); auto tmp6 = at::vec::Vectorized<int>(tmp5); auto tmp7 = to_float_mask(tmp1 < tmp6); auto tmp8 = [&] { auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35Lx0)), to_float_mask(tmp7)); return tmp9; } ; auto tmp10 = [&] { if (all_zero(to_float_mask(tmp7))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7)); } } () ; ... ``` Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load. Before: ```c++ #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))); auto tmp1 = static_cast<long>(2048); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer(x1, 32L))) + (2048L(static_cast<long>(x1) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5); } out_ptr0[static_cast<long>(x1 + (1024Lx0))] = tmp_acc0; } } } ``` After: ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<int, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L))); } return at::vec::Vectorized<int>::loadu(tmpbuf.data()); } () ; auto tmp1 = static_cast<int>(2048); auto tmp2 = at::vec::Vectorized<int>(tmp1); auto tmp3 = to_float_mask(tmp0 < tmp2); auto tmp4 = [&] { auto tmp5 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { if (vector_lane_mask_check(tmp3, x1_inner)) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; } } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; return tmp5; } ; auto tmp6 = [&] { if (all_zero(to_float_mask(tmp3))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3)); } } () ; tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024Lx0))); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545 Approved by: https://github.com/lezcano	2023-12-26 05:36:39 +00:00
Philip Meier	505a9e4854	add support for dynamic shapes in round (#115259 ) Fixes #114310 and supersedes #114748. There are two reasons why we have quite a few special cases for `round`: 1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float. 2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-19 15:45:50 +00:00
Yang Chen	1392843e7b	[inductor] make sure bitcast input and target type have the same bitwidth (#115619 ) This PR fixed #104791 bitcast requires the source and target have the bitwidth. Because the input tensor's dtype could be promoted, e.g. from float16 to float, we have to cast the tensor to its original source dtype before invoking bitcast in such cases. After that, we also need to convert the bit-casted tensor back to float to make sure we keep using higher precision values for the rest of the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619 Approved by: https://github.com/jansel, https://github.com/eellison	2023-12-13 00:53:04 +00:00
Jiong Gong	bfa2c844a8	[inductor][cpp] avoid redundant lowp type cast for direct load/store (#115006 ) Fix https://github.com/pytorch/pytorch/issues/114879. See https://github.com/pytorch/pytorch/issues/114879#issuecomment-1836977610 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115006 Approved by: https://github.com/jansel	2023-12-04 06:39:27 +00:00
Jez Ng	f1fd02503b	Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974 ) This is a backout of #113747 which reverted the above two commits. Now that #113997 has landed, this diff can be landed safely without breaking ABI compatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974 Approved by: https://github.com/chenyang78	2023-12-02 03:25:51 +00:00
vfdev	c867fddab5	[inductor] Fix in CppPrinter._print_Pow (#114872 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114872 Approved by: https://github.com/lezcano	2023-11-30 20:21:44 +00:00
Jez Ng	71b742b42c	[inductor] Remove more type: ignore comments (#114162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114162 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-11-28 06:45:55 +00:00
Xu Han	0f887a6d1a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-22 18:05:33 +00:00
Jiong Gong	a0e3321f0c	[inductor cpp] vectorize embedding lookup (#114062 ) For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar. This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed. For the added UT `test_embedding_vec`, the generated code before this PR is: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = in_ptr2[static_cast<long>(x1 + (128Lx0))]; auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = in_ptr1[static_cast<long>(x1 + (128Ltmp3))]; auto tmp6 = decltype(tmp4)(tmp4 + tmp5); out_ptr0[static_cast<long>(x1 + (128Lx0))] = tmp6; } } } } } ``` After this PR, we have: ```c++ extern "C" void kernel(const long in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128Lx0))); auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128Ltmp3))); auto tmp6 = tmp4 + tmp5; tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0))); } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062 Approved by: https://github.com/jansel	2023-11-22 11:19:42 +00:00
PyTorch MergeBot	dd6ef0877e	Revert "[inductor cpp] vectorize embedding lookup (#114062 )" This reverts commit `2c0474c02d`. Reverted https://github.com/pytorch/pytorch/pull/114062 on behalf of https://github.com/huydhn due to Sorry for reverting your change, please help fix lint and reland it `2c0474c02d` ([comment](https://github.com/pytorch/pytorch/pull/114062#issuecomment-1820526515))	2023-11-21 09:21:20 +00:00
Jiong Gong	2c0474c02d	[inductor cpp] vectorize embedding lookup (#114062 ) For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar. This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed. For the added UT `test_embedding_vec`, the generated code before this PR is: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = in_ptr2[static_cast<long>(x1 + (128Lx0))]; auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = in_ptr1[static_cast<long>(x1 + (128Ltmp3))]; auto tmp6 = decltype(tmp4)(tmp4 + tmp5); out_ptr0[static_cast<long>(x1 + (128Lx0))] = tmp6; } } } } } ``` After this PR, we have: ```c++ extern "C" void kernel(const long in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128Lx0))); auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128Ltmp3))); auto tmp6 = tmp4 + tmp5; tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0))); } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062 Approved by: https://github.com/jansel ghstack dependencies: #113950	2023-11-21 07:37:15 +00:00
PyTorch MergeBot	ff7c06a01b	Revert "limit fused kernel num args. (#113131 )" This reverts commit `7b442c2b0a`. Reverted https://github.com/pytorch/pytorch/pull/113131 on behalf of https://github.com/albanD due to Breaks lint on trunk ([comment](https://github.com/pytorch/pytorch/pull/113131#issuecomment-1817548349))	2023-11-18 16:14:08 +00:00
Jiong Gong	b53d47a719	[inductor cpp] refactor: CppVecOverrides inherits CppOverrides (#113950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113950 Approved by: https://github.com/Skylion007	2023-11-18 15:33:30 +00:00
Han, Xu	7b442c2b0a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-18 03:55:52 +00:00
Wei Wei	b19cf868e8	Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527 )" (#113747 ) Test Plan: sandcastle Differential Revision: D51330618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2023-11-15 22:42:22 +00:00
Jiong Gong	fcdfcdeef9	[inductor cpp] fix non-contiguous reduction store (#113261 ) Fix https://github.com/pytorch/pytorch/issues/113018 The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this. Before fix ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17Lx2) + (306Lx0))); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39Lx1))); // this is wrong since x0 is not vector dim } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x1 + (17Lx2) + (306Lx0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<long>(x0 + (39Lx1))] = tmp_acc0; } } } ``` After fix ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17Lx2) + (306Lx0))); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } { __at_align__ float tmpbuf[16sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x1 + (17Lx2) + (306Lx0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<long>(x0 + (39Lx1))] = tmp_acc0; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113261 Approved by: https://github.com/lezcano	2023-11-15 03:27:17 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
Jez Ng	5e03af8295	[inductor] Enable floor_div indexing to work under ABI-compat mode (#113276 ) Previously, floor_div operations were defined in ATen/native/BinaryOps.h. Since this header was not included under ABI-compat mode, trying to use those indexing operations would result in compilation errors. Technically, it is safe to use aten::native::floor_div_* functions in ABI-compat mode as they are header-only; we could simply include BinaryOps.h. However, there are other declarations in BinaryOps.h that are not binary-compatible, so this is not ideal. Thus, I have moved those functions into a separate file, and put them under c10/util, since they don't really have tensor-specific logic. c10 functions are not all header-only, so this still isn't ideal, but this still seems like an improvement. Moreover, cpp_prefix.h -- used when compiling cpp kernels -- already includes c10 header files, so ABI-compatibility already depends on maintaining some c10 functions as header-only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-11 02:51:29 +00:00
Jiong Gong	cb48f7855a	[inductor cpu] fix uint8 add and sub (#113253 ) Fix https://github.com/pytorch/pytorch/issues/113016 and https://github.com/pytorch/pytorch/issues/113020 and https://github.com/pytorch/pytorch/issues/113141 and https://github.com/pytorch/pytorch/issues/113143 and https://github.com/pytorch/pytorch/issues/113144 Explicit typecast result of add/sub to uint8 (similar to how we fixed mul previously) to avoid implicit type promotion from C. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113253 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-11-10 04:06:42 +00:00
Jiong Gong	8c704f7a0e	[inductor cpp] fix argmax with >1 reduction dims (#113168 ) Fix #113013. The argmax (and argmin) implementation doesn't handle the index compute properly when the number of reduction dims is larger than 1. It wrongly assumed only one reduction dim. With the given reproducer, the generated code before the change: ```c++ #include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h" extern "C" void kernel(const double* in_ptr0, long* out_ptr0) { { { struct IndexValue_1 {size_t index; double value;}; IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()}; #if !defined(__clang_major__) \|\| __clang_major__ > 9 #pragma omp declare reduction(argmax : IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()}) #endif for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(0); auto tmp1 = c10::convert<long>(1); auto tmp2 = tmp0 < tmp1; auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3Lx1), 2L)); auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp5 = tmp3 < tmp4; auto tmp6 = tmp2 & tmp5; auto tmp7 = [&] { auto tmp8 = in_ptr0[static_cast<long>((3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp8; } ; auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0); auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp11 = tmp10 < tmp4; auto tmp12 = tmp2 & tmp11; auto tmp13 = [&] { auto tmp14 = in_ptr0[static_cast<long>(1L + (3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp14; } ; auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0); auto tmp16 = tmp15 + tmp9; auto tmp17 = [&] { auto tmp18 = c10::convert<double>(1.0); return tmp18; } ; auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0); auto tmp20 = [&] { auto tmp21 = c10::convert<double>(1.0); return tmp21; } ; auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0); auto tmp23 = tmp22 + tmp19; auto tmp24 = tmp16 / tmp23; if (tmp_acc0.value < tmp24) { tmp_acc0.index = x1; tmp_acc0.value = tmp24; // both x0 and x1 are reduction vars while only x1 is assigned to tmp_acc0.index } } } out_ptr0[static_cast<long>(0L)] = tmp_acc0.index; } } } ``` After fix: ```c++ #include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h" extern "C" void kernel(const double in_ptr0, long* out_ptr0) { { { struct IndexValue_1 {size_t index; double value;}; IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()}; #if !defined(__clang_major__) \|\| __clang_major__ > 9 #pragma omp declare reduction(argmax : IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()}) #endif for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(0); auto tmp1 = c10::convert<long>(1); auto tmp2 = tmp0 < tmp1; auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3Lx1), 2L)); auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp5 = tmp3 < tmp4; auto tmp6 = tmp2 & tmp5; auto tmp7 = [&] { auto tmp8 = in_ptr0[static_cast<long>((3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp8; } ; auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0); auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp11 = tmp10 < tmp4; auto tmp12 = tmp2 & tmp11; auto tmp13 = [&] { auto tmp14 = in_ptr0[static_cast<long>(1L + (3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp14; } ; auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0); auto tmp16 = tmp15 + tmp9; auto tmp17 = [&] { auto tmp18 = c10::convert<double>(1.0); return tmp18; } ; auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0); auto tmp20 = [&] { auto tmp21 = c10::convert<double>(1.0); return tmp21; } ; auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0); auto tmp23 = tmp22 + tmp19; auto tmp24 = tmp16 / tmp23; if (tmp_acc0.value < tmp24) { tmp_acc0.index = static_cast<long>(x1 + (2Lx0)); tmp_acc0.value = tmp24; } } } out_ptr0[static_cast<long>(0L)] = tmp_acc0.index; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113168 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-11-09 11:47:51 +00:00
Jez Ng	297c26bb8e	Support fp8 in AOTInductor + support optional<> in C ABI (#112527 ) This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358 It turns out that we need to add support for optional types in order to support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface can't support optional<> directly, I am passing in optional types via pointer instead. `AtenTensorHandle`s are already pointers, so nothing needs to change there. Only value types need to change. We decided on this approach instead of adding an extra `bool` param to the callee because this simplifies things. Having the same number of arguments regardless of whether we are emitting Python / C++ / ABI-compatible C++ makes codegen easier. There are a number of existing ABI-compatible functions that have optional-typed value parameters. Previously, they just assumed they would never be passed a `nullopt` / `None` at runtime. Changing them to use pointer types now would break ABI stability, so I have created an exclude list for those functions. Finally, I think the current implementation is kind of messy, and only works for FallbackKernels, even though technically ExternKernels could also have the same issue. It also doesn't support optional types nested in lists. I've left FIXME comments for both issues. Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-08 22:56:48 +00:00
Jez Ng	dc63248b76	Make dynamo configs more amenable to static type checking (#112130 ) `install_config_module` makes a regular module into a ConfigModule with extra methods defined on it. mypy thinks those extra methods (or module functions) are undefined since it cannot analyze something so dynamic. As a workaround, I've created a fake module that defines these extra functions, which I import into the config modules during type checking. As part of this change, I've also added more types to config_utils.py and enabled typechecking for torch/_dynamo/config.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130 Approved by: https://github.com/jansel	2023-11-08 21:17:45 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Peter Bell	718035791d	Prefer `e.is_number` over `not e.free_symbols` in SymPy (#112688 ) We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times. Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is horribly slow. It turns out though that there is another propery `is_number` that does what we want. > property is_number: > > Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster > than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined > function. Even further, we also avoid the overhead of building the unnecessary set object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688 Approved by: https://github.com/lezcano	2023-11-06 20:05:13 +00:00
Jiong Gong	e061144aaf	[inductor] replace ops.div with ops.truediv (#112243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243 Approved by: https://github.com/lezcano ghstack dependencies: #112234	2023-11-01 05:50:51 +00:00
Shunting Zhang	fbafff3668	[reland][inductor] benchmark fusion (#112450 ) reland https://github.com/pytorch/pytorch/pull/108193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450 Approved by: https://github.com/jansel	2023-10-31 18:17:06 +00:00
Jiong Gong	a1c56df1f0	[inductor cpp] vectorize support for truediv (#112234 ) Ops like group_norm has `ops.truediv` that doesn't have vectorization support yet. This PR adds the support. `test_group_norm_vec` Before: ```c++ extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x2 + (1024Lx1) + (32768Lx0))]; auto tmp1 = out_ptr0[static_cast<long>(x1 + (32Lx0))]; auto tmp3 = out_ptr1[static_cast<long>(x1 + (32Lx0))]; auto tmp10 = in_ptr1[static_cast<long>(x1)]; auto tmp12 = in_ptr2[static_cast<long>(x1)]; auto tmp2 = tmp0 - tmp1; auto tmp4 = c10::convert<float>(1024.0); auto tmp5 = tmp3 / tmp4; auto tmp6 = c10::convert<float>(1e-05); auto tmp7 = tmp5 + tmp6; auto tmp8 = 1 / std::sqrt(tmp7); auto tmp9 = decltype(tmp2)(tmp2 tmp8); auto tmp11 = decltype(tmp9)(tmp9 * tmp10); auto tmp13 = tmp11 + tmp12; out_ptr2[static_cast<long>(x2 + (1024Lx1) + (32768Lx0))] = tmp13; } } } } } } ``` After: ```c++ extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (1024Lx1) + (32768Lx0))); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(out_ptr0[static_cast<long>(x1 + (32Lx0))])); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(out_ptr1[static_cast<long>(x1 + (32Lx0))])); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(in_ptr1[static_cast<long>(x1)])); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(in_ptr2[static_cast<long>(x1)])); auto tmp2 = tmp0 - tmp1; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1024.0)); auto tmp5 = tmp3 / tmp4; auto tmp6 = at::vec::Vectorized<float>(static_cast<float>(1e-05)); auto tmp7 = tmp5 + tmp6; auto tmp8 = tmp7.rsqrt(); auto tmp9 = tmp2 tmp8; auto tmp11 = tmp9 * tmp10; auto tmp13 = tmp11 + tmp12; tmp13.store(out_ptr2 + static_cast<long>(x2 + (1024Lx1) + (32768Lx0))); } } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112234 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-10-31 17:15:21 +00:00
PyTorch MergeBot	fc0b0820fc	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `b110d87ac2`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))	2023-10-30 19:45:41 +00:00
chilli	b110d87ac2	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano	2023-10-27 18:23:49 +00:00
PyTorch MergeBot	64fd027f2e	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `73cc5d1cdd`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))	2023-10-27 01:40:06 +00:00
PyTorch MergeBot	0a3199dd7e	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `e38347f490`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))	2023-10-27 01:37:33 +00:00
Shunting Zhang	73cc5d1cdd	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 22:18:37 +00:00
PyTorch MergeBot	485cc0faae	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `ec0cdcdf6a`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))	2023-10-26 16:41:20 +00:00

1 2 3 4 5

235 Commits