pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	bd5b1788cd	Support printing inequality in ExprPrinter (#104104 ) Fixes https://github.com/pytorch/pytorch/issues/103587 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104104 Approved by: https://github.com/jansel	2023-06-23 21:50:17 +00:00
XiaobingSuper	da21273ad5	inductor: support rsqrt for dynamic shape (#103579 ) Fix compiler error for HF hf_BigBird dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103579 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-15 07:02:18 +00:00
lezcano	5f77be8bbe	Refactor OptimizeIndexing (#100549 ) This PR decouples the logic necessary to compute bounds on variables from the logic that uses this info to perform the strenght analysis on int64 variables. While doing so, it tries to minimize the number of attributes of the class in favour of local variables. This class is now accessible from any `LoopBody` object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549 Approved by: https://github.com/eellison	2023-06-13 03:31:41 +00:00
haozhe.zhu	adcefcb378	insert to dtype for fused mem copy scheduler node (#101042 ) Fix https://github.com/pytorch/pytorch/issues/100830. For the inplace node, there will be a `copy_` generated and the `copy_` will be `realized` as a `scheduler buffer` since it is a mutation. This `scheduler buffer` is a memory copy but after fusing with the previous buffer, it will not be a memory copy only buffers. This PR solves the issue by removing `load_bf16_as_fp32` and `store_bf16_from_fp32`. Instead, enable fp32/bf16 vec conversion in `to_dtype`. Then we always store bf16. ```python import torch import torch.nn as nn torch.manual_seed(420) from torch._inductor import config x = torch.randn(1, 18, dtype=torch.bfloat16) class ExampleModel(nn.Module): def __init__(self): super(ExampleModel, self).__init__() self.relu = nn.ReLU(inplace=True) # nn.ReLU(inplace=False) def forward(self, input1): out = self.relu(input1) # input1.copy_(out) return out func = ExampleModel() with torch.no_grad(): func.train(False) res1 = func(x) # without jit print(res1) jit_func = torch.compile(func) res2 = jit_func(x) print(res2) ``` Generated code without this PR: (`tm3` store is wrong, `tmp3` is `float` while `out_ptr1` is `bf16`) ``` auto tmp0 = load_bf16_as_float(out_ptr1 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = (tmp2); store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp3); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` Generated code with this PR: ``` auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(out_ptr1 + static_cast<long>(i0), 16); auto tmp1 = cvt_bf16_to_fp32(tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = cvt_fp32_to_bf16(tmp2); tmp3.store(out_ptr0 + static_cast<long>(i0), 16); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` This PR also fixed the data type propagation for `masked_subblock`. Before the masked_subblock's dtype is propagated by its input which is wrong. ``` opcode name target args kwargs ----------- --------- --------- -------------------------- -------- call_module masked_subblock1 masked_subblock1 (and__2, -inf) ``` Now we propagated it by subblock with the same name: ``` # graph for body.subblocks['masked_subblock1'] opcode name target args kwargs ----------- --------- --------- -------------------------- -------- placeholder ops ops () {} call_module get_index get_index ('index2',) {} call_method load load (ops, 'arg0_1', get_index) {} call_method to_dtype to_dtype (ops, load, torch.float32) {} output output output (to_dtype,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101042 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-07 15:55:25 +00:00
Edward Z. Yang	f760899864	Teach Triton codegen to generate sqrt (#103084 ) Fixes https://github.com/pytorch/pytorch/issues/100972 I know ngimel doesn't like this sort of fix because we shouldn't actually be computed sqrt at runtime, I'm open to some sort of perf warning saying that we're spending FLOPs weirdly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103084 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ngimel	2023-06-07 03:03:56 +00:00
Jason Ansel	fc0fed36d9	[inductor] fix issue with ops.lookup_seed (#102485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102485 Approved by: https://github.com/anijain2305	2023-05-29 22:25:47 +00:00
Wang, Eikan	c3550d8376	Add fast path for BF16 kernel if all the operations within the kernel support bf16 (#99814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99814 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-26 02:08:53 +00:00
Natalia Gimelshein	68816e4fa9	Remove inplace buffers when original and mutation are both removed (#102289 ) Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed). This PR extends buffer removal to inplaced buffers that can be removed. Generated kernel for e.g. ln changes from ``` def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` where in_out_ptr0 is unused in the kernel to ``` def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` and corresponding allocation/reuse lines in the wrapper are removed. The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289 Approved by: https://github.com/jansel	2023-05-26 02:06:36 +00:00
FindHao	3e08988cd3	Fix redudant kernel generations (#102104 ) ## Issue description The PR https://github.com/pytorch/pytorch/pull/100064 introduces a new RNG operation process. However, it causes every `randint` to load a separate random seed by default. TorchInductor generates a buffer to store all necessary random seeds and places the offsets as constant values in the subsequent compute buffers. In ir_pre_fusion generated by TorchInductor, some buffers only differ by one line, which is the load random seed with the corresponding offset. Subsequently, the codegen generates Triton kernels following the same rule. Finally, in the output_code.py, some Triton kernels only differ by one line, meaning that redundant kernels are being generated. ## Solution This PR captures the seed offset and adds it to the existing `self.sizevars` structure. It generates variable names as placeholders, allowing the code wrapper to pass the offset as an argument to the kernels. I've also modified the divisible_by_16 check to exclude this argument. This PR reduces the number of generated kernels from 50 to 17 for BertForMaskedLM forward. According to tests on my own environment, the compilation time of attention_is_all_you_need_pytorch has been reduced from 94s to 66s. The speedup remains largely unchanged, at 1.37X. The following is a comparison for a simple example. Before: ``` triton_poi_fused_0 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + 0) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) triton_poi_fused_1 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + 1) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) ...''') def call(args): triton_poi_fused_0.run(buf0, buf1, 1024, grid=grid(1024), stream=stream0) triton_poi_fused_1.run(buf0, buf2, 1024, grid=grid(1024), stream=stream0) ``` After: ``` triton_poi_fused_0 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + load_seed_offset) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) .... def call(args): triton_poi_fused_0.run(buf0, buf1, 0, 1024, grid=grid(1024), stream=stream0) triton_poi_fused_0.run(buf0, buf2, 1, 1024, grid=grid(1024), stream=stream0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102104 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-24 23:56:53 +00:00
Bin Bao	fd1d442185	[inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019 ) Summary: Use size hint for autotuning; Fix some symbol arg codegen problem. More PRs coming for fixing unit test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102019 Approved by: https://github.com/jansel	2023-05-24 13:29:47 +00:00
Jason Ansel	0c6f409cda	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-20 03:43:33 +00:00
Peter Bell	89bd5d3dab	[inductor] Implement magic methods on IR values (#101076 ) This wraps `ops` into an `OpsWrapper` object which wraps any returned IR values into an `OpsValue` instance. This allows magic methods to be implemented and means lowerings can write mathematical expressions much more fluently. So instead of ```python ops.add(ops.mul(ops.mul(ops.sub(ops.mul(_Ap2, x), _Ap3), x), x), _1) ``` we can write ```python (_Ap2 * x - _Ap3) * x * x + _1 ``` And it will translate to the equivalent `ops` calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101076 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-19 23:09:37 +00:00
Peter Bell	66e398951a	[inductor/decomp] Add aten._unsafe_index to disable range checks (#101602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101602 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-17 23:36:24 +00:00
Peter Bell	b256091c7b	[inductor] Generate indirect_indexing checks even if optimized out (#100895 ) Fixes #100831, fixes #100878 Previously `gen_assert_indirect_indexing` was only called on the index expressions passed to `ops.load` and `ops.store` which means if the variable is optimized out during lowering, we never generate the assert. This instead makes `ops.indirect_indexing` eagerly generate the assert statement, whether or not it will be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100895 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-17 23:36:24 +00:00
PyTorch MergeBot	5f07c589b0	Revert "[inductor] Refactor RNG operators (#100064 )" This reverts commit `3bbf0683a1`. Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))	2023-05-17 21:16:41 +00:00
Jason Ansel	3bbf0683a1	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-17 01:29:31 +00:00
Jiong Gong	86869475ff	[inductor] move dtype propagation log to schedule artifact (#101351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101351 Approved by: https://github.com/jansel	2023-05-16 06:43:38 +00:00
chunyuan	1faef895ca	Inductor cpp wrapper: support sympy.Expr as input (#101257 ) Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel	2023-05-15 23:57:28 +00:00
Natalia Gimelshein	a4830bd86b	fix sign return type (#101346 ) Fixes #101216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101346 Approved by: https://github.com/eellison, https://github.com/jansel	2023-05-15 17:50:36 +00:00
Edward Z. Yang	beb7f79517	Fix intermediate hooks on inplace buffers, enable it in testing (#100322 ) Fixes https://github.com/pytorch/pytorch/issues/100312 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322 Approved by: https://github.com/ngimel	2023-04-30 13:34:44 +00:00
Jiong Gong	4456e932f8	[inductor] fix _print_Pow given reciprocal of dynamic dim with float exponent (#100090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100090 Approved by: https://github.com/XiaobingSuper, https://github.com/jansel	2023-04-27 04:10:15 +00:00
leslie-fang-intel	deaf983bdb	[Inductor][quant]Enable decomposed.quant/dequant lowering and vec code gen (#99131 ) Summary Since current quantization flow has not decomposed quant/dequant into prim ops, in this PR - We enable the quant/dequant decomposition as lowering inside inductor. - For the `decomposed.quant/dequant.tensor` overload, there are loading of scalar tensor of `zero point` and `scale`, we need to enable the vec code gen for these op overloads. - Minor change as adding `is_load_uint8_as_float` and `is_store_float_as_uint8` default value `False` into `OptimizationContext`. TestPlan ``` cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_quant_lowering ``` co-author with @Xia-Weiwen Pull Request resolved: https://github.com/pytorch/pytorch/pull/99131 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-21 04:33:02 +00:00
XiaobingSuper	3af467eff4	inductor: support sqrt for dynamic shape (#99514 ) When running TIMM ```convit_base``` dynamic shape case, there is always has AssertionError, see https://github.com/pytorch/pytorch/issues/97877. A simple reproduce code is: ``` import torch import torch._dynamo import torch._dynamo.config as config config.dynamic_shapes=True torch._dynamo.config.assume_static_by_default=False class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): B, N, C = x.shape return self.get_rel_indices(N) def get_rel_indices(self, num_patches: int) -> torch.Tensor: img_size = int(num_patches ** .5) #rel_indices = torch.zeros(1, num_patches, num_patches, 3) ind = torch.arange(img_size) return ind model = Model().eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(8, 8, 8) ref = model(x) with torch.no_grad(): for i in range(3): out = opt_model(x) ``` After this code, the generated code will be like this: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/x5/cx5442c6dcuxsrrlnqi476yzjlgc6g53ukppuaettiyp6dszhmr4.h" extern "C" void kernel(long* out_ptr0, const long ks0) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(std::floor(std::sqrt(ks0))); i0+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>(i0); out_ptr0[static_cast<long>(i0)] = tmp0; } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99514 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-04-20 04:22:49 +00:00
lezcano	495e1b4d0e	Add device_asserts before indirect loads and stores (#98590 ) This PR also adds a way to CSE statements (not only assignments). The tests follow the pattern from https://github.com/openai/triton/pull/1143 They take a fair amount of time to run (90s in my box). If we wanted to improve this, we could avoid testing the `ndim == 3` case. Changes like this one make me hope that we get to clean the amount of lowerings we have at some point... Generated code for `x[y]` with `x.shape == (3, 2, 4), y.ndim == 1`: With `dynamic=False`: ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tl.device_assert(((0 <= tmp0) & (tmp0 < 3)) \| (~xmask), f"index out of bounds: 0 <= tmp0 < 3") tmp1 = tl.load(in_ptr1 + (x0 + (8tmp0)), xmask) ``` With `dynamic=True`: ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) \| (~xmask), f"index out of bounds: 0 <= tmp0 < ks3") tmp1 = tl.load(in_ptr1 + (x0 + (ks1ks2tmp0)), xmask) ``` Generated code for `x[y+1, y+1]` with `x.shape == (3, 2, 4), y.ndim == (3, 3)`: With `dynamic=False` (note how it folds the two upper bounds to `min(3, 2) == 2` ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tmp1 = 1 tmp2 = tmp0 + tmp1 tl.device_assert(((0 <= tmp2) & (tmp2 < 2)) \| (~xmask), f"index out of bounds: 0 <= tmp2 < 2") tmp3 = tl.load(in_ptr1 + (x0 + (12tmp2)), xmask) ``` With `dynamic=True`: ```python tl.device_assert(((0 <= tmp2) & (tmp2 < min(ks2, k1))) \| (~xmask), f"index out of bounds: 0 <= tmp2 < min(ks2, ks1)") ``` The same works when the CSE'd variable appears 3 or more times, but then it generates `min(ks0, min(ks1, ks2))` Generated code for `x[y] = z` with `x.ndim = 3`, `y.ndim = 1` and dynamic shapes ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tmp1 = tl.load(in_ptr1 + (x2), xmask) tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) \| (~xmask), f"index out of bounds: 0 <= tmp0 < ks3") tl.store(out_ptr0 + (x0 + (ks1ks2tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask) ``` Fixes https://github.com/pytorch/pytorch/issues/93538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98590 Approved by: https://github.com/ngimel	2023-04-19 21:26:57 +00:00
haozhe.zhu	59e343b12c	enable data type propagation (#98065 ) Enable data type propagation in schedule node level. Propagation policy: (1) ops with dtype args [constant, load, rand, randn] -> direct use dtype as node dtype (2) ops semantics decide output dtype -> using output dtype All `override_return_dtype` in https://github.com/pytorch/pytorch/blob/master/torch/_inductor/lowering.py. (3) other ops: perform promote on input nodes dtype. ADD(BF16, FP32) -> FP32 output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98065 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5	2023-04-18 00:07:35 +00:00
Peter Bell	013c7f5ba4	[inductor] Move `tl.broadcast` call out codegen.common (#98304 ) This makes only a cosmetic change to the generated code, but means triton's broadcasting logic doesn't leak out into the CSE class. Before: ```python tmp5_load = tl.load(in_ptr1 + (0)) tmp5 = tl.broadcast_to(tmp5_load, [XBLOCK]) ``` After: ```python tmp5 = tl.load(in_ptr1 + (0)) tmp6 = tl.broadcast_to(tmp5, [XBLOCK]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98304 Approved by: https://github.com/ngimel	2023-04-05 23:10:46 +00:00
Jason Ansel	bc86af0d37	Remove DeferredIndentedBuffer (#97616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616 Approved by: https://github.com/desertfire	2023-03-28 23:13:41 +00:00
Natalia Gimelshein	e7d9331688	[inductor] hoist symbolic padding expressions (#97099 ) Towards fixing pnasnet5large, see #96709. The generated kernel looks much better ``` @pointwise(size_hints=[1048576], filename=__file__, meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32', 4: 'i32', 5: 'i32', 6: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 6), equal_to_1=())]}) @triton.jit def triton_(in_ptr0, out_ptr0, ks0, ks1, ks2, ks3, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = (xindex // ks0) % ks0 x0 = xindex % ks0 x2 = (xindex // ks3) x4 = xindex tmp0 = x1 + ((-1)ks1) tmp1 = 0 tmp2 = tmp0 >= tmp1 tmp3 = ks2 tmp4 = tmp0 < tmp3 tmp5 = x0 + ((-1)ks1) tmp6 = tmp5 >= tmp1 tmp7 = tmp5 < tmp3 tmp8 = tmp2 & tmp4 tmp9 = tmp8 & tmp6 tmp10 = tmp9 & tmp7 tmp11 = tl.load(in_ptr0 + (x0 + ((-1)ks1) + (ks2x1) + (x2(ks2ks2)) + ((-1)ks1ks2) + tl.zeros([XBLOCK], tl.int32)), tmp10 & xmask, other=0) tmp12 = tl.where(tmp10, tmp11, 0.0) tl.store(out_ptr0 + (x4 + tl.zeros([XBLOCK], tl.int32)), tmp12, xmask) ``` Interestingly, removing `expand` in in index `simplify` function makes `load` expression a little bit better, but `store` fails to simplify to flat store in this case, so I'm leaving `expand` in. Full pnasnet still chokes on `ceiling` in batch_norm kernels, additionally, it looks like shape propagation goofs in inductor and generates overly complicated expressions, we should switch to meta data from fx graph. I'm still not adding `ceil` print to triton, because we should be able to hoist all indexing expression (and just printing ceil without converting to int64 doesn't work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97099 Approved by: https://github.com/jansel	2023-03-21 21:43:32 +00:00
BowenBao	60a68477a6	Bump black version to 23.1.0 (#96578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578 Approved by: https://github.com/ezyang	2023-03-15 06:27:59 +00:00
XiaobingSuper	6eca391e83	inductor(cpu): remove __restrict__ keyword to avoid generating wrong result when two pointer point same memory (#96492 ) Fix https://github.com/pytorch/pytorch/issues/93365, https://github.com/pytorch/pytorch/issues/93357 and https://github.com/pytorch/pytorch/issues/96432. Currently, remove `__restrict__` keyword to avoid generating the wrong result, there has a draft PR https://github.com/pytorch/pytorch/pull/96404 to do some memory alias checks before adding `__restrict__ `keyword, but that PR needs to re-designed well for the logic of the memory alias checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96492 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-03-13 07:12:04 +00:00
Horace He	5bbec680d7	Fix usages of contextmanager without finally (#96170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-03-08 20:59:27 +00:00
Edward Z. Yang	d303665d33	Make int unspecialization actually work (#95621 ) OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor. The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors. * I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.) * Only 0/1 integers get specialized by default now * A hodgepodge of fixes. I'll comment on the PR about them. Fixes https://github.com/pytorch/pytorch/issues/95469 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621 Approved by: https://github.com/jansel, https://github.com/Chillee	2023-03-04 01:22:08 +00:00
Natalia Gimelshein	88a31f4be6	hoist precomputed exprs from indices (#95690 ) This generates compilable code for maskrcnn graph 13, with ceilings hoisted to be computed on the host. But it now fails with ``` File "/scratch/ngimel/work/pytorch/torch/_dynamo/symbolic_convert.py", line 379, in wrapper self.output.compile_subgraph(self, reason=reason) File "/scratch/ngimel/work/pytorch/torch/_dynamo/output_graph.py", line 562, in compile_subgraph pass1.foreach(stack_values) File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 166, in foreach self(i) File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 148, in __call__ output.extend(value.reconstruct(self)) File "/scratch/ngimel/work/pytorch/torch/_dynamo/variables/dicts.py", line 40, in reconstruct codegen.create_load_python_module(collections), TypeError: create_load_python_module() missing 1 required positional argument: 'push_null' from user code: File "/scratch/ngimel/work/env/lib/python3.9/site-packages/torchvision-0.15.0a0+928b05c-py3.9-linux-x86_64.egg/torchvision/models/detection/backbone_utils.py", line 58, in forward x = self.fpn(x) ``` looks like we never execute this `create_load_python_module()` path for other subgraphs. Any advice on how to fix this @voznesenskym @jansel ? Pull Request resolved: https://github.com/pytorch/pytorch/pull/95690 Approved by: https://github.com/jansel	2023-02-28 23:32:36 +00:00
Edward Z. Yang	58648822b6	Handle int/float arguments for cpp codegen in inductor (#95533 ) This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-28 03:57:35 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
Fabio Rocha	e116ca93e1	Run test_torchinductor*.py with implicit_fallbacks=False (#94039 ) This way it errors out for ops that don't have decomps and requires you to add explicit fallbacks to lowering.py Turns out there are a lot, and this commit adds them as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94039 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/ngimel	2023-02-10 18:10:56 +00:00
Natalia Gimelshein	a5daea69fb	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-10 11:21:57 +00:00
PyTorch MergeBot	6007874bbb	Revert "teach inductor to handle floor (#94341 )" This reverts commit `e7df9aaec8`. Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related. It fails on both PR and trunk `e7df9aaec8`	2023-02-09 19:31:08 +00:00
Natalia Gimelshein	e7df9aaec8	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-09 17:09:35 +00:00
Natalia Gimelshein	ca8450849b	compute dynamic tensor shapes for indexing on the host (#93872 ) Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is ``` x1 = (xindex // pks0) % 64 ``` instead of ``` x1 = (xindex // (1 + (((((-1) + ks0) // 4))((((-1) + ks0) // 4))) + (2((((-1) + ks0) // 4))))) % 64 ``` with `pks0` arg computed on the host ``` ps0 = (1 + ((((-1) + s2) // 4)))(1 + ((((-1) + s2) // 4))) ``` It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g. ``` tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0(((((-1) + ks0) // 32))((((-1) + ks0) // 32)))) + (2x0((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32) ``` Unfortunately, `unet` which is one of the examples failing with floor does the latter: ``` tmp1 = ((-1)(1/(((-1) + (floor(2.0(ks0//16))))))) + ((1/(((-1) + (floor(2.0(ks0//16))))))*(ks0 // 16)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872 Approved by: https://github.com/jansel	2023-02-03 09:58:39 +00:00
Horace He	19c9b09449	Replace IndexingDiv with FloorDiv in Inductor (#92878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878 Approved by: https://github.com/ezyang	2023-01-24 15:06:22 +00:00
Natalia Gimelshein	4110900b22	let inductor generate broadcast when loading a single value (#92595 ) For better perf with MLIR triton. Changes ``` tmp32 = tl.load(seed3 + (0 + tl.zeros([XBLOCK, RBLOCK], tl.int32)), None) ``` to ``` tmp32_load = tl.load(seed3+(0)); tmp32 = tl.broadcast_to(tmp32_load, [XBLOCK, RBLOCK]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92595 Approved by: https://github.com/Chillee	2023-01-20 00:05:01 +00:00
Horace He	4f4b62e4a2	some fixes to get symbolic shapes working through inductor (#92320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92320 Approved by: https://github.com/ezyang	2023-01-19 03:09:02 +00:00
Jason Ansel	7c1c239db1	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575 ) This reverts commit `94262efc7d` to reland #91105 / #90738. Fixes https://github.com/pytorch/torchdynamo/issues/2015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575 Approved by: https://github.com/ngimel	2023-01-11 00:08:03 +00:00
Fabio Rocha	bc843682dd	[inductor] New approach for computing triton load/store masks (#91241 ) This PR is a new version of #89566, fixing a test failure. Couldn't get ghstack to colaborate on updating that PR after re-opening, so started a new one. This changes the way masks for loads/stores are computed in triton backend of inductor. New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and `r3` it adds `xmask`, `ymask` and `rmask` respectively. For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created. I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms. Relative to #89566, the only change is to not include the mask variables of arguments when the function being called is `tl.where`. The reason being that `tl.where` is often used precisely to make sure the output variable has valid values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91241 Approved by: https://github.com/ngimel	2022-12-22 11:54:48 +00:00
Mark Saroufim	15af4b1cee	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos, https://github.com/malfet	2022-12-21 11:56:58 +00:00
PyTorch MergeBot	94262efc7d	Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 )" This reverts commit `d6dd2e97da`. Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds	2022-12-21 00:02:38 +00:00
Jason Ansel	d6dd2e97da	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 ) https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land. This is an exact copy that PR without using ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105 Approved by: https://github.com/ngimel	2022-12-20 02:38:23 +00:00
Peter Bell	81f351acd7	[inductor] Prevent blowup in inner_fn_str and extract_read_writes (#88933 ) Currently the default `ops` handler expects strings as arguments and just formats them into a function call template string. For complex expressions, this can lead to exponential growth in terms. Say for example you have: ```python def fn(a): for _ in range(3) a = ops.mul(a, a) return a ``` You might expect `inner_fn_str` to contain 1 load and 3 multiplies, but instead you find 8 loads and 7 multiplies: ```python load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) ``` This type of blowup is present in the lowering for `max_pool2d_with_indices_backward` which in #pytorch/torchdynamo#1352 was reported to have caused the entire compilation to hang. This PR fixes the issue by formatting the string as a series of assignments to variables, so for the example above, we now get: ``` tmp0 = load(arg_0, i0) tmp1 = tmp0 * tmp0 tmp2 = tmp1 * tmp1 tmp3 = tmp2 * tmp2 return tmp3 ``` Which corresponds to sequence of `ops` calls made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88933 Approved by: https://github.com/jansel	2022-12-15 15:36:52 +00:00
chunyuan	d35aa2f65a	Inductor cpp wrapper: support Reduction (#88561 ) For reductions, the code string in the codegen stage and the execution stage are different due to `\`. - The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage: ``` #pragma omp declare reduction(argmax : struct IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ``` - The code string loaded during the execution (`\` will be escaped): ``` #pragma omp declare reduction(argmax : struct IndexValue_1 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ``` Thus we can't get the same hash value for these two pieces of code. This PR adds a function to make the transformation escape the backslash in the codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-12-14 12:29:50 +00:00

1 2

70 Commits