pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Pearu Peterson	2327879fb6	Add lowering to special.bessel_j0 (2nd try) (#118565 ) This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565 Approved by: https://github.com/peterbell10	2024-01-30 15:26:59 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Edward Z. Yang	cad79bd0bb	Remove follow_imports = skip from sympy (#118469 ) dmypy silently ignores follow_imports = skip, so to get parity between dmypy and mypy we have to suck it up and type: ignore all of the sympy typing problems. The suppressions were added automatically with the following script generated by GPT-4: ``` import re # Read the error file with open("error_file.txt", "r") as f: errors = f.readlines() # Parse the lines with errors and error types error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type # Insert ignore comments in the source files for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467, #118468	2024-01-28 13:38:38 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
Edward Z. Yang	d03173e88c	Unify MYPYINDUCTOR and MYPY (#118432 ) The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this. Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418	2024-01-27 17:23:20 +00:00
eellison	b95c45fbf7	add stack trace to device skip (#118112 ) Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112 Approved by: https://github.com/bdhirsh	2024-01-26 22:33:48 +00:00
Peter Bell	f129e3fe03	[inductor] Handle cum{sum,prod} on zero-dim tensors (#117990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117990 Approved by: https://github.com/lezcano	2024-01-26 22:21:42 +00:00
Edward Z. Yang	25f72194e8	Realize inputs to DynamicScalar before unwrapping storage (#118125 ) Fixes https://github.com/pytorch/pytorch/issues/118102 Unfortunately, the test still fails due to an unrelated problem https://github.com/pytorch/pytorch/issues/117665 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118125 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #117862	2024-01-26 18:08:03 +00:00
leslie-fang-intel	b66c4eda61	[Inductor] Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/108220 which improves performance of `basic_gnn_gin`, `basic_gnn_sage` and `basic_gnn_gcn` in multi thread test cases. However, it causes performance regression of these 3 models in single thread test case as reported in https://github.com/pytorch/pytorch/issues/117740. Fix the single thread issues in this PR by adding the thread number check to decide whether fallback `scatter_reduce_` or not. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_scatter_using_atomic_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118278 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-01-26 12:43:25 +00:00
Michael Lazos	aaae2d8bb6	Add compilable and capturable foreach adamax with tests (#117835 ) Based off of https://github.com/pytorch/pytorch/pull/110345 Fixes https://github.com/pytorch/pytorch/issues/117812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835 Approved by: https://github.com/janeyx99	2024-01-20 05:29:05 +00:00
vfdev-5	d0fc268918	Fixed issue in upsample_nearestnd lowering with scales (#117538 ) Fixed #116848 Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264 Originally, the code was ```python def upsample_nearestnd( x, output_size, scales_x: Tuple[Optional[float], ...], n: int = 2, exact: bool = False, ): # ... scales = [i / o for i, o in zip(i_sizes, o_sizes)] for i, scale in enumerate(scales): if scale: scales[i] = scale ``` which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output. This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749 ```python def upsample_nearestnd( x, output_size, scales_x: Tuple[Optional[float], ...], n: int = 2, exact: bool = False, ): # ... scales = [i / o for i, o in zip(i_sizes, o_sizes)] for i, scale in enumerate(scales_x): if scale: scales[i] = scale ``` however, this leads to a wrong scale value as it should be inverted as (1 / scale). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538 Approved by: https://github.com/peterbell10	2024-01-17 18:14:35 +00:00
Peter Bell	7a8013fbfa	[inductor] Handle more edge cases in slice and slice_scatter (#117377 ) Fixes #117110 When slicing we can end up with start and end which are out of bounds, which is handled in python slicing by clamping to the correct bounds. There is also the case where end < start which should result in an empty slice. In the isoneutral_mixing failure we have the second case, with `start=2, end=0` which in `slice_scatter` became `src_size[dim] = -2`. This PR improves slice's edge case handling and factors the start and end normalization code out so it can be shared with slice_scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377 Approved by: https://github.com/lezcano	2024-01-15 17:05:48 +00:00
Adnan Akhundov	c3e2b94827	Realize non-ReinterpretView Views in custom Triton kernel args (#117468 ) Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input. This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro). Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input ... ---------------------------------------------------------------------- Ran 1 test in 3.909s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468 Approved by: https://github.com/oulgen	2024-01-14 23:31:38 +00:00
rzou	cb42bc705b	Make auto_functionalized HOP fallback in inductor (#117084 ) It looks like the inductor fallback previously worked with HOPs but no longer does, so I fixed that: - all HOPs are exposed under torch.ops.higher_order, so I changed how inductor looks them up - the inductor fallback assumed that an operator's signature was (args, *kwargs). This is true for all the OpOverloads but not HOPs. I rewrote the code to not rely on this. Test Plan: - existing tests - new test for auto_functionalized HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084 Approved by: https://github.com/williamwen42	2024-01-12 17:57:01 +00:00
Sun, Jiayi	9f57cf502f	[inductor][cpu]disable pointwise_cat on CPU (#116313 ) We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr. This PR fix the following three regression issues: https://github.com/pytorch/pytorch/issues/115827 https://github.com/pytorch/pytorch/issues/112139 https://github.com/pytorch/pytorch/issues/114495 and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-01-11 08:00:00 +00:00
PyTorch MergeBot	1174e82bde	Revert "Add _assert_scalar and teach Inductor to codegen it (#114148 )" This reverts commit `b6028acfa4`. Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))	2024-01-11 02:30:22 +00:00
Edward Z. Yang	b6028acfa4	Add _assert_scalar and teach Inductor to codegen it (#114148 ) Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor. So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed. I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148 Approved by: https://github.com/jansel	2024-01-09 23:21:26 +00:00
Jason Ansel	94363cee41	[inductor] Indexing refactors (#116078 ) Perf differences seems to be noise: ![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078 Approved by: https://github.com/aakhundov	2024-01-09 19:06:51 +00:00
Valentine233	20c2ec9a15	[CPU] Add flash attention mask version (#115913 ) Add a masked-version flash attention for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-01-07 04:58:23 +00:00
PyTorch MergeBot	2ccc7af028	Revert "[CPU] Add flash attention mask version (#115913 )" This reverts commit `76a3fbb709`. Reverted https://github.com/pytorch/pytorch/pull/115913 on behalf of https://github.com/zou3519 due to broke transformer test on dynamo shard ([comment](https://github.com/pytorch/pytorch/pull/115913#issuecomment-1878043389))	2024-01-05 02:39:12 +00:00
Valentine233	76a3fbb709	[CPU] Add flash attention mask version (#115913 ) Add a masked-version flash attention for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-01-05 01:27:36 +00:00
leslie-fang-intel	81cebca3d2	[Inductor] [Quant] Fix QConv Binary Inplace Layout Issue (#115613 ) This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem: - As the changes made in `611a7457ca`, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`. - Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: #116172	2023-12-24 08:04:29 +00:00
Peter Bell	4f4b931aba	[inductor] Do variance calculation in opmath type (#115181 ) Fixes #114903 Previously large split variance reductions stored the intermediates as float16 precision, which may lead to overflow as the intermediate result is unnormalized. In #114903 we see two different `num_split` decisions made based on the hardware capabilities, one of which has large enough intermediates to cause overflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181 Approved by: https://github.com/shunting314	2023-12-23 01:06:43 +00:00
Aaron Meurer	f08c4da86d	Add a decomposition for take() (#114813 ) Presumably this can close https://github.com/pytorch/pytorch/pull/109784 Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there). There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow? Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-22 18:14:57 +00:00
Yifu Wang	718b576e2c	Port all_to_all_single to native c10d_functional (#113438 ) Summary: - Ported `all_to_all_single` to native c10d_functional - Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()` - Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438 Approved by: https://github.com/yf225, https://github.com/ezyang	2023-12-22 08:12:13 +00:00
Adnan Akhundov	247f9c3de4	Preserve strides of custom Triton kernel args (#116219 ) Summary: Currently, we [`clone`](`19207b9183/torch/_inductor/lowering.py (L5273)`) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost). In this PR, we add an extended version of the existing [`clone` lowering](`19207b9183/torch/_inductor/lowering.py (L2289)`)---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though. Test Plan: ``` $ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input ... ---------------------------------------------------------------------- Ran 1 test in 5.568s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219 Approved by: https://github.com/jansel	2023-12-21 22:46:32 +00:00
vfdev-5	b72127cd4b	[inductor] Support sym exprs in lowering constant promotion (#116196 ) Follow-up to https://github.com/pytorch/pytorch/pull/115920 This PR fixes the error with symbolic expression in aten.div: ```python import torch aten = torch.ops.aten def func(x, a): return aten.div(x * 0.5, a, rounding_mode=None) cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cpu" x = 124 a = 33 out = cfunc(x, a) expected = func(x, a) torch.testing.assert_close(out, expected) ``` Error message: ``` File "/pytorch/torch/_inductor/graph.py", line 700, in call_function out = lowerings[target](args, kwargs) File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped out = decomp_fn(args, *kwargs) File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode return div(a, b) File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped out = decomp_fn(args, *kwargs) File "/pytorch/torch/_inductor/lowering.py", line 4857, in div a, b = promote_constants( File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView))) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: StopIteration: target: aten.div.Tensor_mode args[0]: 1.0s0 args[1]: s1 kwargs: {'rounding_mode': None} Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196 Approved by: https://github.com/peterbell10	2023-12-20 21:59:51 +00:00
PyTorch MergeBot	c215e59bf2	Revert "[inductor] Avoid bool being upcast to int (#109913 )" This reverts commit `92998693a9`. Reverted https://github.com/pytorch/pytorch/pull/109913 on behalf of https://github.com/jeanschmidt due to causing performance regression in relevant metrics, @malfet I believe you are the correct person to help identify and fix the issues. More details check internal OPS count for ads metricsnin the internal related diff ([comment](https://github.com/pytorch/pytorch/pull/109913#issuecomment-1864397407))	2023-12-20 12:33:50 +00:00
Elias Ellison	9a2a44457a	SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069 ) The logic to check alignment for realized tensors in the backward can be extended for realized tensors in the forward. This fixes an interaction with freezing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116069 Approved by: https://github.com/drisspg	2023-12-20 00:14:20 +00:00
Peter Bell	92998693a9	[inductor] Avoid bool being upcast to int (#109913 ) Currently the inductor code for `x.any(-1)` does a this strange dance: ```python tmp0 = tl.load(in_ptr0 + (r1 + (128x0)), rmask & xmask) tmp1 = tmp0.to(tl.int64) tmp2 = (tmp1 != 0) ``` This happens because `register_lowering` is doing type promotion with the dimension argument, and so promotes to `int64` which we then cast back to bool. A better fix would be to fix `register_lowering` but for now I just remove the unnecessary type promotion from `aten.any`. In the current code we also see: ```python tmp5 = tl.where(rmask & xmask, tmp3, 0) ``` which promotes the boolean value to int since `0` is an int32 in triton. This fixes it to generate a boolean constant instead. Finally there is also a triton bug where the `tl.load` itself upcasts to `tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final kernel code looks like: ```python tmp0 = tl.load(in_ptr0 + (r1 + (128x0)), rmask & xmask).to(tl.int1) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp3 = tl.full([1, 1], 0, tl.int1) tmp4 = tl.where(rmask & xmask, tmp1, tmp3) tmp5 = triton_helpers.any(tmp4, 1)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913 Approved by: https://github.com/lezcano	2023-12-19 14:16:10 +00:00
Isuru Fernando	8b0122ad33	Add lowerings for reflection_pad{1, 3}d_backward (#115645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115645 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2023-12-19 04:05:10 +00:00
vfdev-5	2a2f2e454a	[inductor] Fixed issue with true div on integer input with dyn shapes (#115920 ) Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8` Description: - Fixed issue with true div on integer input with dyn shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920 Approved by: https://github.com/peterbell10	2023-12-16 02:06:39 +00:00
PyTorch MergeBot	ca4caf4eac	Revert "[inductor] Do variance calculation in opmath type (#115181 )" This reverts commit `42390a097b`. Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))	2023-12-14 18:21:49 +00:00
Peter Bell	ad76a4e1e7	[inductor] Allow sympy expressions to participate in type promotion (#115676 ) In the test example we have `add(i64[10], sympy.Expr)` where `sympy.Expr` is not considered a promoting arg so isn't factored into the type promotion. However, in eager it would promote to float32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676 Approved by: https://github.com/lezcano ghstack dependencies: #115677, #115699, #115700	2023-12-13 22:22:37 +00:00
Peter Bell	42390a097b	[inductor] Do variance calculation in opmath type (#115181 ) Fixes #114903 Previously large split variance reductions stored the intermediates as float16 precision, which may lead to overflow as the intermediate result is unnormalized. In #114903 we see two different `num_split` decisions made based on the hardware capabilities, one of which has large enough intermediates to cause overflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181 Approved by: https://github.com/shunting314	2023-12-13 18:40:44 +00:00
Yang Chen	1392843e7b	[inductor] make sure bitcast input and target type have the same bitwidth (#115619 ) This PR fixed #104791 bitcast requires the source and target have the bitwidth. Because the input tensor's dtype could be promoted, e.g. from float16 to float, we have to cast the tensor to its original source dtype before invoking bitcast in such cases. After that, we also need to convert the bit-casted tensor back to float to make sure we keep using higher precision values for the rest of the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619 Approved by: https://github.com/jansel, https://github.com/eellison	2023-12-13 00:53:04 +00:00
Peter Bell	02196c21ac	[inductor] Parameterize ir.Scan on combine_fn (#109132 ) This replaces `tl.cumsum` and `tl.cumprod` with calls to `tl.associative_scan` where the combine function is generated from inductor IR. So before we had: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr): xnumel = 20 rnumel = 30 RBLOCK: tl.constexpr = 32 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[None, :] rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (30x0)), rmask & xmask, other=0).to(tl.float32) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp2 = tl.where(rmask & xmask, tmp1, 0) tmp3 = tl.cumsum(tmp2, 1) tl.store(out_ptr0 + (r1 + (30x0)), tmp3, rmask & xmask) ``` Now we have: ```python @triton.jit def _triton_helper_fn0(arg0, arg1): tmp0 = tmp0 + tmp1 return tmp0 @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr): xnumel = 20 rnumel = 30 RBLOCK: tl.constexpr = 32 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[None, :] rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (30x0)), rmask & xmask, other=0).to(tl.float32) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp2 = tl.where(rmask & xmask, tmp1, 0) tmp3 = tl.associative_scan(tmp2, 1, _triton_helper_fn0) tl.store(out_ptr0 + (r1 + (30x0)), tmp3, rmask & xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109132 Approved by: https://github.com/lezcano	2023-12-12 16:30:50 +00:00
Isuru Fernando	505574c46a	Add decomposition for torch.block_diag (#115096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096 Approved by: https://github.com/peterbell10	2023-12-11 20:04:22 +00:00
Isuru Fernando	d40a7c6026	Add decompositions for replication_pad (#115113 ) Fixes #115395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115113 Approved by: https://github.com/peterbell10	2023-12-09 02:44:07 +00:00
Isuru Fernando	fb19947962	Add decompositions for reflection_pad{1, 2, 3}d (#115100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115100 Approved by: https://github.com/peterbell10	2023-12-08 23:05:57 +00:00
Peter Bell	7aac689b19	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation. Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel. Fixes https://github.com/pytorch/pytorch/issues/93631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano, https://github.com/atalman	2023-12-05 23:31:49 +00:00
lezcano	0a9819e3e1	Prefer is_number over is_constant() (#114513 ) `is_constant` tries really hard to check whether an expression is constant. `is_number` is often enough. Note that `sympy.nan.is_number` is true. Same for infinities Pull Request resolved: https://github.com/pytorch/pytorch/pull/114513 Approved by: https://github.com/peterbell10	2023-12-05 16:56:15 +00:00
PyTorch MergeBot	0ee1e469cb	Revert "Modify pointwise cat heuristic to only apply when inputs are all pointwise and outputs are all pointwise (#114520 )" This reverts commit `3d47b92dfb`. Reverted https://github.com/pytorch/pytorch/pull/114520 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114520#issuecomment-1840890210))	2023-12-05 14:24:30 +00:00
chilli	3d47b92dfb	Modify pointwise cat heuristic to only apply when inputs are all pointwise and outputs are all pointwise (#114520 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114520 Approved by: https://github.com/eellison	2023-12-02 04:02:39 +00:00
Antoni Viros	d47f715d29	Expose Flash attn to autograd (#114378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378 Approved by: https://github.com/drisspg	2023-12-01 23:42:06 +00:00
Kurt Mohler	6f32eb7eef	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-12-01 18:56:09 +00:00
PyTorch MergeBot	013675ff59	Revert "Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 )" This reverts commit `f1286161a6`. Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job. The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))	2023-11-30 02:28:14 +00:00
chilli	597d3fb86a	Add additional guard for index_put fallback for bfloat16 on whether it's accumulating or not (#114788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114788 Approved by: https://github.com/cpuhrsch	2023-11-30 00:33:50 +00:00

1 2 3 4 5 ...

419 Commits