pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
nopperl	0c21161488	Add meta function for `torch.histc` (#124548 ) Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-04-23 00:24:59 +00:00
Nikita Shulga	00372b1211	Extend int[48]mm ops to float32 input (#124287 ) Just for completeness Pull Request resolved: https://github.com/pytorch/pytorch/pull/124287 Approved by: https://github.com/mikekgfb	2024-04-17 23:10:49 +00:00
Nikita Shulga	298eb69c91	[EZ] Make `weight_int4pack_mm` compilable for `half` input dtype (#124136 ) To enable efficient int4 quantization on ARM Followup after https://github.com/pytorch/pytorch/pull/124022 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124136 Approved by: https://github.com/mikekgfb	2024-04-16 08:10:59 +00:00
Nikita Shulga	a096e99a5d	Enable int8mm kernel for float16 (#124022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124022 Approved by: https://github.com/mikekgfb	2024-04-14 19:48:43 +00:00
Aleksandar Samardžić	f5331aade5	Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473 Approved by: https://github.com/cpuhrsch	2024-04-14 06:57:41 +00:00
PyTorch MergeBot	97261be0a8	Revert "Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 )" This reverts commit `b2a0b8c446`. Reverted https://github.com/pytorch/pytorch/pull/123473 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123473#issuecomment-2053561077))	2024-04-13 07:47:32 +00:00
Aleksandar Samardžić	b2a0b8c446	Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473 Approved by: https://github.com/cpuhrsch	2024-04-11 11:56:27 +00:00
Episkey0109	02b29e7d07	Add meta function for channel_shuffle operation (#123033 ) This commit introduces a meta function for the `channel_shuffle` operation, enabling PyTorch to perform shape inference and optimizations related to this operation without actual computation. The meta function assumes input shape (*, C, H, W) and validates that the number of channels (C) is divisible by the specified number of groups. Fixes #122771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123033 Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki	2024-04-11 10:07:18 +00:00
Jane Xu	adcfc2b582	Add meta reg for addcdiv/addcmul ScalarList (#123486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123486 Approved by: https://github.com/awgu	2024-04-09 22:05:58 +00:00
angelayi	493478db4a	[effects] Add inductor support for tokens (#122347 ) Given the following code/dynamo graph: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ _print = torch.ops.aten._print('moo') res = l_x_ + l_x_; l_x_ = None _print_1 = torch.ops.aten._print('moo') return (res,) ``` AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output: ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"): with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None return (getitem_2, add) ``` However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators. This has to be done after the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph. ``` class <lambda>(torch.nn.Module): def forward(self, arg1_1: "f32[2, 3]"): _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default() with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None return (add,) ``` When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like: ``` def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) # Source Nodes: [_print], Original ATen: [] buf2 = aten._print.default('moo') # Source Nodes: [_print_1], Original ATen: [] buf3 = aten._print.default('moo') buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, buf4) del arg1_1 return (buf4, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347 Approved by: https://github.com/bdhirsh	2024-04-09 03:22:32 +00:00
Edward Z. Yang	deeeaded1f	Add metas for randint/rand factory functions out overload (#122375 ) Fixes https://github.com/pytorch/pytorch/issues/121897 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375 Approved by: https://github.com/lezcano	2024-03-25 04:01:38 +00:00
Flavio Sales Truzzi	bde22835c6	[PT2] - Guard oblivious on meta registrations (#122216 ) Summary: ``` [trainer0\|0]:Potential framework code culprit (scroll up for full backtrace): [trainer0\|0]: File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check [trainer0\|0]: if index.numel() != 0: ``` Test Plan: General CI. Reviewed By: ezyang Differential Revision: D54689183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216 Approved by: https://github.com/ezyang	2024-03-22 01:36:03 +00:00
mingfeima	b3065f6899	add int8 packed gemm support on CPU device (#118056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056 Approved by: https://github.com/mikekgfb	2024-03-07 08:41:43 +00:00
Peter Bell	eae9751e82	Fix linalg_eigvals invalid use of composite dispatch key (#121142 ) `linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals` also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op as not all types support out variants. Instead, I add a new helper `_linalg_eigvals` which does the same thing in a non-composite operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142 Approved by: https://github.com/lezcano	2024-03-05 21:13:27 +00:00
PyTorch MergeBot	0c07c0c15f	Revert "add int4 packed gemm support on CPU device (#117475 )" This reverts commit `30befa592e`. Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))	2024-03-04 21:20:57 +00:00
PyTorch MergeBot	a98c17edc7	Revert "add int8 packed gemm support on CPU device (#118056 )" This reverts commit `f84375ca5d`. Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))	2024-03-04 20:09:40 +00:00
Xia, Weiwen	83d848e1c7	[Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605 ) description Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear. The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case. This feature is targeting PyTorch 2.3 release. Test plan ``` python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear ``` Performance before and after lowering `choose_qparam` to Inductor Before - latency for shape (32, 32) = 0.151 ms latency for shape (128, 128) = 0.153 ms latency for shape (1024, 1024) = 0.247 ms After - latency for shape (32, 32) = 0.049 ms - latency for shape (128, 128) = 0.052 ms - latency for shape (1024, 1024) = 0.133 ms Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-02 05:11:17 +00:00
mingfeima	f84375ca5d	add int8 packed gemm support on CPU device (#118056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056 Approved by: https://github.com/mikekgfb ghstack dependencies: #117475	2024-03-02 04:35:49 +00:00
mingfeima	30befa592e	add int4 packed gemm support on CPU device (#117475 ) This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-03-02 00:17:34 +00:00
Andrew M. James	19fcf6de1a	Add lowering for fraction_max_pool2d (#120460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120460 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-03-01 20:13:20 +00:00
Simon Fan	9b2c35b4fe	[dynamo] Fix convolution meta kernel when input channel is 0 (#120944 ) Addresses https://github.com/pytorch/pytorch/issues/118797 Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0): `67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944 Approved by: https://github.com/zou3519	2024-03-01 01:18:21 +00:00
Jane Xu	da559c98e3	Fix isin decomp and add python meta registration (#120821 ) Fixes #119792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821 Approved by: https://github.com/malfet, https://github.com/peterbell10	2024-02-29 22:08:50 +00:00
David Berard	d6c202975c	Move attention kernels from meta_registrations to fake_impls (#120682 ) This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns. This PR: * Move the `_meta_registrations.py` implementations to `fake_impls.py` * Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them * Wrap all the returned tensors in FakeTensors Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682 Approved by: https://github.com/drisspg	2024-02-28 21:49:13 +00:00
angelayi	f064dec7e0	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-27 01:34:59 +00:00
PyTorch MergeBot	b01bd1f7a1	Revert "Add torch.ops.aten.print (#120295 )" This reverts commit `3b944113c8`. Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))	2024-02-27 01:18:48 +00:00
David Berard	fdae9363b3	[meta registration] efficient_attention_forward fix for NT inputs (#120594 ) When cu_seqlens_q is provided, we should use the user-specified max_seqlen_q instead of inferring it as query.size(1): `1c7b0e7cd1/aten/src/ATen/native/transformers/cuda/attention.cu (L989)` This wasn't caught because the value is taken as ceil(max_seqlen / 32) * 32; in the opinfos, and the opinfo inputs were small enough that this value was 32 in either case. Differential Revision: [D54179733](https://our.internmc.facebook.com/intern/diff/D54179733) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120594 Approved by: https://github.com/drisspg	2024-02-27 00:10:37 +00:00
angelayi	3b944113c8	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-23 17:01:22 +00:00
Jane Xu	4319735ace	Add meta registration for _foreach_norm (2nd try) (#119927 ) The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927 Approved by: https://github.com/albanD	2024-02-16 00:23:23 +00:00
Joel Schlosser	31e59766e7	Fix meta registration for _flash_attention_forward() (#119812 ) Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case. Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812 Approved by: https://github.com/drisspg	2024-02-14 02:38:53 +00:00
Jesse Cai	1c1dc0e4e0	[sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296 ) Summary: Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with cuSPARSELt. Test Plan: ``` python test/test_sparse_semi_structured.py -k mixed ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296 Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic	2024-02-12 16:02:36 +00:00
Pearu Peterson	2c91e13afc	Add lowerings to special functions (#119187 ) As in the title. In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187 Approved by: https://github.com/peterbell10	2024-02-11 16:35:40 +00:00
PyTorch MergeBot	dea15c9fdc	Revert "Add meta registration for _foreach_norm (#118604 )" This reverts commit `b8bb12cd45`. Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))	2024-02-06 22:20:44 +00:00
Vladimir Malinovskii	73f0fdea5b	[fix] accounting for dilation in pool padding assertion (#118897 ) Fixes https://github.com/pytorch/pytorch/issues/7541 It is a copy of https://github.com/pytorch/pytorch/pull/111427, I have failed to fix all its issues in time, and it got closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118897 Approved by: https://github.com/mikaylagawarecki	2024-02-06 20:32:58 +00:00
Jane Xu	b8bb12cd45	Add meta registration for _foreach_norm (#118604 ) This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls. For script: ``` import torch ts = [torch.rand(32, 16, device="cuda") for _ in range(128)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: res = torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` OG baseline: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 25.36% 4.209ms 99.94% 16.586ms 16.586ms 8.000us 88.89% 9.000us 9.000us 1 cudaLaunchKernel 61.21% 10.159ms 61.21% 10.159ms 2.540ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.43% 71.000us 58.35% 9.683ms 9.683ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.33% 55.000us 57.35% 9.517ms 9.517ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.42% 69.000us 57.01% 9.462ms 9.462ms 1.000us 11.11% 1.000us 1.000us 1 aten::select 8.04% 1.335ms 11.29% 1.873ms 14.633us 0.000us 0.00% 0.000us 0.000us 128 aten::as_strided 3.24% 538.000us 3.24% 538.000us 4.203us 0.000us 0.00% 0.000us 0.000us 128 aten::empty 0.90% 150.000us 0.90% 150.000us 75.000us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceSynchronize 0.06% 10.000us 0.06% 10.000us 10.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 11.11% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 66.67% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 22.22% 2.000us 2.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 16.596ms Self CUDA time total: 9.000us ``` And here's after this PR: ``` STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 30.95% 4.653ms 99.95% 15.026ms 15.026ms 9.000us 90.00% 10.000us 10.000us 1 cudaLaunchKernel 52.41% 7.879ms 52.41% 7.879ms 1.970ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.39% 58.000us 48.29% 7.260ms 7.260ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.35% 53.000us 47.25% 7.103ms 7.103ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.43% 65.000us 46.90% 7.050ms 7.050ms 1.000us 10.00% 1.000us 1.000us 1 aten::empty 15.42% 2.318ms 15.42% 2.318ms 17.969us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceSynchronize 0.05% 7.000us 0.05% 7.000us 7.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 10.00% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 60.00% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 30.00% 3.000us 3.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 15.033ms Self CUDA time total: 10.000us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604 Approved by: https://github.com/albanD	2024-02-05 22:01:01 +00:00
David Berard	1b03423526	[meta registration] fix _efficient_attention_forward for jagged inputs (#118657 ) Fixes the meta registration for the logsumexp output, whose shape should be defined by the size of the offsets tensor when it exists. `644f64f2d1/aten/src/ATen/native/transformers/cuda/attention.cu (L1045)` Differential Revision: [D53234217](https://our.internmc.facebook.com/intern/diff/D53234217) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118657 Approved by: https://github.com/YuqingJ	2024-01-31 00:11:39 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit `74e1362499`. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
vfdev-5	f6767244cf	Added meta function for _upsample_bicubic2d_aa (#117347 ) This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127 ``` /opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors) E torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>((FakeTensor(..., size=(1, s0, s1, s2)),), {'size': [s4, floor(s3s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}): E aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers E E from user code: E File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image E image = interpolate( E E Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information E E E You can suppress this exception and fall back to eager by setting: E import torch._dynamo E torch._dynamo.config.suppress_errors = True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347 Approved by: https://github.com/peterbell10	2024-01-16 23:33:55 +00:00
Valentine233	20c2ec9a15	[CPU] Add flash attention mask version (#115913 ) Add a masked-version flash attention for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-01-07 04:58:23 +00:00
PyTorch MergeBot	2ccc7af028	Revert "[CPU] Add flash attention mask version (#115913 )" This reverts commit `76a3fbb709`. Reverted https://github.com/pytorch/pytorch/pull/115913 on behalf of https://github.com/zou3519 due to broke transformer test on dynamo shard ([comment](https://github.com/pytorch/pytorch/pull/115913#issuecomment-1878043389))	2024-01-05 02:39:12 +00:00
Valentine233	76a3fbb709	[CPU] Add flash attention mask version (#115913 ) Add a masked-version flash attention for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-01-05 01:27:36 +00:00
Aleksandar Samardžić	f081c45a34	Add out_dtype support for sparse semi-structured CUTLASS back-end (#116519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116519 Approved by: https://github.com/cpuhrsch	2024-01-03 16:23:17 +00:00
soulitzer	8885128dcc	Fix backward for SDPA NT jagged layout (#115576 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576 Approved by: https://github.com/jbschlosser, https://github.com/ani300	2023-12-12 18:35:40 +00:00
Jesse Cai	4cb7dd0fc9	[sparse][quant] Add support for vector alpha in cusparselt mm (#112056 ) Summary: This PR adds in support for passing in a alpha Tensor, which represents a tensor of alpha values to fuse into the matmul. ``` cusparselt_sparse_mm = alpha A @ B + bias ``` This operation is necessary for quantization, where we would like to fuse one of the dequant matmuls into the sparse op. Test Plan: ``` python test/test_sparse_semi_structured -k alpha ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/112056 Approved by: https://github.com/cpuhrsch	2023-12-04 16:56:06 +00:00
Antoni Viros	d47f715d29	Expose Flash attn to autograd (#114378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378 Approved by: https://github.com/drisspg	2023-12-01 23:42:06 +00:00
Jesse Cai	ae593d0393	[sparse][semi-structured][inductor] meta registrations for _cslt_sparse_mm + additional stride checking in test. (#114685 ) _cslt_sparse_mm + additional stride checking in test. Summary: This PR adds in meta registrations for _cslt_sparse_mm. Based on the work @drisspg did in #114370. Additionally, it updates the tests by checking that the strides of the spare result and the result returned by sparse+compile are the same, to avoid errors like those found in https://github.com/pytorch/pytorch/pull/114477. Test Plan: ``` python test/test_sparse_semi_structred -k compile_cusparselt python test/test_sparse_semi_structred -k compile_cutlass ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114685 Approved by: https://github.com/alexsamardzic, https://github.com/drisspg	2023-11-29 00:31:52 +00:00
Jon Chuang	cef79c0df4	[inductor] `_sparse_semi_structured_linear` fallback - no meta registration; not on testing path (#114477 ) Test was wrong in original PR and merged changes were never tested. Further, the sparse op was never actually compiled due to missing `fullgraph=True` and missing meta registration. When meta is added as per this PR, it gives wrong answers when input needs to be padded and when input needs to be reshaped. Is this something to do with the generated inductor code for: ``` constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0) ... slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1); _sparse_semi_structured_linear = None ``` and ``` [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] mul: "Sym(s0s1)" = primals_4 primals_5 [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view: "f16[s0s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]); primals_6 = mul = None ... [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]); slice_1 = None ``` Failing graphs: Padded: ``` [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] ===== Forward graph 5 ===== [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] <eval_with_key>.66 class GraphModule(torch.nn.Module): [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[1, 128]"): [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] _sparse_semi_structured_linear: "f16[32, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(constant_pad_nd, primals_1, primals_2); constant_pad_nd = primals_1 = primals_2 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1); _sparse_semi_structured_linear = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] slice_2: "f16[1, 128]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807); slice_1 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] relu: "f16[1, 128]" = torch.ops.aten.relu.default(slice_2); slice_2 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias: "f16[1, 128]" = torch.ops.aten.alias.default(relu) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias_1: "f16[1, 128]" = torch.ops.aten.alias.default(alias); alias = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] le: "b8[1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0); alias_1 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] permute: "f16[128, 1]" = torch.ops.aten.permute.default(primals_3, [1, 0]); primals_3 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] return [relu, le, permute] ``` Reshape: ``` [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] <eval_with_key>.69 class GraphModule(torch.nn.Module): [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[128]", primals_4: "Sym(s0)", primals_5: "Sym(s1)", primals_6: "f16[s0, s1, 128]"): [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x) [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] mul: "Sym(s0s1)" = primals_4 * primals_5 [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view: "f16[s0s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]); primals_6 = mul = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] _sparse_semi_structured_linear: "f16[s0s1, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(view, primals_1, primals_2, bias = primals_3); primals_1 = primals_2 = primals_3 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] slice_1: "f16[s0*s1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 1, 0, 9223372036854775807); _sparse_semi_structured_linear = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]); slice_1 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x) [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] relu: "f16[s0, s1, 128]" = torch.ops.aten.relu.default(view_1); view_1 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(relu) [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias_1: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(alias); alias = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] le: "b8[s0, s1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0); alias_1 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] return [relu, view, le, primals_4, primals_5] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114477 Approved by: https://github.com/jcaip	2023-11-28 19:35:05 +00:00
drisspg	8556a09d44	Require less alignment for attn bias (#114173 ) # Summary Improved Fix for Attention Mask Alignment Issue (#112577) This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention. ## Changes Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users. Should this be warn_once? We only call expand, once on the aligned mask. Reference https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115 @albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173 Approved by: https://github.com/danthe3rd	2023-11-28 02:40:41 +00:00
PyTorch MergeBot	88a8a0daa4	Revert "Require less alignment for masking (#114173 )" This reverts commit `f882c175d8`. Reverted https://github.com/pytorch/pytorch/pull/114173 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing some inductor tests `f882c175d8` ([comment](https://github.com/pytorch/pytorch/pull/114173#issuecomment-1823552362))	2023-11-22 21:49:31 +00:00
drisspg	f882c175d8	Require less alignment for masking (#114173 ) # Summary Improved Fix for Attention Mask Alignment Issue (#112577) This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention. ## Changes Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users. Should this be warn_once? We only call expand, once on the aligned mask. Reference https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115 @albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173 Approved by: https://github.com/danthe3rd	2023-11-22 20:02:51 +00:00
Tomasz Bohutyn	84909fef52	Add meta registration for aten.linear_backward (#114359 ) Fixes #114358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114359 Approved by: https://github.com/ezyang	2023-11-22 18:24:24 +00:00
Isuru Fernando	4b7f9fa436	Meta register all foreach ops (#112281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112281 Approved by: https://github.com/lezcano	2023-11-21 14:23:09 +00:00
vfdev-5	1f8d00c5a3	[inductor] Added decomposition for upsample_nearest_exact Nd (#113749 ) Description: - Added decomposition for upsample_nearest_exact: 1d, 2d, 3d Pull Request resolved: https://github.com/pytorch/pytorch/pull/113749 Approved by: https://github.com/lezcano	2023-11-21 13:03:47 +00:00
lezcano	1d96034816	[BE][easy] Simplify the registration of a few metafunctions (#113635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113635 Approved by: https://github.com/Skylion007 ghstack dependencies: #113634, #113674	2023-11-16 19:09:12 +00:00
lezcano	9b3e694f5d	Fix metafunction for many pointwise operations (#113634 ) The previous metafunction was completely broken. It incorrectly used a metafunction that was designed for prims. It also passed in an incorrect enum class for the type promotion. Fixes https://github.com/pytorch/pytorch/issues/113119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113634 Approved by: https://github.com/peterbell10	2023-11-16 19:09:12 +00:00
drisspg	c46fc46dba	expose mem-eff to autograd (#110495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495 Approved by: https://github.com/jbschlosser	2023-11-13 17:47:40 +00:00
Edward Z. Yang	f49b8e9313	Register SymInt-aware meta function for mm out, symintify resize (#113202 ) Fixes https://github.com/pytorch/pytorch/issues/112489 Fixes https://github.com/pytorch/pytorch/issues/112494 New OpInfo tests for out variants added, since these were not exercised previously. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113202 Approved by: https://github.com/albanD	2023-11-10 14:27:05 +00:00
jiayisun	63d65dd6cd	Correct output shape of meta registration for qlinear_pointwise (#112390 ) Corrected output shape of meta registration for qlinear_pointwise. Because the weight of qlinear_pointwise has been transposed during the qLinear weight prepack process, the shape of the weight of qlinear_pointwise is (in_features, out_features). Pull Request resolved: https://github.com/pytorch/pytorch/pull/112390 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2023-11-10 07:50:59 +00:00
eellison	325e0fdfdd	Enable masked_scatter_backward for inductor (#109642 ) masked_scatter_backward was previously implemented as a CompositeExplicitAutograd, which involved a decomp that calls masked_select, and masked_select in general produces data-dependent shapes that inductor doesn't support. But masked_scatter_backward reshapes the return value of masked_select such that the end result has a static shape again. I have converted masked_scatter_backward into an aten op to avoid this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642 Approved by: https://github.com/ezyang ghstack dependencies: #108170	2023-11-09 01:27:57 +00:00
Aaron Gokaslan	376217cc0b	[BE]: Apply FURB145 to make code more readable and idiomatic. (#112990 ) Testing out some new rules that are in beta, I think I will apply this one codebase wide once it's out of preview. Replaces the hack of using `[:]` to do copies of list with the proper copy method. More efficient and more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112990 Approved by: https://github.com/ezyang	2023-11-06 13:15:04 +00:00
leslie-fang-intel	a53d29cc18	Enable oneDNN QLinear FP32/BF16 output (#112126 ) Summary - PR 2 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640. - Enable QLinear (relu) with BFloat16 or Float32 output. TestPlan ``` python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112126 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 ghstack dependencies: #112010	2023-11-03 08:20:54 +00:00
leslie-fang-intel	b6fc7af8a0	Enable oneDNN QConv FP32/BF16 output (#112010 ) Summary - PR 1 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640. - Enable QConv (relu, add, add_relu) with BFloat16 or Float32 output. Test Plan ``` python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_relu_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_add_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_pt2e python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_float_output_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112010 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2023-11-03 08:16:45 +00:00
drisspg	458e7d09fd	Add meta func for scaled mm (#112609 ) # Summary Adds a meta implementation for _scaled_mm which is required for dynamic shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/112609 Approved by: https://github.com/eellison, https://github.com/malfet	2023-11-03 03:44:22 +00:00
PyTorch MergeBot	2e29172942	Revert "Add meta func for scaled mm (#112609 )" This reverts commit `75174c3797`. Reverted https://github.com/pytorch/pytorch/pull/112609 on behalf of https://github.com/huydhn due to Sorry for reverting this change, but it is failing ROCm jobs `75174c3797` ([comment](https://github.com/pytorch/pytorch/pull/112609#issuecomment-1791704037))	2023-11-02 23:37:16 +00:00
drisspg	75174c3797	Add meta func for scaled mm (#112609 ) # Summary Adds a meta implementation for _scaled_mm which is required for dynamic shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/112609 Approved by: https://github.com/eellison, https://github.com/malfet	2023-11-02 18:42:41 +00:00
Peter Bell	04024926f4	Use `pytree.tree_map_` everywhere (#112417 ) Wherever we discard the output of `tree_map` it's better to call `tree_map_` which doesn't unflatten the mapped results and so is a lot cheaper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112417 Approved by: https://github.com/lezcano ghstack dependencies: #112391, #112392, #112393, #112394	2023-10-31 15:57:06 +00:00
lezcano	c8a5bb451e	Do not import sympy within torch._prims_common (#112034 ) This is the first of a few PRs that avoid importing SymPy at import time. The pitch here is that we (almost!) do not have SymPy on our API, so this should be feasible. This should speed-up torch imports by a good 15% as per https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589 In this PR we just move a few global imports into local imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112034 Approved by: https://github.com/ezyang	2023-10-26 12:53:25 +00:00
Jez Ng	ad3572a5dc	Unify torch.SymInt and torch.types.SymInt (#110573 ) Per @ezyang, this should be fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/110573 Approved by: https://github.com/ezyang	2023-10-24 16:17:23 +00:00
Yuanjing Shi	920c9adcc6	[MetaTensor] fix inplace copy for meta tensor (#111705 ) Fixes #105685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111705 Approved by: https://github.com/ezyang	2023-10-21 06:02:37 +00:00
Jane Xu	93a9b1314b	Make step() faster by passing in a tensor vs scalar 1 (#111084 ) This is the culminated result of https://github.com/pytorch/pytorch/pull/110954#issuecomment-1758520411. We are making the code slightly more complicated to gain some perf in minimizing calls to `.copy_()` and `.to()`. ### Code ``` import torch with torch.cuda.device(0): steps = [torch.zeros((), device="cpu", dtype=torch.float32) for i in range(1000)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: # New code: # step_device = steps[0].device # one = torch.tensor(1.0, device=step_device) if str(step_device) == "cpu" else 1 # torch._foreach_add_(steps, one, 1.0) # Old code: torch._foreach_add_(steps, 1) print(p.key_averages().table(sort_by="cpu_time_total")) ``` ### Profiles with old code ``` ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_add_ 35.31% 52.089ms 99.99% 147.495ms 147.495ms 1 aten::add_ 25.05% 36.949ms 64.68% 95.406ms 95.406us 1000 aten::to 3.97% 5.852ms 39.63% 58.457ms 58.457us 1000 aten::_to_copy 10.11% 14.917ms 35.66% 52.605ms 52.605us 1000 aten::copy_ 21.65% 31.939ms 21.65% 31.939ms 31.939us 1000 aten::empty_strided 3.90% 5.749ms 3.90% 5.749ms 5.749us 1000 cudaDeviceSynchronize 0.01% 18.000us 0.01% 18.000us 18.000us 1 ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 147.513ms ``` with new code ``` ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_add_ 55.06% 49.963ms 99.86% 90.625ms 90.625ms 1 aten::add_ 44.81% 40.662ms 44.81% 40.662ms 40.662us 1000 aten::detach_ 0.01% 8.000us 0.05% 45.000us 45.000us 1 detach_ 0.04% 37.000us 0.04% 37.000us 37.000us 1 aten::empty 0.03% 30.000us 0.03% 30.000us 30.000us 1 aten::to 0.03% 23.000us 0.03% 23.000us 23.000us 1 cudaDeviceSynchronize 0.02% 22.000us 0.02% 22.000us 22.000us 1 aten::lift_fresh 0.01% 6.000us 0.01% 6.000us 6.000us 1 ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 90.751ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/111084 Approved by: https://github.com/albanD ghstack dependencies: #111079	2023-10-20 01:34:08 +00:00
Scruel Tao	108378e2af	Fix: `torch.matrix_exp` performance issue (#105225 ) (#110848 ) Fixes #105225 - New implementation for `compute_T18_scale_square` method. - Always use the highest degree for large batch sizes (size > 1). Pull Request resolved: https://github.com/pytorch/pytorch/pull/110848 Approved by: https://github.com/lezcano	2023-10-18 04:43:25 +00:00
Yanbo Liang	29048be41c	[Reland] Add int4mm kernel (#111403 ) This is a reland for #110914, #111327 and #111390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111403 Approved by: https://github.com/Chillee	2023-10-17 06:33:18 +00:00
PyTorch MergeBot	408e991dfe	Revert "Quant: add weight int4pack mm kernel (#110914 )" This reverts commit `9980876cab`. Reverted https://github.com/pytorch/pytorch/pull/110914 on behalf of https://github.com/jeanschmidt due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110914#issuecomment-1765302621))	2023-10-16 21:27:26 +00:00
Brian Hirsh	0d368f586a	fix wrong meta for index_select.out (#111364 ) fixes https://github.com/pytorch/pytorch/issues/110699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111364 Approved by: https://github.com/ezyang ghstack dependencies: #111040	2023-10-16 15:18:20 +00:00
Yanbo Liang	9980876cab	Quant: add weight int4pack mm kernel (#110914 ) Adding the weight int4pack mm CUDA kernel. The kernel comes from the tinnygemm project which developed by Jeff Johnson. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110914 Approved by: https://github.com/Chillee	2023-10-13 01:21:18 +00:00
drisspg	e0dbaa04d2	Fix the meta func for mem_eff_backward (#110893 ) Fixes #110832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110893 Approved by: https://github.com/eellison	2023-10-11 02:58:54 +00:00
Jon Chuang	37afa0c349	fix(inductor): Increase coverage of Inductor ATen lowering (#110473 ) Add sqrt to decomp testing path and fix missing `minimum`, `clamp_min`,`clamp_max` lowerings and/or registrations. Follow up to: https://github.com/pytorch/pytorch/pull/110468#issuecomment-1745718602 (requires upstream to merge to avoid merge conflict) CC: @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110473 Approved by: https://github.com/janeyx99	2023-10-04 23:40:46 +00:00
Jon Chuang	3fd938369f	add `foreach_abs` meta registration and inductor decomp (#110468 ) Fixes https://github.com/pytorch/pytorch/issues/110458 Somehow it is on allowlist but not on testing path. CC @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110468 Approved by: https://github.com/janeyx99	2023-10-04 06:09:37 +00:00
Mwiza Kunda	5c4b5baf21	Fix python decomps for OpOverloadPackets and add tests (#107707 ) - Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments) - Add out parameter wrappers to python decomps for aten ops that have out overloads CC. @ezyang @albanD @lezcano Fixes #107713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707 Approved by: https://github.com/lezcano	2023-09-25 20:53:30 +00:00
Mwiza Kunda	83b4aab5bc	Allow zero sized tensors to be resized with meta_randperm (#109721 ) Failure will be handled by `_maybe_resize_out` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/109721 Approved by: https://github.com/ezyang	2023-09-21 18:41:29 +00:00
eellison	d24ba7a634	Add 3d Attn Pattern to match HF Whisper (#109156 ) Adds a 3d pattern that improves perf of HF Whisper from 1.3 -> 4.1. We could be matching more generally on 3d, but i'll leave that for another pr. Thanks to @drisspg for helping me write the pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109156 Approved by: https://github.com/yanboliang ghstack dependencies: #109663, #108894, #108917, #109142	2023-09-20 16:39:31 +00:00
eellison	ad53b53518	Generate patterns in fp16 and fp32 (#109142 ) aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models). Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142 Approved by: https://github.com/yanboliang ghstack dependencies: #109663, #108894, #108917	2023-09-20 06:38:02 +00:00
PyTorch MergeBot	c2f5d4d8f0	Revert "Generate patterns in fp16 and fp32 (#109142 )" This reverts commit `14994cc978`. Reverted https://github.com/pytorch/pytorch/pull/109142 on behalf of https://github.com/eellison due to MESSAGE ([comment](https://github.com/pytorch/pytorch/pull/109142#issuecomment-1726641232))	2023-09-19 22:52:05 +00:00
eellison	14994cc978	Generate patterns in fp16 and fp32 (#109142 ) aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models). Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142 Approved by: https://github.com/yanboliang ghstack dependencies: #108894, #108917	2023-09-19 20:59:42 +00:00
leslie-fang-intel	4a60bd22b2	[Quant][Inductor] Enable quantization dynamic batch size support (#108550 ) Summary This Diff enables dynamic batch size support for quantization use case in Inductor. Take the UT in this PR as example, after this PR, the generated code will have assumption of dynamic input batch size. ``` cpp_fused_quantize_per_tensor_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(const float* in_ptr0, unsigned char* out_ptr0, const long ks0, const long ks1) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(static_cast<long>(ks1ks1)); i2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i2 + (i1(static_cast<long>(ks1ks1))) + (3Li0(static_cast<long>(ks1ks1))))]; auto tmp1 = static_cast<float>(40.36037717834931); auto tmp2 = decltype(tmp0)(tmp0 * tmp1); auto tmp3 = std::nearbyint(tmp2); auto tmp4 = static_cast<float>(97.0); auto tmp5 = tmp3 + tmp4; auto tmp6 = static_cast<float>(0.0); auto tmp7 = max_propagate_nan(tmp5, tmp6); auto tmp8 = static_cast<float>(255.0); auto tmp9 = min_propagate_nan(tmp7, tmp8); auto tmp10 = static_cast<unsigned char>(tmp9); out_ptr0[static_cast<long>(i1 + (3Li2) + (3Li0(static_cast<long>(ks1ks1))))] = tmp10; } } } } } ''') cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1 = async_compile.cpp(''' #include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(const unsigned char* in_ptr0, float* out_ptr0, unsigned char* out_ptr1, const long ks0, const long ks1) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(16L)) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={at::vec::Vectorized<float>(0)}) float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); for(long i2=static_cast<long>(0L); i2<static_cast<long>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))(at::native::div_floor_integer(ks1, 2L)))) + (2L(at::native::div_floor_integer(ks1, 2L)))); i2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i1 + (16Li0) + (16Li2) + (16Li0(static_cast<long>((at::native::div_floor_integer(ks1, 2L))(at::native::div_floor_integer(ks1, 2L))))) + (32Li0(at::native::div_floor_integer(ks1, 2L))))); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.010429476387798786)); auto tmp5 = tmp3 tmp4; tmp_acc0_vec = tmp_acc0_vec + tmp5; } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (16Li0))); } } } } { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(1L)) { auto tmp0 = out_ptr0[static_cast<long>(i0)]; auto tmp1 = static_cast<float>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))(at::native::div_floor_integer(ks1, 2L)))) + (2L(at::native::div_floor_integer(ks1, 2L)))); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(168.09128392896545); auto tmp4 = decltype(tmp2)(tmp2 * tmp3); auto tmp5 = std::nearbyint(tmp4); auto tmp6 = static_cast<float>(0.0); auto tmp7 = tmp5 + tmp6; auto tmp8 = max_propagate_nan(tmp7, tmp6); auto tmp9 = static_cast<float>(255.0); auto tmp10 = min_propagate_nan(tmp8, tmp9); auto tmp11 = static_cast<unsigned char>(tmp10); out_ptr1[static_cast<long>(i0)] = tmp11; } } } ''') cpp_fused_dequantize_per_tensor_2 = async_compile.cpp(''' #include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(const unsigned char* in_ptr0, float* out_ptr0, const long ks0) { { for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i0)); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.0056716203689575195)); auto tmp5 = tmp3 tmp4; tmp5.store(out_ptr0 + static_cast<long>(i0)); } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg8_1, arg9_1, arg10_1 = args args.clear() s0 = arg8_1 s2 = arg9_1 assert_size_stride(arg10_1, (s0, 3, s2, s2), (3(s2s2), s2s2, s2, 1)) buf0 = empty_strided((s0, 3, s2, s2), (3(s2s2), 1, 3s2, 3), device='cpu', dtype=torch.uint8) cpp_fused_quantize_per_tensor_0(c_void_p(arg10_1.data_ptr()), c_void_p(buf0.data_ptr()), c_long(s0), c_long(s2)) del arg10_1 buf1 = torch.ops.onednn.qconv2d_pointwise(buf0, 0.024776775389909744, 97, constant5, constant2, constant3, constant0, [1, 1], [1, 1], [1, 1], 1, 95.88209060714476, 0, False, 'relu', [], '') assert_size_stride(buf1, (s0, 16, 1 + s2, 1 + s2), (16 + (16(s2s2)) + (32s2), 1, 16 + (16s2), 16)) del buf0 # Source Nodes: [quantize_per_tensor_default_2], Original ATen: [quantized_decomposed.quantize_per_tensor] buf2 = torch.ops.quantized.max_pool2d(buf1, [3, 3], [2, 2], [1, 1], [1, 1], False) del buf1 buf3 = buf2 assert_size_stride(buf3, (s0, 16, 1 + (s2 // 2), 1 + (s2 // 2)), (16 + (16((s2 // 2)(s2 // 2))) + (32(s2 // 2)), 1, 16 + (16(s2 // 2)), 16)) del buf2 buf4 = empty_strided((s0, 16, 1, 1), (16, 1, 16s0, 16s0), device='cpu', dtype=torch.float32) buf5 = empty_strided((s0, 16), (16, 1), device='cpu', dtype=torch.uint8) cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1(c_void_p(buf3.data_ptr()), c_void_p(buf4.data_ptr()), c_void_p(buf5.data_ptr()), c_long(s0), c_long(s2)) del buf3 buf6 = torch.ops.onednn.qlinear_pointwise(buf5, 0.005949148442596197, 0, constant6, constant4, constant3, constant1, 176.31645543014483, 100, False, 'none', [], '') assert_size_stride(buf6, (s0, 16), (16, 1)) del buf5 buf7 = reinterpret_tensor(buf4, (s0, 16), (16, 1)); del buf4 # reuse cpp_fused_dequantize_per_tensor_2(c_void_p(buf6.data_ptr()), c_void_p(buf7.data_ptr()), c_long(s0)) return (buf7, ) ``` TestPlan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_maxpool2d_linear_dynamic ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108550 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-09-19 08:30:16 +00:00
Jez Ng	7f3885137f	Add meta function for _segment_reduce (#109359 ) This fixes numerous tests which were xfailing. For instance, the `_segment_reduce.lengths` OpInfo test, which was previously relying on the fallback kernel to determine the shape of the meta tensor. The fallback kernel would fail with segment_reduce(): Expected all rows of lengths along axis to sum to data.size(lengths.dim()-1) when !unsafe. as it was trying to read the values of a meta tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109359 Approved by: https://github.com/ezyang	2023-09-16 13:31:03 +00:00
PyTorch MergeBot	be9f73f031	Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211 )" This reverts commit `fe14e43d14`. Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing `492a93d185` https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))	2023-09-14 22:29:12 +00:00
Edward Z. Yang	fe14e43d14	Add meta and OpInfo for _embedding_bag_dense_backward (#109211 ) The sample inputs is a bit involved because there are a lot of shenanigans in the derivative formula. Check comments. This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-09-14 18:49:32 +00:00
drisspg	ad90ab31f2	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-13 13:59:05 +00:00
Jez Ng	063a62622b	Add memory overlap check to `meta_copy_` (#108989 ) Fixes `test_copy_many_to_one`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989 Approved by: https://github.com/eellison	2023-09-12 23:28:14 +00:00
Peter Bell	464f9c3725	[meta] Add meta implementation for aten.masked_scatter (#108802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108802 Approved by: https://github.com/lezcano	2023-09-12 16:16:05 +00:00
Li-Huai (Allan) Lin	b2cba439b4	Introduce Tensor overload to linspace and logspace (#104889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889 Approved by: https://github.com/zou3519 ghstack dependencies: #107958	2023-09-11 23:30:40 +00:00
PyTorch MergeBot	a7f5abeade	Revert "Introduce Tensor overload to linspace and logspace (#104889 )" This reverts commit `57e5239321`. Reverted https://github.com/pytorch/pytorch/pull/104889 on behalf of https://github.com/clee2000 due to sorry have to revert this to revert https://github.com/pytorch/pytorch/pull/107958 ([comment](https://github.com/pytorch/pytorch/pull/104889#issuecomment-1714305768))	2023-09-11 17:33:48 +00:00
Li-Huai (Allan) Lin	57e5239321	Introduce Tensor overload to linspace and logspace (#104889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889 Approved by: https://github.com/zou3519 ghstack dependencies: #107958	2023-09-11 15:29:39 +00:00
Huy Do	a9c663c269	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 07:43:04 +00:00
PyTorch MergeBot	e45b290127	Revert "Revert "Flash Attention v2 (#105602 )" (#108827 )" This reverts commit `24e9bbe22a`. Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))	2023-09-08 03:25:45 +00:00

1 2 3 4 5 ...

455 Commits