pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
David Berard	df1e855313	[fake_impls] fix max_seqlen return values in efficient_attention_forward (#120842 ) To match the actual implementation, we should return the max_seqlen_q/k, not M, N, when in the sparse case `7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L981-L996)` Note that although the .cu file sets max_seqlen_k = 0 in the sparse case, it actually returns max_seqlen_k or N: `7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L1224-L1231)` Tests - added in the next PR (#102839, which also fixes other parts of the test_fake tests so that we can un-xfail them and actually run the tests) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120842 Approved by: https://github.com/YuqingJ ghstack dependencies: #120682	2024-02-29 07:12:27 +00:00
eqy	d1d50d2e4c	[Inductor][cuDNN] Disable tf32 in `test_mutate_base_for_conv_output` (#120867 ) Looks like there is a sum? comparison where TF32 may not provide the necessary accuracy, leading to failures on sm86. CC @Skylion007 , hopefully this unblocks #120642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120867 Approved by: https://github.com/Skylion007	2024-02-29 06:59:32 +00:00
cyy	8a42cff7b1	[DeviceIndex][7/N] Use DeviceIndex in XPU (#120576 ) Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120576 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2024-02-29 05:54:23 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Adnan Akhundov	0a46102b37	Add equal_to_1 to triton_meta for user-written Triton kernels (#120579 ) Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization. Fixes #120478. The repro from the issue, on A100: Before this PR: ``` Triton matmul: 0.0167 seconds Triton matmul compiled: 0.0751 seconds ``` After this PR: ``` Triton matmul: 0.0168 seconds Triton matmul compiled: 0.0072 seconds ``` Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 3 tests in 3.545s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78	2024-02-29 05:19:39 +00:00
Shunting Zhang	4407138bf6	[inductor][eazy] fix a typo in test (#120832 ) In theory we can test anything, but the test name mentioned attention so we should multiple the inv_scale rather than divide it. And I guess that the initial intension of the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120832 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-02-29 05:04:04 +00:00
Adnan Akhundov	2d17230212	[inductor] Do not reuse buffers across scopes in mem planning (#120777 ) Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes. In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state. Test Plan: ``` python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters ... ---------------------------------------------------------------------- Ran 27 tests in 149.413s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #120665	2024-02-29 03:52:02 +00:00
Will Constable	f5b99976ad	[C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850 ) Fixes #120847 Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-29 03:41:15 +00:00
PaliC	26d6ddc232	[bug burndown]Fix #119784 (#120846 ) Addresses https://github.com/pytorch/pytorch/issues/119784. Interestingly, the test seem to just pass (yay!). Tested locally that the failing set of tests pass using `PYTORCH_TEST_WITH_DYNAMO=1 pytest functorch/test_vmap.py -v` Will wait for CI to pass first before bugging people for reviews. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120846 Approved by: https://github.com/Skylion007	2024-02-29 03:30:40 +00:00
Yifu Wang	fad228c7cc	Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833 ) Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition. This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833 Approved by: https://github.com/wconstab	2024-02-29 03:19:44 +00:00
youkaichao	2c0c70f763	[Dynamo] enumerate imported names for eval_frame.py (#120778 ) Fixes https://github.com/pytorch/pytorch/issues/120699 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120778 Approved by: https://github.com/Skylion007	2024-02-29 03:08:43 +00:00
Shruthi GN	ef9e89984c	[pytorch] Support output types that are non tensors (#120804 ) Summary: per title This is needed because some modules return None and non tensors as output Test Plan: sandcastle? Reviewed By: zhxchen17 Differential Revision: D54311609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120804 Approved by: https://github.com/zhxchen17	2024-02-29 02:49:10 +00:00
Adnan Akhundov	0dbef1618f	[inductor] Apply fx passes recursively to nested subgraphs (#120665 ) Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`. In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph. For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 26 tests in 59.252s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665 Approved by: https://github.com/eellison	2024-02-29 02:34:54 +00:00
PyTorch MergeBot	db1cc781db	Revert "[dynamo] Function => FunctionCtx for placeholder obj (#120577 )" This reverts commit `ee01d0807b`. Reverted https://github.com/pytorch/pytorch/pull/120577 on behalf of https://github.com/jansel due to Causing breakages internally ([comment](https://github.com/pytorch/pytorch/pull/120577#issuecomment-1970254363))	2024-02-29 01:56:09 +00:00
Edward Z. Yang	b2e4b621cc	Reduce create_env log level to DEBUG (#120772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120772 Approved by: https://github.com/albanD	2024-02-29 01:33:16 +00:00
Brian Hirsh	9e0631cc8a	get CommsDebugMode to work with DTensor (#118769 ) Tested with Wanchao's repro: ``` from typing import Tuple, List, Dict, cast import torch import torch.nn as nn from torch.distributed.device_mesh import init_device_mesh from torch.distributed._tensor import distribute_tensor, DTensor, Shard, Placement, Replicate mesh = init_device_mesh(device_type="cuda", mesh_shape=(2,)) x = torch.randn(4, 8, requires_grad=True) y = torch.randn(4, 32, requires_grad=True) x_dtensor = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) y_dtensor = DTensor.from_local(y, mesh, [Shard(0)], run_check=False) from torch.distributed._tensor.debug import CommDebugMode comm_mode = CommDebugMode() with comm_mode: z = torch.mm(x_dtensor, y_dtensor) print(comm_mode.get_comm_counts()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118769 Approved by: https://github.com/wanchaol	2024-02-29 01:11:05 +00:00
Will Constable	381a7ad3f1	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito ghstack dependencies: #120724, #120270	2024-02-29 01:03:31 +00:00
Will Constable	f85d3a022c	[C10D] Fix pointToPoint op Flight Recording (#120270 ) Fix and test issues with both coalesced and individual send/recv ops Considered an alternate approach and then ditched it - alternate approach: #119757 - reason ditched: prefer recording individual collective events inside coalescing region instead of just the event at the end of the region, which also would not have tensor sizes or opnames without additional state variables added Another approach also ditched - record events on workEnqueue instead of initWork - reason ditched: too messy to get input/output shapes tagged on recording when recording in workEnqueue. Adding the info onto the Work obj would be possible, but adds to overhead of copying Works which we do on every collective. We can get info off the input/output tensors directly in initWork, but we don't want to keep refs to those tensors alive while the work is Enqueued, so we'd have to specifically copy size lists or something. This PR instead avoids creating a work inside pointToPoint when coalescing is active. Instead, only at endCoalescing() is a work finally intialized and enqueued. But it adds a record() call inside pointToPoint() instead of creating a work, during coalescing. This record() call picks up tensor shapes and op names. It ALSO changes initWork to accept a 'record' argument. This defaults to false, and should only be set to true if the caller ensures the work will be enqueued by workEnqueue, ensuring its cuda events are live when used by flight recorder's update_state(). The testing uncovers some odd pre-existing behavior and leaves them alone for now. We could change some of these - seq starts off at 1, not 0 for first op (but this is inconistent) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #120724	2024-02-29 01:03:31 +00:00
Will Constable	7f4d673885	[C10D] Add record_id to flight recorder (#120724 ) In cases where sequence number is shared between events (e.g. coalesced collectives) we want to ensure a unique (and ordered) ID per record. Note: the records are already in a list, so their ID could be implicitly observed. But (1) it's a ring buffer, so absolute ID is lost once the buffer rolls over once, (2) users may sort or process or filter their flight records, so having the ID be an explicit member of an entry is still useful Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724 Approved by: https://github.com/zdevito	2024-02-29 01:03:31 +00:00
leslie-fang-intel	950b484356	skip three pyhpc models with dynamic shape test (#120599 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR. * Error msg is ``` File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 1048576 ``` * Root Cause is * Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](`26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16)`) ``` shape = ( math.ceil(2 * size ** (1/3)), math.ceil(2 * size ** (1/3)), math.ceil(0.25 * size ** (1/3)), ) ``` * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (`c617e7b407/benchmarks/dynamo/common.py (L3456)`) and `math.ceil(2 * size ** (1/3))` happens equaling to 4. * Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-02-29 00:38:06 +00:00
Chien-Chin Huang	3179107629	[DDP][PT2D] Ignore gradient sync if the gradient is not defined (#120419 ) From the test, accum_grad_hook can still be fired even if the gradient is None. We need to ignore the gradient sync for this case. Differential Revision: [D54076485](https://our.internmc.facebook.com/intern/diff/D54076485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120419 Approved by: https://github.com/yf225, https://github.com/XilunWu	2024-02-29 00:27:54 +00:00
Pian Pawakapan	ab38354887	Allow str inputs in non-strict tracing (#120536 ) Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass. Failing test case: ``` class Foo(torch.nn.Module): def forward(self, a, b, mode): return torch.div(a, b, rounding_mode=mode) foo = Foo() inps = (torch.randn(4, 4), torch.randn(4), "trunc") exported = export(foo, inps) with self.assertRaisesRegex( RuntimeError, "to be equal to trunc, but got floor" ): _ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor") self.assertTrue(torch.allclose(exported.module()(inps), foo(inps))) ``` Before: ``` (pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str E ====================================================================== ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper method(args, kwargs) File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str exported = export(foo, inps) ^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export return export(args, *kwargs, strict=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export return _export( ^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper raise e File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper ep = fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export ) = make_fake_inputs(f, args, kwargs, constraints) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs fake_args, fake_kwargs = tree_map_with_path( ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path return treespec.unflatten(func(xs) for xs in zip(all_keypath_leaves)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten leaves = list(leaves) ^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr> return treespec.unflatten(func(xs) for xs in zip(*all_keypath_leaves)) ^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda> lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify raise ValueError("Only tensors allowed as input") ValueError: Only tensors allowed as input To execute this test, run the following from the base repo dir: python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.008s FAILED (errors=1) ``` After: ``` (pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str . ---------------------------------------------------------------------- Ran 1 test in 0.237s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm	2024-02-28 23:56:30 +00:00
Aaron Orenstein	1b8bb027f6	Fix guard for SUPPORTED_NODES (#120798 ) The special-case code for handling SUPPORTED_NODES was producing a guard that looked like: ``` "G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type" ``` resulting in a eval error trying to evaluate the guard. This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly. Also added a unit test which fails before this change and passes after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798 Approved by: https://github.com/anijain2305	2024-02-28 23:34:17 +00:00
Aaron Enye Shi	aa36821615	[Memory Snapshot] Stop clearing history when changing context (#120436 ) Summary: This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`. Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history. Test Plan: # Ran on the following local Resnet50 example: - At iteration=0, record_memory_history(context=None, stacks="python") - At iteration=3, record_memory_history(context="all", stacks="python") - After iteration=4, export_memory_snapshot() ## Before: - Only collects the last 2 iterations with python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3) ## After: - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8) ![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4) Differential Revision: D54084017 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436 Approved by: https://github.com/zdevito, https://github.com/leitian	2024-02-28 22:46:26 +00:00
PyTorch MergeBot	86ff31c4a0	Revert "Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 )" This reverts commit `cabc09a5f2`. Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))	2024-02-28 22:30:18 +00:00
PyTorch MergeBot	dbe0967a0a	Revert "Add test to check that COW inputs are not materialized (#119507 )" This reverts commit `2ebf2c88ba`. Reverted https://github.com/pytorch/pytorch/pull/119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/119507#issuecomment-1970022840))	2024-02-28 22:26:59 +00:00
Eddie Yan	7e185277cd	[cuDNN] bump cuDNN-frontend submodule to 1.1.2 (#120761 ) Hopefully resolves additional `CUDNN_STATUS_SUCCESS` failures that we have been seeing on H100 (though curiously not on upstream CI, perhaps due to the different hardware being tested) Need to confirm the fix on our end before merging CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120761 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2024-02-28 22:15:43 +00:00
Elias Ellison	9c9bde515c	Factor out Submod compilers (#120527 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120527 Approved by: https://github.com/kadeng	2024-02-28 22:11:47 +00:00
Edward Z. Yang	5b5bcf0470	Test that tlparse understands the structured logs we output (#120658 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120658 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #120712, #120289	2024-02-28 21:58:39 +00:00
David Berard	d6c202975c	Move attention kernels from meta_registrations to fake_impls (#120682 ) This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns. This PR: * Move the `_meta_registrations.py` implementations to `fake_impls.py` * Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them * Wrap all the returned tensors in FakeTensors Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682 Approved by: https://github.com/drisspg	2024-02-28 21:49:13 +00:00
lancerts	50073248ed	add a note wrt torch.nn.functional.scaled_dot_product_attention (#120668 ) followup change of https://github.com/pytorch/pytorch/pull/120565 - Added a note in the transformer class pointing out the mask definition is opposite to that of :attr:`attn_mask` in torch.nn.functional.scaled_dot_product_attention. @mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120668 Approved by: https://github.com/mikaylagawarecki	2024-02-28 21:16:34 +00:00
Shubhraprakash Das	e2ee87d48b	Fix segfault on mac when running vulkan tests (#120337 ) Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes. Test Plan: See that there is no segfault on mac with fix and tests can run: Compile gtests: buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Crash w/o diff bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 85 tests from 1 test suite. [----------] Global test environment set-up. [----------] 85 tests from VulkanAPITest [ RUN ] VulkanAPITest.uniform_buffer_copy [ OK ] VulkanAPITest.uniform_buffer_copy (88 ms) [ RUN ] VulkanAPITest.copy_to_buffer Segmentation fault: 11 With diff there is no crash: bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 85 tests from 1 test suite. [----------] Global test environment set-up. [----------] 85 tests from VulkanAPITest [ RUN ] VulkanAPITest.uniform_buffer_copy [ OK ] VulkanAPITest.uniform_buffer_copy (296 ms) ..... [ FAILED ] VulkanAPITest.gelu_quint8_self (23 ms) [----------] 85 tests from VulkanAPITest (1494 ms total) [----------] Global test environment tear-down [==========] 85 tests from 1 test suite ran. (1494 ms total) [ PASSED ] 72 tests. [ FAILED ] 13 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large [ FAILED ] VulkanAPITest.gelu_qint8 [ FAILED ] VulkanAPITest.gelu_qint8_self [ FAILED ] VulkanAPITest.gelu_quint8 [ FAILED ] VulkanAPITest.gelu_quint8_self The above failing tests were failing before as well and are being worked on. Differential Revision: D54023146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337 Approved by: https://github.com/SS-JIA	2024-02-28 20:55:47 +00:00
yuanx749	e317e39a02	Fix `nonlinearity` arg issue in RNN (#120234 ) Fixes #114617 This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg. Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234 Approved by: https://github.com/mikaylagawarecki	2024-02-28 20:53:18 +00:00
Yanbo Liang	8b22fe9594	[FX passes] Set group/batch fusion log to DEBUG level (#120780 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120780 Approved by: https://github.com/jackiexu1992	2024-02-28 20:48:11 +00:00
PyTorch MergeBot	4903e33e19	Revert "Capture non tensor arguments in record_function (#120017 )" This reverts commit `5c5b71b6ee`. Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))	2024-02-28 20:43:33 +00:00
Jason Ansel	01ec8df6d8	[Compiled Autograd] Introduce BackwardState capture (#120382 ) This adds support for backwards hooks that are both: 1) Interior to the graph; and 2) Dynamically generated (e.g. lambdas) We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo after the forwards runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382 Approved by: https://github.com/xmfan	2024-02-28 20:36:47 +00:00
Will Constable	c016ffed5b	[C10D] Fix logic for default group=None in _set_pg_timeout (#120686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686 Approved by: https://github.com/yifuwang	2024-02-28 20:31:14 +00:00
Shengbao Zheng	11de40f82f	[flight recorder] record process group configuration (#120262 ) Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging. Differential Revision: D53792087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262 Approved by: https://github.com/shuqiangzhang	2024-02-28 20:31:08 +00:00
Hongtao Yu	5aa7f8646f	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120742 ) Relanding https://github.com/pytorch/pytorch/pull/120639 + a fix to drop `matrix_instr_nonkdim` that does not align with `BLOCK_M` or `BLOCK_N` Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 0 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120742 Approved by: https://github.com/xw285cornell	2024-02-28 20:27:14 +00:00
Scott Wolchok	b020ee5b05	[PyTorch Use MaybeOwned when promoting indices/offsets in embedding_bag (#120755 ) We're currently doing two unnecessary reference count operations in the case where promotion doesn't need to happen. Differential Revision: [D54285999](https://our.internmc.facebook.com/intern/diff/D54285999/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120755 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #120752	2024-02-28 20:13:30 +00:00
Scott Wolchok	98d1529474	[PyTorch] fix mixed int32/int64 indices/offsets for embedding_bag_out (#120752 ) This was an oversight in D27482738 (#55189) -- it only patched the regular embedding_bag operator, but static runtime uses the out variant. Differential Revision: [D54285460](https://our.internmc.facebook.com/intern/diff/D54285460/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120752 Approved by: https://github.com/houseroad	2024-02-28 20:13:30 +00:00
Emmett Neyman	db92558229	[codemod][lowrisk] Fix deprecated use of 0/NULL (#120740 ) Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D54163060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740 Approved by: https://github.com/Skylion007	2024-02-28 20:13:13 +00:00
Guilherme Leobas	491c2b4665	Let torch dynamo inline torch.func.grad (#118407 ) When dynamo sees torch.func.grad, it tries to inline all frames related to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407 Approved by: https://github.com/zou3519	2024-02-28 20:05:00 +00:00
Avik Chaudhuri	5472923998	derived dim (#118729 ) With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed. Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even). This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2dim`, or even `2dim` and `dim + 1`. We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec. Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more. Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols. Differential Revision: D53254587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729 Approved by: https://github.com/ezyang	2024-02-28 19:48:32 +00:00
Adam J. Stewart	9c55aa6ff6	TransformerEncoder/Decoder: add type hints (#120550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120550 Approved by: https://github.com/mikaylagawarecki	2024-02-28 19:36:08 +00:00
drisspg	4b7a521856	Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 ) # Summary Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5). The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935 Approved by: https://github.com/cpuhrsch	2024-02-28 19:31:15 +00:00
PyTorch MergeBot	a9d9077f12	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit `7c556428c7`. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))	2024-02-28 18:57:09 +00:00
alanhe151220037	1c67f6cb26	fix decomposition of aten.diag_embed (#120549 ) Fixes #117019 Make the input that one dim negative and the other nonnegative be correctly solved in decomposition of `aten.diag_embed`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120549 Approved by: https://github.com/Dalian991, https://github.com/janeyx99	2024-02-28 18:48:01 +00:00
Chien-Chin Huang	f422467ccb	[BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625 ) When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice. However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base. Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625 Approved by: https://github.com/wconstab, https://github.com/yifuwang	2024-02-28 18:34:45 +00:00
andrewor14	91190d8087	[quant][pt2e] Relax `model_is_exported` input (#120720 ) Summary: This commit relaxes the `model_is_exported` API to additionally work for `torch.nn.Module`s in addition to just `torch.fx.GraphModule`s, simplifying downstream uses. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D54263935](https://our.internmc.facebook.com/intern/diff/D54263935) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120720 Approved by: https://github.com/tugsbayasgalan	2024-02-28 18:32:03 +00:00

1 2 3 4 5 ...

70022 Commits