pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Richard Barnes	3705e65254	Add `pin_memory` to `torch.Tensor` type annotation args (#109797 ) Test Plan: Sandcastle Differential Revision: D49504528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109797 Approved by: https://github.com/jianyuh	2023-09-26 17:12:37 +00:00
Zain Rizvi	1277d0e834	[BE] Add sharding data by default to metrics (#110035 ) Extend metric library to allow setting global metrics on a process level which will always be emitted. Current use case for them is to include shard information every time a metric is emitted by run_test.py <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0cae92c</samp> > _`run_test` refactored_ > _Sharding metrics in Rockset_ > _Autumn of testing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035 Approved by: https://github.com/clee2000	2023-09-26 17:06:49 +00:00
Zain Rizvi	5dcee01c2b	Monitor baseline for TD prioritizations (#110031 ) For tests that TD prioritizes, we should track what their ordering _would have been_ if none of the TD heuristics had applied to it. This is useful for two reasons: 1. It lets us better understand TD may have contributed to that test running sooner 2. it's possible that heuristics actually mark a test as less important than the default sorting would have claimed (the default sorts tests in a fixed order). This will let us track how often that happens Pull Request resolved: https://github.com/pytorch/pytorch/pull/110031 Approved by: https://github.com/clee2000	2023-09-26 04:27:16 +00:00
wangxiyuan	5589b81173	Remove redundant change for gloo (#106750 ) HIP deprecated symbols are removed by `d74270ece2` and `fe2ad9c328` which is included in pytorch gloo already. gloo in pytorch master: `597accfd79` There is no need to fix it in pytorch now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106750 Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980	2023-09-26 03:46:14 +00:00
PyTorch MergeBot	83deaa16ed	Revert "[1/N] Cleanup header inclusions in torch_cpu by iwyu (#101178 )" This reverts commit `b7a95f4fdb`. Reverted https://github.com/pytorch/pytorch/pull/101178 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/101178#issuecomment-1734384645))	2023-09-25 20:05:25 +00:00
Zain Rizvi	d6cc3ac8b2	Add PR number to metrics when available (#109406 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 780bfa6</samp> Add a new metric for pull request number in `tools/stats/upload_metrics.py`. This allows tracking the CI performance of pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109406 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/clee2000	2023-09-25 19:57:34 +00:00
cyy	b7a95f4fdb	[1/N] Cleanup header inclusions in torch_cpu by iwyu (#101178 ) Following our previous IWYU work #100304 on C10, it makes more sense to try IWYU on torch_cpu. This PR does exactly that. Meanwhile, it fixes issue #48684. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101178 Approved by: https://github.com/ezyang	2023-09-24 05:01:20 +00:00
lezcano	835c18e7ea	Avoid saving `self` for mean.backward (#109935 ) Fixes https://github.com/pytorch/pytorch/issues/109876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109935 Approved by: https://github.com/soulitzer	2023-09-23 11:50:54 +00:00
Oguz Ulgen	1df14f1bf8	Move has_triton to top level triton utils so that dynamo can also access (#109832 ) it without creating cyclic dependencies Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832 Approved by: https://github.com/zou3519	2023-09-22 19:33:41 +00:00
Randolf Scholz	c6b9481c15	Update type hint for `Tensor.__getitem__`. (#109531 ) Better type-hint that's similar in spirit to `numpy.ndarray.__getitem__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109531 Approved by: https://github.com/ezyang	2023-09-21 18:19:38 +00:00
PyTorch MergeBot	b1f1b39feb	Revert "Add PR number to metrics when available (#109406 )" This reverts commit `5e19216a6e`. Reverted https://github.com/pytorch/pytorch/pull/109406 on behalf of https://github.com/atalman due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/109406#issuecomment-1730049340))	2023-09-21 17:59:12 +00:00
hauntsaninja	2cd0b94533	Hide __getattr__ from type checkers (#109683 ) Visibility of this causes type checkers to conservatively assume that all attributes are defined on torch module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109683 Approved by: https://github.com/ngimel, https://github.com/ezyang, https://github.com/malfet	2023-09-21 17:01:23 +00:00
Zain Rizvi	5e19216a6e	Add PR number to metrics when available (#109406 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 780bfa6</samp> Add a new metric for pull request number in `tools/stats/upload_metrics.py`. This allows tracking the CI performance of pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109406 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/clee2000	2023-09-21 16:47:05 +00:00
PyTorch MergeBot	a399f839ac	Revert "Add PR number to metrics when available (#109406 )" This reverts commit `f0fb4b3897`. Reverted https://github.com/pytorch/pytorch/pull/109406 on behalf of https://github.com/ZainRizvi due to breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/109406#issuecomment-1724061024))	2023-09-18 17:35:37 +00:00
Nikita Shulga	d2ca5fa6c5	[lintrunner] Capture mypy internal error (#109421 ) Mypy internal errors are reported to stderr rather than stdout and does not contain column number This should prevent internal errors from creeping into the code and occlude other legitimate errors Test plan: Checkout `5cd861fcf7` apply this change and see `lintrunner` run to report internal error Fixes https://github.com/pytorch/pytorch/issues/104940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109421 Approved by: https://github.com/Skylion007	2023-09-18 15:48:14 +00:00
Zain Rizvi	f0fb4b3897	Add PR number to metrics when available (#109406 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 780bfa6</samp> Add a new metric for pull request number in `tools/stats/upload_metrics.py`. This allows tracking the CI performance of pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109406 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/clee2000	2023-09-18 03:17:54 +00:00
Aaron Gokaslan	6d725e7d66	[BE]: enable ruff rules PLR1722 and PLW3301 (#109461 ) Enables two ruff rules derived from pylint: * PLR1722 replaces any exit() calls with sys.exit(). exit() is only designed to be used in repl contexts as may not always be imported by default. This always use the version in the sys module which is better * PLW3301 replaces nested min / max calls with simplified versions (ie. `min(a, min(b, c))` => `min(a, b. c)`). The new version is more idiomatic and more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109461 Approved by: https://github.com/ezyang	2023-09-18 02:07:21 +00:00
cyy	75b954b715	[4/N] Enable clang-tidy in torch/csrc/autograd (#109455 ) The PR enables clang-tidy checks in torch/csrc/autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109455 Approved by: https://github.com/Skylion007	2023-09-17 17:11:50 +00:00
drisspg	b275a902d3	Small type hint fix (#109414 ) # Summary Adds these types to the type hint list for better IDE experience Pull Request resolved: https://github.com/pytorch/pytorch/pull/109414 Approved by: https://github.com/Skylion007	2023-09-16 18:46:46 +00:00
cyy	7bce7f50f3	Add torchgen path in gen_vulkan_spy (#108980 ) Fixes the CMake building error ``` from torchgen.code_template import CodeTemplate ModuleNotFoundError: No module named 'torchgen' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108980 Approved by: https://github.com/ezyang	2023-09-16 04:09:56 +00:00
Zain Rizvi	28169193b4	[TD] Improve heuristic metrics collection (#109305 ) Fixes a bug with heuristic metrics collection where the metrics would sometimes inaccurately claim a heuristic to have ranked a test more highly than any other heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109305 Approved by: https://github.com/clee2000	2023-09-14 22:20:34 +00:00
Andrei Gheorghe	00908475e6	Use global variables to register the return_types namedtuples (#108832 ) Fixes #69221. Builds on top of #107000, fixing the buck build issue linked [here](https://github.com/pytorch/pytorch/pull/107000#issuecomment-1708857375). Pull Request resolved: https://github.com/pytorch/pytorch/pull/108832 Approved by: https://github.com/zou3519	2023-09-13 17:42:46 +00:00
drisspg	ad90ab31f2	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-13 13:59:05 +00:00
PyTorch MergeBot	5a7c008b30	Revert "[ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141 )" This reverts commit `8ff00360a4`. Reverted https://github.com/pytorch/pytorch/pull/105141 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/105141#issuecomment-1715629007))	2023-09-12 12:29:55 +00:00
Jack Taylor	8ff00360a4	[ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141 ) Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797 Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141 Approved by: https://github.com/jansel	2023-09-09 16:28:56 +00:00
Huy Do	a9c663c269	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 07:43:04 +00:00
PyTorch MergeBot	e45b290127	Revert "Revert "Flash Attention v2 (#105602 )" (#108827 )" This reverts commit `24e9bbe22a`. Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))	2023-09-08 03:25:45 +00:00
Huy Do	24e9bbe22a	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 02:54:20 +00:00
PyTorch MergeBot	27d5dcf589	Revert "Use global variables to register the return_types namedtuples (#107000 )" This reverts commit `ae8eb7a3f9`. Reverted https://github.com/pytorch/pytorch/pull/107000 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing internal build ([comment](https://github.com/pytorch/pytorch/pull/107000#issuecomment-1708862325))	2023-09-06 18:13:23 +00:00
Andrei Gheorghe	ae8eb7a3f9	Use global variables to register the return_types namedtuples (#107000 ) Fixes #69221 @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/107000 Approved by: https://github.com/zou3519	2023-09-05 20:00:29 +00:00
cyy	efc7c366f4	Remove auto_gil.h (#108492 ) auto_gil.h has been deprecated for a long time. We can switch to pybind11. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108492 Approved by: https://github.com/Skylion007	2023-09-05 08:26:13 +00:00
Zhicheng Yan	01b662bafe	[gen_operators_yaml] add arguments to control include_all_overloads (#108396 ) Summary: In SelectiveBuildOperator, we can specify argument `include_all_overloads`. If True, all overloaded operators (for example, `aten::to.dtype_layout`, `aten::to.prim_Device"` are considered as overloaded operators of `aten::to`), will be built and linked to the final binary. This can significantly increases the final binary size, which could be a deal breaker for on-device deployment. In this diff, we make back-compatible changes to add new arguments `--not-include-all-overloads-static-root-ops` and `--not-include-all-overloads-closure-ops`. When they are set, we set `include_all_overloads` flag to False for static root ops and closure ops, and rely on code analyzer to decide the actual used overloaded operator. Test Plan: - unit test ``` buck test //xplat/caffe2/tools:gen_operators_yaml_test ``` - See test plan in D48771544 where we reduce the shared lib file `libmrengine.lib` from 16653072 bytes to 13686032 bytes. - See detailed document: https://fburl.com/gdoc/mc93h6kb Reviewed By: larryliu0820 Differential Revision: D48772302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108396 Approved by: https://github.com/larryliu0820	2023-09-02 17:37:36 +00:00
drisspg	add45aea1c	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-01 22:14:44 +00:00
Jun Luo	8289ad8e5e	Support is_mtia attribute. (#108307 ) (#108310 ) Summary: FBGEMM uses `self.iter.is_cuda` to check if the tensor is for CUDA. This diff enables similar feature `self.iter.is_mtia` for tensors with MTIA device key. Test Plan: See diff D48693225 Reviewed By: jackm321 Differential Revision: D48809191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108310 Approved by: https://github.com/albanD	2023-09-01 01:25:40 +00:00
PyTorch MergeBot	d569e506ab	Revert "Flash Attention v2 (#105602 )" This reverts commit `9df3d882c8`. Reverted https://github.com/pytorch/pytorch/pull/105602 on behalf of https://github.com/huydhn due to I think we miss a case here for sm80 build on inductor workflow as it is now OOM on trunk https://github.com/pytorch/pytorch/actions/runs/6042843139 ([comment](https://github.com/pytorch/pytorch/pull/105602#issuecomment-1701974862))	2023-09-01 01:15:01 +00:00
Jirka Borovec	9178deedff	removing some redundant str splits (#106089 ) drop some redundant string splits, no factual changes, just cleaning the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/106089 Approved by: https://github.com/albanD, https://github.com/malfet	2023-09-01 00:22:58 +00:00
Zain Rizvi	5727b07ac6	TD: logging bugfix (#108288 ) Fix bug where logging metrics don't get emitted unless the 'keep-going' label is specified on the PR Also adds some extra logging to make debugging easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/108288 Approved by: https://github.com/Skylion007	2023-08-31 16:51:49 +00:00
drisspg	9df3d882c8	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-08-31 16:02:20 +00:00
Zain Rizvi	238cc84af9	[TD] Emit metrics to compare heuristic quality (#108192 ) When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test. ## Context Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation. ## What's measured? The metrics this PR collects are designed to answer the following questions ### How comprehensive are the heuristics? - What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level) ### How precise are the heuristics? - What % of failed tests were prioritized by a given heuristic? What % was prioritized overall? - How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level) - What % of time was a given heuristic prioritizing a failing test higher than any other heuristic? Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192 Approved by: https://github.com/huydhn ghstack dependencies: #108117	2023-08-30 18:28:18 +00:00
Zain Rizvi	620d267ef3	Refactor TestPrioritizations to support more priorities and reduce risk of accidental mutations (#108117 ) Refactor TD code to make it easier to add additional categories later and also support the changes required to enable the metrics needed for TD Pull Request resolved: https://github.com/pytorch/pytorch/pull/108117 Approved by: https://github.com/huydhn	2023-08-30 04:14:28 +00:00
Brian Hirsh	da54f3c519	reorder proxy / fake modes so they always run last (#104482 ) Update: Made refactor of the original PR. See the original description below, but here I'll describe the updates: (1) TLS changes in `TorchDispatchModeTLS.h/cpp`. I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now also contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack). `TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes. `TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes". `TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes). There were also some mild codegen changes to support the new enum (2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py` The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`. I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly. (2) dispatching logic in `python_arg_parser.cpp` This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now: ``` (a) dispatch_on_mode() # try user modes first (where the mode stack automatically considers infra modes last) (b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses) (c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses) ``` Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass. How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key. (3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends). ----- original description below ----- The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run after any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that. (1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes. I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, including fake or proxy, if one is set. This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler. (2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser() This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs. (3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active. (4) Allow traceable tensor subclasses to access storages from python This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable" (5) Fixed subclass fakeification @wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct. (6) make tensor subclasses resizeable Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`. (7) Added a basic DoubleTensor subclass for testing I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482 Approved by: https://github.com/ezyang ghstack dependencies: #107415	2023-08-29 02:36:48 +00:00
Pearu Peterson	fe3309b4b8	Add optional is_coalesced argument to sparse coo tensor factory function. (#107638 ) Resolves https://github.com/pytorch/pytorch/issues/107097 After this PR, instead of ```python torch.sparse_coo_tensor(indices, values, size)._coalesced_(is_coalesced) ``` (that does not work in the autograd context, see #107097), use ```python torch.sparse_coo_tensor(indices, values, size, is_coalesced=is_coalesced) ``` All sparse coo factory functions that take indices as input support the `is_coalesced` argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107638 Approved by: https://github.com/cpuhrsch	2023-08-26 07:24:29 +00:00
Zain Rizvi	36399d067a	Port existing heuristics to TD framework (#107071 ) This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are: - Some metric emissions were changed to comply with the new TD format - Some logging changes - We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general) Refactorings done: - Moves all test reordering code to the new TD framework - Refactors run_test.py to cleanly support multiple levels of test priorities - Deletes some dead code that was originally written for logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-08-23 21:23:23 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Huy Do	a5f83245fd	Access ROCKSET_API_KEY from ephemeral runners (#107652 ) Hardening the access to ROCKSET_API_KEY by only using this key from ephemeral runners `ubuntu-22.04` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107652 Approved by: https://github.com/clee2000	2023-08-22 17:02:44 +00:00
Zain Rizvi	5ddb8ef827	Make emit_metrics importable without having boto3 installed (#107070 ) Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part. It's purely a refactor without any real logic changes Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070 Approved by: https://github.com/huydhn	2023-08-21 21:13:01 +00:00
Pearu Peterson	a816aa785b	Implement autograd support for sparse compressed tensor constructors (#107384 ) Fixes https://github.com/pytorch/pytorch/issues/107126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107384 Approved by: https://github.com/cpuhrsch ghstack dependencies: #107447	2023-08-21 20:26:39 +00:00
Pearu Peterson	d7c0c5de2d	Set crow_indices outputs as non-differentiable. (#107447 ) Fixes https://github.com/pytorch/pytorch/issues/107083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107447 Approved by: https://github.com/cpuhrsch	2023-08-21 19:52:32 +00:00
Catherine Lee	3b2c5d47c0	Use default build env and test config for test times (#107325 ) Redo of #107312 Pairs with https://github.com/pytorch/test-infra/pull/4476 If build env and test config combo cannot be found in the test times, use default. Then we don't have to go manually change the test-times.json a new job is added or we update the jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107325 Approved by: https://github.com/huydhn	2023-08-21 18:39:55 +00:00

1 2 3 4 5 ...

4669 Commits