pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Joel Schlosser	ae53510b9e	Fix setUpClass() / tearDownClass() for device-specific tests (#151129 ) Finishes up the work started in #121686 + adds test Update: this was not as straightforward as I originally imagined. Context below. TL;DR: `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not. Background: The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`. Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an empty class called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`. (1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests: `070f389745/test/test_ops.py (L218-L221)` (2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic: `070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)` `070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)` To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA actually derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc). Side note: I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)`) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here: `4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)` To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384)`)). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129 Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet	2025-04-16 02:18:42 +00:00
PyTorch MergeBot	98b1e82ba8	Revert "Fix setUpClass() / tearDownClass() for device-specific tests (#151129 )" This reverts commit `bd4cf30e31`. Reverted https://github.com/pytorch/pytorch/pull/151129 on behalf of https://github.com/jbschlosser due to flex attention tests failing ([comment](https://github.com/pytorch/pytorch/pull/151129#issuecomment-2807632119))	2025-04-15 22:07:25 +00:00
Joel Schlosser	bd4cf30e31	Fix setUpClass() / tearDownClass() for device-specific tests (#151129 ) Finishes up the work started in #121686 + adds test Update: this was not as straightforward as I originally imagined. Context below. TL;DR: `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not. Background: The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`. Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an empty class called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`. (1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests: `070f389745/test/test_ops.py (L218-L221)` (2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic: `070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)` `070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)` To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA actually derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc). Side note: I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)`) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here: `4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)` To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384)`)). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129 Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet	2025-04-15 20:13:26 +00:00
Aidyn-A	96afa8a2bb	[TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318 ) Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318 Approved by: https://github.com/jcaip	2025-03-05 10:17:00 +00:00
Aidyn-A	aac0577796	[TEST][Sparse] Force CUTLASS backend in TestSparseSemiStructuredCUTLASS (#146398 ) We have noticed some discrepancy between the ways the `test_sparse_semi_structured.py` was called. And in some ways, the test falsely fails, because it was attempting to run on a wrong backend. All because `SparseSemiStructuredTensor._FORCE_CUTLASS = True` was never set in the setup of `TestSparseSemiStructuredCUTLASS` as it was in its `TestSparseSemiStructuredCUSPARSELT` counterpart `8444fe019a/test/test_sparse_semi_structured.py (L1039-L1046)` When I run tests via pytest, just by shear luck it calls `test_values_backend_cutlass_cuda` which sets the backend to CUTLASS `bb4bd5f00b/test/test_sparse_semi_structured.py (L475)` before `test_conversions_all_patterns_cuda_`: ``` test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_values_backend_cutlass_cuda PASSED [0.0071s] [ 72%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_bfloat16 PASSED [0.0484s] [ 73%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_float16 PASSED [0.0041s] [ 73%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_int8 PASSED [0.0079s] [ 73%] ``` In this scenario everything is good. But in `python test/test_sparse_semi_structured.py -v -k cuda` way, the order of the tests is not the same, and it sets cuSparseLt backend just before running `test_conversions_all_patterns_cuda_` which causes failures: ``` test_cusparselt_backend_cuda (__main__.TestSparseSemiStructuredCUSPARSELTCUDA.test_cusparselt_backend_cuda) ... ok ... test_conversions_all_patterns_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_bfloat16) ... FAIL test_conversions_all_patterns_cuda_float16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_float16) ... FAIL test_conversions_all_patterns_cuda_int8 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_int8) ... ERROR ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146398 Approved by: https://github.com/Skylion007, https://github.com/jcaip, https://github.com/eqy	2025-02-04 22:07:12 +00:00
Aleksandar Samardžić	2b00d211f0	Build RowwiseScaledMM.cu for SM89 (#145676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy	2025-02-01 11:44:58 +00:00
Wei Wang	6bcb545d9c	[CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793 ) Make ci cusparselt installation be consistent with nightly binary Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121 Retry of https://github.com/pytorch/pytorch/pull/145696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145793 Approved by: https://github.com/eqy, https://github.com/tinglvv	2025-01-28 21:01:58 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Nikita Shulga	38bbe37187	Enable CI on SM89 (#140305 ) Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376 To enable more balanced sharding, had to push `148ae19935` Added `@xfailIfSM89` to the following tests: - test_fp8_pattern_2 - test_original_aten_preserved_split_addmm - test_sparse_semi_structured_scaled_mm - test_sparse_semi_structured_scaled_mm_fp8 - test_sparse_fp8fp8_mm Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA` Skipped following inductor tests (that either flaky OOMs or timeouts): - test_reduction_fn_std_float64 - test_reduction_fn_var_mean_float64 - test_multi_output_unbacked_custom_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305 Approved by: https://github.com/wdvr, https://github.com/ZainRizvi	2024-12-03 04:49:46 +00:00
Jesse Cai	5accae4197	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-27 05:32:45 +00:00
PyTorch MergeBot	5318bf8baf	Revert "[sparse] add extra options to _cslt_spare_mm (#137427 )" This reverts commit `f1451163ec`. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to This looks like the test is still failing, plz do a rebase ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2499918590))	2024-11-26 08:01:24 +00:00
Jesse Cai	f1451163ec	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-25 23:45:41 +00:00
PyTorch MergeBot	cc90ba8924	Revert "[sparse] add extra options to _cslt_spare_mm (#137427 )" This reverts commit `45b30a5aec`. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_sparse_semi_structured is failing in trunk after it lands ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2494047577))	2024-11-22 15:40:21 +00:00
Jesse Cai	45b30a5aec	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-21 23:37:36 +00:00
PyTorch MergeBot	8197e4c70d	Revert "[sparse] add search for optimal alg_id to torch.compile (#137427 )" This reverts commit `39bfba3f56`. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))	2024-10-24 17:27:06 +00:00
Jesse Cai	39bfba3f56	[sparse] add search for optimal alg_id to torch.compile (#137427 ) Summary: This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal alg_id and cache it when running with `torch.compile` Seeing speedups on both bfloat16 and float8 dtypes: <img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b"> <img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6"> * `torch._cslt_sparse_mm_search` has been modified to return optimal split-k parameters as well as max alg_id. * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch	2024-10-22 22:39:42 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
Jesse Cai	bc21689136	[sparse][semi-structured] Add float8 dtype support to 24 sparsity (#136397 ) Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397 Approved by: https://github.com/drisspg	2024-09-27 21:37:34 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `31c0467594`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `e498b02b47`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Jesse Cai	157de30f53	[sparse] Update cuSPARSELt to v0.6.2 (#134022 ) Summary: This PR updated cuSPARSELt to v0.6.2. I think we should land https://github.com/pytorch/pytorch/pull/128534 first though. Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is available. Unfortunately was running into a bug with fp32 support on Hopper, so I removed fp32 support from the cuSPARSELt backend. I think this should be fine since almost everybody uses the bfloat/float16/int8 kernels. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022 Approved by: https://github.com/jerryzh168, https://github.com/malfet ghstack dependencies: #128534	2024-08-23 19:34:53 +00:00
Jesse Cai	255cd75a97	[sparse] Add cuSPARSELt as a backend (#128534 ) Summary: This PR adds in cuSPARSELt as a backend to PyTorch. It is now possible to see if cuSPARSELt is available and the version if it is with ``` torch.backends.cusparselt.is_available() torch.backends.cusparselt.version() ``` Test Plan: ``` python test/test_sparse_semi_structured.py -k test_cusparselt_backend ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534 Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed	2024-08-21 22:06:07 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Eddie Yan	ce79b09415	[CUDA][Sparse] Change comparison function of `test_sparse_semi_structured.py` and bump tolerances for `sp24_matmuls` (#128553 ) Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553 Approved by: https://github.com/jcaip	2024-06-13 06:58:07 +00:00
Jesse Cai	c9db59e9e4	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-19 13:31:58 +00:00
PyTorch MergeBot	2dc15b6849	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit `14b2273b0c`. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2061070350))	2024-04-17 11:47:02 +00:00
Jesse Cai	14b2273b0c	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-16 20:31:52 +00:00
Aleksandar Samardžić	f5331aade5	Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473 Approved by: https://github.com/cpuhrsch	2024-04-14 06:57:41 +00:00
PyTorch MergeBot	97261be0a8	Revert "Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 )" This reverts commit `b2a0b8c446`. Reverted https://github.com/pytorch/pytorch/pull/123473 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123473#issuecomment-2053561077))	2024-04-13 07:47:32 +00:00
PyTorch MergeBot	3120dbbf81	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit `aaec97a403`. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2051757450))	2024-04-12 13:26:10 +00:00
Jesse Cai	aaec97a403	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-12 02:22:56 +00:00
Aleksandar Samardžić	b2a0b8c446	Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473 Approved by: https://github.com/cpuhrsch	2024-04-11 11:56:27 +00:00
William Wen	cbde0f048b	[dynamo, 3.12] enable tests disabled due to missing dynamo 3.12 support (#123300 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123300 Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/zou3519	2024-04-05 20:13:17 +00:00
PyTorch MergeBot	e61d04e467	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit `c63a7b5691`. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/malfet due to This broke rocm builds, which is visible on PR as well ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2038424125))	2024-04-04 23:15:36 +00:00
Jesse Cai	c63a7b5691	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-04 19:07:35 +00:00
PyTorch MergeBot	0d8e960f74	Revert "[Sparsity] add support for H100 compute capability 9.x (#121768 )" This reverts commit `91fdaa1b41`. Reverted https://github.com/pytorch/pytorch/pull/121768 on behalf of https://github.com/jeanschmidt due to Agreed on reverting and fixing rocm tests ([comment](https://github.com/pytorch/pytorch/pull/121768#issuecomment-2011893826))	2024-03-21 10:42:08 +00:00
Luoshang Pan	91fdaa1b41	[Sparsity] add support for H100 compute capability 9.x (#121768 ) Summary: as title Test Plan: buck test mode/opt //caffe2/test/... Differential Revision: D54792168 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/121768 Approved by: https://github.com/SherlockNoMad	2024-03-20 19:00:54 +00:00
drisspg	39c092d242	Skip semi-structured-sparse on windows (#120807 ) # Sumary We can see that in this job on the other PR: https://github.com/pytorch/pytorch/actions/runs/8086597674/job/22096699337?pr=120641#step:11:11272 building the SemiStrucutredSparse kernel is erroring on windows machine so I think we she land this. ### Details Introduced in here: https://github.com/pytorch/pytorch/pull/120434 we don't compile for windows so we should have skipped this test. There is another PR: https://github.com/pytorch/pytorch/pull/120641 which removes this skip for windows, so if that is green we should do that otherwise skip windows tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/120807 Approved by: https://github.com/alexsamardzic, https://github.com/jcaip	2024-02-29 21:48:52 +00:00
Driss Guessous	36c1cc962a	Update cutlass from 3.3.0 to 3.4.1 (#120434 ) ### COPY OF https://github.com/pytorch/pytorch/pull/120010 ### Update I have rolled the two blocking changes into this PR, I also imported this to fbcode to verify that nothing is breaking: D53870253 This copy was generated by merging in all the internal only changes into one merged atomic commit and re-exporting to github ### Current Status - [PR](https://github.com/pytorch/pytorch/pull/118935) aims to update the flash attention kernels to a more recent version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120434 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-02-23 03:57:26 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
Jesse Cai	16369816a2	[sparse] semi-structured sparse refactor (#117302 ) Summary: This PR is a refactor of semi-structured sparsity support. deprecation: Before `torch.sparse.to_sparse_semi_structured` had a kwarg param `transposed=False`, which has been removed. This kwarg was unused and now thros a deprecation warning. Namely, I've taken the subclassing implementation that xFormers has created and brought it over to PyTorch, as part of our plan to upstream runtime 2:4 sparsity. I've also copied over all the op support that Daniel implemenented that did not depend on the fast sparsification routines, into `_sparse_semi_structured_ops.py` With this subclass, all of our internal tests pass, as well as those in xFormers. The main change is that we now define a base subclass, `SparseSemiStructuredTensor` that is inherited from for each of the specific backends. We also now can arbitrarily override the sparse dispatch table with `_load_dispatch_table()`, idea being this is still general enough where users don't need to modify pytorch source code to get their model working. This also adds in padding support and stores alg_id and fuse_transpose as flags on the tensor, instead of hardcoding them. There still remains two components in xFormers that will need to be ported over eventually: - the autograd functions (`Sparsify24`, `Sparsify24_like`) - fast sparsification routines that they rely on Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302 Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles	2024-02-14 01:10:40 +00:00
Jesse Cai	1c1dc0e4e0	[sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296 ) Summary: Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with cuSPARSELt. Test Plan: ``` python test/test_sparse_semi_structured.py -k mixed ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296 Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic	2024-02-12 16:02:36 +00:00
Aleksandar Samardžić	f081c45a34	Add out_dtype support for sparse semi-structured CUTLASS back-end (#116519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116519 Approved by: https://github.com/cpuhrsch	2024-01-03 16:23:17 +00:00
Aleksandar Samardžić	341c4227a8	Update F32 sparse semi-structured support for CUTLASS back-end (#116017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116017 Approved by: https://github.com/jcaip	2023-12-22 16:53:04 +00:00
Jesse Cai	a8e354a9a0	[sparse][semi-structured] enable fp32 support, separate sparse and dense constraints (#115550 ) Summary: Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for fp32, which this PR enables.(thanks @alexsamardzic). Furthermore, this PR also updates the sparse_config to take into account the different shape constraints for sparse and dense matrices. Technically, cuSPARSELt supports smaller sparse matrix constraints as it seens to pad to the CUTLASS constraints under the hood. However, in practice small sparse matrices are not commonly used and we care more about the dense constraints for LLM inference. For now, we keep the CUTLASS constraints in place for both cuSPARSELt and CUTLASS tensors This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors. Test Plan: ``` python test/test_sparse_semi_structured.py ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550 Approved by: https://github.com/cpuhrsch	2023-12-15 02:28:17 +00:00
Jesse Cai	4471fe6c39	[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178 ) Summary: cuSPARSELt has support for different alg_id, which are set via `cusparseLTMatmulAlgSetAttribute`, in total there are 4 different alg_ids, 0 - 3. Previously we were just using the default alg_id, as from our initial experiments we found that for most shapes the default alg_id is the fastest and that they made no difference on numerical correctness, just performance. From our previous experiments the fastest alg_id seemed to differ only on small matmul shapes. danthe3rd found a performance regression when running with cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these characteristics (activations are small, weights are large). However it's likely that this is due to the alg_id ordering changing, as mentioned in the release notes for v0.5.0. ``` cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0. ``` This PR adds in the following: - support for passing in alg_id to _cslt_sparse_mm - a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for a given matmul _cslt_sparse_mm_search has the same function signature as _cslt_sparse_mm, minus the alg_id parameter. We are able to achieve v0.4.0 performance with alg_id=1 on the shapes that daniel provided. We will address autoselecting the best alg_id in a future PR, possibly with torch.compile. Test Plan: ``` python test/test_sparse_semi_structured -k cslt ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178 Approved by: https://github.com/cpuhrsch	2023-12-11 23:08:51 +00:00
PyTorch MergeBot	40a14e07ef	Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178 )" This reverts commit `1e5636f791`. Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit `1e5636f791` ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))	2023-12-11 18:07:17 +00:00
Jesse Cai	1e5636f791	[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178 ) Summary: cuSPARSELt has support for different alg_id, which are set via `cusparseLTMatmulAlgSetAttribute`, in total there are 4 different alg_ids, 0 - 3. Previously we were just using the default alg_id, as from our initial experiments we found that for most shapes the default alg_id is the fastest and that they made no difference on numerical correctness, just performance. From our previous experiments the fastest alg_id seemed to differ only on small matmul shapes. danthe3rd found a performance regression when running with cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these characteristics (activations are small, weights are large). However it's likely that this is due to the alg_id ordering changing, as mentioned in the release notes for v0.5.0. ``` cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0. ``` This PR adds in the following: - support for passing in alg_id to _cslt_sparse_mm - a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for a given matmul _cslt_sparse_mm_search has the same function signature as _cslt_sparse_mm, minus the alg_id parameter. We are able to achieve v0.4.0 performance with alg_id=1 on the shapes that daniel provided. We will address autoselecting the best alg_id in a future PR, possibly with torch.compile. Test Plan: ``` python test/test_sparse_semi_structured -k cslt ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178 Approved by: https://github.com/cpuhrsch	2023-12-11 15:47:28 +00:00

1 2

75 Commits