pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
drisspg	80c7c7178e	Make sure all SDPA tests are ran with tensor cores enabled (#135592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135592 Approved by: https://github.com/eqy	2024-10-29 20:53:10 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Aaron Orenstein	524fe784ec	BundledAutotuneCache (take 2) (#137902 ) Summary: Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Attempt 2 of #134959 (D60677499). Various configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Test Plan: unit tests Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<< FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D64336043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902 Approved by: https://github.com/oulgen	2024-10-15 18:39:47 +00:00
Wei Feng	14b4099521	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-07 16:36:31 +00:00
PyTorch MergeBot	6cf493158e	Revert "Enable FlashAttention on Windows (#131906 )" This reverts commit `b90bc66766`. Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))	2024-07-29 16:49:23 +00:00
Luca Wehrstedt	b90bc66766	Enable FlashAttention on Windows (#131906 ) Let's just give this a try. Reland of https://github.com/pytorch/pytorch/pull/131875. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906 Approved by: https://github.com/drisspg	2024-07-26 21:41:56 +00:00
Jerry Mannil	42f647219a	[ROCm] Add int4 support (#129710 ) - Add AMD support for int4 kernel - Only supports CDNA2 and CDNA3 gpus for now - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types - Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus - Fix torchscript issues due to hipify for `__nv_bfloat16` type - TorchScript has its own implementation for bfloat16 type - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h) - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify Fixes #124699 Fixes pytorch-labs/gpt-fast/issues/154 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-09 19:49:12 +00:00
PyTorch MergeBot	d7b7f8b79f	Revert "[ROCm] Add int4 support (#129710 )" This reverts commit `d0ad13fa42`. Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))	2024-07-08 16:07:53 +00:00
Jerry Mannil	d0ad13fa42	[ROCm] Add int4 support (#129710 ) Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction. Only supports CDNA2 and CDNA3 gpus for now. Fixes #124699 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-07 23:54:22 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `b7e7a4cb01`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
Andres Lugo	b9a1c2c991	[ROCm] Enable F8 Inductor Unit tests (#128353 ) First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-26 18:30:43 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `4c971932e8`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
Fuzzkatt	1cf62e86a4	skip various unit tests for Jetson (#122531 ) skip multiprocessing, cuda expandable segments, mem eff and flash attention tests on Jetson due to hanging / sigkill issues from nvidia internal testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122531 Approved by: https://github.com/eqy, https://github.com/malfet	2024-04-16 01:26:26 +00:00
drisspg	f4e2a226aa	ScoreMod API (#121845 ) # Summary This PR adds a new higher-order_op: `templated_attention`. This op is designed to extend the functionality of torch.nn.fucntional.scaled_dot_product_attention. PyTorch has efficient pre-written fused-attention kernels. However, users want to modify how scores are computed (a substep inside attention) -- this traditionally requires the user to write their own attention kernel. One such modification to attention scores that is not currently supported by the top level SDPA op is:[ Attention with Linear Biases (ALiBi](https://arxiv.org/abs/2108.12409)). This higher-order op will instead accept a callable( 'score_mod') function that is through torch.compile will be used to create an efficient attention kernel instantiation. ### Details This HOP utilizes the existing fx and HOP infra to capture and convert the User `score-mod` function and convert to an FX graph module. Inductor then consumes this HOP that has a `ir.Subgraph` input. It will inline this lowered subgraph into a triton kernel which performs fused attention with the modification to the scores matrix inlined. ### API The API for a score_mod function should be as follows: ```Python def score_mod(score: torch.Tensor, batch: torch.Tensor, head: torch.Tensor, token_1: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor ``` This function receives five parameters: - `score`: A scalar tensor representing the attention score, with the same data type and device as the query, key, and value tensors. - `batch`, `head`, `seq_len_q`, `seq_len_kv`: Scalar tensors indicating the batch index, head index, query index, and key/value index, respectively, with torch.int data type and located on the same device as the score tensor. Consider inputs query, key, value of shapes (2, 4, 16, 8), leading to an intermediate attention score matrix of shape (2, 4, 16, 16) The score_mod function will be vectorized over each element of this matrix. For instance, modifying the score at the position corresponding to the 0th batch, 2nd head, between the 8th query and the 9th key element, would be invoked as: ```Python score_mod(score[0,2,8,9], torch.tensor(0), torch.tensor(2), torch.tensor(8), torch.tensor(9)) ``` ### Examples ```Python import torch from torch.nn.attention.templated_attention import templated_attention torch.manual_seed(0) # Lets create some input tensors # The input tensor has shape (batch_size, num_heads, seq_len, head_dim) query = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) key = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) value = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) # Lets create a fun new score_modification! I will call this # Checkerboard. It will reduce the score for neighboring tokens (1 step apart) # in the sequence. And increase the score for tokens 2 steps apart. For everything # else, the score will remain the same. def checkerboard(score, batch, head, token_q, token_kv): score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score) score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score) return score # Lets call templated_attention with this new score modification output = templated_attention(query, key, value, score_mod=checkerboard) compiled_templated_attention = torch.compile(templated_attention) out_compiled = compiled_templated_attention(query, key, value, score_mod=checkerboard) torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2) ``` ### Future Work - This PR is currently only forward only. However the triton kernel for backwards where score_modifications to not rely on external buffers has been explored here: https://github.com/drisspg/transformer_nuggets/blob/main/transformer_nuggets/flash/flash_attention.py - Kernel Improvements; There are has been some larger updates to the fused attention implementation that Triton uses in its tutorials. The implementation of this kernel is based on a prior version and should be updated. - We may want to unify this API under the top level SDPA API and leave that as a follow up once this is more stable - Should we error on CPU? - There are some issues with dynamic shapes - Capturing of free variables and lifting to inputs to the subgraph is not working correctly today ### Performance Comparisons generated by this benchmark: \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 5.412 \| \| \| \| \| \| \| \| \| Max \| 8.882 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| \| Min \| 3.645 \| 8 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| \| Min \| 0.345 \| 1 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| For reference \| Configuration \| Forward Time (µ seconds) \| Backend \| Speedup \| \|-----------------------------------------------\|--------------------------\|------------------\|---------\| \| Fastest Config in Sweep (`8 16 4096 4096 64 relative_bias torch.bfloat16`) \| 3608 \| Templated Attention \| 1.0 \| \| Compiled SDPA (No Mask) \| 9928 \| Math \| 2.75x \| \| Compiled SDPA (With Mask) \| 11898 \| Math \| 3.29x \| \| Compiled SDPA (With Mask) \| 8704 \| Memory Efficient Attention \| 2.42x \| \| Compiled SDPA (No Mask) \| 2548 \| FlashAttention2 \| 0.706x \| The speedups are measuring compiled templated attention speed versus different calls to torch.nn.functional.sdpa <details> <summary> FULL PERFORMANCE SWEEP NUMBERS </summary> \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| eager_time \| compiled_time \| speedup \| \|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\|--------------\|-----------------\|-----------\| \| 1 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 331.444 \| 67.221 \| 4.931 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 335.300 \| 64.187 \| 5.224 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 352.039 \| 63.806 \| 5.517 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 371.699 \| 711.349 \| 0.523 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 333.488 \| 86.455 \| 3.857 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 322.363 \| 82.469 \| 3.909 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 349.967 \| 82.233 \| 4.256 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 486.359 \| 1412.453 \| 0.344 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 2794.597 \| 551.188 \| 5.070 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 3965.150 \| 513.101 \| 7.728 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 2408.013 \| 504.759 \| 4.771 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 6850.531 \| 16733.675 \| 0.409 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 441.939 \| 123.576 \| 3.576 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 560.379 \| 116.710 \| 4.801 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 421.172 \| 115.825 \| 3.636 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 994.492 \| 2132.806 \| 0.466 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 1436.430 \| 309.495 \| 4.641 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 1892.216 \| 290.186 \| 6.521 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 1360.665 \| 282.956 \| 4.809 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 3525.532 \| 8359.702 \| 0.422 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 22026.839 \| 3864.604 \| 5.700 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 31262.746 \| 3609.551 \| 8.661 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 20219.079 \| 3480.402 \| 5.809 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 54654.647 \| 116652.357 \| 0.469 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 820.606 \| 188.683 \| 4.349 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 1058.362 \| 179.295 \| 5.903 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 784.372 \| 175.714 \| 4.464 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 1890.792 \| 4212.877 \| 0.449 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 2781.830 \| 557.017 \| 4.994 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 3694.050 \| 525.249 \| 7.033 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 2634.164 \| 507.613 \| 5.189 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 6959.917 \| 15331.116 \| 0.454 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 43889.096 \| 7582.018 \| 5.789 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 62784.293 \| 7075.846 \| 8.873 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 40308.606 \| 6829.587 \| 5.902 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 108892.137 \| 233090.953 \| 0.467 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121845 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-04-06 01:10:44 +00:00
Xinya Zhang	12116aee68	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in future release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/huydhn	2024-03-28 00:27:38 +00:00
PyTorch MergeBot	764eae9c4e	Revert "Add Flash Attention support on ROCM (#121561 )" This reverts commit `a37e22de70`. Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm. We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))	2024-03-19 17:14:28 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
CaoE	113138aa55	add test cases for GradScaler on CPU (#109994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-02-02 21:49:07 +00:00
Edward Z. Yang	9bce208dfb	Replace follow_imports = silent with normal (#118414 ) This is a lot of files changed! Don't panic! Here's how it works: * Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file. * When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded. * The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors. * Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list. * Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves. * torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state. * There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many. In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file. The codemod was done with this script authored by GPT-4: ``` import glob exclude_patterns = [ ... ] for pattern in exclude_patterns: for filepath in glob.glob(pattern, recursive=True): if filepath.endswith('.py'): with open(filepath, 'r+') as f: content = f.read() f.seek(0, 0) f.write('# mypy: ignore-errors\n\n' + content) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414 Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD	2024-01-27 02:44:11 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit `5aa92b5090`. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Eddie Yan	5aa92b5090	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-01-18 01:20:36 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit `5bddbed399`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
chilli	e686341f64	Consider that ops can be fused into cat in the min-cut partitioner (#110501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110501 Approved by: https://github.com/eellison	2023-10-05 01:34:57 +00:00
Ying Zhang	a2d5f13310	[Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates (#108015 ) This is the step 5 to add cutlass as an alternative inductor backend. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931	2023-09-12 17:44:38 +00:00
Huy Do	a9c663c269	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 07:43:04 +00:00
PyTorch MergeBot	e45b290127	Revert "Revert "Flash Attention v2 (#105602 )" (#108827 )" This reverts commit `24e9bbe22a`. Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))	2023-09-08 03:25:45 +00:00
Huy Do	24e9bbe22a	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 02:54:20 +00:00
drisspg	add45aea1c	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-01 22:14:44 +00:00
shibo19	bb2fcc7659	unify TEST_CUDA (#106685 ) Fixes #ISSUE_NUMBER as title, unify TEST_CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/106685 Approved by: https://github.com/zou3519	2023-08-10 09:01:36 +00:00
Fuzzkatt	3c7331742a	test_fused_sdp_choice in test_transformers.py fix (#106587 ) sdp dispatcher prioritizes flash attention over efficient attention: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L684-L687, and flash attention is enabled for sm75+: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L625. Thus, the unit test `test_fused_sdp_choice` from `test_transformers.py` which is failing on T4 (sm75) should have this `SM80OrLater` check changed to `SM75OrLater`: https://github.com/pytorch/pytorch/blob/main/test/test_transformers.py#L1914-L1917. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106587 Approved by: https://github.com/drisspg	2023-08-04 03:43:56 +00:00
Fuzzkatt	1cebfef8a4	sm90 efficient attention test fixes (#105978 ) Fixes the following two test cases involving efficient attention on sm90: Explanations: functorch/test_ops.py: test_vjp_nn_functional_scaled_dot_product_attention_cuda_float32 * originally the test had xfail for all sm * in https://github.com/pytorch/pytorch/issues/102029, we found that it was unexpectedly passing on sm90 * I made https://github.com/pytorch/pytorch/pull/102131 to update the test to let it pass * @drisspg seems to have made changes to the behavior such that the original xfail was getting triggered (https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560071148) * the CI began complaining about the failure again: https://github.com/pytorch/pytorch/issues/102663 * I'm now reverting https://github.com/pytorch/pytorch/pull/102131 to bring back the original xfail now that the behavior has been fixed by @drisspg to trigger the xfail in sm90 similar to all other sm test_transformers.py: test_mem_efficient_fail_sm90_cuda * the test as it's currently written seems to expect the sdp dispatcher to fail for mem efficient attention on sm90; however, testing this on H100, it actually succeeds, so I'm disabling the test for now as the current expected result may be outdated Pull Request resolved: https://github.com/pytorch/pytorch/pull/105978 Approved by: https://github.com/eqy, https://github.com/kshitij12345, https://github.com/zou3519	2023-07-31 17:59:40 +00:00
Justin Chu	be03a56955	[BE] Enable ruff's UP rules and autoformat testing/ (#105425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425 Approved by: https://github.com/malfet	2023-07-18 21:04:39 +00:00
Nikita Shulga	c3e4a67905	Refactor multigpu tests to `test_cuda_multigpu` (#104059 ) Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file. - Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`) - Move individual tests from `TestCuda` to `TestCudaMultiGPU` - Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda` - Add newly created `test_cuda_multigpu` to the multigpu periodic test <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at f4d46fa</samp> This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059 Approved by: https://github.com/huydhn	2023-06-27 05:32:05 +00:00
Nikita Shulga	cd05c3b98c	[BE] Use `TEST_MULTIGPU` from `common_cuda.py` (#103982 ) Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results. Also, delete unused `TEST_MAGMA`. Prep change for https://github.com/pytorch/pytorch/issues/100006 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at e3a5b39</samp> > _`common_cuda.py`_ > _Refactored for dynamo tests_ > _Winter code cleanup_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982 Approved by: https://github.com/atalman, https://github.com/janeyx99	2023-06-22 00:07:44 +00:00
Fuzzkatt	5b01c8dc6a	fix functorch/test_ops.py test_vjp flash attention unexpected success (#102131 ) add isSm90 check for expected failure in nn.functional.scaled_dot_product_attention in functorch/test_ops.py Fixes #102029 Uses solution https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560052965 which was verified by https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560071148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102131 Approved by: https://github.com/zou3519	2023-05-25 22:17:25 +00:00
Edward Z. Yang	3a5427baf4	Add torch.utils._content_store (#99809 ) Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case. One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.) I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range. Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809 Approved by: https://github.com/voznesenskym	2023-04-26 18:02:59 +00:00
Edward Z. Yang	cf354a0491	Don't eagerly initialize CUDA when importing common_cuda (#99536 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99536 Approved by: https://github.com/Chillee, https://github.com/bertmaher, https://github.com/albanD	2023-04-19 22:12:10 +00:00
eqy	2fddcf0fc0	[CUDA][CUDA 11] Remove more CUDA 11 version checks (#92934 ) Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934 Approved by: https://github.com/ngimel	2023-03-30 19:49:52 +00:00
Kazuaki Ishizaki	4610ce49f6	Fix typo under torch/testing directory (#97254 ) This PR fixes typo in comments and messages under `torch/testing` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97254 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-03-23 01:46:17 +00:00
Driss Guessous	653dc73df0	[SDPA] Wire up FlashAttention's backward (#92917 ) # Summary This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml. The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](`33e0860c9c/flash_attn/flash_attn_interface.py (L126)`) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions. ### MetaFunctions I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below. ### Questions? Performance issues vs mem efficient when using torch.nn.mha_forward TorchCompile -> See purposed solution below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917 Approved by: https://github.com/cpuhrsch	2023-02-02 04:02:30 +00:00
Eddie Yan	0bf7506051	[CUDA] Drop CUDA < 11.0 test flags (#92605 ) Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed. CC @ptrblck @malfet @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605 Approved by: https://github.com/ngimel	2023-01-24 04:34:06 +00:00
jpvillam	38dd4cbdf1	ROCm enable sparse_sampled_addmm (#86401 ) Enables: test_comprehensive_sparse_sampled_addmm_cuda_complex128 test_comprehensive_sparse_sampled_addmm_cuda_complex64 test_comprehensive_sparse_sampled_addmm_cuda_float32 test_comprehensive_sparse_sampled_addmm_cuda_float64 test_dispatch_meta_sparse_sampled_addmm_cuda_complex128 test_dispatch_meta_sparse_sampled_addmm_cuda_complex64 test_dispatch_meta_sparse_sampled_addmm_cuda_float32 test_dispatch_meta_sparse_sampled_addmm_cuda_float64 test_meta_sparse_sampled_addmm_cuda_complex128 test_meta_sparse_sampled_addmm_cuda_complex64 test_meta_sparse_sampled_addmm_cuda_float32 test_meta_sparse_sampled_addmm_cuda_float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86401 Approved by: https://github.com/ngimel	2022-10-26 19:39:24 +00:00
jpvillam	247468baf0	[ROCm] More Sparse UTs enablement and more hipification mappings. (#78939 ) Enables: test_bmm_cuda_float64 test_bmm_deterministic_cuda_float64 test_csr_matvec_cuda_complex128 test_csr_matvec_cuda_complex64 test_csr_matvec_cuda_float32 test_csr_matvec_cuda_float64 To enable the above tests had to add some more hip mappings for the hipification process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2022-08-23 13:54:09 +00:00

1 2

73 Commits