Commit Graph

430 Commits

Author SHA1 Message Date
Sun, Jiayi
a59132b9c8 fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661)
Fix https://github.com/pytorch/pytorch/issues/132634.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2024-11-07 03:21:36 +00:00
Nichols A. Romero
641ca67d5a [ROCM] Fix hipBLASLt version check in TunableOp test (#139811)
Allow 3 or more digits for hipBLASLt version check in TunableOp test. Needed due to upcoming ROCm 6.3 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139811
Approved by: https://github.com/eqy, https://github.com/malfet
2024-11-06 14:37:45 +00:00
Nichols A. Romero
2922b9fee1 [ROCm] Fix ADDMM hipBLASLt regression (#138267)
Fixes #138067

A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604

The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267
Approved by: https://github.com/eqy, https://github.com/jeffdaily
2024-10-28 16:07:11 +00:00
Xia, Weiwen
e299193423 Bug fix: Use oneDNN for torch._int_mm CPU only when avx512_vnni is supported (#136942)
Fixes #136746

If AVX512_VNNI is not supported, overflow occurs inside oneDNN. Fall back to ref path in such case.
UT is also updated to catch the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136942
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-26 01:17:11 +00:00
Jeff Daily
3f3b692a00 [ROCm] CK-based GEMM (#131004)
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony

Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-10-20 02:57:43 +00:00
Nichols A. Romero
aa28062169 [ROCm] TunableOp more unit test follow-up - Part 2 (#134517)
More unit tests to cover TunableOp functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517
Approved by: https://github.com/jeffdaily
2024-10-16 01:49:47 +00:00
Huy Do
929797dedb Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835)
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:

```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
    os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```

For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097

The test tried to catch and ignore this, but this is Windows.  So, the fix is to:

1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman
2024-10-14 17:28:33 +00:00
Jin Zhou
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
Yuanhao Ji
758dbac308 Add type check for ord in torch.linalg.vector_norm() and torch.linalg.matrix_norm() (#137463)
fixes #137424, fixes #137460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463
Approved by: https://github.com/lezcano
2024-10-08 17:53:56 +00:00
Peter Y. Yeh
8aa110cb00 [ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999)
This pull request enables the int_mm_error tests for rocm 6.0+ . since  #122431 landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-10-07 21:10:18 +00:00
Haifeng Jin
37dd924c2d Fix test/test_linalg.py for NumPy 2 (#136800)
Related to  #107302.

When built and tested with NumPy 2 the following unit tests failed.

```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```

This PR fixes them. The test is now compatible with both NumPy 1 & 2.

Some more details:

1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
2024-10-01 07:53:24 +00:00
Nikita Shulga
c3e678382b Fix addmm silent correctness on aarch64 (#136371)
Do not dispatch to fast gemmv functions when alpha is not equal to 1

Add regression test to address the problem

Fixes https://github.com/pytorch/pytorch/issues/136299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371
Approved by: https://github.com/swolchok
2024-09-23 17:10:34 +00:00
Jeff Daily
0eb9c870fd [reland][ROCm] TunableOp for gemm_and_bias (#128919)
Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias`

Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919
Approved by: https://github.com/malfet
2024-08-22 18:27:50 +00:00
Nichols A. Romero
f25df31008 TunableOp more unit test follow-up (#130065)
More unit tests for preventing TunableOp regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130065
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-08-08 22:42:16 +00:00
Jiang, Yanbing
bceb91222c Fix meta error in _convert_weight_to_int4pack (#130915)
This PR is to fix meta error in _convert_weight_to_int4pack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915
Approved by: https://github.com/jerryzh168
2024-07-26 08:36:30 +00:00
Zixi Qi
c3fe9075a9 [ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078)
Summary:
When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h

This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide.

Test Plan:
Verified generated tunableops kernel selection has the correct hipblaslt version from runtime:

```
Validator,PT_VERSION,2.5.0
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
Validator,HIPBLASLT_VERSION,800-bf2c3184
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676
GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019
GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723
GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047
```

Differential Revision: D59889043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-25 00:54:07 +00:00
Jeff Daily
69b1999586 TunableOp size hotfix (#130800)
Fixes #130727.  GetSize calculation was incorrect for strided batched gemm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800
Approved by: https://github.com/xw285cornell
2024-07-22 23:42:26 +00:00
Aaron Gokaslan
d1c4e6b55f [BE]: Enable a few additional ruff rules (#130700)
Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700
Approved by: https://github.com/justinchuby, https://github.com/ezyang
2024-07-17 02:06:04 +00:00
inkcherry
f422027fce fix torch.linalg.lstsq input check (#130612)
Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236)
The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient).  Consequently, it triggers an internal assertion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612
Approved by: https://github.com/lezcano
2024-07-12 23:06:52 +00:00
Jiang, Yanbing
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
PyTorch MergeBot
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
Jiang, Yanbing
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
Jerry Mannil
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
PyTorch MergeBot
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
Jerry Mannil
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
Edward Z. Yang
29c68df600 Stop immediately specializing common constants 0/1 for plain int (#128327)
Fixes https://github.com/pytorch/pytorch/issues/128319

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327
Approved by: https://github.com/lezcano
ghstack dependencies: #129983
2024-07-03 16:41:51 +00:00
Fuzzkatt
0441173ab2 Add slowTest marker to test_linalg_solve_triangular_large (#129903)
In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903
Approved by: https://github.com/lezcano
2024-07-02 12:27:12 +00:00
awayzjj
ccc4ee7793 check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839)
Fixes https://github.com/pytorch/pytorch/issues/127043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839
Approved by: https://github.com/lezcano
2024-07-02 09:20:49 +00:00
Jeff Daily
04206d1898 TunableOp hotfix, unit test follow-up (#129606)
PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues.  This is a follow-up set of unit tests that would exercise the problems seen previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606
Approved by: https://github.com/atalman
2024-06-27 17:01:04 +00:00
Jeff Daily
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00
Xuehai Pan
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
lezcano
48538d3d14 Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
We fix a number of bugs previously present in the complex
implementation.

We also heavily simplify the implementation, using, among
other things, that we now have conjugate views.

I saw there is a comment regarding how slow some checks on this
function are. As such, I removed quite a few of the combinations of inputs
to make the OpInfo lighter. I still left a couple relevant examples to not regress
coverage though.

Fixes https://github.com/pytorch/pytorch/issues/122188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-05-30 14:45:58 +00:00
William Wen
5359af0c7e [dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341)
Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-29 05:18:04 +00:00
Nikita Shulga
4ff9113e3d [MPS] Add _weight_int8pack_mm tests (#127041)
As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041
Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales
2024-05-24 16:08:06 +00:00
Scott Wolchok
85fd76f76d Add test coverage for fp16 matrix-vector specialized kernel (#126700)
Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage.

Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700
Approved by: https://github.com/malfet
2024-05-21 17:23:16 +00:00
Catherine Lee
6f619cc727 [ez] functorch/test_vmap and test_dataloader to run in parallel (#125597)
Also mark test_svd serial in linalg to see if it helps with the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-08 15:37:29 +00:00
Nikita Shulga
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
Nikita Shulga
acac7aa70f [CI] Unskip Linalg tests on ARM (#125377)
Removes obscure "Issue with numpy version on arm" added by https://github.com/pytorch/pytorch/pull/82213
And replaces it with 4 targeted skips:
 - test_addmv for `float16`
- test_vector_norm for `float16`, `bfloat16` and `float32`

Followups to fix them are tracked in https://github.com/pytorch/pytorch/issues/125438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125377
Approved by: https://github.com/kit1980
2024-05-03 01:18:52 +00:00
aaitzhan
47ba7a76e2 [ATen][CUDA][AMP] Fix dtype mismatch in linalg_vector_norm (#125175)
Fixes #125174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125175
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-05-01 10:57:12 +00:00
Aaron Orenstein
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
PyTorch MergeBot
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934b.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
Aaron Orenstein
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
Peter Y Yeh
2e7b8ff116 [ROCm] Fix Int_mm() Integration with hipblasLT (#122431)
The PR

- fixes int_mm() /int8_gemm() integration with hipblasLT backend (require ROCm 6.0).
- enables/fixes the following tests on Rocm
    - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_True_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122431
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/atalman
2024-04-24 02:29:33 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
Nikita Shulga
c74dfca5e7 Int4MM: Unswizzle for different dtypes (#124448)
If dtype is not the one this platform is optimized for, it might need different unswizzling pattenrs Implement ones for non-vectorized flavor of the kernel, so that int4mm can be used with float32 and float16 dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124448
Approved by: https://github.com/jgong5, https://github.com/mikekgfb
2024-04-19 21:17:15 +00:00
William Wen
cbde0f048b [dynamo, 3.12] enable tests disabled due to missing dynamo 3.12 support (#123300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123300
Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/zou3519
2024-04-05 20:13:17 +00:00
AyaseNana
0a7162f898 Fix svd_lowrank parameter M (#122681)
ISSUE: #122699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122681
Approved by: https://github.com/lezcano
2024-03-29 18:06:38 +00:00
min-jean-cho
057892f4be [CPU] optimize Lp norm for 1-dimensional vector (#122143)
Fixes https://github.com/pytorch/pytorch/issues/120229

- Optimize vector norm by simplifying vector norm formula for 1-dimensional vector.
- Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof.
- Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix.
- Additionally, avoids overflow in power, `abs(x) ** p` for large `p` or `x`, for 1-dimensional vector.

### Performance
Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for
`torch.norm(torch.randn(2**18, 1), ord, -1)`
`torch.linalg.vector_norm(torch.randn(2**18, 1), ord, -1)`
Tested on 28 physical cores/socket, 1 socket on Skylake.

|                          	|                 	|         	|         	| **Avg Latency (ms)**  	|                       	|                                        	|
|--------------------------	|-----------------	|---------	|---------	|-----------------------	|-----------------------	|----------------------------------------	|
| **op**                   	| **input shape** 	| **dim** 	| **ord** 	| **baseline (master)** 	| **optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a)** 	| **speedup ratio (baseline/optimized)** 	|
| torch.norm               	| (2**18, 1)      	| -1      	| fro     	| 34.3755531            	| 0.0125408             	| 2741.094                               	|
|                          	|                 	|         	| inf     	| 34.0952635            	| 0.0122237             	| 2789.271                               	|
|                          	|                 	|         	| -inf    	| 34.3674493            	| 0.0120759             	| 2845.953                               	|
|                          	|                 	|         	| 0       	| 34.1004515            	| 0.0175261             	| 1945.69                                	|
|                          	|                 	|         	| 1       	| 34.1688442            	| 0.0121593             	| 2810.089                               	|
|                          	|                 	|         	| -1      	| 33.949492             	| 0.0120282             	| 2822.487                               	|
|                          	|                 	|         	| 2       	| 34.3669581            	| 0.0120401             	| 2854.366                               	|
|                          	|                 	|         	| -2      	| 33.9252067            	| 0.0121069             	| 2802.139                               	|
|                          	|                 	|         	|         	|                       	|                       	|                                        	|
| torch.linalg.vector_norm 	| (2**18, 1)      	| -1      	| inf     	| 34.090879             	| 0.0095105             	| 3584.545                               	|
|                          	|                 	|         	| -inf    	| 34.3708754            	| 0.0099111             	| 3467.931                               	|
|                          	|                 	|         	| 0       	| 34.0880775            	| 0.0141716             	| 2405.38                                	|
|                          	|                 	|         	| 1       	| 34.1392851            	| 0.0093174             	| 3664.036                               	|
|                          	|                 	|         	| -1      	| 33.925395             	| 0.0092483             	| 3668.302                               	|
|                          	|                 	|         	| 2       	| 34.3854165            	| 0.0092459             	| 3719.002                               	|
|                          	|                 	|         	| -2      	| 33.932972             	| 0.0093007             	| 3648.429                               	|

### Proof
<details>
<summary>For those interested :)</summary>

<img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b">
<img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322">

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143
Approved by: https://github.com/lezcano
2024-03-20 23:20:25 +00:00
Xia, Weiwen
8168338063 Add CPU implementation for torch._int_mm (s8*s8->s32) (#121792)
Fixes #121647

**Description**
Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it.
Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.

**Test plan**
`python test/test_linalg.py -k test__int_mm_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-03-19 08:44:33 +00:00
mingfeima
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00