Commit Graph

529 Commits

Author SHA1 Message Date
Nichols A. Romero
641ca67d5a [ROCM] Fix hipBLASLt version check in TunableOp test (#139811)
Allow 3 or more digits for hipBLASLt version check in TunableOp test. Needed due to upcoming ROCm 6.3 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139811
Approved by: https://github.com/eqy, https://github.com/malfet
2024-11-06 14:37:45 +00:00
Nichols A. Romero
2922b9fee1 [ROCm] Fix ADDMM hipBLASLt regression (#138267)
Fixes #138067

A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604

The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267
Approved by: https://github.com/eqy, https://github.com/jeffdaily
2024-10-28 16:07:11 +00:00
Xia, Weiwen
e299193423 Bug fix: Use oneDNN for torch._int_mm CPU only when avx512_vnni is supported (#136942)
Fixes #136746

If AVX512_VNNI is not supported, overflow occurs inside oneDNN. Fall back to ref path in such case.
UT is also updated to catch the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136942
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-26 01:17:11 +00:00
Jeff Daily
3f3b692a00 [ROCm] CK-based GEMM (#131004)
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony

Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-10-20 02:57:43 +00:00
Nichols A. Romero
aa28062169 [ROCm] TunableOp more unit test follow-up - Part 2 (#134517)
More unit tests to cover TunableOp functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517
Approved by: https://github.com/jeffdaily
2024-10-16 01:49:47 +00:00
Huy Do
929797dedb Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835)
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:

```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
    os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```

For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097

The test tried to catch and ignore this, but this is Windows.  So, the fix is to:

1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman
2024-10-14 17:28:33 +00:00
Jin Zhou
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
Yuanhao Ji
758dbac308 Add type check for ord in torch.linalg.vector_norm() and torch.linalg.matrix_norm() (#137463)
fixes #137424, fixes #137460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463
Approved by: https://github.com/lezcano
2024-10-08 17:53:56 +00:00
Peter Y. Yeh
8aa110cb00 [ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999)
This pull request enables the int_mm_error tests for rocm 6.0+ . since  #122431 landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-10-07 21:10:18 +00:00
Haifeng Jin
37dd924c2d Fix test/test_linalg.py for NumPy 2 (#136800)
Related to  #107302.

When built and tested with NumPy 2 the following unit tests failed.

```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```

This PR fixes them. The test is now compatible with both NumPy 1 & 2.

Some more details:

1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
2024-10-01 07:53:24 +00:00
Nikita Shulga
c3e678382b Fix addmm silent correctness on aarch64 (#136371)
Do not dispatch to fast gemmv functions when alpha is not equal to 1

Add regression test to address the problem

Fixes https://github.com/pytorch/pytorch/issues/136299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371
Approved by: https://github.com/swolchok
2024-09-23 17:10:34 +00:00
Jeff Daily
0eb9c870fd [reland][ROCm] TunableOp for gemm_and_bias (#128919)
Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias`

Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919
Approved by: https://github.com/malfet
2024-08-22 18:27:50 +00:00
Nichols A. Romero
f25df31008 TunableOp more unit test follow-up (#130065)
More unit tests for preventing TunableOp regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130065
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-08-08 22:42:16 +00:00
Jiang, Yanbing
bceb91222c Fix meta error in _convert_weight_to_int4pack (#130915)
This PR is to fix meta error in _convert_weight_to_int4pack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915
Approved by: https://github.com/jerryzh168
2024-07-26 08:36:30 +00:00
Zixi Qi
c3fe9075a9 [ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078)
Summary:
When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h

This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide.

Test Plan:
Verified generated tunableops kernel selection has the correct hipblaslt version from runtime:

```
Validator,PT_VERSION,2.5.0
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
Validator,HIPBLASLT_VERSION,800-bf2c3184
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676
GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019
GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723
GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047
```

Differential Revision: D59889043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-25 00:54:07 +00:00
Jeff Daily
69b1999586 TunableOp size hotfix (#130800)
Fixes #130727.  GetSize calculation was incorrect for strided batched gemm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800
Approved by: https://github.com/xw285cornell
2024-07-22 23:42:26 +00:00
Aaron Gokaslan
d1c4e6b55f [BE]: Enable a few additional ruff rules (#130700)
Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700
Approved by: https://github.com/justinchuby, https://github.com/ezyang
2024-07-17 02:06:04 +00:00
inkcherry
f422027fce fix torch.linalg.lstsq input check (#130612)
Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236)
The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient).  Consequently, it triggers an internal assertion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612
Approved by: https://github.com/lezcano
2024-07-12 23:06:52 +00:00
Jiang, Yanbing
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
PyTorch MergeBot
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
Jiang, Yanbing
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
Jerry Mannil
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
PyTorch MergeBot
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
Jerry Mannil
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
Edward Z. Yang
29c68df600 Stop immediately specializing common constants 0/1 for plain int (#128327)
Fixes https://github.com/pytorch/pytorch/issues/128319

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327
Approved by: https://github.com/lezcano
ghstack dependencies: #129983
2024-07-03 16:41:51 +00:00
Fuzzkatt
0441173ab2 Add slowTest marker to test_linalg_solve_triangular_large (#129903)
In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903
Approved by: https://github.com/lezcano
2024-07-02 12:27:12 +00:00
awayzjj
ccc4ee7793 check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839)
Fixes https://github.com/pytorch/pytorch/issues/127043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839
Approved by: https://github.com/lezcano
2024-07-02 09:20:49 +00:00
Jeff Daily
04206d1898 TunableOp hotfix, unit test follow-up (#129606)
PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues.  This is a follow-up set of unit tests that would exercise the problems seen previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606
Approved by: https://github.com/atalman
2024-06-27 17:01:04 +00:00
Jeff Daily
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00
Xuehai Pan
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
lezcano
48538d3d14 Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
We fix a number of bugs previously present in the complex
implementation.

We also heavily simplify the implementation, using, among
other things, that we now have conjugate views.

I saw there is a comment regarding how slow some checks on this
function are. As such, I removed quite a few of the combinations of inputs
to make the OpInfo lighter. I still left a couple relevant examples to not regress
coverage though.

Fixes https://github.com/pytorch/pytorch/issues/122188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-05-30 14:45:58 +00:00
William Wen
5359af0c7e [dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341)
Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-29 05:18:04 +00:00
Nikita Shulga
4ff9113e3d [MPS] Add _weight_int8pack_mm tests (#127041)
As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041
Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales
2024-05-24 16:08:06 +00:00
Scott Wolchok
85fd76f76d Add test coverage for fp16 matrix-vector specialized kernel (#126700)
Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage.

Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700
Approved by: https://github.com/malfet
2024-05-21 17:23:16 +00:00
Catherine Lee
6f619cc727 [ez] functorch/test_vmap and test_dataloader to run in parallel (#125597)
Also mark test_svd serial in linalg to see if it helps with the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-08 15:37:29 +00:00
Nikita Shulga
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
Nikita Shulga
acac7aa70f [CI] Unskip Linalg tests on ARM (#125377)
Removes obscure "Issue with numpy version on arm" added by https://github.com/pytorch/pytorch/pull/82213
And replaces it with 4 targeted skips:
 - test_addmv for `float16`
- test_vector_norm for `float16`, `bfloat16` and `float32`

Followups to fix them are tracked in https://github.com/pytorch/pytorch/issues/125438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125377
Approved by: https://github.com/kit1980
2024-05-03 01:18:52 +00:00
aaitzhan
47ba7a76e2 [ATen][CUDA][AMP] Fix dtype mismatch in linalg_vector_norm (#125175)
Fixes #125174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125175
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-05-01 10:57:12 +00:00
Aaron Orenstein
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
PyTorch MergeBot
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934b.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
Aaron Orenstein
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
Peter Y Yeh
2e7b8ff116 [ROCm] Fix Int_mm() Integration with hipblasLT (#122431)
The PR

- fixes int_mm() /int8_gemm() integration with hipblasLT backend (require ROCm 6.0).
- enables/fixes the following tests on Rocm
    - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_True_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122431
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/atalman
2024-04-24 02:29:33 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
Nikita Shulga
c74dfca5e7 Int4MM: Unswizzle for different dtypes (#124448)
If dtype is not the one this platform is optimized for, it might need different unswizzling pattenrs Implement ones for non-vectorized flavor of the kernel, so that int4mm can be used with float32 and float16 dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124448
Approved by: https://github.com/jgong5, https://github.com/mikekgfb
2024-04-19 21:17:15 +00:00
William Wen
cbde0f048b [dynamo, 3.12] enable tests disabled due to missing dynamo 3.12 support (#123300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123300
Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/zou3519
2024-04-05 20:13:17 +00:00
AyaseNana
0a7162f898 Fix svd_lowrank parameter M (#122681)
ISSUE: #122699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122681
Approved by: https://github.com/lezcano
2024-03-29 18:06:38 +00:00
min-jean-cho
057892f4be [CPU] optimize Lp norm for 1-dimensional vector (#122143)
Fixes https://github.com/pytorch/pytorch/issues/120229

- Optimize vector norm by simplifying vector norm formula for 1-dimensional vector.
- Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof.
- Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix.
- Additionally, avoids overflow in power, `abs(x) ** p` for large `p` or `x`, for 1-dimensional vector.

### Performance
Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for
`torch.norm(torch.randn(2**18, 1), ord, -1)`
`torch.linalg.vector_norm(torch.randn(2**18, 1), ord, -1)`
Tested on 28 physical cores/socket, 1 socket on Skylake.

|                          	|                 	|         	|         	| **Avg Latency (ms)**  	|                       	|                                        	|
|--------------------------	|-----------------	|---------	|---------	|-----------------------	|-----------------------	|----------------------------------------	|
| **op**                   	| **input shape** 	| **dim** 	| **ord** 	| **baseline (master)** 	| **optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a)** 	| **speedup ratio (baseline/optimized)** 	|
| torch.norm               	| (2**18, 1)      	| -1      	| fro     	| 34.3755531            	| 0.0125408             	| 2741.094                               	|
|                          	|                 	|         	| inf     	| 34.0952635            	| 0.0122237             	| 2789.271                               	|
|                          	|                 	|         	| -inf    	| 34.3674493            	| 0.0120759             	| 2845.953                               	|
|                          	|                 	|         	| 0       	| 34.1004515            	| 0.0175261             	| 1945.69                                	|
|                          	|                 	|         	| 1       	| 34.1688442            	| 0.0121593             	| 2810.089                               	|
|                          	|                 	|         	| -1      	| 33.949492             	| 0.0120282             	| 2822.487                               	|
|                          	|                 	|         	| 2       	| 34.3669581            	| 0.0120401             	| 2854.366                               	|
|                          	|                 	|         	| -2      	| 33.9252067            	| 0.0121069             	| 2802.139                               	|
|                          	|                 	|         	|         	|                       	|                       	|                                        	|
| torch.linalg.vector_norm 	| (2**18, 1)      	| -1      	| inf     	| 34.090879             	| 0.0095105             	| 3584.545                               	|
|                          	|                 	|         	| -inf    	| 34.3708754            	| 0.0099111             	| 3467.931                               	|
|                          	|                 	|         	| 0       	| 34.0880775            	| 0.0141716             	| 2405.38                                	|
|                          	|                 	|         	| 1       	| 34.1392851            	| 0.0093174             	| 3664.036                               	|
|                          	|                 	|         	| -1      	| 33.925395             	| 0.0092483             	| 3668.302                               	|
|                          	|                 	|         	| 2       	| 34.3854165            	| 0.0092459             	| 3719.002                               	|
|                          	|                 	|         	| -2      	| 33.932972             	| 0.0093007             	| 3648.429                               	|

### Proof
<details>
<summary>For those interested :)</summary>

<img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b">
<img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322">

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143
Approved by: https://github.com/lezcano
2024-03-20 23:20:25 +00:00
Xia, Weiwen
8168338063 Add CPU implementation for torch._int_mm (s8*s8->s32) (#121792)
Fixes #121647

**Description**
Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it.
Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.

**Test plan**
`python test/test_linalg.py -k test__int_mm_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-03-19 08:44:33 +00:00
mingfeima
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
mingfeima
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
eqy
8dafc81ba9 [cuBLAS][cuBLASLt] Fix expected failures for int_mm on sm75 (turing) (#121277)
CC @malfet @atalman @ptrblck @tinglvv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121277
Approved by: https://github.com/malfet
2024-03-06 01:51:01 +00:00
PyTorch MergeBot
0c07c0c15f Revert "add int4 packed gemm support on CPU device (#117475)"
This reverts commit 30befa592e.

Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))
2024-03-04 21:20:57 +00:00
PyTorch MergeBot
a98c17edc7 Revert "add int8 packed gemm support on CPU device (#118056)"
This reverts commit f84375ca5d.

Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))
2024-03-04 20:09:40 +00:00
mingfeima
f84375ca5d add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
ghstack dependencies: #117475
2024-03-02 04:35:49 +00:00
mingfeima
30befa592e add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-02 00:17:34 +00:00
Jeff Daily
e98dbae0a0 [ROCm] enable hipsolver backend for linalg.eigh (#115177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115177
Approved by: https://github.com/lezcano
2024-02-08 22:03:27 +00:00
haozhe.zhu@intel.com
0ae952db76 enable mkldnn bf32 matmul (#116015)
### Testing
FP32 matmul vs. mkldnn BF32 matmul  on SPR

single core:

Input | BF32   / ms | FP32  /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041

56 cores:
Input | BF32   / ms | FP32 /   ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 |  1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915  | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 |  8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128  | 1296.419 | 1308.411 | 1.009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-20 09:30:23 +00:00
lezcano
f2d6e99f8d Workaround a cusolver bug on CUDA < 12.1 in triangular_solve (#117636)
Fix https://github.com/pytorch/pytorch/issues/79191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117636
Approved by: https://github.com/malfet
2024-01-19 12:42:37 +00:00
Animesh Jain
6e4e81a9ef [dynamo] Extend LazyVariableTracker to tuples (#117426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-01-18 15:51:28 +00:00
Sun, Jiayi
7a851fedc8 support torch.mm with conjugate transposed inputs (#117238)
Fix https://github.com/pytorch/pytorch/issues/116855.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117238
Approved by: https://github.com/lezcano
2024-01-15 12:36:01 +00:00
soulitzer
e10cfdd895 Update matmul requires_grad checks (#117067)
Fixes https://github.com/pytorch/pytorch/issues/116099
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117067
Approved by: https://github.com/lezcano, https://github.com/albanD
ghstack dependencies: #116523, #116710
2024-01-10 23:16:42 +00:00
lcskrishna
b9293e74a2 [ROCm] Fixes for hipblasLt for mm use case. (#116537)
This PR fixes the accuracy issues for hipblasLT for mm case on ROCm.
This PR is a follow up to the integration PR https://github.com/pytorch/pytorch/pull/114329 and https://github.com/pytorch/pytorch/pull/114890

The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue.
Added a unit-test case for this issue (bias=None) case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116537
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-01-10 22:13:18 +00:00
Aaron Gokaslan
bbe3261dd3 [BE]: Use iterable.chain.from_iterable where possible (#116376)
This is more readable and more efficient when dealing with lots of sequences to chain together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376
Approved by: https://github.com/albanD
2023-12-27 19:20:07 +00:00
Aaron Gokaslan
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00
PyTorch MergeBot
24af118e55 Revert "markDynamoStrictTest more tests (#115871)"
This reverts commit 478f0e96dc.

Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))
2023-12-19 15:36:27 +00:00
rzou
49af19cd8e Skip some flaky Dynamo tests in test_linalg.py (#115925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115925
Approved by: https://github.com/lezcano
2023-12-16 02:38:56 +00:00
rzou
478f0e96dc markDynamoStrictTest more tests (#115871)
For:
test_dispatch.py
test_fake_tensor.py
test_indexing.py
test_linalg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870
2023-12-15 05:26:54 +00:00
eqy
353f2dbd9c [CUDA] Fix V100 expected failures in test_mm_decomp and test_linalg (#115666)
BFloat16 isn't supported on sm70 and we get an unexpected cuBLAS success in 12.3+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115666
Approved by: https://github.com/malfet
2023-12-14 19:17:53 +00:00
atalman
43e3242490 [BE] Remove test corner cases for CUDA older than supported 11.8 (#114989)
Remove deprecated CUDA use cases from tests.
Similar to: https://github.com/pytorch/pytorch/pull/112873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114989
Approved by: https://github.com/malfet
2023-12-04 21:41:03 +00:00
Miles Lubin
070b2d3cff cholesky_solve_backward: speed up using output_mask (#112981)
Introduces a faster path for `cholesky_solve_backward` when the gradient with respect to the cholesky factor isn't required.

Adds test coverage in `test_linalg.py`.

# Example

## Setup

```py
import torch
torch.set_num_threads(1)
mat = torch.randn(500, 1000)
mat = mat @ mat.T
L = torch.linalg.cholesky(mat, upper=False)

rhs = torch.randn(500, 1)
rhs.requires_grad = True

sol = torch.cholesky_solve(rhs, L, upper=False).sum(dim=0)
```

## Before
```
%timeit torch.autograd.grad(sol, rhs, retain_graph=True)
2.61 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

## After
```
%timeit torch.autograd.grad(sol, rhs, retain_graph=True)
109 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112981
Approved by: https://github.com/lezcano
2023-11-16 18:30:57 +00:00
Ting Lu
b3a7d9208b disable test int_mm for sm90 or later (#113327)
disable test int_mm for sm90 or later

```
python test/test_linalg.py -k test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda

_ TestLinalgCUDA.test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda _
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2410, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2410, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 428, in instantiated_test
    raise rte
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test
    result = test(self, **param_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 1084, in only_fn
    return fn(slf, *args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_linalg.py", line 5719, in test__int_mm
    _test(17, k, n, use_transpose_a, use_transpose_b)
  File "/opt/pytorch/pytorch/test/test_linalg.py", line 5680, in _test
    c_int32 = torch._int_mm(a_int8, b_int8)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 0 transpose_mat2 0 m 32 n 17 k 32 mat1_ld 32 mat2_ld 32 result_ld 32 abType 3 cType 10 computeType 72 scaleType 10
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113327
Approved by: https://github.com/malfet
2023-11-13 22:13:44 +00:00
Oleg Bulatov
192477b5ba Enable flake8-bugbear B020 lint (#110823)
Fixes part of https://github.com/pytorch/pytorch/issues/106571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823
Approved by: https://github.com/Skylion007
2023-10-24 22:43:47 +00:00
Scruel Tao
503f44fbb8 Fix: perverse input's NaN values to prevent undefined behavior for matrix_exp function (#111539)
Currently, for `matrix_exp` function, if we have NaN values in the input matrices (small batches), it will keep outputting a "normal" result without any NaN value in it, and this will cause some problems that we may can't notice. This PR is for preventing such undefined behavior by "bring back" those NaN values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111539
Approved by: https://github.com/lezcano
2023-10-19 09:07:36 +00:00
Yanbo Liang
29048be41c [Reland] Add int4mm kernel (#111403)
This is a reland for #110914, #111327 and #111390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111403
Approved by: https://github.com/Chillee
2023-10-17 06:33:18 +00:00
PyTorch MergeBot
408e991dfe Revert "Quant: add weight int4pack mm kernel (#110914)"
This reverts commit 9980876cab.

Reverted https://github.com/pytorch/pytorch/pull/110914 on behalf of https://github.com/jeanschmidt due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110914#issuecomment-1765302621))
2023-10-16 21:27:26 +00:00
Evgeni Burovski
48989bc820 trace frames with np.ndarray (#110512)
Fixes #109604

Resubmit gh-109715 + several skips and small fixes to make tests pass.

The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:

- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
2023-10-15 00:56:10 +00:00
Yanbo Liang
9980876cab Quant: add weight int4pack mm kernel (#110914)
Adding the weight int4pack mm CUDA kernel. The kernel comes from the tinnygemm project which developed by Jeff Johnson.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110914
Approved by: https://github.com/Chillee
2023-10-13 01:21:18 +00:00
Liao, Xuan
8820dda943 Revise def of contiguity in bmm (#110811)
Fixes #108754.

`hf_T5_generate` would encounter a regression when calling `extern_kernels.bmm`, if one input is `reinterpret_tensor(buf2, (8, 1, 64), (64, 0, 1))` rather than `reinterpret_tensor(buf2, (8, 1, 64), (64, 512, 1), 0)`. As @jgong5 mentioned in comment, in fact the two tensors are equivalent: The stride doesn't matter when the corresponding size is 1.

We revise the definition of contiguity in `bmm` to add the above situation as a contiguous case. Thus, when stride equals to 0, `extern_kernels.bmm` could still use `gemm` of MKL to gain the performance.

Speedup of `hf_T5_generate` is **1.343x** now and **1.138x** before, with script `bash inductor_single_test.sh multiple inference performance torchbench hf_T5_generate float32 first dynamic default 0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110811
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/Chillee
2023-10-10 06:48:08 +00:00
CaoE
9399e0b1ff add fp16 support for gemm (#99498)
### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401

56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-09-28 01:03:50 +00:00
Kurt Mohler
3f88e3105f Reland: Remove remaining global set_default_dtype calls from tests (#108088)
Fixes #68972

Relands #107246

To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
2023-09-07 03:04:34 +00:00
PyTorch MergeBot
161ea463e6 Revert "Remove remaining global set_default_dtype calls from tests (#107246)"
This reverts commit aa8ea1d787.

Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))
2023-08-25 19:34:55 +00:00
Kurt Mohler
aa8ea1d787 Remove remaining global set_default_dtype calls from tests (#107246)
Fixes #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246
Approved by: https://github.com/ezyang
2023-08-24 16:10:48 +00:00
Xiao Wang
6bfb4f7c4b [CUDA][Linalg} Patch crash of linalg.eigh when input matrix is ill-conditioned, in some cusolver version (#107082)
Related: https://github.com/pytorch/pytorch/issues/94772, https://github.com/pytorch/pytorch/issues/105359

I can locally reproduce this crash with pytorch 2.0.1 stable pip binary. The test already passes with the latest cuda 12.2 release.

Re: https://github.com/pytorch/pytorch/issues/94772#issuecomment-1658909998
> From discussion in triage review:

- [x] we should add a test to prevent regressions
- [x] properly document support wrt different CUDA versions
- [x] possibly add support using MAGMA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107082
Approved by: https://github.com/lezcano
2023-08-16 21:15:15 +00:00
Andres Lugo
7fb543e36d [ROCm] Enable hipsolver unit tests for batched linalg drivers (#106620)
This is a follow up to https://github.com/pytorch/pytorch/pull/105881 and replaces https://github.com/pytorch/pytorch/pull/103203

The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default`
The following 16 unit tests will be enabled for rocm in this change:
- test_inverse_many_batches_cuda*
- test_inverse_errors_large_cuda*
- test_linalg_solve_triangular_large_cuda*
- test_lu_solve_batched_many_batches_cuda*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106620
Approved by: https://github.com/lezcano
2023-08-15 15:54:27 +00:00
Nikita Shulga
387e3b04fa Reenable torch._int_mm testing on newer CUDAs (#106840)
Looks like "it just works" on SM80+ on CUDA-12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106840
Approved by: https://github.com/vkuzo
2023-08-09 16:23:30 +00:00
nikitaved
a61a0fe490 test_linalg: triangular_solve - make well_conditioned well conditioned (#105919)
`well_contioned=True` does not guarantee that the samples for `triangular_solve` are actually well-conditioned. This PR fixes that. This issues was discovered in https://github.com/pytorch/pytorch/pull/104425.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105919
Approved by: https://github.com/lezcano
2023-07-26 09:21:12 +00:00
Andres Lugo-Reyes
c89aec207a [ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)
Current test case causes an edge case tensor input that causes a single generated tensor to fail the tolerance assertion on ROCm only and only for float32. We have reviewed the logic with our libraries team and have discovered the discrepancy is due to a difference in order of operations on AMD GPUs. They came back with "working as intended" and found no perceivable bug. Interestingly, if we change the values in ks, ns, or bs, the test passes on ROCm. These particular sizes in this particular order generates a single problematic input that causes the assertion to fail the tolerance check by ~0.07. Again, this is not a bug, just differences in implementation. This PR loosens the tolerance for ROCm only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104425
Approved by: https://github.com/jeffdaily, https://github.com/nikitaved, https://github.com/lezcano
2023-07-25 05:03:09 +00:00
Andres Lugo-Reyes
b78341dda9 Use hipsolver for default svd case on ROCm (#103540)
Fixes #102678
Fixes #102629
Fixes #102558
HipSOLVER performance on ROCm5.4.2 and later no longer serves as massive bottleneck. Additionally, using magma on rocm in this case caused test_compare_cpu_lialg_pinv_singular_cuda_float32 to fail. Using hipSOLVER, the test now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103540
Approved by: https://github.com/lezcano
2023-07-24 20:50:56 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
PyTorch MergeBot
f353d17755 Revert "[ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)"
This reverts commit ef7bc3e23d.

Reverted https://github.com/pytorch/pytorch/pull/104425 on behalf of https://github.com/huydhn due to Sorry for reverting your PR.  It is failing CUDA test in trunk built in debug mode https://github.com/pytorch/pytorch/actions/runs/5429187622/jobs/9874360641 ([comment](https://github.com/pytorch/pytorch/pull/104425#issuecomment-1617247699))
2023-07-03 04:18:04 +00:00
Andres Lugo-Reyes
ef7bc3e23d [ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)
Current test case causes an edge case tensor input that causes a single generated tensor to fail the tolerance assertion on ROCm only and only for float32. We have reviewed the logic with our libraries team and have discovered the discrepancy is due to a difference in order of operations on AMD GPUs. They came back with "working as intended" and found no perceivable bug. Interestingly, if we change the values in ks, ns, or bs, the test passes on ROCm. These particular sizes in this particular order generates a single problematic input that causes the assertion to fail the tolerance check by ~0.07. Again, this is not a bug, just differences in implementation. This PR loosens the tolerance for ROCm only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104425
Approved by: https://github.com/jeffdaily, https://github.com/nikitaved, https://github.com/lezcano
2023-06-30 21:43:42 +00:00
cyy
54cb61f7d9 enable ASAN on some tests (#103647)
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
 but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
2023-06-28 02:17:14 +00:00
Adnan Akhundov
e4d8504ebc Unify GELU tanh approximation in _addmm_activation GPU back-end (#104061)
Summary:

Currently, cuBLASLt-based fused GELU epilogue in the GPU back-end of the `_addmm_activation` operator uses tanh approximation, whereas other code paths on GPU don't.

With this PR, the GELU tanh approximation is switched on in all back-end code paths of `_addmm_activation` on GPU for better consistency.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 1.896s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.050s

OK
```

Reviewers: @eellison

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104061
Approved by: https://github.com/eellison
2023-06-24 18:36:45 +00:00
Adnan Akhundov
f818036f85 Fix test_addmm_gelu assertion on Windows CUDA (#104031)
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Windows. See [this comment](https://github.com/pytorch/pytorch/pull/103811#issuecomment-1601936203) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: @eellison @huydhn

Subscribers:

Tasks:

Tags:

Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104031
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-06-22 17:42:33 +00:00
Adnan Akhundov
1c79003b3c Enable addmm + GELU epilogue fusion via cuBLASLt (#103811)
Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: @eellison

Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103811
Approved by: https://github.com/IvanYashchuk, https://github.com/eellison
2023-06-21 19:59:40 +00:00
PyTorch MergeBot
b34ac35b77 Revert "Use hipsolver for default svd case on ROCm (#103540)"
This reverts commit 0a4a7d4b26.

Reverted https://github.com/pytorch/pytorch/pull/103540 on behalf of https://github.com/huydhn due to Turn out that the failure discussed in https://github.com/pytorch/pytorch/issues/102629 is not a fluke and ROCm signal in trunk is red atm ([comment](https://github.com/pytorch/pytorch/pull/103540#issuecomment-1595309297))
2023-06-16 20:59:40 +00:00
Andres Lugo-Reyes
0a4a7d4b26 Use hipsolver for default svd case on ROCm (#103540)
Fixes #102678
Fixes #102629
Fixes #102558
HipSOLVER performance on ROCm5.4.2 and later no longer serves as massive bottleneck. Additionally, using magma on rocm in this case caused test_compare_cpu_lialg_pinv_singular_cuda_float32 to fail. Using hipSOLVER, the test now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103540
Approved by: https://github.com/lezcano
2023-06-16 14:57:34 +00:00
Bearnardd
2abad0c184 Add dtype check baddbmm (#102659)
Fixes part of the #100838 related to disabling support for non matching dtypes for input/batches for `baddbmm` operator.

* [x] added dtype checks
* [x] added test case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102659
Approved by: https://github.com/ngimel
2023-06-13 00:31:06 +00:00
Xinya Zhang
1d857586f1 [ROCM] enable hipSOLVER backend for linalg.ldl_factor (#102665)
* Add complex dtype support for linalg.ldl_factor
* Fixes SWDEV-360139
* Enable the following 19 tests for ROCM
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_cuda_complex128
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_cuda_complex64
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_ex_cuda_complex128
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_ex_cuda_complex64
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_cuda_complex128
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_cuda_complex64
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_ex_cuda_complex128
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_ex_cuda_complex64
    + test_meta.py TestMetaCUDA.test_meta_linalg_ldl_factor_cuda_complex128
    + test_ops.py TestCommonCUDA.test_noncontiguous_samples_linalg_ldl_factor_cuda_complex64
    + test_ops.py TestCommonCUDA.test_noncontiguous_samples_linalg_ldl_factor_ex_cuda_complex64
    + test_ops.py TestCommonCUDA.test_variant_consistency_eager_linalg_ldl_factor_cuda_complex64
    + test_ops.py TestCommonCUDA.test_variant_consistency_eager_linalg_ldl_factor_ex_cuda_complex64
    + test_ops.py TestMathBitsCUDA.test_conj_view_linalg_ldl_factor_cuda_complex64
    + test_ops.py TestMathBitsCUDA.test_conj_view_linalg_ldl_factor_ex_cuda_complex64
    + test_ops.py TestMathBitsCUDA.test_neg_conj_view_linalg_ldl_factor_cuda_complex128
    + test_ops.py TestMathBitsCUDA.test_neg_conj_view_linalg_ldl_factor_ex_cuda_complex128
    + test_ops_jit.py TestJitCUDA.test_variant_consistency_jit_linalg_ldl_factor_cuda_complex64
    + test_ops_jit.py TestJitCUDA.test_variant_consistency_jit_linalg_ldl_factor_ex_cuda_complex64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102665
Approved by: https://github.com/lezcano
2023-06-08 20:05:01 +00:00
Andres Lugo-Reyes
eaffd98880 Enable hipSOLVER in ROCm builds (#97370)
Enables the hipSolver backend for ROCm builds
--------------------------------------------------------------------------

- Minimum ROCm version requirement - 5.3
- Introduces new macro USE_LINALG_SOLVER the controls enablement of both cuSOLVER and hipSOLVER
- Adds hipSOLVER API to hipification process
- combines hipSOLVER and hipSPARSE mappings into single SPECIAL map that takes priority among normal mappings
- Torch api to be moved to hipsolver backend (as opposed to magma) include: torch.svd(), torch.geqrf(), torch.orgqr(), torch.ormqr()
- Will enable 100+ linalg unit tests for ROCm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97370
Approved by: https://github.com/malfet
2023-05-31 16:53:23 +00:00
PyTorch MergeBot
a64e97b62c Revert "[dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180)"
This reverts commit aa8dcab1ce.

Reverted https://github.com/pytorch/pytorch/pull/99180 on behalf of https://github.com/huydhn due to Sorry for reverting this, but linux-bionic-py3.11-clang9 test starts to timeout after this taking more than 3h30m. This is probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/99180#issuecomment-1545982256))
2023-05-12 16:18:22 +00:00
William Wen
aa8dcab1ce [dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180)
Notes:
- No segfaults observed in any CI tests: dynamo unittests, inductor unittests, dynamo-wrapped pytorch tests. So we remove the warning that using dynamo 3.11 may result in segfaults.
- Some dynamo-wrapped pytorch tests hang. They will be skipped in the dynamo-wrapped test suite and will be addressed in a future PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99180
Approved by: https://github.com/malfet
2023-05-12 07:03:09 +00:00
soulitzer
6a02342131 Check inputs have same dtype in addmm_impl_cpu_ even if input has zero numel (#100274)
Fixes #99226

When an inputs has zero numel, addmm_impl_cpu_'s check that the inputs have the same dtype are bypassed. This PR adds a check before  the early return.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100274
Approved by: https://github.com/ngimel
2023-04-29 00:07:54 +00:00
Irem Yuksel
2504089329 Enable test_linalg_solve_triangular_large (#96182)
PR to see if test fails after removing skip line

Fixes #70111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96182
Approved by: https://github.com/lezcano
2023-04-28 12:54:27 +00:00
Larry Liu
687afeb686 [dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849)
Issue: #93684

# Problem

Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.

# Design (as I know it)

* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.

This PR adds `NumpyTensorVariable` and supports:
1.  tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`

Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
2023-04-27 16:18:35 +00:00
eqy
2fddcf0fc0 [CUDA][CUDA 11] Remove more CUDA 11 version checks (#92934)
Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934
Approved by: https://github.com/ngimel
2023-03-30 19:49:52 +00:00
Aaron Gokaslan
47dca20d80 [BE] Enable flake8-comprehension rule C417 (#97880)
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD
2023-03-30 14:34:24 +00:00
Christian Puhrsch
9d37cefcb0 Resubmit _int_mm (#96685)
Avoids any changes to gemm_and_bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96685
Approved by: https://github.com/drisspg, https://github.com/ngimel
2023-03-27 16:14:07 +00:00
haozhe.zhu
fe0afc5852 use accumulate type in BF16 gemm(include dot, mv) ref path (#96074)
Fix https://github.com/pytorch/pytorch/issues/95125 and https://github.com/pytorch/pytorch/issues/83863 for bf16 accumulation in gemm ref path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96074
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-03-23 01:22:59 +00:00
Christian Puhrsch
0a53c9624a Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)" (#96885)
Summary:
Backing out  _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885
Approved by: https://github.com/drisspg
2023-03-16 05:32:55 +00:00
mantaionut
2cbce06fee Enablee test_inverse_errors_large (#94727)
Test to see if TestLinAlgCUDA.test_inverse_errors_large_cuda_float64 still fails on CI.
The test was not failing in multiple CI runs.
I was not able to reproduce the crash locally.
Fixes #57482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94727
Approved by: https://github.com/lezcano
2023-03-13 08:31:41 +00:00
XiaobingSuper
ac77883e48 fix issue of baddbmm when out has nan value for beta=0 (#96086)
Fix https://github.com/pytorch/pytorch/issues/96037.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96086
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-03-07 14:54:05 +00:00
Christian Puhrsch
1fe2a9d122 Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)
Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-27 20:27:25 +00:00
lezcano
03cc0f587c Don't create large intermediary tensors in the backward of matmul (#95261)
Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the
improved folding algorithm in the next PR of this stack. See https://github.com/pytorch/pytorch/pull/76828#issuecomment-1432359980
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

Differential Revision: [D43541495](https://our.internmc.facebook.com/intern/diff/D43541495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95261
Approved by: https://github.com/ezyang
2023-02-27 15:19:09 +00:00
kshitij12345
3b966a6ce3 [autograd] disable backward/grad for complex scalar output (#92753)
Fixes https://github.com/pytorch/pytorch/issues/92750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92753
Approved by: https://github.com/ezyang
2023-02-23 11:38:27 +00:00
XiaobingSuper
5730cabdd0 using float type to do the computation of norm reduce for cpu half and bfloat16 dtype (#95166)
As the title, we should use a higher dtype to compute norm reduce for half and bfloat1 dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95166
Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/ngimel, https://github.com/lezcano
2023-02-23 05:00:25 +00:00
Nikita Shulga
42b6bcdb13 [BE] Add empty tensor check to _compute_linear_combination (#94245)
Fixes https://github.com/pytorch/pytorch/issues/94124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94245
Approved by: https://github.com/lezcano
2023-02-07 11:31:11 +00:00
Ivan Yashchuk
fba13d94a1 Remove deprecated torch.symeig (#70988)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`.

- [x] XLA PR: https://github.com/pytorch/xla/pull/4498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988
Approved by: https://github.com/lezcano, https://github.com/kit1980, https://github.com/malfet
2023-01-31 11:59:11 +00:00
PyTorch MergeBot
acdd462b1a Revert "Remove deprecated torch.symeig (#70988)"
This reverts commit d70ed68162.

Reverted https://github.com/pytorch/pytorch/pull/70988 on behalf of https://github.com/kit1980 due to Failing XLA tests, forward fix unsuccessful
2023-01-24 19:03:40 +00:00
Eddie Yan
0bf7506051 [CUDA] Drop CUDA < 11.0 test flags (#92605)
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.

CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
Ivan Yashchuk
d70ed68162 Remove deprecated torch.symeig (#70988)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988
Approved by: https://github.com/lezcano, https://github.com/kit1980
2023-01-23 22:51:40 +00:00
PyTorch MergeBot
0a6053e9b5 Revert "Avoid copies in matmul (#76828)"
This reverts commit 8c2e82b487.

Reverted https://github.com/pytorch/pytorch/pull/76828 on behalf of https://github.com/mehtanirav due to Internal breakages
2023-01-03 23:36:58 +00:00
lezcano
8c2e82b487 Avoid copies in matmul (#76828)
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805

Fixes https://github.com/pytorch/pytorch/issues/76702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828
Approved by: https://github.com/ngimel
2023-01-03 14:18:38 +00:00
PyTorch MergeBot
db2a237763 Revert "Avoid copies in matmul (#76828)"
This reverts commit 0c3659586d.

Reverted https://github.com/pytorch/pytorch/pull/76828 on behalf of https://github.com/lezcano due to Makes functorch tests fail
2023-01-03 12:26:29 +00:00
lezcano
0c3659586d Avoid copies in matmul (#76828)
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805

Fixes https://github.com/pytorch/pytorch/issues/76702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828
Approved by: https://github.com/ngimel
2023-01-02 20:07:38 +00:00
Jithun Nair
e8e591b72f Upgrade CI to ROCm5.3 (#88297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88297
Approved by: https://github.com/malfet
2022-12-14 05:09:56 +00:00
PyTorch MergeBot
af4735d3ad Revert "Upgrade CI to ROCm5.3 (#88297)"
This reverts commit 181a82ffd2.

Reverted https://github.com/pytorch/pytorch/pull/88297 on behalf of https://github.com/IvanYashchuk due to Tests are unnecessarily skipped on all platforms
2022-12-13 12:23:44 +00:00
Jithun Nair
181a82ffd2 Upgrade CI to ROCm5.3 (#88297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88297
Approved by: https://github.com/malfet
2022-12-13 04:50:06 +00:00
lezcano
1d6a188d08 Reland Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761) (#84624)
Reland https://github.com/pytorch/pytorch/pull/81761

Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84624
Approved by: https://github.com/kit1980
2022-11-22 07:53:24 +00:00
lezcano
d8506ff42b Generalize gesvdjBatched to run whith full_matrices==false (#88502)
As brought up in https://github.com/pytorch/pytorch/issues/86234#issuecomment-1268296036, our heuristic for which SVD backend to choose was not great in some cases.
The case in which there could be some improvements is when we have a
large batch of very small non-square matrices.

This PR, adapts the calling code to gesvdj by creating two temporary
square buffers to allow to call gesvdjBatched, and then copies back the
result into the output buffers.

We then modify the heuristic that chooses between gesvdj and
gesvdjBatched.

Fixes https://github.com/pytorch/pytorch/issues/86234
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88502
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved, https://github.com/mruberry, https://github.com/xwang233
2022-11-07 22:07:48 +00:00
Fang Wang
160118d72a Add test case for matrix multiply-add with large inputs (#85550)
Summary:
- Added test case for addmm, baddbmm and linear with large inputs
- Testing with torch types: float32, float16, bfloat16

Test Plan:
Run unit tests with:
`buck2 run mode/opt //caffe2/test:linalg_re_cuda`

```
...
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok

----------------------------------------------------------------------
Ran 24 tests in 63.224s

OK (skipped=12)
```

Differential Revision: D39718256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
2022-10-11 17:52:21 +00:00
Jane Xu
a348975e00 Add opteinsum backend to give users control (#86219)
This achieves the same things as https://github.com/pytorch/pytorch/pull/85908 but using backends instead of kwargs (which breaks torchscript unfortunately). This also does mean we let go of numpy compatibility BUT the wins here are that users can control what opt einsum they wanna do!

The backend allows for..well you should just read the docs:
```
.. attribute::  torch.backends.opteinsum.enabled

    A :class:`bool` that controls whether opt_einsum is enabled (on by default). If so,
    torch.einsum will use opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html)
    to calculate an optimal path of contraction for faster performance.

.. attribute::  torch.backends.opteinsum.strategy

    A :class:`str` that specifies which strategies to try when `torch.backends.opteinsum.enabled` is True.
    By default, torch.einsum will try the "auto" strategy, but the "greedy" and "optimal" strategies are
    also supported. Note that the "optimal" strategy is factorial on the number of inputs as it tries all
    possible paths. See more details in opt_einsum's docs
    (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html).
```

In trying (and failing) to land 85908, I discovered that jit script does NOT actually pull from python's version of einsum (because it cannot support variadic args nor kwargs). Thus I learned that jitted einsum does not subscribe to the new opt_einsum path calculation. Overall, this is fine since jit script is getting deprecated, but where is the best place to document this?

## Test plan:
- added tests to CI
- locally tested that trying to set the strategy to something invalid will error properly
- locally tested that tests will pass even if you don't have opt-einsum
- locally tested that setting the strategy when opt-einsum is not there will also error properly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86219
Approved by: https://github.com/soulitzer, https://github.com/malfet
2022-10-05 06:33:25 +00:00
albanD
94da90e41f LU solve/unpack fix to prevent bad memory usage on CPU (#85922)
Fixes https://github.com/pytorch/pytorch/issues/77898
Fixes https://github.com/pytorch/pytorch/issues/85026

There is a minor perf impact but:
- For lu_solve, the actual compute is going to be more expensive than this O(n) check (ones pass over the other matrices is O(n^2) in any case)
- For lu_unpack, the check inside the kernel should be almost free.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85922
Approved by: https://github.com/ngimel, https://github.com/nikitaved
2022-09-30 20:07:08 +00:00
Jane Xu
e7e1cd945f Add path optimize kwarg to einsum (#84890)
## This PR seeks to:
- [x] add c++ support for an optimize path
- [x] add python opt_einsum path passthrough
- [x] add opt_einsum to OSS requirements, but a soft one
- [x] show benchmark results here

Additional things I've explored + their conclusions:
- **Delaying the summing over dimensions** => added!
    - The idea here is to not incur kernel calls to `sum` as we try to early sum out in einsum. Thus, we collect all the dimensions that need to be summed together in one contraction + sum at the end instead of summing as we go. While this optimization didn't feel like it made things faster for the random cases we've selected (they all summed 1 dim per contraction), it is a good principle and would help more common use cases that would reduce multiple dimensions at a time (like `bxy,xyi,xyj->bij`).
- **Caching contract_path based on equation and tensor sizes** => dropped :(
    - The benchmarks were strictly worse for all the cases, and, from scanning the use cases, I observed people do not often call einsum on the same equation/tensor order enough for caching to be justified. I do think caching can be effective in the future, but it would require further investigation.

## Not a part of this PR (but are next steps):
- adding opt_einsum package to OSS CI
- adding it to internal CI
- potentially adding a kwarg path argument to the python API -- if the path is given, we wouldn't have to spend time calculating it, but there would be some time lost validating user input.

## Testing:
- Added more tests to CI

## Benchmarking:
**TL;DRs**
- **torch.einsum with opt_einsum is a definite win for the production case**.
- **torch.einsum with opt_einsum installed is consistently fast, but has an overhead** of needing to find the path. If the path is already found/optimal, it will be slightly slower.
- The einsum overhead decreases for bigger dimensions.
- **torch.einsum without opt_einsum installed is comparable to before this commit**, with occasional slowness potentially due to not reshaping/squeezing as we contract until the end.
- For many of the random generated cases, the dimensions were too similar and small where an optimal order wasn't that much more optimal than just going left to right. However, in production, dimensions are commonly quite distinct (batch size will be small, but the data will be huge).
- **torch.einsum opt is comparable (slightly faster overall) compared to numpy.einsum opt for the cpu case**. This is interesting given that torch.einsum currently spends time computing the path, but numpy.einsum takes it as input.
- **torch.einsum opt is significantly faster than numpy.einsum opt for the gpu case**. This is because numpy doesn't take advantage of GPUs.

The following benchmarks were done on an A100 GPU and Linux CPUs. The line in the first chart separates GPU (on top) from CPU, and the line in the second graph separates CPU (on top) and then GPU. Sorry it's flipped 😛 .

Production example (see [colab benchmark](https://colab.research.google.com/drive/1V2s4v1dOOKwRvp5T_DC-PNUosOV9FFJx?authuser=1#scrollTo=WZoQkC8Mdt6I) for more context):
<img width="1176" alt="image" src="https://user-images.githubusercontent.com/31798555/192012636-9a68bfa7-2601-43b1-afeb-b4e0877db6a4.png">

Randomly generated examples (the same ones as in https://github.com/pytorch/pytorch/pull/60191)
<img width="1176" alt="image" src="https://user-images.githubusercontent.com/31798555/192012804-1c639595-b3e6-48c9-a385-ad851c13e1c2.png">

Open below to see old + not super relevant benchmarking results:
<details>
Benchmark results BEFORE this PR (on Linux -- I will update devices so they are consistent later):
<img width="776" alt="image" src="https://user-images.githubusercontent.com/31798555/190807274-18f71fce-556e-47f4-b18c-e0f7d0c0d5aa.png">

Benchmark results with the code on this PR (on my x86 mac):
For the CPU internal use case --
![image](https://user-images.githubusercontent.com/31798555/190801376-6f591b00-cebd-4ca7-bb23-ae8f17f1634e.png)

For the general use case --
It looks like numpy opt still does better in several of these random cases, but torch einsum opt is consistently faster than torch.einsum.
![image](https://user-images.githubusercontent.com/31798555/190811730-fbb6797d-af59-4f5a-92da-ba4103372014.png)
<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84890
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-09-24 03:47:36 +00:00
Sourav Mandal
70b27e91c7 [pytorch] Skip linalg tests that fail on Meta infra (#85577)
Summary: test_inverse_errors_large and test_linalg_solve_triangular fail for dtype=float64 when invoked on GPUs on Meta internal testing infra.  Skip in Meta internal testing.

Test Plan: (observe tests skipped on Meta internal infra)

Reviewed By: mikekgfb

Differential Revision: D39785331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85577
Approved by: https://github.com/malfet
2022-09-24 01:02:42 +00:00
Wei Wang
8bd4724f04 Adding a unit test that would gate PRs and prevent reverts, e.g. #83327 (#85442)
PR #83327 slipped through CI, adding this unit test as part of efforts to minimize future reverts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85442
Approved by: https://github.com/Balandat, https://github.com/mehtanirav
2022-09-23 01:05:17 +00:00
Ivan Yashchuk
539076e2c2 Remove deprecated torch.lstsq (#70980)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.lstsq`.

There's a note in `tools/codegen/gen.py` about `lstsq` schema in `native_function.yaml` that I will not remove:
87139d8532/tools/codegen/gen.py (L734-L770)

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70980
Approved by: https://github.com/lezcano, https://github.com/kit1980
2022-09-23 00:16:55 +00:00
Ivan Yashchuk
bcf93181a0 Remove deprecated torch.matrix_rank (#70981)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.matrix_rank`.

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70981
Approved by: https://github.com/lezcano, https://github.com/kit1980
2022-09-22 17:40:46 +00:00
Sourav Mandal
5aa84c16db [pytorch] cuBLAS addmm malfunction test (#85432)
Summary:
Re-submit for approved PR that was then reverted: https://github.com/pytorch/pytorch/pull/85084

Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations

Test Plan:
Sample unit test output --

[...]
test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
[...]

Reviewed By: mikekgfb

Differential Revision: D39433029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85432
Approved by: https://github.com/zrphercule
2022-09-21 22:17:48 +00:00
PyTorch MergeBot
2fb820455c Revert "[pytorch] cuBLAS addmm malfunction test (#85084)"
This reverts commit 0297c75c14.

Reverted https://github.com/pytorch/pytorch/pull/85084 on behalf of https://github.com/clee2000 due to broke tests on trunk, https://github.com/pytorch/pytorch/actions/runs/3098347639/jobs/5017166419
2022-09-21 16:48:55 +00:00
Sourav Mandal
0297c75c14 [pytorch] cuBLAS addmm malfunction test (#85084)
Summary: Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations

Test Plan:
Sample unit test output --

[...]
test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
[...]

Reviewed By: mikekgfb

Differential Revision: D39433029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85084
Approved by: https://github.com/zrphercule
2022-09-21 13:42:13 +00:00
Ivan Yashchuk
01c54ad6de Remove deprecated torch.eig (#70982)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.eig`.

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70982
Approved by: https://github.com/Lezcano, https://github.com/malfet
2022-09-09 21:31:57 +00:00
PyTorch MergeBot
166dec74b5 Revert "Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761)"
This reverts commit 65beff5acb.

Reverted https://github.com/pytorch/pytorch/pull/81761 on behalf of https://github.com/mehtanirav due to Breakages in pytorch/glow
2022-09-06 22:31:14 +00:00
lezcano
65beff5acb Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761)
`torch.norm` is very odd. Some notable issues are:

- The default value of `"fro"` in `torch.norm` has an odd behaviour when `dim=None`. This is handled in the new dispatch
- The treatment of the `dtype` argument in `torch.norm` was completely wrong. This should fix it
- Some `out=` variants in the previous implementation were also wrong. This should fix those.
- This new dispatch should make some paths much faster. For example, `torch.norm(x)` where `x` is complex.

I'll try to make the changes in these PRs as incremental as possible as this is a tricky one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81761
Approved by: https://github.com/ngimel
2022-09-02 19:12:25 +00:00
Mario Lezcano
f5a3515083 Make linalg.inv composite of linalg.solve (#80074)
The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet
2022-08-25 09:28:55 +00:00
PyTorch MergeBot
5321bf52f2 Revert "Make linalg.inv composite of linalg.solve (#80074)"
This reverts commit 4737b33614.

Reverted https://github.com/pytorch/pytorch/pull/80074 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628
2022-08-25 00:43:00 +00:00
Mario Lezcano
4737b33614 Make linalg.inv composite of linalg.solve (#80074)
The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet
2022-08-24 15:18:56 +00:00
lezcano
0bdcfcb840 Strenghten preconditions of linalg.cross (#83798)
This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry
2022-08-24 15:17:12 +00:00