Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:
1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.
To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989
Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2
Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1
Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda
Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)
Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda
Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
For #143486
Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Fixes#143071
Operations performed on tensors with `requires_grad=True` such as
```python
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.
While an operation using `numpy` like
```python
import numpy as np
x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.
However, an operation that uses `math` like
```python
import math
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!
This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.
To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```
Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/albanD
Step required for performance in #143122
Adds support for CPU scalar for tensor_2 in addcmul. For example:
```
import torch
a = torch.rand(2, 2, device="cuda")
b = torch.tensor(1e-3)
torch.add(a, b)
torch.addcmul(a, a, b) # used to fail, now works
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/137267
While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive.
Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143203
Approved by: https://github.com/ezyang, https://github.com/colesbury
ghstack dependencies: #143202
On ROCm, hipification converts std::min to ::min, but ::min is not returning the right result. This impacts index_add_ operation on a large tensor, we end up picking the large values instead of max supported block size (128). This leads to GPU accessing memory out of bounds.
While we wait for ::min to be fixed, we can use < operator to compare instead of relying on ::min.
Example Code w/ failure:
```
D=6144
hidden_states = torch.zeros([16384, 6144], device="cuda:0", dtype=torch.bfloat16)
index = torch.randint(0, 16384, (1, 32, 16384), device="cuda:0", dtype=torch.int64)
output = torch.empty([1, 32, 16384, 6144], device="cuda:0", dtype=torch.bfloat16)
hidden_states.index_add_(0, index.view(-1), output.view(-1, D))
```
```
Traceback (most recent call last):
RuntimeError: HIP error: invalid configuration argument
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
Related to #107302
The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2.
This PR fixes all the corresponding test failures in `test_torch.py`.
1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor.
NumPy 1: Return value of `np.float64`.
NumPy 2: Return value of `np.float32`.
Solution: Enforce it with `.astype(np.float64)`.
2. The type of `np.gradient()` when returning multiple arrays.
NumPy1: A list of arrays.
NumPy2: A tuple of arrays.
Solution: Cast the tuple to a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/136358
The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.
Before this PR, they have the following gc methods:
Tensor:
- tp_clear subtype_clear
- tp_traverse THPVariable_subclass_traverse
- tp_dealloc THPVariable_subclass_dealloc
TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc
The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267
Approved by: https://github.com/ezyang, https://github.com/kshitij12345
- Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal.
Logic behind choice of tolerance:
The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less.
Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow )
```
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large
self.assertEqual(expected, actual)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 134118 / 1000000 (13.4%)
Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed)
Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed)
```
@malfet @jondea
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315
Approved by: https://github.com/albanD
Summary:
there're some issues for dim order creation. T194410923 has detail illustration.
One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut.
To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it.
Test Plan: CI
Differential Revision: D60056793
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366
Approved by: https://github.com/ezyang
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.
Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.
Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.
Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.
Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.
Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.
Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops
We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.
Taken over from https://github.com/pytorch/pytorch/pull/128937
joint work with @imzhuhl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops
We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.
Taken over from https://github.com/pytorch/pytorch/pull/128937
joint work with @imzhuhl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367