Fix on non-rocm:
```
root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
E
======================================================================
ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
method(*args, **kwargs)
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test
raise rte
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test
result = test(self, **param_kwargs)
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn
return fn(slf, *args, **kwargs)
File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn
fn(*args, **kwargs)
File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library
torch.backends.cuda.preferred_blas_library('ck')
File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library
torch._C._set_blas_preferred_backend(_BlasBackends[backend])
RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.
To execute this test, run the following from the base repo dir:
python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.346s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316
Approved by: https://github.com/jeffdaily
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.
The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],
The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],
I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure
Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.
This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.
Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66
Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.
The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.
This diff does not add the launcher to torch, but introduces a basic test suite.
A list of TODOs that are not yet complete:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Probably lots of features of the triton C++ generated code that I haven't handled yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238
Approved by: https://github.com/oulgen
Summary: as title.
See internal Diff summary for more context.
Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated
Differential Revision: D71241676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246
Approved by: https://github.com/houseroad
Co-authored-by: Huamin Li <huaminli@meta.com>
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
And delete some duplicating glue code by relying on the stub
After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints
Checked by `test_div2` inductor testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233
Approved by: https://github.com/atalman
ghstack dependencies: #149216
Fixes#103425
## Changes
- Add doc description size value `must be > 0`
- Add validation for `in1_features` param
Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking.
```python
import torch
from torch import nn
class lenet(nn.Module):
def __init__(self):
super(lenet, self).__init__()
self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1)
# Error, `in1_features=1, in2_features=0, out_features=0` no error
self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0)
def forward(self, x):
# 1st block
x = self.conv(x)
x = self.linear(x)
return x
if __name__ == '__main__':
net = lenet()
```
## Test Result
```bash
pytest test/test_nn.py -k test_bilinear -vv
```


Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018
Approved by: https://github.com/mikaylagawarecki
Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.
This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.
Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66
Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.
The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.
This diff does not add the launcher to torch, but introduces a basic test suite.
A list of TODOs that are not yet complete, will do in separate diff:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation.
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Hooking it up with a config to inductor
- Testing harness to test against torch generated triton kernels
Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561
Approved by: https://github.com/aorenste, https://github.com/syed-ahmed
Changes in this PR:
1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.
Resolves#75982. New tests are included in this PR.
- #75982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET.
Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda
buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay
Differential Revision: D71152464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159
Approved by: https://github.com/sraikund16, https://github.com/jansel