A few workspace API changes:
1. return outer name when creating. Usually a use case does not care about outer name. But for mix-order-reduction (stacked PR), we need it to do the next-layer of reduction on the workspace tensor
2. be able to override workspace tensor dtype
3. be able to delay the deallocation of workspace tensors in TritonKernel.call_kernel since they may be used after the call. The lifetime of the workspace tensors are only enlarged a little bit. They would be deallocated once the next layer reduction is done.
Test with the stacked PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166204
Approved by: https://github.com/jansel
Summary:
Attempting to forward fix failures from D85405167 (PR
https://github.com/pytorch/pytorch/pull/166021)
This is devmates suggestion and seems to work, but idk if it's a good idea or not. Devmate says it's getting resolved to at::min which is host only, and it doesn't happen in OSS is likely because `AT_PER_OPERATOR_HEADERS` is defined in OSS but not internally.
```
In file included from .../ATen/native/hip/Normalization.hip:11:
.../ATen/native/hip/Normalization.cuh:302:37: error: no matching function for call to 'min'
302 | v_[u] = input[batch][plane][min(x+u*blockDim.x, input.size(2)-1)];
| ^~~
```
Differential Revision: D85463674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166195
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/eqy
Summary:
The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.
AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.
We've observed a throughput increase of 25%, when tested on tensors containing 1M elements
Before:
f32 erf: 3175.977us
After:
f32 erf: 2539.446us
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85522651
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166262
Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@8d373b](8d373ba272), includes:
- Add CONFIGURE_DEPENDS in install_xpu_headers macro to track these headers
- Add check to ensure P2P Tensors are dense
- Switch philox_engine_inputs usage to philox_xpu_state per XPU graph request
- Add vectorization path for maxpool backward channel last
- Fix SYCL_PRINT macro usable on Windows
- Eliminate unnecessary warning if no AOT enabled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166129
Approved by: https://github.com/EikanWang
For https://github.com/pytorch/pytorch/issues/114850, we will port some aten unit tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles:
1. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
2. Added allow_xpu=True for supported test class in test parameterization.
3. Use torch.accelerator to extend cude specific test to XPU if needed.
4. Enabled 'xpu' for some test pathes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165405
Approved by: https://github.com/guangyey, https://github.com/ezyang
This PR removes outdated ROCm version checks and their branches. While there is no explicit mention of minimum supported version. ROCm 6.4 is listed in the installation page and the CI yaml files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166111
Approved by: https://github.com/ezyang
I'd like to propose a new module `Out-of-tree Backend Integration` via `PrivateUse1` device key. The out-of-tree backend integration via `PrivateUse1` device key has been a recommended mechanism of plug-in third-party accelerator devices into PyTorch. There are already quite a few documents/tutorials on the usage with the primary one as https://docs.pytorch.org/docs/main/accelerator/index.html.
We also saw more and more HW vendors to leverage the `PrivateUse1` mechanism to support their accelerators. For example:
1. Ascend NPU
2. Microsoft MAIA
3. MooreThreads MUSA
4. Cambricon MLU
The scope of `PrivateUse1` based out-of-tree backend integration is composed of two parts:
1. `PrivateUse1` device as an out-of-tree backend that involves:
(a) make `PrivateUse1` a function-complete device as other in-tree devices: i.e., device runtime, autograd, autocast, profiling, distributed, quantization etc.
(b) a pluggable design to allow out-of-tree integration to extend the functionality of `PrivateUse1` such as a backend registration mechanism that allows user-friendly device naming, runtime extension points with either C++ and Python for third-party to plug-in their runtime implementation, customizable tensor implementation for third-party to add extra info/functionality to the tensor and their serialization.
2. OpenReg: A test suite and documentation effort to guarantee the functional correctness of `PrivateUse1` mechanism and to guide HW vendors with the right implementation.
I'm also proposing @FFFrog as the module maintainer for this new module due to his continuous contribution to the design and implementation both parts of the module. Below are the RFCs/Feature Proposals @FFFrog was working on:
1. [An improvement of PrivateUse1 mechanism, facilitating third-party backend integration](https://docs.google.com/document/d/1_2EO5A2Ww3xDwqbhIvs9Nk65-jV0oNYg3XAmNUsHdAY/edit?tab=t.0#heading=h.5vt8c1vo4dc7)
2. [The interoperability Standard of Third-party Backend Integration Mechanism](9bd181e742/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md)
3. [PyTorch Backend Accelerator Integration Verification and Guidance](f6048cbd4f/RFC-0045-PyTorch-Accelerator-Integration-Enhancements.md)
@FFFrog contributed 240+ PRs and a majority of them is related to `PrivateUse1`. (https://github.com/pytorch/pytorch/pulls?q=is%3Apr+author%3Afffrog+). He also reviewed 50+ PRs related to this area. He is also the primary author of OpenReg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165958
Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/ezyang
Summary:
Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code.
We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.
cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci
imported-using-ghimport
Test Plan: Imported from OSS
Differential Revision: D85526705
Pulled By: fduwjj
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166264
Approved by: https://github.com/XilunWu
This makes it so that `GraphModule.recompile()` will also recompile any submodules that are also graph modules, which allows us to pass all existing regional inductor tests without skipping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166002
Approved by: https://github.com/oulgen
ghstack dependencies: #165996
Redistribute test exercise extensively various sharding schemes and
redistribution between them. These tests uncovered more edge cases
that were not supported by the local tensor primarily different flavors
of uneven sharding. In order to handle these cases this change implements
missing functional collectives and adds support for uneven sharding
case where sharding group (ranks) is larger than the size of the dimension
being sharded. In the latter case the "missing" shards are represented
by zero sized tensors so that the rest of the local tensor machinery
can stay oblivious to this special case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081
Approved by: https://github.com/ezyang
Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code.
Before:
triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 |
After:
triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162053
Approved by: https://github.com/mlazos, https://github.com/naromero77amd
Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
This PR partially fixes https://github.com/pytorch/torchtitan/issues/1791, as it will work with `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` setting only.
The core of the problem: In `max-autotune` mode Inductor runs multiple benchmarks to determine the best config. If one of these benchmarks fails with `cudaErrorLaunchFailure`, all other CUDA calls within the same process will fail including the rest of the benchmarks.
The solution: Restart the child process gracefully and continue benchmarking.
Unfortunately, if autotuning is done in the main process, the whole program falls into unrecoverable state. In this case, the only way of successful execution would be just preventing the ULF.
Here is some info from [CUDA documentation](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html):
>cudaErrorLaunchFailure = 719
An exception occurred on the device while executing a kernel. ... . This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166073
Approved by: https://github.com/syed-ahmed, https://github.com/drisspg