Namely, error out rather than crash when out dtype is of an unexpected type
Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed.
Preserve symbolic shapes whenever possible
Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272
Approved by: https://github.com/kulinseth
Fixes#163929
Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors. When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983
Approved by: https://github.com/shunting314
\# why
- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
them for the inputs
\# what
- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing
- currently only supports templates that go through
`V.choices.get_template_configs`
\# testing
```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
Summary: Fallback kernels are created with flattened constant args and an `unflatten` utility to unflatten them when needed. Apply it in FXConverter to preserve the original structure
Test Plan: added new CI tests
Differential Revision: D85347589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166144
Approved by: https://github.com/blaine-rister
- TheRock build system for ROCm builds OpenBLAS from source and uses a custom name for the library.
- Following existing conventions in `FindOpenBLAS.cmake` to support finding a custom named version of OpenBLAS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166333
Approved by: https://github.com/jeffdaily
**Summary:** When operations are done on partial placements, we use sharding logic to incorrectly determine whether we should redistribute the tensor to replicate. By delaying the redistribution, we do the operation first, and then the partial reduction. This leads to incorrect results for max, min, gradient norm clipping, and more. We solve this by setting reduction_linear to False when there is a Partial placement to force the redistribution before completing the op.
**Test Cases**
1. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops
2. pytest test/distributed/tensor/test_math_ops.py -k test_matching_partial_reduction_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165962
Approved by: https://github.com/wconstab
**Summary:** The first thing I did was increase the world size to 8 because test_3d_with_tp_dp_pp wouldn't actually do fully shard as tp = 2, pp = 2, leaving dp = 1. The second thing was refactoring the tests using both single and multi stage schedules so that their logic is largely combined. This was accomplished by using the logic in test_replicate_pp_grad multi-stage schedule to determine the start and end indices for a partial model, but setting virtual_stage to 1 if we are using single stage schedules. Even if this approach isn't approved, multistage schedule logic in test_3d_with_tp_dp_pp and test_replicate_pp should be changed as the logic used is incorrect.
**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165701
Approved by: https://github.com/H-Huang
- Fixes `s/#pragma onces/#pragma once` typoe
All methods in the headers must be inline, otherwise one gets barrage of following warnings
```
/Users/malfet/git/pytorch/pytorch/c10/metal/utils.h:337:7: warning: unused function 'conj<half __attribute__((ext_vector_type(2)))>' [-Wunused-function]
half2 conj(half2 a) {
^
/Users/malfet/git/pytorch/pytorch/c10/metal/utils.h:342:8: warning: unused function 'conj<float __attribute__((ext_vector_type(2)))>' [-Wunused-function]
float2 conj(float2 a) {
^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166315
Approved by: https://github.com/seemethere, https://github.com/atalman
This diff moves export run_decompositions to use aot_export_joint_with_descriptors instead of aot_export_module. Doing so, i ran into 2 main bugs:
1) aot_export_joint_with_descriptors don't correctly pass in record_nn_module_stack flag that is needed to populate nn_module_stack by switching the internal tracer.
2) When creating symint with negative inputs, we need to pass in positive=False. This didn't matter before because aot_autograd directly returns integer inputs instead of creating symint.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165931
Approved by: https://github.com/zhxchen17
Summary:
Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.
We are adding a shim that makes conversion routines use autovec code for float16
We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16
before:
float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us
after
float16_t->uint8->float16_t ===> 181.216us ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us ---> 25% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85533271
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166306
Approved by: https://github.com/mcfi, https://github.com/ezyang
Differential Revision: D85446553
Internal builds failing after https://github.com/pytorch/pytorch/pull/161369
```
buck-headers/ATen/Context.h:22:10: fatal error: 'ATen/detail/XLAHooksInterface.h' file not found
22 | #include <ATen/detail/XLAHooksInterface.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
```
Changes similar to that PR also change the build_variables file, which I've done here. I'm not sure why this wasn't caught by the bazel build we have?
Sanity checked that some of the previously failing builds pass after this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166179
Approved by: https://github.com/Camyll
A few workspace API changes:
1. return outer name when creating. Usually a use case does not care about outer name. But for mix-order-reduction (stacked PR), we need it to do the next-layer of reduction on the workspace tensor
2. be able to override workspace tensor dtype
3. be able to delay the deallocation of workspace tensors in TritonKernel.call_kernel since they may be used after the call. The lifetime of the workspace tensors are only enlarged a little bit. They would be deallocated once the next layer reduction is done.
Test with the stacked PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166204
Approved by: https://github.com/jansel
Summary:
Attempting to forward fix failures from D85405167 (PR
https://github.com/pytorch/pytorch/pull/166021)
This is devmates suggestion and seems to work, but idk if it's a good idea or not. Devmate says it's getting resolved to at::min which is host only, and it doesn't happen in OSS is likely because `AT_PER_OPERATOR_HEADERS` is defined in OSS but not internally.
```
In file included from .../ATen/native/hip/Normalization.hip:11:
.../ATen/native/hip/Normalization.cuh:302:37: error: no matching function for call to 'min'
302 | v_[u] = input[batch][plane][min(x+u*blockDim.x, input.size(2)-1)];
| ^~~
```
Differential Revision: D85463674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166195
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/eqy
Summary:
The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.
AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.
We've observed a throughput increase of 25%, when tested on tensors containing 1M elements
Before:
f32 erf: 3175.977us
After:
f32 erf: 2539.446us
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85522651
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166262
Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01