Fix https://github.com/pytorch/pytorch/issues/151589
It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size
1. alignement
2. max number of contiguous elements a thread need to process
Here is the triton code that decides vectorization size [link](c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)), and here is the triton code that considers contiguity for vectorization [link](c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269))
When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction.
Check this test:
```
@parametrize(
"size",
(
128,
1024,
1024 * 1024,
),
)
def test_slice_view_dtype(self, size):
offset = 1
def f(x):
return x[2:].view(dtype=torch.float32) + 1
x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device)
self.common(f, (x,), reference_in_float=False)
```
Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size
1. when size = 128, triton pick ld.global.b32 to load data from global memory
2. when size = 1024, triton uses ld.global.v2.b32
4. when size = 1024 * 1024, triton uses ld.global.v4.b32
So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #151841
**Summary**
Fixes [#151290](https://github.com/pytorch/pytorch/issues/151290) and [#151523](https://github.com/pytorch/pytorch/issues/151523), which are regressions introduced by [#144020](https://github.com/pytorch/pytorch/pull/144020). That PR enabled parallelization at the inner loop level.
However, a currently unsupported case arises when parallel reduction occurs under the vectorization loop level, specifically in patterns like:
```
for vec_loop_level:
do_parallel_reduction
```
In such cases, a temporary buffer `tmp_acc_array` is allocated for tail scalar kernels, and another temporary buffer `tmp_acc_array` is also defined for parallel reduction. This results in a conflict due to overlapping temporary buffers. This PR disables the problematic case to avoid the conflict until proper support is implemented.
**Test Plan**
```
python test/inductor/test_flex_attention.py -k test_make_block_mask_cpu
python test/inductor/test_cpu_repro.py -k test_parallel_reduction_vectorization
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151887
Approved by: https://github.com/jansel
Summary:
I had a minor annoyance when debugging graphs using EXIR dialect ops,
that all the function normalization went away. For functions with > 5 arguments,
some of which are just simple bools and ints, it's very helpful to have
the kwarg names attached.
Enhance `normalize_target` to handle EdgeOpOverload targets. To avoid
a circular dependency on Executorch from pytorch core, I just use a `hasattr`
check for "_op". This only happens if the target is not already a recognized
torch function.
Also, I noticed that the new `fx.Node.normalized_arguments` function
didn't forward an important kwarg to `normalize_target`, so I fixed that too.
Test Plan: Tested with FxGraphDrawer and an fx Graph containing EXIR nodes.
Differential Revision: D67545909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143689
Approved by: https://github.com/angelayi
Filtering out the stacktrace so that the stacktrace on nodes when using fx.Tracer looks nicer. I just copied the filtering we have in [proxy_tensor.py](6720d23969/torch/fx/experimental/proxy_tensor.py (L1903-L1931)).
Previously the stacktrace looked like:
```
File "/data/users/angelayi/pytorch/moo.py", line 3964, in <module>
run_tests()
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 1342, in run_tests
unittest.main(argv=argv)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 101, in __init__
self.runTests()
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 271, in runTests
self.result = testRunner.run(self.test)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/runner.py", line 184, in run
test(result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
test(result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
test(result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 650, in __call__
return self.run(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3324, in run
self._run_custom(
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3296, in _run_custom
super_run(result=result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3156, in wrapper
method(*args, **kwargs)
File "/data/users/angelayi/pytorch/moo.py", line 1495, in test_stack_trace
gm = torch.fx.GraphModule(m, tracer.trace(m))
File "/data/users/angelayi/pytorch/torch/fx/_symbolic_trace.py", line 837, in trace
(self.create_arg(fn(*args)),),
File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
x = x * 2
File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 716, in impl
return tracer.create_proxy("call_function", target, args, kwargs)
File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 248, in create_proxy
proxy.node.stack_trace = "".join(CapturedTraceback.extract().format())
```
Now it looks like:
```
File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
x = x * 2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151029
Approved by: https://github.com/jfix71, https://github.com/zou3519, https://github.com/jingsh
With `AdditionalInputs`, the behavior is the same as with tensors:
```python
class M(torch.nn.Module):
def forward(self, x, y):
return x + y
additional_inputs = torch.export.AdditionalInputs()
additional_inputs.add((5, 5))
additional_inputs.add((3, 5))
additional_inputs.add((5, 4))
ep = torch.export.export(
M(), (6, 7), dynamic_shapes=additional_inputs, strict=False
)
```
With `ShapesCollection`, we now need to wrap integer inputs as `_IntWrapper` so that we can have a unique identifier for each integer input.
```python
class M(torch.nn.Module):
def forward(self, x, y):
return x + y
from torch.export.dynamic_shapes import _IntWrapper
args = (_IntWrapper(5), _IntWrapper(5))
# Or we can do `args = pytree.tree_map_only(int, lambda a: _IntWrapper(a), orig_args)`
shapes_collection = torch.export.ShapesCollection()
shapes_collection[args[0]] = Dim.DYNAMIC
shapes_collection[args[1]] = Dim.DYNAMIC
ep = torch.export.export(
M(), args, dynamic_shapes=shapes_collection, strict=False
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151842
Approved by: https://github.com/pianpwk
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support
TODOs:
- Get rid of indexing kernel and compute it directly when kernel is run
- Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
- Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.
scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513
Using the following reproducer
```
import torch
import triton
def main():
dtype = torch.float32
dim = 1305301
a = torch.rand(100, device="cuda", dtype=dtype)
index = torch.randint(0, 100, (dim,), device="cuda")
src = torch.rand(dim, device="cuda", dtype=dtype)
print("=" * 20)
print(
triton.testing.do_bench(
lambda: a.scatter_add(0, index, src),
return_mode="median",
)
)
print("=" * 20)
if __name__ == "__main__":
main()
```
co-authored by: @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
Should be the last part of https://github.com/pytorch/pytorch/pull/150558, except for maybe s390x stuff, which I'm still not sure what's going on there
For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used. Previously it would use imagename:arch-main, which could be a version of the image based on an older commit
After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images
Release and main builds should still pull from docker io
Cons:
* if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch
* spin wait for docker build to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151706
Approved by: https://github.com/atalman
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:
1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")
1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).
Most of this PR is tests.
Test Plan:
- a lot of new tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
fixes https://github.com/pytorch/pytorch/issues/151055. Thanks @desertfire for the patch that fixed this.
I was a bit careful about the test - I wanted to make sure the test accurately ensures that we don't regress and our error message is not spammy when users enter an invalid `TORCH_LOGS=....` argument. But I tried to avoid using expecttests, since people occasionally add new logging artifacts and I didn't want to add to much churn by forcing this to fail CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151678
Approved by: https://github.com/desertfire, https://github.com/zou3519
Fixes#150367
This PR makes decomposition table from onnx registry, which includes registered ops not only ATen and prim. This will help to keep the custom ops that are specified in the custom_translation table from decomposition during ONNX export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151826
Approved by: https://github.com/justinchuby
Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour
Test Plan: CI
Differential Revision: D73388529
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824
Approved by: https://github.com/bdhirsh
Summary: Also check the module's named buffers and parameters when resolving name collision
Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```
Differential Revision: D73264885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151684
Approved by: https://github.com/angelayi