Refactor release only changes to two step execution.
1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed.
2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728
Approved by: https://github.com/jeanschmidt
It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy.
I tested by running `build/bin/TCPStoreTest` and observing the log messages there. I am actually not sure how to look at the log messages from Python though.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384
Approved by: https://github.com/Skylion007, https://github.com/malfet
Summary:
`np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly:
```python
def asscalar(a):
return a.item()
```
This fixes all of the references.
Test Plan: visual inspection and automated tests
Differential Revision: D54697760
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545
Approved by: https://github.com/malfet
Summary:
## Context
This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways:
1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes.
2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed.
Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing.
Differential Revision: D54728401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598
Approved by: https://github.com/jorgep31415
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.
Update the error message to explicitly say that sparse_allreduce is not supported.
Test Plan: sandcastle
Differential Revision: D54759307
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308
Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.
At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.
This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel
Reviewed By: digantdesai
Differential Revision: D53590486
Pulled By: manuelcandales
Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.
Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
Reduces the `torch.compile(backend="eager")` for this code
~~~
def fn(x):
for _ in range(10000):
# x = torch.sin(x)
x = torch.ops.aten.sin(x)
# x = sin(x)
return x
~~~
From 18 seconds to 12 seconds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031
Approved by: https://github.com/jansel
In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678
Approved by: https://github.com/angelayi
ghstack dependencies: #121652
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases (m or n or k = 0)
CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain.
Test Plan:
Run test cases
Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34.
Reviewed By: frank-wei
Differential Revision: D54711064
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674
Approved by: https://github.com/frank-wei
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster.
@albanD This is the GIL change extracted from #112607 as discussed.
Also fixes a potential use of a moved-from object introduced in #116560:
- `f` is captured by value in a lambda that may be used called times
- After `std::move(f)` the lambda is not safe to call anymore
CC @cyyever for that change
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695
Approved by: https://github.com/albanD, https://github.com/Skylion007
Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520
Approved by: https://github.com/eellison
Summary: Does not change weights structure so compatible with const folding and realtime weights update
Test Plan: run added test cases
Reviewed By: frank-wei
Differential Revision: D53843428
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617
Approved by: https://github.com/frank-wei
Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args.
Test Plan: added test case
Differential Revision: D54435230
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145
Approved by: https://github.com/zhxchen17
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```
sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
Summary:
This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229
The object is using __prepare_scriptable__ correctly inside of torch.jit.script()
but the clousre that is obtained below is using the non-prepared version.
This causes issues when the prepared and non-prepared versions are in different python modules.
Test Plan:
```
buck2 run mode/opt caffe2/test:jit -- -r test_decorator
```
Differential Revision: D54308741
Re-exporting, as #120806#121307 were not properly merged.
Co-authored-by: Daniel Herrera <dherrera@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553
Approved by: https://github.com/huydhn, https://github.com/seemethere
Update the torch_xla pin to a more recent one (03/08/2024). We need to make sure the torch_xla pin stays up-to-date so that pytorch can test against a up-to-date version of torch_xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121529
Approved by: https://github.com/atalman
Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705:
> I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases
I would like to add that the version check is not necessary as in any case the outcome is the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541
Approved by: https://github.com/nWEIdia, https://github.com/albanD
Fix round robin sharding when there are no test times and sort_by_time=False
Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code
Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar