Commit Graph

2158 Commits

Author SHA1 Message Date
Kurt Mohler
cabc09a5f2 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
Sergii Dymchenko
bd9db6a9c7 Update to TorchFix 0.4.0 (#119424)
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
Hirochika Matsumoto
02c24b0b5e Add Python binding resizable to class {Untyped,Typed}Storage (#119286)
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.

Fixes #119233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
CaoE
113138aa55 add test cases for GradScaler on CPU (#109994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-02-02 21:49:07 +00:00
Yifu Wang
0f7e63620f CUDA fast path for split_with_sizes_copy.out (#117203)
### Motivation
In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`):

All-gather output:
```
AAAABBBCCAAAABBBCC
```

After all-gather-copy-out:
```
AAAAAAAA  BBBBBB  CCCC
```

The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today.

We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD).

### all-gather-copy-out via Composing ATen Ops

Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows:

Reshape all-gather output as (world_size, -1):
```
AAAABBBCC
AAAABBBCC
```

`split_with_sizes` + `_foreach_copy_`:
```
AAAA BBB CC
AAAA BBB CC
```

However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons:
- The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high.
- `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy.
- `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy.
- Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads.

### PR
Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details.

### Benchmarks
The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time.

Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline.

Baseline:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147)
```
New kernel:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #118512
2024-02-01 18:23:01 +00:00
CaoE
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Mikayla Gawarecki
41a56f7828 Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955)
This PR intends to fix the following issue when swapping two tensors

```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581,  0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581,  0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```

What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.

57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)

When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.

The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
Kurt Mohler
cd084c4909 Add TensorIteratorConfig::add_const_input to avoid COW materialize (#118053)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053
Approved by: https://github.com/ezyang
2024-01-23 22:32:39 +00:00
Oguz Ulgen
3b38f7b266 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 16:11:38 +00:00
PyTorch MergeBot
bb28965924 Revert "Remove skips for passing tests (#118000)"
This reverts commit 3c339b5b21.

Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))
2024-01-23 06:10:25 +00:00
Oguz Ulgen
3c339b5b21 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 03:41:23 +00:00
haozhe.zhu@intel.com
0ae952db76 enable mkldnn bf32 matmul (#116015)
### Testing
FP32 matmul vs. mkldnn BF32 matmul  on SPR

single core:

Input | BF32   / ms | FP32  /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041

56 cores:
Input | BF32   / ms | FP32 /   ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 |  1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915  | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 |  8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128  | 1296.419 | 1308.411 | 1.009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-20 09:30:23 +00:00
CaoE
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
Edward Z. Yang
2200118f59 Enable some uint{16,32,64} tests that are working (#116809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809
Approved by: https://github.com/albanD
2024-01-15 02:25:21 +00:00
Edward Z. Yang
edec54b9de Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
Edward Z. Yang
8bcdde5058 Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807
Approved by: https://github.com/albanD
ghstack dependencies: #116805, #116806
2024-01-10 02:17:23 +00:00
Edward Z. Yang
43a23a704a Support uint{16,32,64} copy (#116806)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806
Approved by: https://github.com/albanD
ghstack dependencies: #116805
2024-01-10 02:17:23 +00:00
Edward Z. Yang
2e983fcfd3 Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805)
These are some basic utilities that are often used for testing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805
Approved by: https://github.com/albanD
2024-01-10 02:17:23 +00:00
Aaron Gokaslan
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
Yanbo Liang
f657b2b1f8 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-27 18:47:05 +00:00
PyTorch MergeBot
3b709d7c1e Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)"
This reverts commit 015bd0e0a1.

Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))
2023-12-26 23:47:15 +00:00
Yanbo Liang
015bd0e0a1 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-23 09:44:09 +00:00
Mikayla Gawarecki
f206e31e2f Swap slots if slots match in swap_tensor (#116128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128
Approved by: https://github.com/albanD
2023-12-21 00:43:30 +00:00
Kurt Mohler
8a8d0adc0b Fix troch.gradient check for spacing arg list length (#115686)
Fixes #114207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686
Approved by: https://github.com/albanD
2023-12-13 20:17:20 +00:00
mantaionut
d521857411 Terminate handler (#101332)
Fixes #50051.
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.

This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-12 17:55:27 +00:00
ecao
65651d970b Optimize the copy of Half to Float and Float to Half on CPU (#103148)
### Description
Optimize the copy of Half to Float and Float to Half on CPU.

### Testing

Single core:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (1, 777) | 0.00345 | 0.00344 | 0.00411 | 0.00410
size: (2, 512) | 0.00355 | 0.00344 | 0.00431 | 0.00400
size: (10, 555) | 0.00473 | 0.00391 | 0.00562 | 0.00477
size: (1, 2048, 1024) | 0.488 | 0.480 | 0.498 | 0.499
size: (32, 100, 777) | 0.584 | 0.568 | 0.571 | 0.587

28 cores:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (10, 555) |  0.00472 | 0.00369 | 0.00576 |  0.00481
size: (1, 2048, 1024) |  0.0189 | 0.0188 | 0.0173 | 0.0251
size: (64, 512, 1024) | 3.159 | 2.375 |  3.152 | 2.358
size: (32, 100, 777) | 0.0225 | 0.0195 | 0.0193 | 0.0261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-12 05:57:52 +00:00
FFFrog
3361496f96 Fix the corner case of index_add (#114929)
Fixes #114864

As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929
Approved by: https://github.com/mikaylagawarecki
2023-12-09 01:57:25 +00:00
albanD
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
Kurt Mohler
6f32eb7eef Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-12-01 18:56:09 +00:00
PyTorch MergeBot
013675ff59 Revert "Add decomp for replication_pad2d and use for CUDA deterministic (#111590)"
This reverts commit f1286161a6.

Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job.  The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))
2023-11-30 02:28:14 +00:00
Kurt Mohler
f1286161a6 Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-11-29 21:50:46 +00:00
PyTorch MergeBot
fe428a284b Revert "Add torch._lazy_clone to create COW tensors (#113397)"
This reverts commit 9916d8a9ea.

Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))
2023-11-20 10:21:09 +00:00
PyTorch MergeBot
d40d72d664 Revert "Skip test_lazy_clone for Inductor (#114012)"
This reverts commit ecd8d388b9.

Reverted https://github.com/pytorch/pytorch/pull/114012 on behalf of https://github.com/DanilBaibak due to I revert the PR due to the original changes broke the internal build. Here is the original diff stack [D51444337](https://www.internalfb.com/diff/D51444337) ([comment](https://github.com/pytorch/pytorch/pull/114012#issuecomment-1818745425))
2023-11-20 10:12:44 +00:00
Nikita Shulga
ecd8d388b9 Skip test_lazy_clone for Inductor (#114012)
As half of those tests fail if run individually, but first failure masks all subsequent ones, i.e.
```
PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -v -k test_lazy_clone_cuda_float32
test_lazy_clone_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) ... FAIL
...
   self.assertTrue(torch._C._is_cow_tensor(t))
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1 test in 19.419s

FAILED (failures=1)
```
But
```
$ PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -k test_lazy_clone_
...
......................
----------------------------------------------------------------------
Ran 24 tests in 24.969s

OK
```
This flaky behavior was already detected, for example see https://github.com/pytorch/pytorch/issues/113953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114012
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-11-18 04:57:00 +00:00
Kurt Mohler
9916d8a9ea Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
ghstack dependencies: #113396
2023-11-17 01:58:51 +00:00
Brian Hirsh
cebad9867b graph break on intermediate leaves that require grad (#113277)
fixes https://github.com/pytorch/pytorch/issues/90552. This is a simpler fix that just detects the situation where AOTAutograd can't create a proper backward graph for the situation and graph breaks. This was technically a silent correctness issue before.

This PR tries to always graph break when we see a factory function that returns a tensor requiring grad. I check this by seeing if the op returned a `TensorVariable` in dynamo, and if one of the input arguments was a `requires_grad=True` kwarg. I think this is high-fidelity enough, and I'm also hoping that this is uncommon enough that a graph break is reasonable here.

The fix to avoid the graph break in user land is also pretty easy - just instantiate your tensor outside of the compiled region and plumb it in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113277
Approved by: https://github.com/eellison
ghstack dependencies: #113267, #113416, #113584
2023-11-16 02:47:45 +00:00
Nikita Shulga
78f3937ee8 [BE] Handle errors in set_num_threads (#113684)
and `set_num_interop_threads`

Before that, call `torch.set_num_threads(2**65)` resulted in segmentation fault, afterwards it becomes a good old runtime error:
```
% python -c "import torch;torch.set_num_threads(2**65)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Overflow when unpacking long
```

Similar to https://github.com/pytorch/pytorch/pull/60073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113684
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-15 06:17:41 +00:00
Kurt Mohler
8bdce9bb74 Fix UntypedStorage.resize_ to keep same CUDA device index (#113386)
Fixes #113300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113386
Approved by: https://github.com/albanD
2023-11-10 01:57:25 +00:00
Kurt Mohler
fd209543d5 Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)
Part of #109802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD, https://github.com/aaronenyeshi
2023-11-01 16:10:09 +00:00
PyTorch MergeBot
ace2713d1e Revert "Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)"
This reverts commit f1785373c0.

Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040))
2023-10-29 17:41:55 +00:00
Nikita Shulga
b61efe1c2b Fix torch.[size|stride](dim=None)` invocation (#111991)
Per documentation, one should be able to explicitly pass dim argument as None to get tensor size across all dimentions/strides, but before this change it was incorrectly interpreted as named tensor call.

Modify `size` and `stride` signatures generated by `gen_pyi.py` to highlight that overload with `None` will return a Tuple, but one with `dim: _int` returns `int`.

Add regression test to validate the behavior, and remove the check for asserts from two named tensors tests (NamedTensors are dead, aren't they?)

Fixes https://github.com/pytorch/pytorch/issues/111944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111991
Approved by: https://github.com/zou3519
2023-10-26 04:14:35 +00:00
Kurt Mohler
f1785373c0 Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)
Part of #109802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD
2023-10-26 02:39:06 +00:00
Nikita Shulga
7709382b50 Fix regression in torch.equal behavior for NaNs (#111699)
`torch.equal(x, x)` should return false if one of `x` is a tenor of floats one of which is NaN.
So, it renders some of the optimization proposed in https://github.com/pytorch/pytorch/pull/100024 invalid, though as result `torch.equal` will become much slower for identical floating point tensors.

Add regression test that calls torch.equal for tensor containing NaN

Fixes https://github.com/pytorch/pytorch/issues/111251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111699
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-10-21 00:02:45 +00:00
CaoE
d1afb7d43d add Half support for multinomial on CPU (#104178)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178
Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch
2023-10-20 19:16:04 +00:00
Evgeni Burovski
48989bc820 trace frames with np.ndarray (#110512)
Fixes #109604

Resubmit gh-109715 + several skips and small fixes to make tests pass.

The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:

- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
2023-10-15 00:56:10 +00:00
CaoE
8713a1a363 add Half support for bernoulli on CPU (#104176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2023-10-13 01:18:55 +00:00
Prachi Gupta
53a9ac534c Added decorator skipRocmIfTorchInductor and skipped failing tests (#107760)
This PR adds a skip decorator which will disable tests in CI for ROCm inductor workflow. This new workflow will be coming in via https://github.com/pytorch/pytorch/pull/110544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107760
Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/atalman
2023-10-12 16:00:35 +00:00
Elias Ellison
cf1da9bd17 enable index add test (#111016)
Dynamo is swallowing a user exception when suppress_errors is set to True. There's an issue filed for that: https://github.com/pytorch/pytorch/issues/108798. In the meantime we still like the functionality in this test which works without the default setting (dont suppress errors) to not regress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111016
Approved by: https://github.com/yanboliang
2023-10-11 19:41:35 +00:00
eellison
fb4b9e9c8e Re-enable a couple of fixed tests (#110770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110770
Approved by: https://github.com/yanboliang, https://github.com/int3, https://github.com/Skylion007
ghstack dependencies: #110651
2023-10-10 19:13:14 +00:00
eellison
c5f06b9753 Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651)
- neg view can just be lowered to neg() post functionalization
- we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651
Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007
2023-10-10 16:34:01 +00:00
jjsjann123
37567fdf31 Nvfuser cpp api deprecation attempt 2 (#110881)
attempting to re-try #110318 deprecating nvfuser c++ API

warning has been updated to TORCH_WARN_ONCE;
Warning thrown inside torch::jit::fuser::cuda::isEnabled() is turned off and will be deprecated when we pulled out TorchScript integration in the follow up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110881
Approved by: https://github.com/davidberard98, https://github.com/NicolasHug
2023-10-10 08:07:03 +00:00
PyTorch MergeBot
bbdc8c7b05 Revert "deprecating nvfuser c++ API (#110318)"
This reverts commit bf0866fc16.

Reverted https://github.com/pytorch/pytorch/pull/110318 on behalf of https://github.com/davidberard98 due to too many warnings being thrown in torchvision https://github.com/pytorch/pytorch/issues/110857 ([comment](https://github.com/pytorch/pytorch/pull/110318#issuecomment-1753245449))
2023-10-09 15:41:50 +00:00
jjsjann123
bf0866fc16 deprecating nvfuser c++ API (#110318)
deprecating nvfuser c++ API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110318
Approved by: https://github.com/davidberard98
2023-10-07 02:25:21 +00:00
eellison
3812f2e40c Preserve layout on like constructors (#110242)
Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor.

This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242
Approved by: https://github.com/int3
2023-10-02 23:53:55 +00:00
Moritz Hennen
09c598745c Rename torch._C._TensorBase to TensorBase (#109940)
I have gone ahead and implemented the renaming of the type `torch._C._TensorBase` to a non-private class name `TensorBase`.
The changes also include leaving `torch._C._TensorBase` as an alias to the new type: 70458768fb/torch/csrc/autograd/python_variable.cpp (L2196-L2197) both in the c++ code and in the corresponding `__init__.pyi.in` file:
70458768fb/torch/_C/__init__.pyi.in (L1522)

Fixes #109438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109940
Approved by: https://github.com/ezyang
2023-09-25 19:10:22 +00:00
Jez Ng
063a62622b Add memory overlap check to meta_copy_ (#108989)
Fixes `test_copy_many_to_one`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989
Approved by: https://github.com/eellison
2023-09-12 23:28:14 +00:00
Kurt Mohler
4c5e43574c Reland 2: Add PyObject preservation for UntypedStorage (#109039)
Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039
Approved by: https://github.com/ezyang
2023-09-12 22:26:05 +00:00
PyTorch MergeBot
41bd0fde7e Revert "Remove fixed skips (#108674)"
This reverts commit ab9fb03d6f.

Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for picking this up a bit late, but with https://github.com/pytorch/pytorch/pull/108647 reverted, these tests are failing again. So we need to wait for the PR to reland before we can land this change ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1715202692))
2023-09-12 08:04:32 +00:00
PyTorch MergeBot
59f605be57 Revert "Reland 2: Add PyObject preservation for UntypedStorage (#109039)"
This reverts commit 419e4e17a2.

Reverted https://github.com/pytorch/pytorch/pull/109039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing linter job in trunk, probably due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/109039#issuecomment-1715147020))
2023-09-12 07:26:11 +00:00
Kurt Mohler
419e4e17a2 Reland 2: Add PyObject preservation for UntypedStorage (#109039)
Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039
Approved by: https://github.com/ezyang
2023-09-12 01:19:40 +00:00
Li-Huai (Allan) Lin
b2cba439b4 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 23:30:40 +00:00
PyTorch MergeBot
a7f5abeade Revert "Introduce Tensor overload to linspace and logspace (#104889)"
This reverts commit 57e5239321.

Reverted https://github.com/pytorch/pytorch/pull/104889 on behalf of https://github.com/clee2000 due to sorry have to revert this to revert https://github.com/pytorch/pytorch/pull/107958 ([comment](https://github.com/pytorch/pytorch/pull/104889#issuecomment-1714305768))
2023-09-11 17:33:48 +00:00
Li-Huai (Allan) Lin
57e5239321 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 15:29:39 +00:00
Edward Z. Yang
137afe74e0 Don't fastpath conj copy when conj/neg bit mismatch (#108881)
Fixes https://github.com/pytorch/pytorch/issues/106051

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108881
Approved by: https://github.com/soulitzer
2023-09-08 20:44:43 +00:00
PyTorch MergeBot
68238606f3 Revert "Reland: Add PyObject preservation for UntypedStorage (#103907)"
This reverts commit 56b848157c.

Reverted https://github.com/pytorch/pytorch/pull/103907 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing torchdistx build which uses check_pyobj here 9c1b9f5cb2/src/python/torchdistx/_C/deferred_init.cc (L87) ([comment](https://github.com/pytorch/pytorch/pull/103907#issuecomment-1712121158))
2023-09-08 19:27:07 +00:00
Evgeni Burovski
1f20531939 fall back to eager on NotImplementedError (#107863)
Follow-up to https://github.com/pytorch/pytorch/pull/107710:

Help  dynamo fall back to eager when compiling unimplemented numpy constructs:

- arrays of strings
- (arg){min, max} for complex types
- various arguments typed as NotImplemented (`np.ones(4, order="F")` etc)
- numpy functions which torch._numpy does not implement

To test, run (we do not implement arrays of strings)

```
import torch
import numpy as np

@torch.compile(fullgraph=False)
def fn():
    return np.asarray(["L", "U"])
```

and observe it compiles with fullgraph=False and fails with fullgraph=True

Fixes https://github.com/pytorch/pytorch/issues/107970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107863
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-09-07 21:22:20 +00:00
eellison
ab9fb03d6f Remove fixed skips (#108674)
These no longer fail with TEST_WITH_TORCHINDUCTOR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108674
Approved by: https://github.com/desertfire
2023-09-07 17:36:56 +00:00
Kurt Mohler
56b848157c Reland: Add PyObject preservation for UntypedStorage (#103907)
This relands #97470 after #102553 reverted it. This PR attempts to fix the internal failure by avoiding an unnecessary intermediate storage buffer allocation in `c10::newStorageImplFromRefcountedDataPtr`.

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103907
Approved by: https://github.com/ezyang
2023-09-07 04:24:11 +00:00
Kurt Mohler
3f88e3105f Reland: Remove remaining global set_default_dtype calls from tests (#108088)
Fixes #68972

Relands #107246

To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
2023-09-07 03:04:34 +00:00
PyTorch MergeBot
43527d41a2 Revert "Remove fixed skips (#108674)"
This reverts commit 518cfda2dd.

Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for reverting this, but one test is failing on inductor 518cfda2dd, and it seems easier to revert this than disabling the test ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1709310192))
2023-09-07 00:56:46 +00:00
eellison
518cfda2dd Remove fixed skips (#108674)
These no longer fail with TEST_WITH_TORCHINDUCTOR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108674
Approved by: https://github.com/desertfire
2023-09-06 22:33:43 +00:00
PyTorch MergeBot
161ea463e6 Revert "Remove remaining global set_default_dtype calls from tests (#107246)"
This reverts commit aa8ea1d787.

Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))
2023-08-25 19:34:55 +00:00
Digant Desai
8a7a6867b9 [PyTorch][Tensor] Introduce tensor.dim_order (#106835)
Summary:
This is a stride based attribute for a tensor available in Python.

This can help inspect tensors generated using `torch.empty_permuted(.., physical_layout, ...)`, where physical_layout should match the dim_order returned here. `empty_permuted` will be renamed to use dim_order as the param name in the future. And also help Executorch export pipeline with implementing dim_order based tensors.

Differential Revision: D48134476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106835
Approved by: https://github.com/ezyang
2023-08-25 00:06:03 +00:00
Kurt Mohler
aa8ea1d787 Remove remaining global set_default_dtype calls from tests (#107246)
Fixes #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246
Approved by: https://github.com/ezyang
2023-08-24 16:10:48 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
Catherine Lee
bc053070f8 Mark test_gradient_extreme_cases as slow for inductor (#107189)
test_gradient_extreme_cases_* takes ~5 minutes on the inductor sm86 shard and possibly even longer on the inductor workflow since it's timing out right now although I'm not sure what the difference between the two is, and sometimes auto slow test detection isn't catching it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107189
Approved by: https://github.com/ZainRizvi
2023-08-15 22:03:00 +00:00
Sam Larsen
3d00170b20 [inductor] fix test_dim_function_empty (#106994)
Summary: Looks like the assert syntax was just wrong

Test Plan:
PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py -k test_dim_function_empty
PYTORCH_TEST_WITH_AOT_EAGER=1 python test/test_torch.py -k test_dim_function_empty
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106994
Approved by: https://github.com/eellison
2023-08-11 21:38:53 +00:00
Kshiteej K
a899333ffc fix: nll_loss batch rule with negative ignore_idx (#106118)
We use python decompositions instead of writing our own for batching rules.

Fixes https://github.com/pytorch/pytorch/issues/105736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118
Approved by: https://github.com/lezcano, https://github.com/zou3519
2023-08-04 07:43:02 +00:00
Fuzzkatt
ae1c0f42a3 update tf32 thresholds for H100 (#105879)
Addresses tf32 threshold related failures from NVIDIA internal testing for following unit tests:

H100:
- test_nn.py: test_ConvTranspose2d_dilated_cuda_tf32, test_ConvTranspose2d_no_bias_cuda_tf32, test_Transformer_multilayer_coder_cuda_tf32
- test_torch.py: test_cdist_non_contiguous_batch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105879
Approved by: https://github.com/ezyang
2023-08-02 16:44:01 +00:00
Scott Wolchok
b435bff53a [PyTorch] Add tests for empty tensors w/storage null data_ptr (#101426)
Further investigation seems to show that changing this behavior (making empty tensors sometimes have non-null data_ptr) was the real problem with #98090 . Adding tests to lock down this behavior so we don't change it by accident again.

Differential Revision: [D45873002](https://our.internmc.facebook.com/intern/diff/D45873002/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101426
Approved by: https://github.com/zou3519
2023-07-27 05:19:42 +00:00
Nikita Karetnikov
eac9e1b35f [OpInfo] add reference and error inputs for multilabel_margin_loss (#105523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Kurt Mohler
fcb7d4b358 Mark bincount CUDA deterministic if weights are not given (#105244)
Fixes #98316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105244
Approved by: https://github.com/mikaylagawarecki
2023-07-18 01:16:51 +00:00
Nikita Shulga
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
PyTorch MergeBot
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
Nikita Karetnikov
0c89596e4f [OpInfo] add reference and error inputs for multi_margin_loss (#104850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
Nikita Shulga
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
PyTorch MergeBot
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
Kurt Mohler
f987d11fa7 Reland: Make torch.empty* deterministic by filling with NaN or max int (#104995)
Relands #101849 after #104302 reverted it.

torchrec PR https://github.com/pytorch/torchrec/pull/1269 fixes the torchrec failure that caused #101849 to be reverted

Part of #82004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104995
Approved by: https://github.com/albanD
2023-07-13 22:18:03 +00:00
Nikita Shulga
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
yanbing-j
053654b9cf Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-13 09:34:29 +00:00
Aaron Gokaslan
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
Kurt Mohler
0ccdbbe233 Add deterministic path for Tensor.resize_ (#104300)
New elements added to a tensor by `torch.Tensor.resize_` are set to NaN/MAX_INT when deterministic mode is turned on.

When `torch.Tensor.resize_` is called on a quantized tensor and deterministic mode is turned on, a nondeterministic error is raised.

Part of #82004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104300
Approved by: https://github.com/albanD
2023-07-07 00:22:13 +00:00
Nikita Shulga
ddd7da7546 Enable more tests (#104437)
Remove `test_segment_reductions` from list of blocklisted tests Remove `@onlyCPU` qualifier from test_segment_reductions as it has CUDA specific parts

Fixes https://github.com/pytorch/pytorch/issues/104410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104437
Approved by: https://github.com/atalman, https://github.com/huydhn
2023-06-30 16:26:11 +00:00
Amr Elshennawy
a78bddac01 Revert D46920584: Multisect successfully blamed D46920584 for test or build failures (#104269) (#104302)
Summary:

This diff is reverting D46920584
D46920584: Make `torch.empty*` deterministic by filling with NaN or max int value (#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures:

Tests affected:
- [torchrec/distributed/composable/tests:test_fsdp - torchrec.distributed.composable.tests.test_fsdp.FullyShardTest: test_composable_checkpoint](https://www.internalfb.com/intern/test/281475062923125/)

Here's the Multisect link:
https://www.internalfb.com/multisect/2341386
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Reviewed By: huydhn, osalpekar

Differential Revision: D46997394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104302
Approved by: https://github.com/osalpekar
2023-06-29 20:20:58 +00:00
Richard Barnes
8cad411d3d Fix UntypedStorage pin error (#104355)
Summary:
Fixes:
```
TypeError: cannot pin 'torch.storage.UntypedStorage' only CPU memory can be pinned
```

Test Plan: Sandcastle

Differential Revision: D47093797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104355
Approved by: https://github.com/malfet
2023-06-29 16:06:52 +00:00
Kurt Mohler
2642f31e4c Make torch.empty* deterministic by filling with NaN or max int value (#101849)
Part of #82004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101849
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/kulinseth
2023-06-21 02:53:22 +00:00
Elias Ellison
40d70ba7ed Remove a number of fixed skips (#103162)
Also adds `PYTORCH_TEST_WITH_AOT_EAGER` to distinguish errors coming from aot_autograd and not inductor (not tested in ci, but useful for local debugging)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103162
Approved by: https://github.com/desertfire
2023-06-08 17:37:59 +00:00
ts
d2d03f0f44 Make index_add_ error if input source shape is wrong (#100321)
Fixes #92576 , checking the following as described in the documentation:

"source.shape[dim] == len(index) and source.shape[i] == self.shape[i] for i != dim"

Would be happy to iterate on this if there are any issues, and would be happy to implement the checking for the CUDA and MPS implementations of index_add_.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100321
Approved by: https://github.com/lezcano
2023-06-08 06:51:10 +00:00
Lu Fang
1237502213 Introduce fast path for cuda_equal (#102714)
We introduce the same trick for cuda_equal. Assuming in cuda_equal, the flags are already handled correctly.

Added the tests for cuda part.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102714
Approved by: https://github.com/ezyang
2023-06-03 05:49:49 +00:00
Shiyan Deng
685505353a Back out "Add PyObject preservation for UntypedStorage (#97470)" (#102553)
Summary:
Original commit changeset: c24708d18ccb

Original Phabricator Diff: D46159983

Test Plan: SL tests and CI

Differential Revision: D46284986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102553
Approved by: https://github.com/DanilBaibak
2023-06-01 17:23:43 +00:00
Edward Z. Yang
818d92f58c Support resize on meta storage (#101988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-05-25 04:41:45 +00:00
PyTorch MergeBot
210fc28d5e Revert "Support resize on meta storage (#101988)"
This reverts commit 7d1ba0a92a.

Reverted https://github.com/pytorch/pytorch/pull/101988 on behalf of https://github.com/osalpekar due to Need to revert and rebase this in order to unblock train import ([comment](https://github.com/pytorch/pytorch/pull/101988#issuecomment-1561970230))
2023-05-24 21:51:33 +00:00
Wang, Eikan
2e18dd2bdc Improve bf16 neg by bypassing the convertion between BF16 and FP32 (#99711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99711
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/desertfire
2023-05-24 03:25:23 +00:00
Kazuaki Ishizaki
be5e77ca4c Make _StorageBase.byteswap faster ( > 10000x) (#101925)
This PR addresses #101690. This PR implement faster data elements swap in `_StorageBase` using C++ rather than using Python.

This PR helps such a situation that a large model saved on a little-endian machine will be loaded on a big-endian machine.

TODO:
- [x] Add test cases
- [x] Add performance comparison before and after the PR
- [ ] (Optional) Investigate further opportunities for performance improvements by [SIMDization](https://dev.to/wunk/fast-array-reversal-with-simd-j3p)

Fixes #101690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101925
Approved by: https://github.com/mikaylagawarecki
2023-05-24 00:13:41 +00:00
Edward Z. Yang
7d1ba0a92a Support resize on meta storage (#101988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-05-23 16:49:17 +00:00
Kurt Mohler
5fe629e314 Add PyObject preservation for UntypedStorage (#97470)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97470
Approved by: https://github.com/ezyang
2023-05-23 01:27:30 +00:00
drisspg
6f13d6892a Add meta support for multinomial (#101324)
# Summary
Found this when trying to compile the text gen loop of nanogpt here: b33289942b/torchbenchmark/models/nanogpt_generate/model.py (L322)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101324
Approved by: https://github.com/ngimel
2023-05-19 00:04:26 +00:00
Edward Z. Yang
c567748e16 Make interpolate_bilinear deterministic using decomposition (#101115)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101115
Approved by: https://github.com/ngimel
2023-05-11 22:48:01 +00:00
Yu, Guangye
14964b3aa5 Add is_xpu to torch type (#101072)
# Motivate
Without this PR:
```python
>>>import torch
>>>torch.IntTensor.is_cuda
False
>>>torch.IntTensor.is_xpu
<attribute 'is_xpu' of 'torch._C._TensorBase' objects>
```

With this PR:
```python
>>>import torch
>>>torch.IntTensor.is_xpu
False
```
Align to CUDA, some customer code use is_xpu to check the backend. Without this PR, the check is always True which result in an unexpected behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101072
Approved by: https://github.com/mikaylagawarecki
2023-05-11 17:50:59 +00:00
vfdev-5
622e582a2b Register get_cpu_capability for jit (#100723)
Description:

Context: In torchvision we ensure that functional ops are torchscriptable. Recently exposed `torch.backends.cpu.get_cpu_capability()` in https://github.com/pytorch/pytorch/pull/100164 is failing in torchvision CI
```
RuntimeError:
Python builtin <built-in function _get_cpu_capability> is currently not supported in Torchscript:
  File "/usr/local/lib/python3.10/dist-packages/torch/backends/cpu/__init__.py", line 17
    - "AVX512"
    """
    return torch._C._get_cpu_capability()
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
```
Ref: https://github.com/pytorch/vision/pull/7557

In this PR, `torch._C._get_cpu_capability()` is explicitly registered for JIT and tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100723
Approved by: https://github.com/albanD
2023-05-09 09:52:29 +00:00
Valentin Andrei
9bc68fcd25 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If stride == 1 we can use the whole warp to reduce the gradients
If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505
Approved by: https://github.com/ngimel
2023-05-03 23:52:58 +00:00
vfdev-5
6a12f10b08 Publicly exposing torch.backends.cpu.get_cpu_capability() (#100164)
Description:

- As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`.

- In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities

Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-03 19:02:07 +00:00
PyTorch MergeBot
1114673c90 Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)"
This reverts commit 97afbcbc80.

Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))
2023-05-02 16:46:04 +00:00
Lu Fang
090ec55f8d
Only skip in torch inductor test
Differential Revision: D45464303nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100435
2023-05-01 22:21:37 -07:00
Lu Fang
429155b3c8
Disable some check to get the test pass
Differential Revision: D45437730nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100364
2023-05-01 16:28:12 -07:00
valentinandrei
97afbcbc80 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If `stride == 1` we can use the whole warp to reduce the gradients
If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441
Approved by: https://github.com/ngimel
2023-05-01 22:41:00 +00:00
Lu Fang
d7fa7fa8cf
Introduce fast path in the CPU equal op
Differential Revision: D45282119nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100024
2023-04-28 16:00:17 -07:00
kshitij12345
61dffa61c3 [fix] masked_scatter_: non-contiguous self (#100232)
Fixes https://github.com/pytorch/pytorch/issues/99638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100232
Approved by: https://github.com/ngimel
2023-04-28 18:12:23 +00:00
dujinhang
9cd48b0575 Add warning information for dtypetensor. (#99521)
Fixes #ISSUE_NUMBER

Without affecting the existing cpu/cuda logic, a separate interface is provided for the custom backend and users can choose whether to use the interface function which provides 10 tensor types with custom backend variations.

Therefore, users can use torch.set_deafult_tensor_type to set the default device tensor type, or use torch.xxx.dtypetensor to create a tensor.For example,torch.set_deafult_tensor_type(torch.foo.DoubleTensor) or torch.foo.DoubleTensor([]).

@albanD , please review my changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99521
Approved by: https://github.com/albanD
2023-04-28 18:01:45 +00:00
Larry Liu
687afeb686 [dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849)
Issue: #93684

# Problem

Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.

# Design (as I know it)

* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.

This PR adds `NumpyTensorVariable` and supports:
1.  tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`

Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
2023-04-27 16:18:35 +00:00
Jiong Gong
e5c9a0fcf5 [dynamo] avoid graph break on repeat_interleave.self_int (#99528)
Address convit_base failure: https://github.com/pytorch/torchdynamo/issues/1886 mentioned in https://github.com/pytorch/pytorch/issues/93777
Also for models like EleutherAI/gpt-j-6B.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99528
Approved by: https://github.com/ezyang
2023-04-25 04:47:39 +00:00
BJ Hargrave
555ab310dc Add itemsize and nbytes properties to Tensor (#98322)
Adds properties for itemsize and nbytes to Tensor matching the properties in NumPy.

Fixes https://github.com/pytorch/pytorch/issues/12728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98322
Approved by: https://github.com/ezyang
2023-04-05 12:11:55 +00:00
Jason Ansel
b96fe9b61c Fix issues related to ClassInstantier in HF models (#97997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997
Approved by: https://github.com/anijain2305
2023-04-04 00:01:08 +00:00
Jason Ansel
71d850a100 [inductor] Fallback on complex64 kernels (#98155)
Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155
Approved by: https://github.com/anijain2305, https://github.com/ngimel
2023-04-03 01:06:43 +00:00
Nikita Shulga
2af09393f9 masked_scatter should accept only bool masks (#97999)
Modify test_torch to check that assert is raised in this case

torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU.
This PR finally restricts mask to just `torch.bool`
See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill`

Fixes https://github.com/pytorch/pytorch/issues/94634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999
Approved by: https://github.com/ngimel
2023-04-01 23:25:25 +00:00
Nikita Shulga
a1dc2b1774 [BE] Remove bool dtype from masked_scatter (#98015)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at a9fa438</samp>

Simplified a test function for `torch.masked_scatter` in `test/test_torch.py` by removing redundant and unnecessary code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98015
Approved by: https://github.com/ezyang
2023-03-31 01:45:57 +00:00
Aleksei Nikiforov
8289120ef0 Revert "test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)" (#97759)
Tests were already fixed in https://github.com/pytorch/pytorch/pull/92834, and these changes instead of also fixing tests are now breaking them again.

This reverts commit 7f94ea8492.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97759
Approved by: https://github.com/janeyx99
2023-03-28 18:43:08 +00:00
Nikita Shulga
542fb0b1fa Specify file encoding in test_torch.py (#97628)
Attempt to fix
```
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 5260: ordinal not in range(128)
```
in https://github.com/pytorch/pytorch/actions/runs/4522628359/jobs/7965372405

In general, it's a good practice to explicitly specify encoding, as otherwise it depends on environment variable and makes tests failures unpredicatble

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97628
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-03-26 20:03:25 +00:00
Edward Z. Yang
37faa48844 DCE inference graphs too (#97275)
I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275
Approved by: https://github.com/bdhirsh
2023-03-23 07:02:52 +00:00
Kurt Mohler
fbc803df0c Only warn once for TypedStorage deprecation (#97379)
Fixes #97207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97379
Approved by: https://github.com/ezyang
2023-03-23 05:40:23 +00:00
Aleksei Nikiforov
7f94ea8492 test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)
Use opposite encoding on big endian systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952
Approved by: https://github.com/ezyang
2023-03-17 14:36:33 +00:00
mingfeima
06054d7df0 fix random output issue on index_select when src is scalar and index is empty (#96408)
Fix https://github.com/pytorch/pytorch/issues/94340
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96408
Approved by: https://github.com/ngimel
2023-03-16 05:30:45 +00:00
Kurt Mohler
06b7285163 Add torch._check* functions analogous to C++ TORCH_CHECK* (#88725)
Adds `_check`, `_check_index`, `_check_value`, `_check_type`, `_check_not_implemented`, `_check_tensor_all`

Part of #72948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88725
Approved by: https://github.com/albanD
2023-03-14 20:44:50 +00:00
kvathupo
2b9d9bcb85 Deprecate non-bool masks in masked_fill (#96594)
__What?__
Per discussion at #94634, deprecate `masked_fill` with non-bool masks. Deprecation warnings were previously added by #22261, but not for Apple MPS. I can revert the MPS changes if deprecation warnings are wanted first tho. See also #96112.

Fixes #85063 and #89320.

__Further Development?__
- Fixed the mask dtype checking for the cuda dispatch for `masked_fill` in `aten/src/ATen/native/cuda/Indexing.cu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96594
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-03-13 01:41:47 +00:00
Nikita Shulga
1cd0929bf7 [BC] Allow only bool tensors as mask in masked_select (#96112)
`byte` support was marked as deprecated in 1.8, so it's fine to remove this in 2.1 (or even 2.0)
Deprecation warning was added by https://github.com/pytorch/pytorch/pull/22261

Also, fix bunch of syntactic errors in comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96112
Approved by: https://github.com/ezyang
2023-03-07 01:43:14 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
Yuxin Wu
9bb2fe3eae fix numpy1.24 deprecations in unittests (#93997)
Fixes https://github.com/pytorch/pytorch/issues/91329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93997
Approved by: https://github.com/ngimel, https://github.com/jerryzh168
2023-02-18 00:59:09 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Brian Hirsh
ceb0f1576b turn functionalization on in aot_autograd inference (#92857)
still waiting for CI fallout
fixes #90759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92857
Approved by: https://github.com/ezyang
2023-02-13 17:48:00 +00:00
Nikita Shulga
4869929f32 Update Triton hash (#94249)
That includes MLIR + latest packaging changes (that also download ptxas from CUDA-12)
Tweak CI to install gcc-9 to build trition

Disable a few tests to make everything be correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94249
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/weiwangmeta
2023-02-13 13:17:36 +00:00
Aaron Gokaslan
9171f7d4cd [BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520)
Applies some more pyupgrade fixits to PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520
Approved by: https://github.com/ezyang
2023-02-10 18:02:50 +00:00
ganler
db6cfff827 fix: forbid multi-index for index_select over scalar (#94347)
Fixes #88940

According to the [doc](https://pytorch.org/docs/stable/generated/torch.index_select.html):
1. "The returned tensor has the same number of dimensions as the original tensor (`input`). "
2.  "The `dim`th dimension has the same size as the length of `index`; other dimensions have the same size as in the original tensor."

These two conditions cannot be satisfied at the same time if the `input` is a scalar && `index` has multiple values: because a scalar at most holds one element (according to property 1, the output is a scalar), it is impossible to satisfy "The `dim`th dimension has the same size as the length of `index`" when `index` has multiple values.

However, currently, if we do so we either get:

1. Buffer overflow with ASAN;
2. Or (w/o ASAN) silently returns outputs that is not consistent with the doc (`x.index_select(0, torch.Tensor([0, 0, 0]).int())` returns `x`).

As a result, we should explicitly reject such cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94347
Approved by: https://github.com/malfet
2023-02-10 17:17:09 +00:00
min-jean-cho
900e09c872 [Dynamo] Support torch.Tensor.fn as TorchVariable, not UserDefinedObjectVariable, preventing graph break (#93243)
As found in #92709, thanks to @ngimel and @jansel, currently `torch.Tensor.fn` points to `UserDefinedObjectVariable` rather than `TorchVariable`. The root cause is due to https://github.com/pytorch/pytorch/pull/92709#pullrequestreview-1273357406. To prevent this, build `TorchVariable`  of `torch.Tensor.fn` pointing to `torch.ops.aten.fn`.

This issue propagates to `torch.Tensor.fn` causing graph break with `nopython=True`.
```python
import torch
import torch._dynamo as dynamo

#op = torch.ops.aten.abs_ # no graph break
op = torch.Tensor.abs_ # graph break
args = torch.empty(10)

def foo(args):
    return op(args)

opt_foo = dynamo.optimize("inductor", nopython=True)(foo)
y_ = opt_foo(args)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93243
Approved by: https://github.com/jansel
2023-02-07 09:26:50 +00:00
min-jean-cho
6e1cfcdf4b cauchy_ few fixes (1) check gamma > 0 (2) better dtype error log (#93314)
Related #92047

(1) `torch.Tensor.cauchy_` is missing check for `gamma > 0` (`torch.distributions.cauchy.Cauchy` correctly checks  `gamma > 0`).
(2) add better error log on dtype similar to exponential_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93314
Approved by: https://github.com/jgong5, https://github.com/fritzo, https://github.com/lezcano
2023-02-03 11:56:28 +00:00