Kurt Mohler
13a54ce279
Avoid COW materialization in at::parallel_for/parallel_reduce ( #120455 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
PyTorch MergeBot
86ff31c4a0
Revert "Avoid COW materialization in at::parallel_for/parallel_reduce ( #120455 )"
...
This reverts commit cabc09a5f2 .
Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100 ))
2024-02-28 22:30:18 +00:00
Kurt Mohler
cabc09a5f2
Avoid COW materialization in at::parallel_for/parallel_reduce ( #120455 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
Sergii Dymchenko
bd9db6a9c7
Update to TorchFix 0.4.0 ( #119424 )
...
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
Hirochika Matsumoto
02c24b0b5e
Add Python binding resizable to class {Untyped,Typed}Storage ( #119286 )
...
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.
Fixes #119233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang , https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
CaoE
113138aa55
add test cases for GradScaler on CPU ( #109994 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5 , https://github.com/ezyang
2024-02-02 21:49:07 +00:00
Yifu Wang
0f7e63620f
CUDA fast path for split_with_sizes_copy.out ( #117203 )
...
### Motivation
In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`):
All-gather output:
```
AAAABBBCCAAAABBBCC
```
After all-gather-copy-out:
```
AAAAAAAA BBBBBB CCCC
```
The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today.
We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515 ). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD).
### all-gather-copy-out via Composing ATen Ops
Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows:
Reshape all-gather output as (world_size, -1):
```
AAAABBBCC
AAAABBBCC
```
`split_with_sizes` + `_foreach_copy_`:
```
AAAA BBB CC
AAAA BBB CC
```
However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons:
- The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high.
- `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy.
- `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy.
- Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads.
### PR
Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details.
### Benchmarks
The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time.
Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline.
Baseline:
```
num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151)
num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664)
num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584)
num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612)
num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559)
num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073)
num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133)
num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093)
num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207)
num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904)
num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996)
num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765)
num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611)
num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100)
num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176)
num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099)
num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207)
num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531)
num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752)
num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717)
num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633)
num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107)
num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182)
num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089)
num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177)
num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488)
num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655)
num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671)
num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572)
num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078)
num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149)
num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086)
num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170)
num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476)
num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633)
num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622)
num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576)
num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076)
num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140)
num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073)
num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147)
```
New kernel:
```
num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138)
num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996)
num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289)
num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264)
num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249)
num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078)
num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104)
num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075)
num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099)
num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616)
num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276)
num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278)
num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245)
num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080)
num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102)
num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078)
num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106)
num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656)
num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264)
num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279)
num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274)
num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077)
num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112)
num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081)
num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116)
num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632)
num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294)
num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286)
num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255)
num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074)
num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094)
num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075)
num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105)
num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668)
num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274)
num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276)
num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262)
num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078)
num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109)
num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080)
num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203
Approved by: https://github.com/albanD , https://github.com/awgu
ghstack dependencies: #118512
2024-02-01 18:23:01 +00:00
CaoE
bacbad5bc9
add GradScaler on CPU ( #109993 )
...
Step 2 of https://github.com/pytorch/pytorch/issues/111559 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5 , https://github.com/ezyang
2024-01-29 23:42:35 +00:00
Edward Z. Yang
46712b019d
Enable local_partial_types ( #118467 )
...
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414 , #118418 , #118432
2024-01-28 13:38:22 +00:00
Mikayla Gawarecki
41a56f7828
Fix swap_tensors to swap PyObjects associated with TensorImpl ( #116955 )
...
This PR intends to fix the following issue when swapping two tensors
```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581, 0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581, 0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```
What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.
57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)
When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.
The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
Kurt Mohler
cd084c4909
Add TensorIteratorConfig::add_const_input to avoid COW materialize ( #118053 )
...
Part of #97856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053
Approved by: https://github.com/ezyang
2024-01-23 22:32:39 +00:00
Oguz Ulgen
3b38f7b266
Remove skips for passing tests ( #118000 )
...
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 16:11:38 +00:00
PyTorch MergeBot
bb28965924
Revert "Remove skips for passing tests ( #118000 )"
...
This reverts commit 3c339b5b21 .
Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752 ))
2024-01-23 06:10:25 +00:00
Oguz Ulgen
3c339b5b21
Remove skips for passing tests ( #118000 )
...
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 03:41:23 +00:00
haozhe.zhu@intel.com
0ae952db76
enable mkldnn bf32 matmul ( #116015 )
...
### Testing
FP32 matmul vs. mkldnn BF32 matmul on SPR
single core:
Input | BF32 / ms | FP32 / ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041
56 cores:
Input | BF32 / ms | FP32 / ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 | 1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915 | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 | 8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128 | 1296.419 | 1308.411 | 1.009
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5 , https://github.com/ezyang
2024-01-20 09:30:23 +00:00
CaoE
29516bd2a0
add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU ( #109281 )
...
Step1 of https://github.com/pytorch/pytorch/issues/111559 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5 , https://github.com/ezyang
2024-01-16 15:25:08 +00:00
Edward Z. Yang
2200118f59
Enable some uint{16,32,64} tests that are working ( #116809 )
...
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809
Approved by: https://github.com/albanD
2024-01-15 02:25:21 +00:00
Edward Z. Yang
edec54b9de
Add torch._lazy_clone to create COW tensors ( #113397 )
...
Part of #109833
Stack from [ghstack](https://github.com/ezyang/ghstack ) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
Edward Z. Yang
8bcdde5058
Support uint{16,32,64} deterministic empty fill and scalar Python binding handling ( #116807 )
...
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807
Approved by: https://github.com/albanD
ghstack dependencies: #116805 , #116806
2024-01-10 02:17:23 +00:00
Edward Z. Yang
43a23a704a
Support uint{16,32,64} copy ( #116806 )
...
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806
Approved by: https://github.com/albanD
ghstack dependencies: #116805
2024-01-10 02:17:23 +00:00
Edward Z. Yang
2e983fcfd3
Support unsigned int for randint, item, equality, fill, iinfo, tensor ( #116805 )
...
These are some basic utilities that are often used for testing.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805
Approved by: https://github.com/albanD
2024-01-10 02:17:23 +00:00
Aaron Gokaslan
3fe437b24b
[BE]: Update flake8 to v6.1.0 and fix lints ( #116591 )
...
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
- `assert(a == b)` -> `assert a == b`
- `if(x > y or y < z):`->`if x > y or y < z:`
- And `return('...')` -> `return '...'`
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD , https://github.com/malfet
2024-01-03 06:04:44 +00:00
Aaron Gokaslan
bd10fea79a
[BE]: Enable F821 and fix bugs ( #116579 )
...
Fixes #112371
I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
Yanbo Liang
f657b2b1f8
[Dynamo][10/N] Remove TorchVariable and is_allowed ( #116312 )
...
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
- The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-27 18:47:05 +00:00
PyTorch MergeBot
3b709d7c1e
Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed ( #116312 )"
...
This reverts commit 015bd0e0a1 .
Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506 ))
2023-12-26 23:47:15 +00:00
Yanbo Liang
015bd0e0a1
[Dynamo][10/N] Remove TorchVariable and is_allowed ( #116312 )
...
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
- The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-23 09:44:09 +00:00
Mikayla Gawarecki
f206e31e2f
Swap slots if slots match in swap_tensor ( #116128 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128
Approved by: https://github.com/albanD
2023-12-21 00:43:30 +00:00
Kurt Mohler
8a8d0adc0b
Fix troch.gradient check for spacing arg list length ( #115686 )
...
Fixes #114207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686
Approved by: https://github.com/albanD
2023-12-13 20:17:20 +00:00
mantaionut
d521857411
Terminate handler ( #101332 )
...
Fixes #50051 .
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.
This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051 , exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD , https://github.com/malfet
2023-12-12 17:55:27 +00:00
ecao
65651d970b
Optimize the copy of Half to Float and Float to Half on CPU ( #103148 )
...
### Description
Optimize the copy of Half to Float and Float to Half on CPU.
### Testing
Single core:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (1, 777) | 0.00345 | 0.00344 | 0.00411 | 0.00410
size: (2, 512) | 0.00355 | 0.00344 | 0.00431 | 0.00400
size: (10, 555) | 0.00473 | 0.00391 | 0.00562 | 0.00477
size: (1, 2048, 1024) | 0.488 | 0.480 | 0.498 | 0.499
size: (32, 100, 777) | 0.584 | 0.568 | 0.571 | 0.587
28 cores:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (10, 555) | 0.00472 | 0.00369 | 0.00576 | 0.00481
size: (1, 2048, 1024) | 0.0189 | 0.0188 | 0.0173 | 0.0251
size: (64, 512, 1024) | 3.159 | 2.375 | 3.152 | 2.358
size: (32, 100, 777) | 0.0225 | 0.0195 | 0.0193 | 0.0261
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148
Approved by: https://github.com/jgong5 , https://github.com/cpuhrsch
2023-12-12 05:57:52 +00:00
FFFrog
3361496f96
Fix the corner case of index_add ( #114929 )
...
Fixes #114864
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929
Approved by: https://github.com/mikaylagawarecki
2023-12-09 01:57:25 +00:00
albanD
a2b89154bf
New swap function ( #111747 )
...
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.
This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.
The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.
Draft right now to see what @colesbury thinks about doing this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
Kurt Mohler
6f32eb7eef
Add decomp for replication_pad2d and use for CUDA deterministic ( #111590 )
...
Fixes #95578
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-12-01 18:56:09 +00:00
PyTorch MergeBot
013675ff59
Revert "Add decomp for replication_pad2d and use for CUDA deterministic ( #111590 )"
...
This reverts commit f1286161a6 .
Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job. The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794 ))
2023-11-30 02:28:14 +00:00
Kurt Mohler
f1286161a6
Add decomp for replication_pad2d and use for CUDA deterministic ( #111590 )
...
Fixes #95578
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-11-29 21:50:46 +00:00
PyTorch MergeBot
fe428a284b
Revert "Add torch._lazy_clone to create COW tensors ( #113397 )"
...
This reverts commit 9916d8a9ea .
Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396 ) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224 ))
2023-11-20 10:21:09 +00:00
PyTorch MergeBot
d40d72d664
Revert "Skip test_lazy_clone for Inductor ( #114012 )"
...
This reverts commit ecd8d388b9 .
Reverted https://github.com/pytorch/pytorch/pull/114012 on behalf of https://github.com/DanilBaibak due to I revert the PR due to the original changes broke the internal build. Here is the original diff stack [D51444337](https://www.internalfb.com/diff/D51444337 ) ([comment](https://github.com/pytorch/pytorch/pull/114012#issuecomment-1818745425 ))
2023-11-20 10:12:44 +00:00
Nikita Shulga
ecd8d388b9
Skip test_lazy_clone for Inductor ( #114012 )
...
As half of those tests fail if run individually, but first failure masks all subsequent ones, i.e.
```
PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -v -k test_lazy_clone_cuda_float32
test_lazy_clone_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) ... FAIL
...
self.assertTrue(torch._C._is_cow_tensor(t))
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1 test in 19.419s
FAILED (failures=1)
```
But
```
$ PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -k test_lazy_clone_
...
......................
----------------------------------------------------------------------
Ran 24 tests in 24.969s
OK
```
This flaky behavior was already detected, for example see https://github.com/pytorch/pytorch/issues/113953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114012
Approved by: https://github.com/huydhn , https://github.com/kit1980
2023-11-18 04:57:00 +00:00
Kurt Mohler
9916d8a9ea
Add torch._lazy_clone to create COW tensors ( #113397 )
...
Part of #109833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
ghstack dependencies: #113396
2023-11-17 01:58:51 +00:00
Brian Hirsh
cebad9867b
graph break on intermediate leaves that require grad ( #113277 )
...
fixes https://github.com/pytorch/pytorch/issues/90552 . This is a simpler fix that just detects the situation where AOTAutograd can't create a proper backward graph for the situation and graph breaks. This was technically a silent correctness issue before.
This PR tries to always graph break when we see a factory function that returns a tensor requiring grad. I check this by seeing if the op returned a `TensorVariable` in dynamo, and if one of the input arguments was a `requires_grad=True` kwarg. I think this is high-fidelity enough, and I'm also hoping that this is uncommon enough that a graph break is reasonable here.
The fix to avoid the graph break in user land is also pretty easy - just instantiate your tensor outside of the compiled region and plumb it in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113277
Approved by: https://github.com/eellison
ghstack dependencies: #113267 , #113416 , #113584
2023-11-16 02:47:45 +00:00
Nikita Shulga
78f3937ee8
[BE] Handle errors in set_num_threads ( #113684 )
...
and `set_num_interop_threads`
Before that, call `torch.set_num_threads(2**65)` resulted in segmentation fault, afterwards it becomes a good old runtime error:
```
% python -c "import torch;torch.set_num_threads(2**65)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
RuntimeError: Overflow when unpacking long
```
Similar to https://github.com/pytorch/pytorch/pull/60073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113684
Approved by: https://github.com/Skylion007 , https://github.com/albanD
2023-11-15 06:17:41 +00:00
Kurt Mohler
8bdce9bb74
Fix UntypedStorage.resize_ to keep same CUDA device index ( #113386 )
...
Fixes #113300
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113386
Approved by: https://github.com/albanD
2023-11-10 01:57:25 +00:00
Kurt Mohler
fd209543d5
Add torch.utils.deterministic.fill_uninitialized_memory flag ( #111377 )
...
Part of #109802
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD , https://github.com/aaronenyeshi
2023-11-01 16:10:09 +00:00
PyTorch MergeBot
ace2713d1e
Revert "Add torch.utils.deterministic.fill_uninitialized_memory flag ( #111377 )"
...
This reverts commit f1785373c0 .
Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040 ))
2023-10-29 17:41:55 +00:00
Nikita Shulga
b61efe1c2b
Fix torch.[size|stride](dim=None)` invocation ( #111991 )
...
Per documentation, one should be able to explicitly pass dim argument as None to get tensor size across all dimentions/strides, but before this change it was incorrectly interpreted as named tensor call.
Modify `size` and `stride` signatures generated by `gen_pyi.py` to highlight that overload with `None` will return a Tuple, but one with `dim: _int` returns `int`.
Add regression test to validate the behavior, and remove the check for asserts from two named tensors tests (NamedTensors are dead, aren't they?)
Fixes https://github.com/pytorch/pytorch/issues/111944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111991
Approved by: https://github.com/zou3519
2023-10-26 04:14:35 +00:00
Kurt Mohler
f1785373c0
Add torch.utils.deterministic.fill_uninitialized_memory flag ( #111377 )
...
Part of #109802
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD
2023-10-26 02:39:06 +00:00
Nikita Shulga
7709382b50
Fix regression in torch.equal behavior for NaNs ( #111699 )
...
`torch.equal(x, x)` should return false if one of `x` is a tenor of floats one of which is NaN.
So, it renders some of the optimization proposed in https://github.com/pytorch/pytorch/pull/100024 invalid, though as result `torch.equal` will become much slower for identical floating point tensors.
Add regression test that calls torch.equal for tensor containing NaN
Fixes https://github.com/pytorch/pytorch/issues/111251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111699
Approved by: https://github.com/Skylion007 , https://github.com/albanD
2023-10-21 00:02:45 +00:00
CaoE
d1afb7d43d
add Half support for multinomial on CPU ( #104178 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178
Approved by: https://github.com/jgong5 , https://github.com/kulinseth , https://github.com/cpuhrsch
2023-10-20 19:16:04 +00:00
Evgeni Burovski
48989bc820
trace frames with np.ndarray ( #110512 )
...
Fixes #109604
Resubmit gh-109715 + several skips and small fixes to make tests pass.
The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:
- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
2023-10-15 00:56:10 +00:00
CaoE
8713a1a363
add Half support for bernoulli on CPU ( #104176 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176
Approved by: https://github.com/mingfeima , https://github.com/cpuhrsch
2023-10-13 01:18:55 +00:00