pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Alexander Grund	a0d1dbc446	Fix pytest arguments when --save-xml is not passed (#94589 ) The expression `argv + [f'--junit-xml-reruns={test_report_path}'] if TEST_SAVE_XML else []` evaluates to the empty list when `TEST_SAVE_XML` is false and would need parentheses. Instead simplify the code by appending the argument when required directly where `test_report_path` is set. Note that `.append()` may not be used as that would modify `argv` and in turn `UNITTEST_ARGS` which might have undesired side effects. Without this patch `pytest.main()` would be called, i.e. no arguments which will try to discover all tests in the current working directory which ultimately leads to (many) failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94589 Approved by: https://github.com/clee2000, https://github.com/Neilblaze	2023-02-13 22:19:51 +00:00
mfkasim1	2acac8a83a	Logcumsumexp for CUDA (build-time optimized) (#94310 ) Hopefully fixes #89205. This is another version of #90847 where it was reverted because it increases the compile-time significantly. From my discussion with @ngimel in https://github.com/pytorch/pytorch/pull/93153#issuecomment-1409051528, it seems the option of jiterator would be very tricky if not impossible. So what I did was to optimize the compile-time in my computer. To optimize the build time, first I compile the pytorch as a whole, then only change the `LogcumsumexpKernel.cu` file to see how it changes the compile time. Here are my results for the compilation time of only the `LogcumsumexpKernel.cu` file in my computer: - Original version (without any complex implementations): 56s (about 1 minute) - The previous PR (#90847): 13m 57s (about 14 minutes) - This PR: 3m 35s (about 3.5 minutes) If the previous PR increases the build time by 30 mins in pytorch's computer, then this PR reduces the increment of build time to about 6 mins. Hopefully this is an acceptable level of build-time increase. What I did was (sorted by how significant it reduces the build time from the most significant one): - Substituting `log(x)` to `log1p(x - 1)`. This is applied in the infinite case, so we don't really care about precision. - Implementing complex exponential manually tag: @malfet, @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/94310 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-13 16:00:52 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
haozhe.zhu	ed54a5d06b	enable bf16 emb (#94163 ) Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163 Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5	2023-02-12 00:05:09 +00:00
Aaron Gokaslan	3d82d8d0ed	[BE] Enable more flake8-comprehensions checks (#94601 ) I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR. This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601 Approved by: https://github.com/ezyang	2023-02-10 23:40:29 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
mingfeima	c620ece726	port sparse_mm.reduce to pytorch and optimize it on CPU (#83727 ) ### Motivation of this PR This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of Gather, Apply Scatter in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300 GAS is the major step for Message Passing, the behavior of GAS can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes: * COO: the hotspot is `scatter_reduce` * CSR: the hotspot is `spmm_reduce` The reduce type can be choose from: "max", "mean", "max", "min". extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally. `sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs: * `out` - the actual output * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated. ### Performance Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch. Performance benefit for training will be bigger, the original backward impl for `sum\|mean` is sequential; the original backward impl for `max\|min` is not fused. #### before: ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9 aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9 aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9 aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9 aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6 aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6 aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9 aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3 ``` #### after ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9 aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9 aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9 aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9 aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6 aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6 aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9 aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4 aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu	2023-02-10 15:56:40 +00:00
Nicolas Hug	544c04f2df	Add uint8 support for interpolate for CPU images (#90771 ) Joint work with @vfdev-5 This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ). On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`: ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (Note: we removed bicubic support for now) (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms ``` There is still room for further speed-ups (see TODOs in the code). #### More benchmark details with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why. <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 5X 1.1ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 12X 2.9ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 3X 0.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 7X 1.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 2.6X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 1.7X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 2.7X 0.7ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 1.8X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 4X 1.0ms vs 0.2ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 4X 2.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 3.0X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 3X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 4X 2.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 4X 2.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 7X 4.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 3X 2.1ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 4X 2.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 2.7X 1.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 2.6X 1.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 2.1X 1.2ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 2.8X 1.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 5X 2.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 2.3X 1.4ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 3X 1.9ms vs 0.6ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 4X 26.6ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 4X 23.9ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 2.5X 16.8ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 5X 33.1ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 4X 25.9ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 8X 59.6ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.9X 14.3ms vs 7.4ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 5X 35.4ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 2.0X 13.6ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 2.2X 14.8ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 1.3X 8.8ms vs 6.9ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.2X 8.4ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.8X 12.8ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 4X 32.1ms vs 7.2ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.4X 10.1ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.9X 20.9ms vs 7.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 2.1X 0.7ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 1.9X 0.6ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.6X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.8X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.2X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 1.2X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 1.5X 1.0ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.2X 0.8ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 2.3X 1.5ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.9X 1.2ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.6X 1.2ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 4X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 2.4X 1.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 2.8X 1.8ms vs 0.6ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 2.1X 12.8ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.6X 3.8ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 1.2X 7.1ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.9X 11.0ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 2.0X 12.6ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 6.1ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 1.8X 11.3ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 4.6ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.6X 9.3ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.3X 2.0ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.2X 7.2ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.3X 1.6ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.1X 7.1ms vs 6.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 0.6X 3.3ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 0.9X 5.9ms vs 6.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.4X 2.4ms vs 5.9ms ``` </details> without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs) <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 0.8X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 1.5X 1.7ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 0.9X 1.6ms vs 1.8ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 2.1X 3.9ms vs 1.9ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 0.8X 1.1ms vs 1.4ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 1.7X 2.4ms vs 1.5ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 0.9X 0.5ms vs 0.6ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 0.7X 0.5ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 0.9X 0.9ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 2.1X 2.0ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 0.8X 0.6ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 1.7X 1.3ms vs 0.8ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 1.0X 3.0ms vs 3.0ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 1.0X 2.8ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 1.0X 2.3ms vs 2.2ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 1.4X 3.3ms vs 2.3ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 1.0X 3.5ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 1.7X 6.1ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 0.9X 2.6ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 1.4X 4.2ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 1.0X 1.7ms vs 1.7ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 0.9X 1.6ms vs 1.8ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 0.9X 1.3ms vs 1.4ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 0.7X 1.1ms vs 1.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 1.0X 2.0ms vs 2.0ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 1.7X 3.2ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 0.8X 1.5ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 1.2X 2.3ms vs 1.9ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 1.1X 34.7ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 1.0X 31.2ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 1.0X 23.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 1.9X 42.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 0.9X 33.9ms vs 37.4ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 2.2X 84.0ms vs 37.5ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.0X 28.4ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 2.0X 56.7ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 1.1X 17.5ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 1.1X 17.7ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 0.8X 8.8ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.0X 11.1ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.1X 19.9ms vs 18.8ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 2.3X 42.5ms vs 18.7ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.0X 14.1ms vs 14.5ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.0X 28.4ms vs 14.5ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.0X 0.6ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.3ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 0.9X 0.5ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.7X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 1.0X 0.8ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.1X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 0.9X 0.7ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 0.9X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.8X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.3X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 0.9X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 1.2X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 0.8X 2.1ms vs 2.5ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 0.7X 1.6ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 1.2X 2.4ms vs 2.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 1.3X 2.6ms vs 2.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 1.1X 3.4ms vs 3.0ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 1.7X 4.8ms vs 2.8ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 1.1X 2.9ms vs 2.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 1.4X 3.5ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 0.9X 1.2ms vs 1.3ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.3X 1.6ms vs 1.2ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 0.8X 0.9ms vs 1.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.3X 1.3ms vs 1.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.4X 2.2ms vs 1.6ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 1.9X 2.8ms vs 1.5ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 0.8X 1.1ms vs 1.4ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 1.7X 2.1ms vs 1.3ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 1.0X 10.0ms vs 9.9ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.7X 4.6ms vs 6.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 0.9X 9.1ms vs 9.8ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.7X 9.4ms vs 5.7ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 1.0X 15.2ms vs 14.8ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 7.6ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 0.9X 13.3ms vs 14.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 5.9ms vs 7.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.2X 6.0ms vs 5.2ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.7X 2.3ms vs 3.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.0X 4.8ms vs 5.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.7X 1.9ms vs 2.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.6X 12.3ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 1.0X 3.9ms vs 3.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 1.0X 7.0ms vs 7.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.9X 3.0ms vs 3.5ms ``` </details> Benchmark code <details> ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu') if channels_last: input_image = input_image.contiguous(memory_format=torch.channels_last) self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "antialias": antialias, "dtype":dtype, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, antialias, dtype): if dtype == torch.float: input_image = input_image.float() out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias) if dtype == torch.float: out = out.round().clamp(min=0, max=256).to(torch.uint8) def make_config(): sizes = ( ((224, 224), (64, 64)), ((270, 268), (224, 224)), ((256, 256), (1024, 1024)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels # attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels # attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True, False], 'mode': ["bilinear", "bicubic"], 'antialias': [True, False], # 'dtype': [torch.float, torch.uint8] # 'dtype': [torch.uint8] 'dtype': [torch.float] }, tags=["short"], ) return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` ```py import re import argparse parser = argparse.ArgumentParser() parser.add_argument("f1", nargs="?", default="main") parser.add_argument("f2", nargs="?", default="new") args = parser.parse_args() with open(args.f1) as f: main = f.readlines() with open(args.f2) as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): # num_threads=1 # TODO: remove if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("antialias=", "") deets = deets.replace("channels_last=", "") # deets = deets.replace("channels_last=True, ", "") split = deets.split(",") # size = ','.join(split[:-3]) # mode, dtype, threads = split[-3:] # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" size = ','.join(split[:-5]) channels_last, mode, antialias, dtype, threads= split[-5:] deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("\(.?\)", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: # assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 8 == 0: print() # if i % 10 == 0 and i % 40 != 0: # print() # if i % 40 == 0: # print("-" 100) print(l) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771 Approved by: https://github.com/peterbell10, https://github.com/ngimel	2023-02-10 01:43:54 +00:00
Jeff Daily	66bfcd32fd	[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725 ) Fixes #64427. MIOpen supports ChannelsLast. No longer need to opt-in with env var. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90725 Approved by: https://github.com/malfet	2023-02-09 22:26:24 +00:00
Howard Huang	f45c196653	Update backend config to be under _World (#94191 ) All the c10d process group state is under `_World`, so this is BE work to include a missing map Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191 Approved by: https://github.com/kumpera	2023-02-09 20:48:42 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
min-jean-cho	81853354c3	added aten.log_normal_ decomp (#91674 ) Fixes #91275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91674 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2023-02-09 18:34:25 +00:00
Driss Guessous	81bbee7d7e	[SDPA] Adds basic correctness checks (#94274 ) # Summary Add more checks around shape constraints as well as update the sdp_utils to properly catch different head_dims between qk and v for flash_attention which is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94274 Approved by: https://github.com/cpuhrsch	2023-02-09 08:05:26 +00:00
min-jean-cho	92f569fe11	[Inductor] added aten.geometric_ decomp (#91672 ) Fixes #91671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91672 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2023-02-09 07:29:14 +00:00
CaoE	c82bb28759	Update autocast policy list on CPU (#92527 ) Update autocast policy list on CPU. It depends on #92530. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92527 Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet	2023-02-09 06:40:56 +00:00
Aaron Gokaslan	1e2d82b8e4	[BE] Merge isinstance calls together (#94419 ) Simplify and speeds up isinstance calls by checking for multiple types at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419 Approved by: https://github.com/ezyang	2023-02-09 00:47:26 +00:00
min-jean-cho	66ae3aa096	[Inductor] added aten.cauchy_ decomp (#92047 ) Fixes #91675 TODO: compare perf of decomposed tan --vs-- libdevice tan, aten tan for triton, cpp backeneds Pull Request resolved: https://github.com/pytorch/pytorch/pull/92047 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel	2023-02-09 00:02:56 +00:00
lezcano	5a7c1b7894	[decompositions] LSTM with packed input (#91465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91465 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
lezcano	bef61225c3	[decompositions] add decomposition for RNN with packed sequence (#91281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91281 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
lezcano	20d01d2dc9	[expanded weights] add RNN support via decomp (#91807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91807 Approved by: https://github.com/albanD	2023-02-08 14:16:30 +00:00
lezcano	c2a92687e0	[decompositions] add RNN decomp and testing (#91123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91123 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
Philip Meier	6f543e0d0a	add not_close_error_metas for internal comparison machinery (#90004 ) While discussing a possible addition of `assert_not_close` to the API (See #90005 later in the stack), it became clear that we should have an intermediate function that returns a bool-ish value that one can assert on. This PR introduces this function as `are_equal` as replacement for `assert_equal`. Interface is the same, but instead of raising in case a comparison failed, we return the `ErrorMeta`'s of all failures and leave it to the caller to handle. Note that this only applies to errors raised during the comparison stage. Everything else, e.g. only setting `atol` or `rtol`, will raise just as before. We decided to keep this private for now unless there is user demand. The largest issue that needs to be solved before this can become public is the return type: if we have something like `torch.testing.are_close` we are targeting two uses cases: 1. Using it to branch inside code like `if are_close(...):` 2. Using it to assert closeness inside a test like `assert are_close(...)`. This is the default way to assert something with `pytest` To do that, the return type has to be bool-ish, i.e. being an instance of `bool` or implementing `__bool__`. Plus, `bool(are_close()) is True` needs to be the if the inputs are close and `False` otherwise. The current logic of `are_close` satisfies the former, but violates the latter. In case everything is close, we return an empty list, but `bool([]) is False`. Directly using an instance of `bool` would work for the requirements above, but then we would have no option to add diagnositics to the error. Meaning `assert are_close()` would work, but would be non-descriptive. Using `Tuple[bool, str]` would work in general, but is quite dangerous and unexpected: since all non-empty tuples evaluate to `True`, this can easily hide bugs if the user is not super careful: ```pycon >>> close = (False, "error message with diagnostics") >>> assert close[0] AssertionError: error message with diagnostics >>> assert close ``` One possible solution here would be a thin custom object: ```py class Close: def __init__(self, flag:bool, msg: str = "") -> None: self._flag = flag self._msg = msg def __bool__(self): return self._flag def __str__(self): return self._msg ``` Now we can do something like ```pycon close = Close(False, "error message with diagnostics") # coming from are_close >>> if not close: ... print("It works!") It works! >>> assert close AssertionError >>> assert close, close # This looks weird, but does its job AssertionError: error message with diagnostics ``` But this means we introduce another abstraction that the user has to deal with. To reiterate, we are not going to make `are_close` public until there is user demand, since none of the options above is without flaws. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90004 Approved by: https://github.com/mruberry, https://github.com/malfet	2023-02-08 11:22:55 +00:00
Philip Meier	566eb49ed2	minor internal cleanup in assert_close (#90003 ) Per title. I'm going to highlight them with inline comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90003 Approved by: https://github.com/mruberry, https://github.com/malfet	2023-02-08 11:22:55 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
Aaron Gokaslan	748bac8757	[BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309 ) Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309 Approved by: https://github.com/albanD	2023-02-07 20:08:58 +00:00
Natalia Gimelshein	7bba87ed06	add rsub decomposition with alpha (#94144 ) Fixes #93376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94144 Approved by: https://github.com/desertfire	2023-02-07 17:21:13 +00:00
albanD	0b2dc3b3ac	[Py-3.11] Skip dynamo related tests (#94187 ) The quantization test fails to import Dynamo as expected. The traceback tool looks a lot more tricky, opened https://github.com/pytorch/pytorch/issues/94189 to investigate further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94187 Approved by: https://github.com/malfet	2023-02-07 16:40:55 +00:00
Vasiliy Kuznetsov	f15ab8a7f2	AO migration: replace torch internal callsites (#94170 ) Summary: Do the following renames: `torch.quantization` -> `torch.ao.quantization` `torch.nn.quantized` -> `torch.ao.nn.quantized` `torch.nn.quantizable` -> `torch.ao.nn.quantizable` `torch.nn.qat` -> `torch.ao.nn.qat` `torch.nn.intrinsic` -> `torch.ao.nn.intrinsic` And then, do `torch.ao.nn.quantized._reference` -> `torch.ao.nn.quantized.reference` to clean up the aftermath of https://github.com/pytorch/pytorch/pull/84974 Then, manually update `test/test_module_init.py` to fix hanging whitespace due to the replace. Run this script to do the replacements: https://gist.github.com/vkuzo/7f7afebf8c31b9ba48306223e68a1c82 This is for https://github.com/pytorch/pytorch/issues/81667 Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/94170 Approved by: https://github.com/jerryzh168	2023-02-07 02:32:23 +00:00
PyTorch MergeBot	53e4fe076a	Revert "enable bf16 emb (#94163 )" This reverts commit `f3bf46e801`. Reverted https://github.com/pytorch/pytorch/pull/94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, `05397b1250`	2023-02-07 00:32:22 +00:00
albanD	496c0a207b	Make segment_reduce properly private. (#93166 ) I am attempting not to change the aten function to reduce the amount of BC issues on the torchscript side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93166 Approved by: https://github.com/ngimel	2023-02-06 18:32:23 +00:00
mingfeima	26cba842ad	Optimize ConvTransposed2D with mkldnn float32 and bfloat16 on CPU (#92530 ) this PR optimized `ConvTranspose2d` with oneDNN and add channels last support for it. Also the fallback path `slow_conv_transpose2d` also have channels last support. So the memory format propagation behavior would stay the same with or without oneDNN. Replacement of https://github.com/pytorch/pytorch/pull/77060, https://github.com/pytorch/pytorch/pull/70897 and https://github.com/pytorch/pytorch/pull/74023 which enables oneDNN for `ConvTranspose2d` and `ConvTranspose3d` The following results collects on Skylake Xeon 8180, dual sockets, 28 cores per socket. ### single core channels last configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 181.36 \| 91.16 \| 1.99 \| 531.38 \| 124.08 \| 4.28 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 324.35 \| 153.50 \| 2.11 \| 973.16 \| 185.97 \| 5.23 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 1086.82 \| 671.52 \| 1.62 \| 3008.94 \| 1453.33 \| 2.07 ### single core channels first configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 138.10 \| 5.94 \| 23.23 \| 37.97 \| 11.25 \| 3.38 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 236.43 \| 8.75 \| 27.03 \| 87.77 \| 18.58 \| 4.72 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 484.39 \| 37.69 \| 12.85 \| 185.40 \| 90.57 \| 2.05 ### single socket channels last configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 138.10 \| 5.94 \| 23.23 \| 37.97 \| 11.25 \| 3.38 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 236.43 \| 8.75 \| 27.03 \| 87.77 \| 18.58 \| 4.72 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 484.39 \| 37.69 \| 12.85 \| 185.40 \| 90.57 \| 2.0 ### single socket channels first configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 132.56 \| 7.19 \| 18.43 \| 31.43 \| 11.20 \| 2.81 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 227.94 \| 13.33 \| 17.11 \| 63.00 \| 23.41 \| 2.69 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 473.68 \| 52.79 \| 8.97 \| 150.40 \| 87.33 \| 1.72 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92530 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-02-06 10:11:25 +00:00
haozhe.zhu	f3bf46e801	enable bf16 emb (#94163 ) Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163 Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5	2023-02-06 07:11:40 +00:00
Howard Huang	5c7f4534e9	[small] multithreaded-pg guard attr (#93883 ) currently the test ``` pytest test/distributed/test_multi_threaded_pg.py -vs ``` has errors ``` Traceback (most recent call last): File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run self._target(self._args, *self._kwargs) File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run self._tls.precision = TestCase._precision AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883 Approved by: https://github.com/awgu, https://github.com/wanchaol	2023-02-03 23:01:02 +00:00
albanD	5be57d51f9	Fix testing now that random.sample() arg must be a sequence (#94052 ) This is only enforced in 3.11 but the change is not bad for other versions either (and this is test code so perf is not a concern). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94052 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-03 21:28:02 +00:00
Peter Bell	5817695bfa	[pt2] Fix arange to match ATen behavior (#93353 ) Fixes #92676 `arange` infers the output dtype from the argument types, but in order to reduce falling back to ATen, inductor preferred to cast whole number float arguments to int which gave the wrong output dtype. Instead, this decomposes floating point arange into the prim equivalent for integers. This also changes the signature of `prims.arange` to ```python prims.iota(length, , start, step, *factory_kwargs) ``` which only supports integers arguments. This is done because calculating the output size from `start, end, step` is surprisingly complex and liable to off by one errors so should not be duplicated in each backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93353 Approved by: https://github.com/ngimel, https://github.com/lezcano	2023-02-03 00:44:32 +00:00
Andrew Gu	481a334b7a	[FSDP][3/N] Refactor `summon_full_params` unit tests (#92298 ) Overview - This PR refactors the `summon_full_params()` unit tests to prepare for `unshard_params()` by consolidating redundant tests and improving others. - This PR enables `CPUOffload(offload_params=True)` + `NO_SHARD` + `writeback=True`. - This PR provides an improved error message when calling `summon_full_params()` from an invalid context (i.e. from forward, backward, or in `summon_full_params()`). Details <details> <summary>Existing Unit Tests</summary> `test_summon_full_param_writeback()` with `world_size=1` `test_summon_full_param_writeback()` with `world_size=2` - Tests that `writeback=True` persists write and that `writeback=False` does not persist write when modifying a root FSDP instance's `flat_param` (`modify_outer=True`) or a non-root FSDP instance's `flat_param` (`modify_outer=False`); additionally configures with `mixed_precision` and `use_orig_params` - `CPUOffload(offload_params=True)` + `world_size=1` is not tested because it is not supported. - The write inside `summon_full_params()` is on the `flat_param` itself, which is not the expected usage. `test_summon_full_param_shard_value()` - Tests that reconstructing the `flat_param` (by re-flattening and chunking parameters) inside `summon_full_params()` gives the same as the originally constructed `flat_param` when using a single FSDP instance - This test seems to exercise the FSDP sharding algorithm, not the specification of `summon_full_params()`. The only relevant part being implicitly tested is that `model.parameters()` order is preserved. - This test assumes the current FSDP sharding algorithm. `test_summon_full_param_recursive()` - Tests that `recurse=True` recursively applies to all FSDP instances and that `recurse=False` does not - This test assumes the current FSDP sharding algorithm. `test_cannot_summon_full_params_from_forward()` `test_cannot_summon_full_params_from_backward()` - Tests that calling `summon_full_params()` from inside the forward or backward raises an error - The error message leaks `FlatParamHandle` to the user. I provided a better error in this PR. `test_summon_full_params_respects_reshard_after_forward()` - Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`) - This test depends on FSDP internals (`flat_param._full_param_padded.storage().size()`). `test_summon_single_param()` - Tests that writing to padding with `writeback=True` does not persist those writes (doing so by using a singleton `(1, 1)` parameter that gets flattened and padded to `(2,)`) - This test name is misleading. `test_summon_full_params_equivalence()` - Tests `writeback`, `rank0_only`, and `offload_to_cpu` with `writeback=not rank0_only`, using `CPUOffload(offload_params=True)` and including a `torch.cuda._sleep(int(1e6))` _after_ the write in `summon_full_params()` - The PR introducing this test said that the `torch.cuda._sleep(int(1e6))` exercised the stream synchronization in `summon_full_params()`--namely that the current stream waits for the all-gather stream after all-gathering the parameters. I did not follow conceptually how that works since the `torch.cuda._sleep()` call happens after both the all-gather and write and is in the default stream, which seems to be after the relevant ops. If we clarify this, I can re-incorporate this into the unit tests. Doing so is not a high priority since `summon_full_params()` unshards in the default stream now and does not require stream synchronization. - This unit test has overlap with `test_summon_full_param_writeback()` and can be coalesced. `test_summon_from_non_fsdp()` - Tests calling `summon_full_params()` with default args on a non-FSDP root module exposes the original parameters correctly - This test actually covers much of the specification since checking for original parameter equivalence includes shape, value, device, etc. checking. `test_reshard_outside_forward_backward_iteration()` - Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`) and that calling `summon_full_params()` after backward preserves that the padded unsharded `flat_param` data are freed; additionally configures `mixed_precision` - This test strictly dominates `test_summon_full_params_respects_reshard_after_forward()` in strictness since it includes the check after backward as well. `test_params_are_unflattenned()` - Tests that original parameters are exposed with the unflattened shape factoring in `rank0_only` (e.g. including that nonzero ranks reshard early when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision` `test_params_count_and_value()` - Tests that original parameters are all exposed and with the correct values factoring in `rank0_only` (e.g. including that nonzero ranks do not expose the original parameters when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision` `test_raises_rank0_with_writeback()` - Tests that `rank0_only` + `writeback=True` raises an error `test_named_parameters_buffers()` - Tests that `named_parameters()` and `named_buffers()` return clean names (without FSDP prefixes) inside `summon_full_params()` `test_with_grads_core()` - Tests `with_grads=True` by comparing against DDP `test_with_grads_none_grads()` - Tests `with_grads=True` when ranks' `FlatParameter`s have `None` gradient </details> <details> <summary>New Unit Tests</summary> `test_unshard_params_writeback_no_shard()` (with `world_size=1`) `test_unshard_params_writeback()` (with `world_size=2`) - Tests the `writeback` argument (using the default value for all others) `test_unshard_params_param_data_no_shard()` (with `world_size=1`) `test_unshard_params_param_data()` (with `world_size=2`) - Tests that parameters are exposed correctly for `recurse=True` and all other argument configs for a non-FSDP root module `test_unshard_singleton_param_writeback()` - Tests `writeback=True` for a singleton parameter, which includes testing that writing to padding does not persist `test_unshard_params_respects_reshard()` - Tests that unsharding parameters respects the expected reshard behavior between forward and backward as well as after backward `test_unshard_params_recurse()` - Tests the `recurse` argument (using default for all others) `test_offload_to_cpu_no_shard_raises()` - Tests that `offload_to_cpu=True` with `NO_SHARD` raises an error </details> <details> <summary>Summary of Unit Test Changes</summary> - `test_summon_full_param_writeback` -> `test_unshard_params_writeback()` - `test_summon_full_params_equivalence()`, `test_params_are_unflattenned()`, `test_params_count_and_value()` -> `test_unshard_params_param_data()` - `test_summon_full_params_respects_reshard_after_forward()`, `test_reshard_outside_forward_backward_iteration()` -> `test_unshard_params_respects_reshard()` - `test_summon_full_param_recursive()` -> `test_unshard_params_recurse()` - `test_named_parameters_and_buffers()` unchanged - `test_with_grads_core()` unchanged - `test_with_grads_none_grads()` unchanged - `test_cannot_summon_full_params_from_forward()`, `test_cannot_summon_full_params_from_backward()` -> `test_unshard_params_from_forward_raises()`, `test_unshard_params_from_backward_raises()` - `test_raises_rank0_with_writeback()` -> `test_rank0_only_with_writeback_raises()` - `test_offload_to_cpu_no_shard_raises()` new - `test_summon_full_param_shard_value()` removed </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92298 Approved by: https://github.com/rohan-varma	2023-02-02 15:10:14 +00:00
Xilun Wu	966030f7c7	[DTensor][fix] MultiThreadedTestCase misses _tls object and it won't reflect in CI (#93832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93832 Approved by: https://github.com/wanchaol	2023-02-02 07:56:44 +00:00
Driss Guessous	653dc73df0	[SDPA] Wire up FlashAttention's backward (#92917 ) # Summary This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml. The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](`33e0860c9c/flash_attn/flash_attn_interface.py (L126)`) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions. ### MetaFunctions I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below. ### Questions? Performance issues vs mem efficient when using torch.nn.mha_forward TorchCompile -> See purposed solution below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917 Approved by: https://github.com/cpuhrsch	2023-02-02 04:02:30 +00:00
Jesse Cai	86ab4d49d4	[pruning][core][feature] LSTM Structured Pruning prune_functions + pattern (#90801 ) Summary: This PR adds in support for LSTM Structured Pruning. - Adds in LSTMSaliencyPruner, an implemented pruner that splits the packed weights, finds the appropriate mask for each piece individually based on saliency, and then combines to create an overall mask for the LSTM. - Adds in pruning functions for LSTM pruning, which will split the weights, apply the masks, and then recombine the pruned weights. Works for both single and multiple-layer LSTMs. Also added a basic pattern to the default set of of patterns for LSTM -> Linear pruning LSTM -> LayerNorm -> Linear pruning Adds in test to check that LSTM pruning works, as well as for LSTMSaliencyPruner Test Plan: `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_single_layer` `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_multiple_layer` `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_single_layer` `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_multiple_layer` `python test/test_ao_sparsity.py -- TestSaliencyPruner.test_lstm_saliency_pruner_update_mask` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D42199001](https://our.internmc.facebook.com/intern/diff/D42199001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90801 Approved by: https://github.com/jerryzh168	2023-02-01 19:29:03 +00:00
Vasiliy Kuznetsov	56f9475625	ns: change PNP testing to use QNNPACK (#91421 ) Summary: Changes the PNP test cases to use QNNPACK. The only reason is because I'm switching to Mac M1 as my primary machine, which supports QNNPACK but not fbgemm, and it's convenient for me to be able to run these locally. PNP itself is not backend specific, so it does not matter which backend the functionality is tested on. Test plan: ``` python test/test_quantization.py -k NShadows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91421 Approved by: https://github.com/jerryzh168	2023-02-01 18:34:04 +00:00
jjsjann123	bdca5fcd43	cherry-picking autodiff support for gather/index_select (#93333 ) added gather & index_select in autodiff; test coverage should be handled by opinfo; Pull Request resolved: https://github.com/pytorch/pytorch/pull/93333 Approved by: https://github.com/ngimel	2023-02-01 09:47:40 +00:00
Will Constable	ac791bddce	Refactor dynamo distributed test helpers to be reusable (#93187 ) The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187 Approved by: https://github.com/kumpera	2023-02-01 06:09:42 +00:00
leslie-fang-intel	ef4118e435	[Quant][FX] Lower QConvAdd2d for onednn backend (#91153 ) Summary Add quantization mappings for QConvAdd2d for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. Test plan ``` python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_onednn python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_by_default python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_lowering ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91153 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 01:14:12 +00:00
leslie-fang-intel	53c3555a6a	[Quant] Add fused ConvAdd2d module for onednn backend (#91152 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `ConvAdd2d` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown. Test plan ``` python -m pytest test_quantization.py -k test_conv2d_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91152 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 01:11:25 +00:00
Ivan Yashchuk	fba13d94a1	Remove deprecated torch.symeig (#70988 ) The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`. - [x] XLA PR: https://github.com/pytorch/xla/pull/4498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988 Approved by: https://github.com/lezcano, https://github.com/kit1980, https://github.com/malfet	2023-01-31 11:59:11 +00:00
Jacob Szwejbka	2e9107ec1e	[Pytorch][Executorch] Handwritten view copy out ops should resize out (#91194 ) Summary: Handwritten out ops should have feature parity with the codegend ones. This means they should resize out to the appropriate size. Q. Why are these handwritten instead of codegend anyway? Q2. Wheres a good spot to put the resize and copy helpers since they are reused in the codegend out kernels Test Plan: ci. Differential Revision: D42177051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91194 Approved by: https://github.com/ezyang	2023-01-30 23:07:14 +00:00
Nikita Shulga	5976f0bdfe	Set min supported Python version to 3.8 (#93155 ) Also, grep for `if sys.version_info .cond. (3, 8)` and replaces them with appropriate action. This is a last in a series of PRs that moved CI/CD away from testing PyTorch behavior against Python-3.7. Fixes https://github.com/pytorch/pytorch/issues/80513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93155 Approved by: https://github.com/huydhn	2023-01-29 18:28:46 +00:00
mfkasim1	75cfc0be21	Logcumsumexp for CPU (#93153 ) Partial work from #90847, in the direction of solving #89205. Most of the content is from #90847, but this is only for CPU, so hopefully it does not increase the build time by a lot. tag: @albanD, @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/93153 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-01-27 22:29:33 +00:00

1 2 3 4 5 ...

3599 Commits