Commit Graph

3599 Commits

Author SHA1 Message Date
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Alexander Grund
a0d1dbc446 Fix pytest arguments when --save-xml is not passed (#94589)
The expression `argv + [f'--junit-xml-reruns={test_report_path}'] if TEST_SAVE_XML else []` evaluates to the empty list when `TEST_SAVE_XML` is false and would need parentheses.

Instead simplify the code by appending the argument when required directly where `test_report_path` is set.
Note that `.append()` may not be used as that would modify `argv` and in turn `UNITTEST_ARGS` which might have undesired side effects.

Without this patch `pytest.main()` would be called, i.e. no arguments which will try to discover all tests in the current working directory which ultimately leads to (many) failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94589
Approved by: https://github.com/clee2000, https://github.com/Neilblaze
2023-02-13 22:19:51 +00:00
mfkasim1
2acac8a83a Logcumsumexp for CUDA (build-time optimized) (#94310)
Hopefully fixes #89205.
This is another version of #90847 where it was reverted because it increases the compile-time significantly.
From my discussion with @ngimel in https://github.com/pytorch/pytorch/pull/93153#issuecomment-1409051528, it seems the option of jiterator would be very tricky if not impossible.
So what I did was to optimize the compile-time in my computer.

To optimize the build time, first I compile the pytorch as a whole, then only change the `LogcumsumexpKernel.cu` file to see how it changes the compile time.
Here are my results for the compilation time of only the `LogcumsumexpKernel.cu` file in my computer:

- Original version (without any complex implementations): 56s (about 1 minute)
- The previous PR (#90847): 13m 57s (about 14 minutes)
- This PR: 3m 35s (about 3.5 minutes)

If the previous PR increases the build time by 30 mins in pytorch's computer, then this PR reduces the increment of build time to about 6 mins. Hopefully this is an acceptable level of build-time increase.

What I did was (sorted by how significant it reduces the build time from the most significant one):

- Substituting `log(x)` to `log1p(x - 1)`. This is applied in the infinite case, so we don't really care about precision.
- Implementing complex exponential manually

tag: @malfet, @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94310
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-13 16:00:52 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
Aaron Gokaslan
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
haozhe.zhu
ed54a5d06b enable bf16 emb (#94163)
Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163
Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5
2023-02-12 00:05:09 +00:00
Aaron Gokaslan
3d82d8d0ed [BE] Enable more flake8-comprehensions checks (#94601)
I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR.

This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601
Approved by: https://github.com/ezyang
2023-02-10 23:40:29 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
mingfeima
c620ece726 port sparse_mm.reduce to pytorch and optimize it on CPU (#83727)
### Motivation of this PR

This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`

The reduce type can be choose from: "max", "mean", "max",  "min".

extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

### Performance

Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.

#### before:
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
       torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                 aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                 aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                     aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                   aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
              aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                   aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
            aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
```

#### after
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                 aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                 aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                     aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                   aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
              aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                   aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                 aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
            aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu
2023-02-10 15:56:40 +00:00
Nicolas Hug
544c04f2df Add uint8 support for interpolate for CPU images (#90771)
Joint work with @vfdev-5

This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ).

On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`:

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms

(Note: we removed bicubic support for now)
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

```

There is still room for further speed-ups (see TODOs in the code).

#### More benchmark details

with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why.

<details>

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   5X    1.1ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   12X   2.9ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   3X    0.8ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   7X    1.8ms vs 0.2ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   2.6X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   1.7X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   2.7X  0.7ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   1.8X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   4X    1.0ms vs 0.2ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   4X    2.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   3.0X  1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   3X    1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   4X    2.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   4X    2.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   7X    4.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   3X    2.1ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   4X    2.6ms vs 0.6ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   2.7X  1.6ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   2.6X  1.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   2.1X  1.2ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   2.8X  1.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   5X    2.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   2.3X  1.4ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   3X    1.9ms vs 0.6ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   4X    26.6ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   4X    23.9ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   2.5X  16.8ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   5X    33.1ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   4X    25.9ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   8X    59.6ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.9X  14.3ms vs 7.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   5X    35.4ms vs 7.3ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   2.0X  13.6ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   2.2X  14.8ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   1.3X  8.8ms vs 6.9ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.2X  8.4ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.8X  12.8ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   4X    32.1ms vs 7.2ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.4X  10.1ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.9X  20.9ms vs 7.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   2.1X  0.7ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   1.9X  0.6ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   1.0X  0.3ms vs 0.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.6X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.8X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.2X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   1.2X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   0.9X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   1.5X  1.0ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.2X  0.8ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   2.3X  1.5ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.9X  1.2ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.6X  1.2ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   4X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   2.4X  1.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   2.8X  1.8ms vs 0.6ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   2.1X  12.8ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.6X  3.8ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   1.2X  7.1ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.9X  11.0ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   2.0X  12.6ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  6.1ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   1.8X  11.3ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  4.6ms vs 6.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.6X  9.3ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.3X  2.0ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.2X  7.2ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.3X  1.6ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.1X  7.1ms vs 6.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   0.6X  3.3ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   0.9X  5.9ms vs 6.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.4X  2.4ms vs 5.9ms
```

</details>

without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs)

<details>

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   0.8X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   1.5X  1.7ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   0.9X  1.6ms vs 1.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   2.1X  3.9ms vs 1.9ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   0.8X  1.1ms vs 1.4ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   1.7X  2.4ms vs 1.5ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   0.9X  0.5ms vs 0.6ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   0.7X  0.5ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   0.9X  0.9ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   2.1X  2.0ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   0.8X  0.6ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   1.7X  1.3ms vs 0.8ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   1.0X  3.0ms vs 3.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   1.0X  2.8ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   1.0X  2.3ms vs 2.2ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   1.4X  3.3ms vs 2.3ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   1.0X  3.5ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   1.7X  6.1ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   0.9X  2.6ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   1.4X  4.2ms vs 2.9ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   1.0X  1.7ms vs 1.7ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   0.9X  1.6ms vs 1.8ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   0.9X  1.3ms vs 1.4ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   0.7X  1.1ms vs 1.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   1.0X  2.0ms vs 2.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   1.7X  3.2ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   0.8X  1.5ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   1.2X  2.3ms vs 1.9ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   1.1X  34.7ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   1.0X  31.2ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   1.0X  23.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   1.9X  42.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   0.9X  33.9ms vs 37.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   2.2X  84.0ms vs 37.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.0X  28.4ms vs 28.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   2.0X  56.7ms vs 28.8ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   1.1X  17.5ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   1.1X  17.7ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   0.8X  8.8ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.0X  11.1ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.1X  19.9ms vs 18.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   2.3X  42.5ms vs 18.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.0X  14.1ms vs 14.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.0X  28.4ms vs 14.5ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.0X  0.6ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.3ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   0.9X  0.5ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.7X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   1.0X  0.8ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.1X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   0.9X  0.7ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   0.9X  0.4ms vs 0.4ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.8X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.9X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.3X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   0.9X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   1.2X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   0.8X  2.1ms vs 2.5ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   0.7X  1.6ms vs 2.4ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   1.2X  2.4ms vs 2.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   1.3X  2.6ms vs 2.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   1.1X  3.4ms vs 3.0ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   1.7X  4.8ms vs 2.8ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   1.1X  2.9ms vs 2.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   1.4X  3.5ms vs 2.4ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   0.9X  1.2ms vs 1.3ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.3X  1.6ms vs 1.2ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   0.8X  0.9ms vs 1.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.3X  1.3ms vs 1.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.4X  2.2ms vs 1.6ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   1.9X  2.8ms vs 1.5ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   0.8X  1.1ms vs 1.4ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   1.7X  2.1ms vs 1.3ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   1.0X  10.0ms vs 9.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.7X  4.6ms vs 6.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   0.9X  9.1ms vs 9.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.7X  9.4ms vs 5.7ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   1.0X  15.2ms vs 14.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  7.6ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   0.9X  13.3ms vs 14.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  5.9ms vs 7.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.2X  6.0ms vs 5.2ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.7X  2.3ms vs 3.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.0X  4.8ms vs 5.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.7X  1.9ms vs 2.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.6X  12.3ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   1.0X  3.9ms vs 3.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   1.0X  7.0ms vs 7.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.9X  3.0ms vs 3.5ms

```

</details>

Benchmark code
<details>

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu')

        if channels_last:
            input_image = input_image.contiguous(memory_format=torch.channels_last)

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "antialias": antialias,
            "dtype":dtype,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, antialias, dtype):
        if dtype == torch.float:
            input_image = input_image.float()

        out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias)
        if dtype == torch.float:
            out = out.round().clamp(min=0, max=256).to(torch.uint8)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((270, 268), (224, 224)),
        ((256, 256), (1024, 1024)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        # attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        # attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True, False],
            'mode': ["bilinear", "bicubic"],
            'antialias': [True, False],
            # 'dtype': [torch.float, torch.uint8]
            # 'dtype': [torch.uint8]
            'dtype': [torch.float]
        },
        tags=["short"],
    )

    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()

```

```py
import re
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("f1", nargs="?", default="main")
parser.add_argument("f2", nargs="?", default="new")
args = parser.parse_args()

with open(args.f1) as f:
    main = f.readlines()
with open(args.f2) as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    # num_threads=1  # TODO: remove
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("antialias=", "")
        deets = deets.replace("channels_last=", "")
        # deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")

        # size = ','.join(split[:-3])
        # mode, dtype, threads = split[-3:]
        # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        size = ','.join(split[:-5])
        channels_last, mode, antialias, dtype, threads= split[-5:]
        deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        # assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 8 == 0:
        print()
    # if i % 10 == 0 and i % 40 != 0:
    #     print()
    # if i % 40 == 0:
    #     print("-" * 100)
    print(l)

```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771
Approved by: https://github.com/peterbell10, https://github.com/ngimel
2023-02-10 01:43:54 +00:00
Jeff Daily
66bfcd32fd [ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725)
Fixes #64427.  MIOpen supports ChannelsLast.  No longer need to opt-in with env var.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90725
Approved by: https://github.com/malfet
2023-02-09 22:26:24 +00:00
Howard Huang
f45c196653 Update backend config to be under _World (#94191)
All the c10d process group state is under `_World`, so this is BE work to include a missing map
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191
Approved by: https://github.com/kumpera
2023-02-09 20:48:42 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
min-jean-cho
81853354c3 added aten.log_normal_ decomp (#91674)
Fixes #91275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91674
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2023-02-09 18:34:25 +00:00
Driss Guessous
81bbee7d7e [SDPA] Adds basic correctness checks (#94274)
# Summary
Add more checks around shape constraints as well as update the sdp_utils to properly catch different head_dims between qk and v for flash_attention which is not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94274
Approved by: https://github.com/cpuhrsch
2023-02-09 08:05:26 +00:00
min-jean-cho
92f569fe11 [Inductor] added aten.geometric_ decomp (#91672)
Fixes #91671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91672
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2023-02-09 07:29:14 +00:00
CaoE
c82bb28759 Update autocast policy list on CPU (#92527)
Update autocast policy list on CPU. It depends on #92530.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92527
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
2023-02-09 06:40:56 +00:00
Aaron Gokaslan
1e2d82b8e4 [BE] Merge isinstance calls together (#94419)
Simplify and speeds up isinstance calls by checking for multiple types at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419
Approved by: https://github.com/ezyang
2023-02-09 00:47:26 +00:00
min-jean-cho
66ae3aa096 [Inductor] added aten.cauchy_ decomp (#92047)
Fixes #91675

TODO: compare perf of decomposed tan --vs-- libdevice tan, aten tan for triton, cpp backeneds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92047
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel
2023-02-09 00:02:56 +00:00
lezcano
5a7c1b7894 [decompositions] LSTM with packed input (#91465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91465
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
bef61225c3 [decompositions] add decomposition for RNN with packed sequence (#91281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91281
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
20d01d2dc9 [expanded weights] add RNN support via decomp (#91807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91807
Approved by: https://github.com/albanD
2023-02-08 14:16:30 +00:00
lezcano
c2a92687e0 [decompositions] add RNN decomp and testing (#91123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91123
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
Philip Meier
6f543e0d0a add not_close_error_metas for internal comparison machinery (#90004)
While discussing a possible addition of `assert_not_close` to the API (See #90005 later in the stack), it became clear that we should have an intermediate function that returns a bool-ish value that one can assert on. This PR introduces this function as `are_equal` as replacement for `assert_equal`. Interface is the same, but instead of raising in case a comparison failed, we return the `ErrorMeta`'s of all failures and leave it to the caller to handle. Note that this only applies to errors raised during the comparison stage. Everything else, e.g. only setting `atol` *or* `rtol`, will raise just as before.

We decided to keep this private for now unless there is user demand. The largest issue that needs to be solved before this can become public is the return type: if we have something like `torch.testing.are_close` we are targeting two uses cases:

1. Using it to branch inside code like `if are_close(...):`
2. Using it to assert closeness inside a test like `assert are_close(...)`. This is the default way to assert something with `pytest`

To do that, the return type has to be bool-ish, i.e. being an instance of `bool` or implementing `__bool__`. Plus, `bool(are_close()) is True` needs to be the if the inputs are close and `False` otherwise. The current logic of `are_close` satisfies the former, but violates the latter. In case everything is close, we return an empty list, but `bool([]) is False`.

Directly using an instance of `bool` would work for the requirements above, but then we would have no option to add diagnositics to the error. Meaning `assert are_close()` would work, but would be non-descriptive.

Using `Tuple[bool, str]` would work in general, but is quite dangerous and unexpected: since all non-empty tuples evaluate to `True`, this can easily hide bugs if the user is not super careful:

```pycon
>>> close = (False, "error message with diagnostics")
>>> assert close[0]
AssertionError: error message with diagnostics
>>> assert close
```

One possible solution here would be a thin custom object:

```py
class Close:
    def __init__(self, flag:bool, msg: str = "") -> None:
        self._flag = flag
        self._msg = msg

    def __bool__(self):
        return self._flag

    def __str__(self):
        return self._msg
```

Now we can do something like

```pycon
close = Close(False, "error message with diagnostics")  # coming from are_close
>>> if not close:
...     print("It works!")
It works!
>>> assert close
AssertionError
>>> assert close, close  # This looks weird, but does its job
AssertionError: error message with diagnostics
```

But this means we introduce another abstraction that the user has to deal with.

To reiterate, we are not going to make `are_close` public until there is user demand, since none of the options above is without flaws.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90004
Approved by: https://github.com/mruberry, https://github.com/malfet
2023-02-08 11:22:55 +00:00
Philip Meier
566eb49ed2 minor internal cleanup in assert_close (#90003)
Per title. I'm going to highlight them with inline comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90003
Approved by: https://github.com/mruberry, https://github.com/malfet
2023-02-08 11:22:55 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Aaron Gokaslan
748bac8757 [BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309)
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
2023-02-07 20:08:58 +00:00
Natalia Gimelshein
7bba87ed06 add rsub decomposition with alpha (#94144)
Fixes #93376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94144
Approved by: https://github.com/desertfire
2023-02-07 17:21:13 +00:00
albanD
0b2dc3b3ac [Py-3.11] Skip dynamo related tests (#94187)
The quantization test fails to import Dynamo as expected.
The traceback tool looks a lot more tricky, opened https://github.com/pytorch/pytorch/issues/94189 to investigate further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94187
Approved by: https://github.com/malfet
2023-02-07 16:40:55 +00:00
Vasiliy Kuznetsov
f15ab8a7f2 AO migration: replace torch internal callsites (#94170)
Summary:

Do the following renames:
`torch.quantization` -> `torch.ao.quantization`
`torch.nn.quantized` -> `torch.ao.nn.quantized`
`torch.nn.quantizable` -> `torch.ao.nn.quantizable`
`torch.nn.qat` -> `torch.ao.nn.qat`
`torch.nn.intrinsic` -> `torch.ao.nn.intrinsic`

And then, do
`torch.ao.nn.quantized._reference` -> `torch.ao.nn.quantized.reference` to clean up the aftermath of https://github.com/pytorch/pytorch/pull/84974

Then, manually update `test/test_module_init.py` to fix hanging whitespace due to the replace.

Run this script to do the replacements: https://gist.github.com/vkuzo/7f7afebf8c31b9ba48306223e68a1c82

This is for https://github.com/pytorch/pytorch/issues/81667

Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94170
Approved by: https://github.com/jerryzh168
2023-02-07 02:32:23 +00:00
PyTorch MergeBot
53e4fe076a Revert "enable bf16 emb (#94163)"
This reverts commit f3bf46e801.

Reverted https://github.com/pytorch/pytorch/pull/94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk.  For example, 05397b1250
2023-02-07 00:32:22 +00:00
albanD
496c0a207b Make segment_reduce properly private. (#93166)
I am attempting not to change the aten function to reduce the amount of BC issues on the torchscript side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93166
Approved by: https://github.com/ngimel
2023-02-06 18:32:23 +00:00
mingfeima
26cba842ad Optimize ConvTransposed2D with mkldnn float32 and bfloat16 on CPU (#92530)
this PR optimized `ConvTranspose2d` with oneDNN and add channels last support for it. Also the fallback path `slow_conv_transpose2d` also have channels last support. So the memory format propagation behavior would stay the same with or without oneDNN.

Replacement of https://github.com/pytorch/pytorch/pull/77060, https://github.com/pytorch/pytorch/pull/70897 and https://github.com/pytorch/pytorch/pull/74023 which enables oneDNN for `ConvTranspose2d` and `ConvTranspose3d`

The following results collects on Skylake Xeon 8180, dual sockets, 28 cores per socket.
### single core channels last

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 181.36 | 91.16 | 1.99 | 531.38 | 124.08 | 4.28
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 324.35 | 153.50 | 2.11 | 973.16 | 185.97 | 5.23
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 1086.82 | 671.52 | 1.62 | 3008.94 | 1453.33 | 2.07

### single core channels first

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 138.10 | 5.94 | 23.23 | 37.97 | 11.25 | 3.38
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 236.43 | 8.75 | 27.03 | 87.77 | 18.58 | 4.72
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 484.39 | 37.69 | 12.85 | 185.40 | 90.57 | 2.05

### single socket channels last

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 138.10 | 5.94 | 23.23 | 37.97 | 11.25 | 3.38
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 236.43 | 8.75 | 27.03 | 87.77 | 18.58 | 4.72
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 484.39 | 37.69 | 12.85 | 185.40 | 90.57 | 2.0

### single socket channels first

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100,   100), weight size: (32, 32, 3, 3) | 132.56 | 7.19 | 18.43 | 31.43 | 11.20 | 2.81
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 227.94 | 13.33 | 17.11 | 63.00 | 23.41 | 2.69
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 473.68 | 52.79 | 8.97 | 150.40 | 87.33 | 1.72

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92530
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-02-06 10:11:25 +00:00
haozhe.zhu
f3bf46e801 enable bf16 emb (#94163)
Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163
Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5
2023-02-06 07:11:40 +00:00
Howard Huang
5c7f4534e9 [small] multithreaded-pg guard attr (#93883)
currently the test
```
pytest test/distributed/test_multi_threaded_pg.py -vs
```

has errors

```
Traceback (most recent call last):
  File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run
    self._tls.precision = TestCase._precision
AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-02-03 23:01:02 +00:00
albanD
5be57d51f9 Fix testing now that random.sample() arg must be a sequence (#94052)
This is only enforced in 3.11 but the change is not bad for other versions either (and this is test code so perf is not a concern).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94052
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-03 21:28:02 +00:00
Peter Bell
5817695bfa [pt2] Fix arange to match ATen behavior (#93353)
Fixes #92676

`arange` infers the output dtype from the argument types, but in order to reduce
falling back to ATen, inductor preferred to cast whole number float arguments to
int which gave the wrong output dtype. Instead, this decomposes floating point
arange into the prim equivalent for integers.

This also changes the signature of `prims.arange` to

```python
prims.iota(length, *, start, step, **factory_kwargs)
```

which only supports integers arguments. This is done because calculating the
output size from `start, end, step` is surprisingly complex and liable to off by
one errors so should not be duplicated in each backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93353
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-02-03 00:44:32 +00:00
Andrew Gu
481a334b7a [FSDP][3/N] Refactor summon_full_params unit tests (#92298)
**Overview**
- This PR refactors the `summon_full_params()` unit tests to prepare for `unshard_params()` by consolidating redundant tests and improving others.
- This PR enables `CPUOffload(offload_params=True)` + `NO_SHARD` + `writeback=True`.
- This PR provides an improved error message when calling `summon_full_params()` from an invalid context (i.e. from forward, backward, or in `summon_full_params()`).

**Details**
<details>
<summary>Existing Unit Tests</summary>

`test_summon_full_param_writeback()` with `world_size=1`
`test_summon_full_param_writeback()` with `world_size=2`
- Tests that `writeback=True` persists write and that `writeback=False` does not persist write when modifying a root FSDP instance's `flat_param` (`modify_outer=True`) or a non-root FSDP instance's `flat_param` (`modify_outer=False`); additionally configures with `mixed_precision` and `use_orig_params`
- `CPUOffload(offload_params=True)` + `world_size=1` is not tested because it is not supported.
- The write inside `summon_full_params()` is on the `flat_param` itself, which is not the expected usage.

`test_summon_full_param_shard_value()`
- Tests that reconstructing the `flat_param` (by re-flattening and chunking parameters) inside `summon_full_params()` gives the same as the originally constructed `flat_param` when using a single FSDP instance
- This test seems to exercise the FSDP sharding algorithm, not the specification of `summon_full_params()`. The only relevant part being implicitly tested is that `model.parameters()` order is preserved.
- This test assumes the current FSDP sharding algorithm.

`test_summon_full_param_recursive()`
- Tests that `recurse=True` recursively applies to all FSDP instances and that `recurse=False` does not
- This test assumes the current FSDP sharding algorithm.

`test_cannot_summon_full_params_from_forward()`
`test_cannot_summon_full_params_from_backward()`
- Tests that calling `summon_full_params()` from inside the forward or backward raises an error
- The error message leaks `FlatParamHandle` to the user. I provided a better error in this PR.

`test_summon_full_params_respects_reshard_after_forward()`
- Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`)
- This test depends on FSDP internals (`flat_param._full_param_padded.storage().size()`).

`test_summon_single_param()`
- Tests that writing to padding with `writeback=True` does not persist those writes (doing so by using a singleton `(1, 1)` parameter that gets flattened and padded to `(2,)`)
- This test name is misleading.

`test_summon_full_params_equivalence()`
- Tests `writeback`, `rank0_only`, and `offload_to_cpu` with `writeback=not rank0_only`, using `CPUOffload(offload_params=True)` and including a `torch.cuda._sleep(int(1e6))` _after_ the write in `summon_full_params()`
- The PR introducing this test said that the `torch.cuda._sleep(int(1e6))` exercised the stream synchronization in `summon_full_params()`--namely that the current stream waits for the all-gather stream after all-gathering the parameters. I did not follow conceptually how that works since the `torch.cuda._sleep()` call happens after both the all-gather and write and is in the default stream, which seems to be after the relevant ops. If we clarify this, I can re-incorporate this into the unit tests. Doing so is not a high priority since `summon_full_params()` unshards in the default stream now and does not require stream synchronization.
- This unit test has overlap with `test_summon_full_param_writeback()` and can be coalesced.

`test_summon_from_non_fsdp()`
- Tests calling `summon_full_params()` with default args on a non-FSDP root module exposes the original parameters correctly
- This test actually covers much of the specification since checking for original parameter equivalence includes shape, value, device, etc. checking.

`test_reshard_outside_forward_backward_iteration()`
- Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`) and that calling `summon_full_params()` after backward preserves that the padded unsharded `flat_param` data are freed; additionally configures `mixed_precision`
- This test strictly dominates `test_summon_full_params_respects_reshard_after_forward()` in strictness since it includes the check after backward as well.

`test_params_are_unflattenned()`
 - Tests that original parameters are exposed with the unflattened shape factoring in `rank0_only` (e.g. including that nonzero ranks reshard early when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision`

`test_params_count_and_value()`
- Tests that original parameters are all exposed and with the correct values factoring in `rank0_only` (e.g. including that nonzero ranks do not expose the original parameters when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision`

`test_raises_rank0_with_writeback()`
- Tests that `rank0_only` + `writeback=True` raises an error

`test_named_parameters_buffers()`
- Tests that `named_parameters()` and `named_buffers()` return clean names (without FSDP prefixes) inside `summon_full_params()`

`test_with_grads_core()`
- Tests `with_grads=True` by comparing against DDP

`test_with_grads_none_grads()`
- Tests `with_grads=True` when ranks' `FlatParameter`s have `None` gradient

</details>

<details>
<summary>New Unit Tests</summary>

`test_unshard_params_writeback_no_shard()` (with `world_size=1`)
`test_unshard_params_writeback()` (with `world_size=2`)
- Tests the `writeback` argument (using the default value for all others)

`test_unshard_params_param_data_no_shard()` (with `world_size=1`)
`test_unshard_params_param_data()` (with `world_size=2`)
- Tests that parameters are exposed correctly for `recurse=True` and all other argument configs for a non-FSDP root module

`test_unshard_singleton_param_writeback()`
- Tests `writeback=True` for a singleton parameter, which includes testing that writing to padding does not persist

`test_unshard_params_respects_reshard()`
- Tests that unsharding parameters respects the expected reshard behavior between forward and backward as well as after backward

`test_unshard_params_recurse()`
- Tests the `recurse` argument (using default for all others)

`test_offload_to_cpu_no_shard_raises()`
- Tests that `offload_to_cpu=True` with `NO_SHARD` raises an error

</details>

<details>
<summary>Summary of Unit Test Changes</summary>

- `test_summon_full_param_writeback` -> `test_unshard_params_writeback()`
- `test_summon_full_params_equivalence()`, `test_params_are_unflattenned()`, `test_params_count_and_value()` -> `test_unshard_params_param_data()`
- `test_summon_full_params_respects_reshard_after_forward()`, `test_reshard_outside_forward_backward_iteration()` -> `test_unshard_params_respects_reshard()`
- `test_summon_full_param_recursive()` -> `test_unshard_params_recurse()`
- `test_named_parameters_and_buffers()` unchanged
- `test_with_grads_core()` unchanged
- `test_with_grads_none_grads()` unchanged
- `test_cannot_summon_full_params_from_forward()`, `test_cannot_summon_full_params_from_backward()` -> `test_unshard_params_from_forward_raises()`, `test_unshard_params_from_backward_raises()`
- `test_raises_rank0_with_writeback()` -> `test_rank0_only_with_writeback_raises()`
- `test_offload_to_cpu_no_shard_raises()` new
- `test_summon_full_param_shard_value()` removed

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92298
Approved by: https://github.com/rohan-varma
2023-02-02 15:10:14 +00:00
Xilun Wu
966030f7c7 [DTensor][fix] MultiThreadedTestCase misses _tls object and it won't reflect in CI (#93832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93832
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
Driss Guessous
653dc73df0 [SDPA] Wire up FlashAttention's backward (#92917)
# Summary
This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml.

The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](33e0860c9c/flash_attn/flash_attn_interface.py (L126)) natively in PyTorch.  One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions.

### MetaFunctions
I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below.

### Questions?
Performance issues vs mem efficient when using torch.nn.mha_forward

TorchCompile -> See purposed solution below.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917
Approved by: https://github.com/cpuhrsch
2023-02-02 04:02:30 +00:00
Jesse Cai
86ab4d49d4 [pruning][core][feature] LSTM Structured Pruning prune_functions + pattern (#90801)
Summary:

This PR adds in support for LSTM Structured Pruning.

- Adds in LSTMSaliencyPruner, an implemented pruner that splits the packed weights, finds the appropriate mask for each piece individually based on saliency, and then combines to create an overall mask for the LSTM.
- Adds in pruning functions for LSTM pruning, which will split the weights, apply the masks, and then recombine the pruned weights. Works for both single and multiple-layer LSTMs.

Also added a basic pattern to the default set of of patterns for
LSTM -> Linear pruning
LSTM -> LayerNorm -> Linear pruning

Adds in test to check that LSTM pruning works, as well as for LSTMSaliencyPruner

Test Plan:
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_single_layer`
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_multiple_layer`
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_single_layer`
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_multiple_layer`
`python test/test_ao_sparsity.py -- TestSaliencyPruner.test_lstm_saliency_pruner_update_mask`
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D42199001](https://our.internmc.facebook.com/intern/diff/D42199001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90801
Approved by: https://github.com/jerryzh168
2023-02-01 19:29:03 +00:00
Vasiliy Kuznetsov
56f9475625 ns: change PNP testing to use QNNPACK (#91421)
Summary:

Changes the PNP test cases to use QNNPACK. The only reason is because
I'm switching to Mac M1 as my primary machine, which supports QNNPACK
but not fbgemm, and it's convenient for me to be able to run these
locally.

PNP itself is not backend specific, so it does not matter which backend
the functionality is tested on.

Test plan:

```
python test/test_quantization.py -k NShadows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91421
Approved by: https://github.com/jerryzh168
2023-02-01 18:34:04 +00:00
jjsjann123
bdca5fcd43 cherry-picking autodiff support for gather/index_select (#93333)
added gather & index_select in autodiff;
test coverage should be handled by opinfo;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93333
Approved by: https://github.com/ngimel
2023-02-01 09:47:40 +00:00
Will Constable
ac791bddce Refactor dynamo distributed test helpers to be reusable (#93187)
The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187
Approved by: https://github.com/kumpera
2023-02-01 06:09:42 +00:00
leslie-fang-intel
ef4118e435 [Quant][FX] Lower QConvAdd2d for onednn backend (#91153)
**Summary**
Add quantization mappings for QConvAdd2d for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode.

**Test plan**
```
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_onednn
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_by_default
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_lowering
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91153
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 01:14:12 +00:00
leslie-fang-intel
53c3555a6a [Quant] Add fused ConvAdd2d module for onednn backend (#91152)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `ConvAdd2d` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown.

**Test plan**
```
python -m pytest test_quantization.py -k test_conv2d_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91152
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 01:11:25 +00:00
Ivan Yashchuk
fba13d94a1 Remove deprecated torch.symeig (#70988)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`.

- [x] XLA PR: https://github.com/pytorch/xla/pull/4498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988
Approved by: https://github.com/lezcano, https://github.com/kit1980, https://github.com/malfet
2023-01-31 11:59:11 +00:00
Jacob Szwejbka
2e9107ec1e [Pytorch][Executorch] Handwritten view copy out ops should resize out (#91194)
Summary: Handwritten out ops should have feature parity with the codegend ones. This means they should resize out to the appropriate size. Q. Why are these handwritten instead of codegend anyway? Q2. Wheres a good spot to put the resize and copy helpers since they are reused in the codegend out kernels

Test Plan: ci.

Differential Revision: D42177051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91194
Approved by: https://github.com/ezyang
2023-01-30 23:07:14 +00:00
Nikita Shulga
5976f0bdfe Set min supported Python version to 3.8 (#93155)
Also, grep for `if sys.version_info .cond. (3, 8)` and replaces them with appropriate action.

This is a last in a series of PRs that moved CI/CD away from testing PyTorch behavior against Python-3.7.

Fixes https://github.com/pytorch/pytorch/issues/80513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93155
Approved by: https://github.com/huydhn
2023-01-29 18:28:46 +00:00
mfkasim1
75cfc0be21 Logcumsumexp for CPU (#93153)
Partial work from #90847, in the direction of solving #89205.
Most of the content is from #90847, but this is only for CPU, so hopefully it does not increase the build time by a lot.

tag: @albanD, @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93153
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-01-27 22:29:33 +00:00