Commit Graph

120 Commits

Author SHA1 Message Date
Aaron Gokaslan
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
Masaki Kozuki
30876229a7 [mta] Backward of unary foreach functions (#89591)
as per title, this PR defines backward of those.

This doesn't implement forward-mode automatic differentiation as [the current codegen](a747326423/tools/autograd/gen_variable_type.py (L1513)) doesn't seem to handle `ArrayRef<Tensor>`.

Rel:
- https://github.com/pytorch/pytorch/issues/53796
- https://github.com/pytorch/pytorch/issues/58833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89591
Approved by: https://github.com/albanD
2023-01-23 08:28:06 +00:00
Masaki Kozuki
32b2d8009a check if multi_tensor_apply_kernel was called (#92077)
Replacing all the hard coded number of cuda kernel launches with `multi_tensor_apply_kernel` call check, keeping the dependency on kineto profiler there

Rel: https://github.com/pytorch/pytorch/pull/91844#issuecomment-1379844523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92077
Approved by: https://github.com/ngimel
2023-01-23 06:46:36 +00:00
milesial
a76bc410df Fix _foreach_norm on some tensor sizes (#91844)
This PR fixes 2 bugs with CUDA `_foreach_norm`:

1. Wrong norm when tensors are larger than kChunkSize = 65536
```
>>> torch._foreach_norm([torch.ones(60000, device="cuda") for _ in range(1)])
(tensor(244.9490, device='cuda:0', grad_fn=<NotImplemented>),)
>>> torch._foreach_norm([torch.ones(70000, device="cuda") for _ in range(1)])
(tensor(256., device='cuda:0', grad_fn=<NotImplemented>),)

>>> torch.ones(60000, device="cuda").norm()
tensor(244.9490, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>)
>>> torch.ones(70000, device="cuda").norm()
tensor(264.5751, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>)
```

2. Error when a tensor numel is smaller than the number of tensors

```
>> torch._foreach_norm([torch.ones(9, device="cuda") for _ in range(10)])
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
IndexError: select(): index 9 out of range for tensor of size [9] at dimension 0
```

This bug could have been caught by tests if `PYTORCH_TEST_WITH_SLOW` was 1, because it would have tested tensors of size 300*300=90000. It's not enabled by default, does someone know if it's ever enabled?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91844
Approved by: https://github.com/ngimel
2023-01-12 05:48:01 +00:00
Masaki Kozuki
554a796aef Implement torch._foreach_lerp (#87562)
As per title.

- [ ] ~~Q: Do we want `torch._foreach_lerp.ScalarList` as well?~~
- [ ] ~~we might want to have `ATen/native/cuda/lerp.cuh` and include it in `ATen/native/cuda/Lerp.cu` and `ATen/native/cuda/ForeachTernaryOp.cu`~~

Related:
- https://github.com/pytorch/pytorch/issues/58833
- https://github.com/pytorch/pytorch/issues/71683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87562
Approved by: https://github.com/ngimel
2023-01-11 02:52:04 +00:00
milesial
9d20d6d5ec Foreach clamp_min clamp_max (#91384)
Adds `_foreach_clamp_min` and `_foreach_clamp_max` as binary ops, with scalar, scalarlist and tensorlist support.

Timing example for `_foreach_clamp_min_` on a GTX3070Ti across a list of tensors with varying count and item size (times are in microseconds (us)):

CUDA:

```
[------------------ (tensors, scalar) -------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |     29.0   |     10.2
      100 tensors of size 4        |    234.4   |     18.3
      1000 tensors of size 4       |   2194.1   |    113.5
      10000 tensors of size 4      |  21745.6   |   1144.5
      10 tensors of size 16        |     29.5   |     12.0
      100 tensors of size 16       |    256.9   |     19.9
      1000 tensors of size 16      |   2499.7   |    123.6
      10000 tensors of size 16     |  25022.2   |   1295.6
      10 tensors of size 256       |     32.8   |     11.2
      100 tensors of size 256      |    258.8   |     19.7
      1000 tensors of size 256     |   2509.2   |    123.7
      10000 tensors of size 256    |  25016.2   |   1295.4
      10 tensors of size 65536     |     32.9   |     18.7
      100 tensors of size 65536    |    327.1   |    150.3
      1000 tensors of size 65536   |   3051.3   |   1388.0
      10000 tensors of size 65536  |  30476.9   |  14021.5

[------------------ (tensors, tensors) ------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |     26.8   |     17.3
      100 tensors of size 4        |    206.8   |     90.5
      1000 tensors of size 4       |   1993.0   |    828.9
      10000 tensors of size 4      |  19851.0   |   9063.3
      10 tensors of size 16        |     34.7   |     20.0
      100 tensors of size 16       |    232.2   |    102.1
      1000 tensors of size 16      |   2220.9   |    977.3
      10000 tensors of size 16     |  22644.5   |  10361.4
      10 tensors of size 256       |     30.5   |     19.7
      100 tensors of size 256      |    231.6   |    102.4
      1000 tensors of size 256     |   2251.9   |    978.7
      10000 tensors of size 256    |  22680.3   |  10405.8
      10 tensors of size 65536     |     30.6   |     34.4
      100 tensors of size 65536    |    315.1   |    223.6
      1000 tensors of size 65536   |   3252.1   |   2114.4
      10000 tensors of size 65536  |  30578.0   |  22826.3

```

CPU:
```
[------------------- (tensors, scalar) -------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |      13.0  |       9.6
      100 tensors of size 4        |      62.4  |      31.6
      1000 tensors of size 4       |     562.2  |     245.6
      10000 tensors of size 4      |    5552.2  |    2517.7
      10 tensors of size 16        |      14.9  |      11.3
      100 tensors of size 16       |      74.1  |      36.9
      1000 tensors of size 16      |     663.7  |     285.5
      10000 tensors of size 16     |    6765.2  |    2947.5
      10 tensors of size 256       |      15.2  |      11.8
      100 tensors of size 256      |      76.0  |      37.7
      1000 tensors of size 256     |     728.8  |     323.9
      10000 tensors of size 256    |    7274.4  |    3800.3
      10 tensors of size 65536     |     105.6  |     124.5
      100 tensors of size 65536    |     982.8  |     939.7
      1000 tensors of size 65536   |   14993.1  |   14579.2
      10000 tensors of size 65536  |  163091.0  |  151555.8

[------------------- (tensors, tensors) ------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |      11.8  |      10.5
      100 tensors of size 4        |      53.1  |      38.2
      1000 tensors of size 4       |     465.1  |     316.1
      10000 tensors of size 4      |    4616.9  |    3625.9
      10 tensors of size 16        |      13.5  |      12.3
      100 tensors of size 16       |      63.0  |      46.5
      1000 tensors of size 16      |     560.1  |     359.9
      10000 tensors of size 16     |    5586.8  |    3765.9
      10 tensors of size 256       |      15.2  |      13.7
      100 tensors of size 256      |      64.4  |      48.3
      1000 tensors of size 256     |     653.7  |     410.0
      10000 tensors of size 256    |    5916.6  |    3901.3
      10 tensors of size 65536     |     109.1  |     106.8
      100 tensors of size 65536    |    1128.9  |    1105.0
      1000 tensors of size 65536   |   16245.0  |   15950.8
      10000 tensors of size 65536  |  171111.3  |  163540.2
```

Example use:

```
tensors = [torch.randn(16, device='cuda') for _ in range(10)]

out = torch._foreach_clamp_min(tensors, 0.1)
out = torch._foreach_clamp_min(tensors, [0.1] * len(tensors))
out = torch._foreach_clamp_min(tensors, tensors)
torch._foreach_clamp_min_(tensors, 0.1)
torch._foreach_clamp_min_(tensors, [0.1] * len(tensors))
torch._foreach_clamp_min_(tensors, tensors)
```

Does not support complex types.
Changes the existing `foreach_minimum/maximum` to use this new implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91384
Approved by: https://github.com/ngimel
2023-01-09 19:28:47 +00:00
Christian Puhrsch
6fd416650a Add _foreach_addc(div/mul)(_).Tensor (#88157)
Support passing value scalars as a flat 1D Tensor.

Currently we can only pass either an individual scalar or a ScalarList.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88157
Approved by: https://github.com/ngimel, https://github.com/albanD
2022-11-02 23:24:35 +00:00
Elias Ellison
f701cb04fb Test Dynamo CI w Fake Tensors (#84282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282
Approved by: https://github.com/anijain2305
2022-09-01 00:15:05 +00:00
Masaki Kozuki
3139722679 [foreach][mta] Inplace maximum and minimum (#82523)
### Description
<!-- What did you change and why was it needed? -->
Implement `torch._foreach_maximum_` and `torch._foreach_minimum_` mainly for `_multi_tensor_adam` and `_multi_tensor_adamw` with `amsgrad=True` to correctly update their `max_exp_avg_sqs`.

### Issue
<!-- Link to Issue ticket or RFP -->
- https://github.com/pytorch/pytorch/issues/78807
- https://github.com/pytorch/pytorch/pull/81894
- https://github.com/pytorch/pytorch/pull/81348
- https://github.com/pytorch/pytorch/pull/81705
- https://github.com/pytorch/pytorch/issues/58833
- https://github.com/pytorch/pytorch/issues/68041

### Testing
<!-- How did you test your change? -->
Updated `test_foreach.py::TestForeach::_minmax_test` to compare the outputs of `_foreach_maximum_` (and `_foreach_minimum_`) against those of `[torch.maximum(a, b) for a, b in zip(tensors1, tensors2)]`

cc @ngimel @albanD @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82523
Approved by: https://github.com/albanD
2022-08-03 03:40:42 +00:00
Nikita Shulga
bfac65dfe5
[testing] Update dispatch macros (#74977)
This PR is reland of #74289 
Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
2022-03-30 14:13:21 -07:00
PyTorch MergeBot
2e4152b118 Revert "[testing] Update dispatch macros"
This reverts commit eed19a0f38.

Reverted https://github.com/pytorch/pytorch/pull/74289 on behalf of https://github.com/malfet
2022-03-30 19:52:37 +00:00
Khushi Agrawal
eed19a0f38 [testing] Update dispatch macros
Hi,
This PR is the follow-up PR of #71561. (the previous PR had a couple of merge conflicts and was reverted, this PR resolves that).
Please take a look. Thanks!

cc: @pmeier @mruberry @kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74289
Approved by: https://github.com/pmeier, https://github.com/mruberry
2022-03-30 16:10:16 +00:00
Masaki Kozuki
105e58a552 [Foreach Reduction] Use OpMathType tensor for intermediate results
Follow-up of https://github.com/pytorch/pytorch/pull/62646

In APEX, multi_tensor_norm only supports float and half and the dtype of `output` and `output_per_tensor` is hardcoded as single-precision (see ae757634ef/csrc/multi_tensor_l2norm_kernel.cu (L318)).

But in my previous PR, any tensor created in the kernel has the same dtype as the input tensors.
I'm not quite sure why I didn't see any failures in the previous PR but internal math should be performed in 32 bits for 16-bit tensors, in my opinion.

rel: https://github.com/pytorch/pytorch/issues/58833

cc @ptrblck @mcarilli @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68107
Approved by: https://github.com/ngimel
2022-03-23 17:16:37 +00:00
Jeff Daily
1e08448435 [ROCm] enable foreach fastpath
Reverts #46216 now that rocm is fixed.

Benchmark to verify:

```python
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils.benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.1) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
optimizer_mta = optim._multi_tensor.SGD(model.parameters(), lr=1e-3, momentum=0.1)
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="torch.cuda.synchronize();optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    timer_mta = benchmark_utils.Timer(
        stmt="torch.cuda.synchronize(); optimizer_mta.step()",
        globals=globals(),
        label="str(optimizer_mta)",
    )
    for _ in range(1):
        for i in range(1):
           print(f"Run: {i}\n{'-' * 40}")
           print(f"autorange:\n{timer.blocked_autorange()}\n\n")

        for i in range(1):
            print(f"Run: {i}\n{'-' * 40}")
            print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Before revert:
```
Run: 0
----------------------------------------
autorange:
<torch.utils.benchmark.utils.common.Measurement object at 0x7f253e67c910>
str(optimizer)
  7.33 ms
  1 measurement, 100 runs , 1 thread

Run: 0
----------------------------------------
autorange:
<torch.utils.benchmark.utils.common.Measurement object at 0x7f253e67c510>
str(optimizer_mta)
  5.76 ms
  1 measurement, 100 runs , 1 thread
```

After revert:
```
Run: 0
----------------------------------------
autorange:
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa2aa15e8d0>
str(optimizer)
  7.35 ms
  1 measurement, 100 runs , 1 thread

Run: 0
----------------------------------------
autorange:
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa2aa15e4d0>
str(optimizer_mta)
  3.53 ms
  1 measurement, 100 runs , 1 thread
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74417
Approved by: https://github.com/ngimel
2022-03-21 17:37:06 +00:00
Nikita Shulga
ef066f0832 Revert D34856571: [pytorch][PR] Replace get_all_ type macros with the ATen dispatch macros.
Test Plan: revert-hammer

Differential Revision:
D34856571 (3ded7b1da3)

Original commit changeset: 0dca038bcad5

Original Phabricator Diff: D34856571 (3ded7b1da3)

fbshipit-source-id: 594553fa0b710d78beba59d5d2b646f1f1270386
(cherry picked from commit 8090eb9b12dcf452a9e7dc01792a66fb91b563b6)
2022-03-15 22:07:11 +00:00
Khushi Agrawal
3ded7b1da3 Replace get_all_ type macros with the ATen dispatch macros. (#71561)
Summary:
Hi, Team!
The PR is motivated from https://github.com/pytorch/pytorch/pull/71153#discussion_r782446738. It aims to replace `get_all` type macros with the ATen dispatch macros.

The files it iterates over are: (Thanks, Lezcano, for the idea!!)

<details>
<summary>

`test/test_autograd.py`</summary>

<p>

```python
43:from torch.testing._internal.common_dtype import get_all_dtypes
8506:        floating_dt = [dt for dt in get_all_dtypes() if dt.is_floating_point]
```

</p>
</details>

<details>
<summary>

`test/test_binary_ufuncs.py`</summary>

<p>

```python
26:    all_types_and_complex_and, integral_types_and, get_all_dtypes, get_all_int_dtypes, get_all_math_dtypes,
27:    get_all_complex_dtypes, get_all_fp_dtypes,
935:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1035:    dtypes(*get_all_dtypes(
1488:    dtypes(*(get_all_dtypes(include_bool=False, include_bfloat16=False)))
1879:    dtypes(*product(get_all_dtypes(include_complex=False), get_all_dtypes(include_complex=False)))
1887:    dtypes(*(get_all_int_dtypes() + [torch.bool]))
1913:    dtypes(*(get_all_fp_dtypes()))
1941:    dtypes(*(get_all_fp_dtypes()))
1977:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
2019:    dtypes(*product(get_all_fp_dtypes(), get_all_fp_dtypes()))
2048:    dtypes(*get_all_dtypes())
2110:    dtypes(*product(get_all_dtypes(include_complex=False),
2111:                     get_all_dtypes(include_complex=False)))
2128:            types = [torch.bool, torch.bfloat16] + get_all_int_dtypes()
2173:        if dtypes[1] in get_all_fp_dtypes():
2178:    dtypes(*product(get_all_fp_dtypes(),
2179:                     get_all_fp_dtypes()))
2260:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2261:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2273:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2274:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2307:    dtypes(*get_all_math_dtypes('cpu'))
2319:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
2331:    dtypes(*get_all_int_dtypes())
2356:    dtypes(*get_all_dtypes(include_bfloat16=False, include_bool=False, include_complex=False))
2393:        if dtype in get_all_int_dtypes():
2614:    dtypes(*get_all_dtypes())
2624:    dtypes(*tuple(itertools.combinations_with_replacement(get_all_dtypes(), 2)))
2806:    dtypes(*list(product(get_all_dtypes(include_complex=False),
2807:                          get_all_dtypes(include_complex=False))))
2866:    dtypes(*list(product(get_all_complex_dtypes(),
2867:                          get_all_complex_dtypes())))
2902:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2906:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2910:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
3019:        dtypes = [torch.float, torch.double] + get_all_complex_dtypes()
3221:    dtypes(*get_all_dtypes(include_complex=False))
3407:    dtypes(*list(product(get_all_dtypes(include_bool=False),
3408:                          get_all_dtypes(include_bool=False))))
3504:    dtypes(*product(get_all_dtypes(include_complex=False, include_bfloat16=False),
3505:                     get_all_dtypes(include_complex=False, include_bfloat16=False)))
3516:            if x.dtype in get_all_int_dtypes() + [torch.bool]:
3643:    dtypes(*product(get_all_dtypes(include_complex=False,
3645:                     get_all_dtypes(include_complex=False,
```

</p>
</details>

<details>
<summary>

`test/test_complex.py`</summary>

<p>

```python
6:from torch.testing._internal.common_dtype import get_all_complex_dtypes
11:    dtypes(*get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_foreach.py`</summary>

<p>

```python
18:    get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
142:            if dtype in get_all_int_dtypes():
179:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
201:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
205:                disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
211:                disable_fastpath |= dtype not in get_all_complex_dtypes()
241:                bool_int_div = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
246:                    disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
248:                    disable_fastpath |= dtype not in get_all_complex_dtypes()
250:                    disable_fastpath |= True and dtype not in get_all_complex_dtypes()
307:        disable_fastpath = dtype in get_all_int_dtypes() + [torch.bool]
365:        if opinfo.name == "_foreach_abs" and dtype in get_all_complex_dtypes():
376:    ops(foreach_unary_op_db, dtypes=get_all_dtypes())
393:         dtypes=get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
401:    ops(foreach_minmax_op_db, dtypes=get_all_fp_dtypes(include_bfloat16=True, include_half=True))
426:            if ord in (1, 2) and dtype in torch.testing.get_all_fp_dtypes():
439:    dtypes(*get_all_dtypes())
449:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
481:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
536:            if dtype in get_all_int_dtypes() + [torch.bool] and foreach_op == torch._foreach_div:
545:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
637:    ops(foreach_pointwise_op_db, allowed_dtypes=get_all_fp_dtypes(include_half=False, include_bfloat16=False))
```

</p>
</details>

<details>
<summary>

`test/test_linalg.py`</summary>

<p>

```python
29:    all_types, floating_types, floating_and_complex_types, get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes,
30:    get_all_fp_dtypes,
111:    dtypes(*(get_all_dtypes()))
794:        float_and_complex_dtypes = get_all_fp_dtypes() + get_all_complex_dtypes()
807:    dtypes(*(get_all_int_dtypes()))
828:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
841:        if dtype in get_all_complex_dtypes():
844:    dtypes(*itertools.product(get_all_dtypes(),
845:                               get_all_dtypes()))
855:        for dtypes0, dtypes1, dtypes2 in product(get_all_dtypes(), repeat=3):
5607:                  *get_all_fp_dtypes(include_half=not CUDA9, include_bfloat16=(CUDA11OrLater and SM53OrLater)))
5608:    dtypes(*(set(get_all_dtypes()) - {torch.half, torch.bool}))
5644:    dtypes(*(get_all_complex_dtypes() + get_all_fp_dtypes()))
6255:    dtypesIfCUDA(*get_all_complex_dtypes(),
6256:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)),
6292:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6323:    dtypesIfCUDA(*get_all_complex_dtypes(),
6324:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6325:    dtypes(*get_all_complex_dtypes(), *get_all_fp_dtypes())
6358:    dtypesIfCUDA(*([torch.float, torch.double] + get_all_complex_dtypes()))
6556:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6668:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6741:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_nn.py`</summary>

<p>

```python
37:from torch.testing._internal.common_dtype import integral_types, get_all_fp_dtypes, get_all_math_dtypes
50:    onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types, \
8862:                for device in get_all_device_types():
9629:            for dt1 in get_all_math_dtypes(device):
9630:                for dt2 in get_all_math_dtypes(device):
9631:                    for dt3 in get_all_math_dtypes(device):
9648:            for input_dtype in get_all_math_dtypes(device):
9664:            for input_dtype in get_all_math_dtypes(device):
13015:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13034:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13159:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17400:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17768:    dtypesIfCUDA(*get_all_fp_dtypes())
17773:    dtypesIfCUDA(*get_all_fp_dtypes())
17778:    dtypesIfCUDA(*get_all_fp_dtypes())
17783:    dtypesIfCUDA(*get_all_fp_dtypes())
17788:    dtypesIfCUDA(*get_all_fp_dtypes())
17793:    dtypesIfCUDA(*get_all_fp_dtypes())
17798:    dtypesIfCUDA(*get_all_fp_dtypes())
17963:    dtypesIfCUDA(*get_all_fp_dtypes())
17977:    dtypesIfCUDA(*get_all_fp_dtypes())
18684:    def test_cross_entropy_loss_prob_target_all_reductions(self, device):
```

</p>
</details>

<details>
<summary>

`test/test_numpy_interop.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import get_all_dtypes
399:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_ops.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import floating_and_complex_types_and, get_all_dtypes
86:        for dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_reductions.py`</summary>

<p>

```python
16:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
360:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
366:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
394:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
750:        for dtype in [dtype for dtype in get_all_math_dtypes('cpu') if dtype != torch.float16]:
1404:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1457:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1458:              get_all_complex_dtypes()))
1465:            return dtype in get_all_int_dtypes()
1494:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1501:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1507:    dtypes(*(get_all_complex_dtypes()))
1514:        dtypes = list(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))
1523:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1531:        if dtype in get_all_fp_dtypes():
1608:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
1837:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1855:    dtypes(*(set(get_all_dtypes(include_bool=False, include_complex=False)) - {torch.uint8}))
3219:        for dtype in get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_serialization.py`</summary>

<p>

```python
26:from torch.testing._internal.common_dtype import get_all_dtypes
586:        for device, dtype in product(devices, get_all_dtypes()):
589:            for other_dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_shape_ops.py`</summary>

<p>

```python
18:from torch.testing._internal.common_dtype import get_all_dtypes
230:    dtypes(*get_all_dtypes(include_complex=False, include_bool=False, include_half=False,
232:    dtypesIfCUDA(*get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False))
344:    dtypes(*get_all_dtypes())
443:    dtypes(*get_all_dtypes())
461:    dtypes(*get_all_dtypes())
570:    dtypes(*get_all_dtypes(include_complex=False))
```

</p>
</details>

<details>
<summary>

`test/test_sort_and_select.py`</summary>

<p>

```python
12:    all_types, all_types_and, floating_types_and, get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes,
136:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
231:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
296:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
647:    dtypesIfCUDA(*get_all_fp_dtypes())
678:    dtypesIfCUDA(*(get_all_dtypes(include_complex=False,
682:    dtypes(*(get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)))
739:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
740:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
799:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
800:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
```

</p>
</details>

<details>
<summary>

`test/test_sparse.py`</summary>

<p>

```python
20:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes
29:    floating_and_complex_types, floating_and_complex_types_and, get_all_dtypes, get_all_int_dtypes,
1963:            return dtype in get_all_int_dtypes()
1994:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2103:            return dtype in get_all_int_dtypes()
2138:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2626:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
2633:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
3230:    dtypes(*get_all_complex_dtypes(),
3231:            *get_all_fp_dtypes(include_half=False, include_bfloat16=False))
3234:                  *get_all_fp_dtypes(
```

</p>
</details>

<details>
<summary>

`test/test_sparse_csr.py`</summary>

<p>

```python
7:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes, floating_and_complex_types, make_tensor
17:from torch.testing._internal.common_dtype import floating_types, get_all_dtypes
120:    dtypes(*get_all_dtypes())
133:    dtypes(*get_all_dtypes())
150:    dtypes(*get_all_dtypes())
180:    dtypes(*get_all_dtypes())
201:    dtypes(*get_all_dtypes())
210:    dtypes(*get_all_dtypes())
225:    dtypes(*get_all_dtypes())
244:    dtypes(*get_all_dtypes())
263:    dtypes(*get_all_dtypes())
285:    dtypes(*get_all_dtypes())
411:    dtypes(*get_all_dtypes())
482:    dtypes(*get_all_dtypes())
502:    dtypes(*get_all_dtypes())
562:    dtypes(*get_all_dtypes())
588:    dtypesIfCUDA(*get_all_complex_dtypes(),
589:                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
745:    dtypesIfCUDA(*get_all_complex_dtypes(),
746:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
765:    dtypesIfCUDA(*get_all_complex_dtypes(),
766:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
801:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
841:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
1182:    dtypes(*get_all_dtypes())
1276:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_bfloat16=False))
1286:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_tensor_creation_ops.py`</summary>

<p>

```python
21:    onlyCUDA, skipCPUIf, dtypesIfCUDA, skipMeta, get_all_device_types)
23:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
150:        for dt in get_all_dtypes():
160:        for dt in get_all_dtypes():
314:        dtypes = [dtype for dtype in get_all_dtypes() if dtype != torch.bfloat16]
1012:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1013:              get_all_complex_dtypes()))
1032:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1033:              get_all_complex_dtypes()))
1050:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1051:              get_all_complex_dtypes()))
1745:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1779:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1868:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1926:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1954:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
1956:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, None)
1957:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
2538:        for device in get_all_device_types():
2645:        for dtype in get_all_dtypes():
2678:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False) +
2679:              get_all_complex_dtypes()))
2716:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
2827:            for dt in get_all_dtypes():
2913:    dtypes(*get_all_dtypes(include_bool=False, include_half=False))
2914:    dtypesIfCUDA(*get_all_dtypes(include_bool=False, include_half=True))
3028:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3033:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3074:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
3075:    dtypesIfCUDA(*((get_all_int_dtypes() + [torch.float32, torch.float16, torch.bfloat16])
3077:                    else get_all_dtypes(include_bool=False, include_half=True, include_complex=False)))
3873:    dtypes(*get_all_dtypes())
3884:    dtypes(*get_all_dtypes(include_bool=False))
3916:            for other in get_all_dtypes():
3922:    dtypes(*get_all_dtypes())
3932:    dtypes(*get_all_dtypes(include_bool=False))
3955:    dtypes(*get_all_dtypes(include_bool=False))
3961:    dtypes(*get_all_dtypes(include_bool=False))
3965:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_testing.py`</summary>

<p>

```python
25:from torch.testing._internal.common_dtype import get_all_dtypes
31:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_torch.py`</summary>

<p>

```python
51:    expectedAlertNondeterministic, get_all_device_types, skipXLA)
57:    get_all_fp_dtypes, get_all_int_dtypes, get_all_math_dtypes, get_all_dtypes, get_all_complex_dtypes
296:            for d in get_all_device_types():
323:            for device in get_all_device_types():
324:                for dt1 in get_all_dtypes():
325:                    for dt2 in get_all_dtypes():
343:            all_dtypes = get_all_dtypes()
350:            all_dtypes = get_all_dtypes()
781:            for dtype in get_all_dtypes():
986:            for device in get_all_device_types():
1017:            for device in get_all_device_types():
1018:                for dtype in get_all_math_dtypes(device):
2792:            for device in get_all_device_types():
3186:    dtypes(*get_all_dtypes())
3195:        for error_dtype in get_all_dtypes():
3203:    dtypes(*get_all_dtypes())
3212:        for error_dtype in get_all_dtypes():
4539:    dtypes(*get_all_fp_dtypes())
4545:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
4577:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
4578:    dtypesIfCPU(*(get_all_fp_dtypes(include_half=False, include_bfloat16=True)))
4579:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4599:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4600:    dtypesIfCPU(*(get_all_dtypes(include_half=False, include_bfloat16=False, include_complex=False)))
4601:    dtypesIfCUDA(*(get_all_dtypes(include_bfloat16=False, include_complex=False)))
4613:        for p_dtype in get_all_fp_dtypes(include_half=device.startswith('cuda'), include_bfloat16=False):
4628:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4629:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4640:    dtypes(*get_all_fp_dtypes())
4723:    dtypes(*get_all_fp_dtypes())
4735:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
4736:    dtypesIfCUDA(*get_all_fp_dtypes())
4747:    dtypes(*get_all_fp_dtypes())
4761:    dtypes(*get_all_fp_dtypes())
4771:    dtypes(*get_all_fp_dtypes())
4792:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
5302:    dtypes(*get_all_dtypes(include_bfloat16=False))
5322:    dtypes(*get_all_dtypes(include_half=False, include_bfloat16=False))
5323:    dtypesIfCPU(*get_all_dtypes(include_bfloat16=False))
5324:    dtypesIfCUDA(*get_all_dtypes(include_bfloat16=False))
5591:        for dt in get_all_dtypes():
5611:        for dt in get_all_dtypes():
5678:        for dt in get_all_dtypes():
5696:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
5697:    dtypes(*set(get_all_math_dtypes('cpu')))
5746:    dtypes(*get_all_dtypes())
5780:    dtypes(*get_all_dtypes())
5885:    dtypes(*get_all_dtypes())
5902:    dtypes(*get_all_dtypes())
5945:    dtypes(*get_all_dtypes())
5979:    dtypes(*get_all_dtypes(include_bool=False))
6049:    dtypes(*get_all_dtypes(include_bool=False))
6092:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6093:              get_all_complex_dtypes()))
6094:    dtypesIfCPU(*get_all_dtypes())
6095:    dtypesIfCUDA(*get_all_dtypes())
6122:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6123:              get_all_complex_dtypes()))
6124:    dtypesIfCPU(*get_all_dtypes())
6125:    dtypesIfCUDA(*get_all_dtypes())
6163:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6164:              get_all_complex_dtypes()))
6165:    dtypesIfCPU(*get_all_dtypes())
6166:    dtypesIfCUDA(*get_all_dtypes())
6190:    dtypes(*(get_all_complex_dtypes() +
6191:              get_all_int_dtypes()))
6238:    dtypes(*get_all_dtypes())
6323:    dtypes(*get_all_dtypes())
6389:    dtypes(*product(get_all_dtypes(), (torch.uint8, torch.bool)))
6699:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
6700:    dtypes(*set(get_all_math_dtypes('cpu')))
7452:    dtypes(*get_all_dtypes(include_bool=False))
7461:    dtypes(*get_all_dtypes(include_bool=False))
7477:    dtypes(*get_all_dtypes(include_bool=False))
7496:    dtypes(*get_all_dtypes(include_bool=False))
7538:    dtypes(*get_all_dtypes(include_bool=False))
8162:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8163:              get_all_complex_dtypes()))
8175:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8176:              get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_type_promotion.py`</summary>

<p>

```python
14:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes
187:        for dtype in get_all_dtypes():
262:        dtypes1 = get_all_math_dtypes('cuda')
263:        dtypes2 = get_all_math_dtypes(device)
339:    dtypes(*itertools.product(get_all_dtypes(), get_all_dtypes()))
468:            for dt1 in get_all_math_dtypes(device):
469:                for dt2 in get_all_math_dtypes(device):
519:            for dt1 in get_all_math_dtypes(device):
520:                for dt2 in get_all_math_dtypes(device):
528:        for dt in get_all_math_dtypes(device):
561:        for dtype in get_all_dtypes():
766:                                          dtypes=get_all_math_dtypes(device))
771:                                          dtypes=get_all_math_dtypes(device))
782:                                          dtypes=get_all_math_dtypes(device))
879:        dtypes = get_all_dtypes(include_bfloat16=False)
898:        dtypes = get_all_dtypes(include_bfloat16=False, include_bool=False)
965:    dtypesIfCUDA(*itertools.product(get_all_dtypes(include_bfloat16=False, include_complex=False),
966:                                     get_all_dtypes(include_bfloat16=False, include_complex=False)))
967:    dtypes(*itertools.product(get_all_dtypes(include_half=False, include_bfloat16=False,
969:                               get_all_dtypes(include_half=False, include_bfloat16=False,
976:            return dtype in get_all_int_dtypes() + [torch.bool]
979:            return dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False)
```

</p>
</details>

<details>
<summary>

`test/test_unary_ufuncs.py`</summary>

<p>

```python
24:    floating_types_and, all_types_and_complex_and, floating_and_complex_types_and, get_all_dtypes, get_all_math_dtypes,
25:    get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
517:    dtypes(*(get_all_int_dtypes() + [torch.bool] +
518:              get_all_fp_dtypes(include_bfloat16=False)))
596:    dtypes(*get_all_fp_dtypes(include_half=True, include_bfloat16=False))
611:        invalid_input_dtypes = get_all_int_dtypes() + \
612:            get_all_complex_dtypes() + \
619:        for dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False):
1048:    dtypes(*get_all_math_dtypes('cpu'))
1182:    dtypesIfCUDA(*get_all_fp_dtypes())
1190:    dtypesIfCUDA(*get_all_fp_dtypes())
1205:    dtypesIfCUDA(*get_all_fp_dtypes())
1215:    dtypesIfCUDA(*get_all_fp_dtypes())
1307:    dtypes(*(get_all_dtypes(include_bool=False)))
1349:    dtypes(*(get_all_fp_dtypes(include_half=False) +
1350:              get_all_complex_dtypes()))
1351:    dtypesIfCUDA(*(get_all_fp_dtypes(include_half=True) +
1352:                    get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_view_ops.py`</summary>

<p>

```python
19:    get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
124:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
131:    dtypes(*get_all_dtypes(include_bfloat16=False))
213:            for view_dtype in [*get_all_fp_dtypes(), *get_all_complex_dtypes()]:
220:    dtypes(*get_all_dtypes())
224:        for view_dtype in get_all_dtypes():
305:    dtypes(*get_all_complex_dtypes(include_complex32=True))
343:    dtypes(*get_all_dtypes())
354:    dtypes(*get_all_dtypes())
364:    dtypes(*get_all_dtypes())
374:    dtypes(*get_all_dtypes())
384:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
395:    dtypes(*get_all_complex_dtypes())
426:    dtypes(*get_all_complex_dtypes())
451:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
1263:    dtypes(*(torch.testing.get_all_dtypes()))
1279:    dtypes(*(torch.testing.get_all_dtypes()))
1405:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1406:              get_all_complex_dtypes()))
1471:    dtypes(*get_all_dtypes(include_bfloat16=False))
1574:    dtypes(*get_all_dtypes())
1601:    dtypes(*get_all_dtypes(include_bfloat16=False))
1632:    dtypes(*get_all_dtypes(include_bfloat16=False))
1711:        for dt in get_all_dtypes():
1717:        for dt in get_all_dtypes():
1724:        for dt in get_all_dtypes():
```

</p>
</details>

I'm looking forward to your viewpoints. Thanks :)

cc: mruberry kshitij12345 anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71561

Reviewed By: samdow

Differential Revision: D34856571

Pulled By: mruberry

fbshipit-source-id: 0dca038bcad5cf69906245c496d2e61ac3876335
(cherry picked from commit b058f67b4313143efa714ab105f36e74083131b9)
2022-03-15 20:31:41 +00:00
Philip Meier
d4d0ab71b3 use torch.testing.assert_equal in TestCase.assertEqual (#67796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67796

Supersedes #58981.

cc mruberry

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542994

Pulled By: mruberry

fbshipit-source-id: 527099f5fdc154fd95ee48cd19f0a85eeec43443
(cherry picked from commit 1a58915e2c)
2022-01-27 08:33:55 +00:00
Joel Schlosser
61ea2fc35e Fix device type / dtype handling for parametrized test names (#65217)
Summary:
This PR absolves `_TestParametrizer`s (e.g. `ops`, `modules`, `parametrize`) of the responsibility of adding device type (e.g. `'cpu'`, `'cuda'`, etc.) / dtype (e.g. 'float32') to generated test names. This fixes repeated instances of the device string being added to generated test names (e.g. `test_batch_norm_training_True_cuda_track_running_stats_True_cuda_affine_True_cuda`).

The responsibility for placing device / dtype suffixes is now handled by `instantiate_device_type_tests()` instead so it is added a single time. It will place `<device>_<dtype>` at the end of the test name unconditionally, maintaining the current naming convention.

As part of this work, I also tightened the semantics through some additional error case handling:
* Composing multiple decorators that each try to handle the same parameter will error out with a nice message. This includes the case to trying to compose `modules` + `ops`, as they each try to handle `dtype`. Similarly, `ops` + `dtypes` is forbidden when both try to handle `dtype`. This required changes in the following test files:
  * `test/test_unary_ufuncs.py`
  * `test/test_foreach.py`
* The `modules` / `ops` decorators will now error out with a nice message if used with `instantiate_parametrized_tests()` instead of `instantiate_device_type_tests()`, since they're not (currently) written to work outside of a device-specific context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65217

Reviewed By: mruberry

Differential Revision: D32627303

Pulled By: jbschlosser

fbshipit-source-id: c2957228353ed46a0b7da8fa1a34c67598779312
2021-11-29 19:02:23 -08:00
Masaki Kozuki
cdd5d16489 [Foreach] Implement L1&L2 norm (#62646)
Summary:
Implement L1 & L2 norm in fast path with the reference of [nvidia/apex](https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_l2norm_kernel.cu).
When `ord` is neither 1 nor 2, then slow path is chosen.

Related: https://github.com/pytorch/pytorch/issues/58833

cc ptrblck mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62646

Reviewed By: malfet

Differential Revision: D32173421

Pulled By: ngimel

fbshipit-source-id: 14b7544601658a979b83509df351e1848ded7675
2021-11-05 11:23:00 -07:00
Jane Xu
c19cda5782 [skip ci] Add test owners for a special hi-pri class of tests (#67553)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

This change does require some context: there were several suggestions regarding what to do about this group of tests: tests that are core and crucial to all of PyTorch and are too broad to be owned by one team.
1. Let's add a "module: core" and put people behind it! This idea sounds appealing unless you are one of the people backing the label. From talking to albanD among others, this idea of putting all these core tests on the shoulder of a few people or one team isn't super fair and I have not yet found anyone willing to take on this job.
2. Taking advantage of the fact that we already have a triaging oncall that takes turns triaging issues, we can leave these tests essentially unlabeled and allow the oncall to triage these tests. Since these tests are crucial to PyTorch, we'll add the "high priority" label to mark them different from other unowned tests (see https://github.com/pytorch/pytorch/issues/67552).
3. I _could_ still create an unbacked label "module: core" and attribute these tests there, but I don't like the idea of creating a facade that the tests are "triaged" to a label when no one is actually taking a look.

Now we could potentially break these tests down into smaller files so that each piece _could_ be owned by a team, but 1. I don't know if this is currently feasible and 2. This approach does not prevent that from happening in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67553

Reviewed By: albanD

Differential Revision: D32025004

Pulled By: janeyx99

fbshipit-source-id: 1fb1aa4c27e305695ab6e80ae3d02f90519939c0
2021-10-29 12:17:21 -07:00
Philip Meier
26b7ff5aea deprecate dtype getters from torch.testing namespace (#63554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554

Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:

1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.

We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30662206

Pulled By: mruberry

fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
2021-09-07 08:58:51 -07:00
Kushashwa Ravi Shrimali
d37636901e [Doc] make_tensor to torch.testing module (#63925)
Summary:
This PR aims to add `make_tensor` to the `torch.testing` module in PyTorch docs.

TODOs:

* [x] Add examples

cc: pmeier mruberry brianjo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63925

Reviewed By: ngimel

Differential Revision: D30633487

Pulled By: mruberry

fbshipit-source-id: 8e5a1f880c6ece5925b4039fee8122bd739538af
2021-08-30 12:25:40 -07:00
Shen Li
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
Zsolt Dollenstein
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
Yukio Siraichi
693b0af996 Port addcmul kernels to structured kernels. (#62318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62318

Tracking issue: #55070

This PR introduces the method `TensorIteratorBase::build_ternary_op` for building a
`TensorIteratorBase` for 3-input 1-output kernel.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29961997

Pulled By: bdhirsh

fbshipit-source-id: 2208d24823bad6e74c8d508f363716d8125b8619
2021-08-04 15:34:01 -07:00
Ilia Cherniavskii
773a8eede4 [profiler][refactor] Refactor the usage of legacy profiler implementation (#61931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61931

This PR consolidates the profiling code around a new C++ implementation
(profiler_kineto.h/cpp) and uses it unconditionally from
torch.autograd.profiler/torch.profiler:
1. Always use profiler_kineto.h/cpp as the C++ implementation
2. Simplify profiler.py to remove unneeded parts depending on legacy
impl
3. Move some of the legacy logic into profiler_legacy.py (to be fully
deleted later)

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py -v
USE_KINETO=0 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py -v

Imported from OSS

Reviewed By: gdankel

Differential Revision: D29801599

fbshipit-source-id: 9794d29f2af38dddbcd90dbce4481fc8575fa29e
2021-08-03 18:51:29 -07:00
Masaki Kozuki
43d4fe68cd [Foreach] support implicit broadcasting in slow path (#62167)
Summary:
This PR has foreach functions support implicit broadcasting via slow path.

rel: https://github.com/pytorch/pytorch/issues/58833

cc: ptrblck  ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62167

Reviewed By: mruberry

Differential Revision: D30005109

Pulled By: ngimel

fbshipit-source-id: f48c0a13e304411763541ffcfcfc6154adb26bac
2021-07-30 13:29:56 -07:00
Masaki Kozuki
8a2063e58a Foreach Test Refactor: Pointwise, Min/Max-imum (#61327)
Summary:
- rewrite pointwise unittests using `ops` decorator
- rewrite minimum&maximum unittests using `ops` decorator
- enable minimum/maximum fastpath for BFloat16
- remove _test_data method

https://github.com/pytorch/pytorch/issues/58833

cc: ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61327

Reviewed By: albanD

Differential Revision: D29830209

Pulled By: ngimel

fbshipit-source-id: fa7805262b86c40fc32750b16629d80ad48ea4b5
2021-07-21 21:59:57 -07:00
Masaki Kozuki
fac744e116 Foreach Binary Test Refactor (#59907)
Summary:
Related: https://github.com/pytorch/pytorch/issues/58833

## Changes I'm a bit concerned
- binary ops with one tensorlist and one scalarlist support complex dtypes. To realize this, I added a specialization of [`TensorListScalarListMetadata<c10::complex<double>, 1>` ](https://github.com/pytorch/pytorch/pull/59907/files#diff-131eb9b310905b15b3528da6a23e542a3a3aa952bc88f7423c98a23a8a28cca1R49). This might be out of the scope of this pull request.

cc ptrblck ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59907

Reviewed By: mruberry

Differential Revision: D29551001

Pulled By: ngimel

fbshipit-source-id: 46b25fdba85dd4d6332a77b27376fe96cd422384
2021-07-06 11:49:38 -07:00
Natalia Gimelshein
52b2ed65c0 Revert D29007258: Revert D28926135: [pytorch][PR] Refactor Foreach Tests: Unary Functions
Test Plan: revert-hammer

Differential Revision:
D29007258

Original commit changeset: c15f51661641

fbshipit-source-id: 98236153136a5c6b6c2911079b7bd214da6cb424
2021-06-09 21:02:56 -07:00
Natalia Gimelshein
171142f9cc Revert D28926135: [pytorch][PR] Refactor Foreach Tests: Unary Functions
Test Plan: revert-hammer

Differential Revision:
D28926135 (0897df18a3)

Original commit changeset: 4eb21dcebbff

fbshipit-source-id: c15f51661641f455ae265cdf048051a3c01198f9
2021-06-09 14:05:56 -07:00
Masaki Kozuki
0897df18a3 Refactor Foreach Tests: Unary Functions (#58960)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/58833

__changes__
- slowpath tests: pass every dtype&device tensors and compare the behavior with regular functions including inplace
- check of #cudaLaunchKernel
- rename `ForeachUnaryFuncInfo` -> `ForeachFuncInfo`: This change is mainly for the future binary/pointwise test refactors

cc: ngimel ptrblck mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58960

Reviewed By: ejguan

Differential Revision: D28926135

Pulled By: ngimel

fbshipit-source-id: 4eb21dcebbffffaf79259e31961626e0707fb8d1
2021-06-09 09:45:16 -07:00
Masaki Kozuki
7eade660c6 [PyTorch] Reduce errors of foreach functions (#56993)
Summary:
This is based on  https://github.com/pytorch/pytorch/issues/48224.

To make `foreach` more flexible, this PR pushes unsupported cases to slow path.
Also, this adds some tests to verify that
- `foreach` functions work with tensors of different dtypes and/or memory layouts in 7bd4b2c89f
- `foreach` functions work with tensors on different devices in a list, but are on the same device if the indices are the same: def4b9b5a1

Future plans:
1. Improve the coverage of unittests using `ops` decorator & updating `foreach_unary_op_db` and creating `foreach_(binary|pointwise|minmax)_db`.
2. Support broadcasting in slow path. Ref:  https://github.com/pytorch/pytorch/pull/52448
3. Support type promotion in fast path. Ref https://github.com/pytorch/pytorch/pull/52449

CC: ngimel mcarilli  ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56993

Reviewed By: zou3519

Differential Revision: D28630580

Pulled By: ngimel

fbshipit-source-id: e26ee74a39a591025e18c1ead48948cb7ec53c19
2021-05-25 10:50:20 -07:00
Iurii Zdebskyi
cc7a28d727 Refactor Unary Ops tests (#49712)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49712

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25673712

Pulled By: izdeby

fbshipit-source-id: 4420d5d129026195097d914e410b75b144bea795
2021-03-19 09:28:00 -07:00
Iurii Zdebskyi
b5cdb53af1 Add division logic to a slow/fast path (#49250)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49250

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502938

Pulled By: izdeby

fbshipit-source-id: bdd583464eb15d7cb30fd0c22d119cc4b31cbf8d
2021-03-15 12:17:39 -07:00
Iurii Zdebskyi
4bb34c2a75 Update Binary Ops with scalar lists (#49249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49249

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502939

Pulled By: izdeby

fbshipit-source-id: b16e23063b37521be549e83cb17676e3afc4ddb3
2021-03-15 12:16:04 -07:00
Iurii Zdebskyi
84af0c7acd Refactor ForeachUtils.h (#51131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51131

--------
- Refactored `can_use_fast_route` logic in ForeachUtils.h.
- Fixed related bugs in test_foreach.py

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26103904

Pulled By: izdeby

fbshipit-source-id: b3859b39adaab55c87dab6f7709d227adc0f6342
2021-03-13 13:39:25 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Iurii Zdebskyi
c697e48023 Refactor ForeachUnaryOp.cu (#51894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51894

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26323605

Pulled By: izdeby

fbshipit-source-id: eb65269ab3e14160d7cb5e6e84e85ef4037d3b0d
2021-03-05 10:26:58 -08:00
Iurii Zdebskyi
110a17a4d9 Update foreach APIs to use scalar lists (#51893)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51893

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26323606

Pulled By: izdeby

fbshipit-source-id: 53791087c924d04526fe7adb8f4ab5676d383b04
2021-03-04 18:20:53 -08:00
Michael Carilli
e36576d153 Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 (ci-all edition) (#52634)
Summary:
Should close https://github.com/pytorch/pytorch/issues/51992.

ci-all resubmit of https://github.com/pytorch/pytorch/pull/52591. The plot also thickened considerably since then. Every foreach functor, it turns out, has bad `r_args` accesses for certain code paths and instantiations.

Also, I noticed the [`n % kILP == 0`](2680ff7759/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L87)) condition for vectorization in all functors is way too restrictive: it'll refuse to vectorize anything on any tensor whose overall numel is not a multiple of ILP. That's out of scope though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52634

Reviewed By: H-Huang

Differential Revision: D26725991

Pulled By: izdeby

fbshipit-source-id: 4bade0ac186bf85527baddc1c44b2c2b8e3c9777
2021-03-01 12:41:24 -08:00
Natalia Gimelshein
443a431ac3 Revert D25074763: [WIP] Update foreach APIs to use scalar lists
Test Plan: revert-hammer

Differential Revision:
D25074763 (cce84b5ca5)

Original commit changeset: 155e3d2073a2

fbshipit-source-id: ef0d153e2740b50bd4a95f7a57c370bb5da46355
2021-02-03 17:06:40 -08:00
Iurii Zdebskyi
cce84b5ca5 [WIP] Update foreach APIs to use scalar lists (#48223)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48223

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25074763

Pulled By: izdeby

fbshipit-source-id: 155e3d2073a20d16bdbe358820170bf53f93c7a5
2021-02-02 14:54:28 -08:00
Iurii Zdebskyi
dad74e58fc [WIP] Added foreach_trunc, foreahc_reciprocal, foreach_sigmoid APIs (#47385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47385

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24737051

Pulled By: izdeby

fbshipit-source-id: ed259d9184b2b784d8cc1983a8b85cc6cbf930ba
2020-12-07 10:47:23 -08:00
Iurii Zdebskyi
94cd048bda Added foreach_frac API (#47384)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47384

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24737052

Pulled By: izdeby

fbshipit-source-id: 8c94cc42bf22bfbb8f78bfeb2017a5756045763a
2020-11-17 16:56:30 -08:00
Iurii Zdebskyi
134bce7cd0 Adding bunch of unary foreach APIs (#47875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47875

Implementing several unary operators for _foreach_ APIs.
### Planned list of ops
- [x]  abs
- [x]  acos
- [x]  asin
- [x]  atan
- [x]  ceil
- [x]  cos
- [x]  cosh
- [x]  erf
- [x]  erfc
- [x]  exp
- [x]  expm1
- [x]  floor
- [x]  log
- [x]  log10
- [x]  log1p
- [x]  log2
- [ ]  frac
- [x]  neg
- [ ]  reciprocal
- [x]  round
- [ ]  rsqrt
- [ ]  sigmoid
- [x]  sin
- [x]  sinh
- [x]  sqrt
- [x]  tan
- [x]  tanh
- [ ]  trunc
- [x]  lgamma
- [ ]  digamma
- [ ]  erfinv
- [ ]  sign
- [ ]  mvlgamma
- [ ]  clamp
- [ ]  clamp_min
- [ ]  clamp_max

### Perf results
```
----------------- OP:  sin  -----------------
  Median: 998.79 us
  300.84 us

----------------- OP:  abs  -----------------
  Median: 1.19 ms
  294.97 us

----------------- OP:  acos  -----------------
  Median: 982.30 us
  299.40 us

----------------- OP:  asin  -----------------
  Median: 1.16 ms
  298.09 us

----------------- OP:  atan  -----------------
  Median: 986.92 us
  295.64 us

----------------- OP:  ceil  -----------------
  Median: 1.17 ms
  297.25 us

----------------- OP:  cos  -----------------
  Median: 972.72 us
  294.41 us

----------------- OP:  cosh  -----------------
  Median: 1.17 ms
  294.97 us

----------------- OP:  erf  -----------------
  Median: 1.17 ms
  297.02 us

----------------- OP:  erfc  -----------------
  Median: 1.14 ms
  299.23 us

----------------- OP:  exp  -----------------
  Median: 1.15 ms
  298.79 us

----------------- OP:  expm1  -----------------
  Median: 1.17 ms
  291.79 us

----------------- OP:  floor  -----------------
  Median: 1.17 ms
  293.51 us

----------------- OP:  log  -----------------
  Median: 1.13 ms
  318.01 us

----------------- OP:  log10  -----------------
  Median: 987.17 us
  295.57 us

----------------- OP:  log1p  -----------------
  Median: 1.13 ms
  297.15 us

----------------- OP:  log2  -----------------
  Median: 974.21 us
  295.01 us

----------------- OP:  frac  -----------------
  Median: 1.15 ms
  296.01 us

----------------- OP:  neg  -----------------
  Median: 1.13 ms
  294.98 us

----------------- OP:  reciprocal  -----------------
  Median: 1.16 ms
  293.69 us

----------------- OP:  round  -----------------
  Median: 1.12 ms
  297.48 us

----------------- OP:  sigmoid  -----------------
  Median: 1.13 ms
  296.53 us

----------------- OP:  sin  -----------------
  Median: 991.02 us
  295.78 us

----------------- OP:  sinh  -----------------
  Median: 1.15 ms
  295.70 us

----------------- OP:  sqrt  -----------------
  Median: 1.17 ms
  297.75 us

----------------- OP:  tan  -----------------
  978.20 us
  297.99 us

----------------- OP:  tanh  -----------------
  Median: 967.84 us
  297.29 us

----------------- OP:  trunc  -----------------
  Median: 1.14 ms
  298.72 us

----------------- OP:  lgamma  -----------------
  Median: 1.14 ms
  317.53 us
```

### Script

```

import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
import torch.utils.benchmark as benchmark_utils

inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)]

def main():
    for op in [
            "sin", "abs", "acos", "asin", "atan", "ceil",
            "cos", "cosh", "erf", "erfc",
            "exp", "expm1", "floor", "log",
            "log10", "log1p", "log2", "frac",
            "neg", "reciprocal", "round",
            "sigmoid", "sin", "sinh", "sqrt",
            "tan", "tanh", "trunc", "lgamma"
        ]:
        print("\n\n----------------- OP: ", op, " -----------------")
        stmt = "[torch.{op}(t) for t in inputs]"
        timer = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer)",
        )
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

        stmt = "torch._foreach_{op}(inputs)"
        timer_mta = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer_mta)",
        )
        print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()

```

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24948801

Pulled By: izdeby

fbshipit-source-id: defec3c0394d6816d9a8b05a42a057348f1b4d96
2020-11-17 16:51:54 -08:00
Iurii Zdebskyi
1c45631f10 Revert D24737050: [WIP] Adding bunch of unary foreach APIs
Test Plan: revert-hammer

Differential Revision:
D24737050 (b6a2444eff)

Original commit changeset: deb59b41ad1c

fbshipit-source-id: 76cd85028114cfc8fc5b7bb49cd27efc2e315aa5
2020-11-10 09:41:41 -08:00
Iurii Zdebskyi
b6a2444eff [WIP] Adding bunch of unary foreach APIs (#47383)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47383

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D24737050

Pulled By: izdeby

fbshipit-source-id: deb59b41ad1c79b66cafbd9a9d3d6b069794e743
2020-11-09 14:14:28 -08:00
Iurii Zdebskyi
2c55426610 Renamed a TensorListMetaData property. Cleaned up a test (#46662)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46662

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24453346

Pulled By: izdeby

fbshipit-source-id: f88ac21708befa2e8f3edeffe5805b69a4634d12
2020-11-04 12:01:28 -08:00
Iurii Zdebskyi
2652f2e334 Optimize arguments checks (#46661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46661

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24453342

Pulled By: izdeby

fbshipit-source-id: 26866fdbc9dc2b5410b3b728b175a171cc6a4521
2020-11-03 17:43:10 -08:00
Iurii Zdebskyi
3ea26b1424 [WIP] Push rocm to slow path for foreach APIs (#46733)
Summary:
Move ROCM to a slow path for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46733

Reviewed By: ngimel

Differential Revision: D24485012

Pulled By: izdeby

fbshipit-source-id: f0f4227cc594d8a87d44008cd5e27ebe100b6b22
2020-10-23 10:33:41 -07:00
Iurii Zdebskyi
c57c560744 Revert "Push rocm to slow path (#46216)" (#46728)
Summary:
This reverts commit bc1ce58451.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46728

Reviewed By: cpuhrsch

Differential Revision: D24482783

Pulled By: izdeby

fbshipit-source-id: 619b710a8e790b9878e7317f672b4947e7b88145
2020-10-22 12:04:29 -07:00
Iurii Zdebskyi
bc1ce58451 Push rocm to slow path (#46216)
Summary:
Push rocm to slow path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46216

Reviewed By: bwasti

Differential Revision: D24263731

Pulled By: izdeby

fbshipit-source-id: 98ede2478b8f075ceed44a9e4f2aa292f523b8e2
2020-10-22 09:31:01 -07:00
Iurii Zdebskyi
e7564b076c Refactor scalar list APIs to use overloads (#45673)
Summary:
Refactor foreach APIs to use overloads in case of scalar list inputs.
Tested via unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45673

Reviewed By: heitorschueroff

Differential Revision: D24053424

Pulled By: izdeby

fbshipit-source-id: 35976cc50b4acfe228a32ed26cede579d5621cde
2020-10-19 09:28:49 -07:00
Iurii Zdebskyi
8a074af929 Added scalar lists APIs for addcdiv and addcmul (#45932)
Summary:
1) Added new APIs:
 _foreach_addcdiv(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcdiv_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcmul(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcmul_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)

2) Updated optimizers to use new APIs

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45932

Reviewed By: navahgar

Differential Revision: D24150306

Pulled By: izdeby

fbshipit-source-id: c2e65dedc95d9d81a2fdd116e41df0accb0b6f26
2020-10-14 08:12:37 -07:00
Iurii Zdebskyi
1a57b390e8 Add torch._foreach_maximum(TensorList, TensorList) & torch._foreach_minimum(TensorList, TensorList) APIs (#45692)
Summary:
- Adding torch._foreach_maximum(TensorList, TensorList) API
- Adding torch._foreach_minimum(TensorList, TensorList) API
- Updated Adam/AdamW optimizers

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45692

Reviewed By: anjali411

Differential Revision: D24142464

Pulled By: izdeby

fbshipit-source-id: 6a4fc343a1613cb1e26c8398450ac9cea0a2eb51
2020-10-13 09:22:30 -07:00
Sam Estep
a69a78daa2 Use smaller N to speed up TestForeach (#45785)
Summary:
Between September 25 and September 27, approximately half an hour was added to the running time of `pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test`. Judging from the CircleCI data, it looks like the majority of the new time was added by the following PRs:

- https://github.com/pytorch/pytorch/issues/44550
- https://github.com/pytorch/pytorch/issues/45298

I'm not sure what to do about https://github.com/pytorch/pytorch/issues/44550, but it looks like https://github.com/pytorch/pytorch/issues/45298 increased the `N` for `TestForeach` from just 20 to include both 30 and 300. This PR would remove the 300, decreasing the test time by a couple orders of magnitude (at least when running it on my devserver), from over ten minutes to just a few seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45785

Reviewed By: malfet

Differential Revision: D24094782

Pulled By: samestep

fbshipit-source-id: 2476cee9d513b2b07bc384de751e08d0e5d8b5e7
2020-10-06 13:29:04 -07:00
Michael Carilli
72bc3d9de4 Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for).  This PR adds an MTA unscale+infcheck functor.  Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).

The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`).  This allows calling code to pass lambdas.

Open question:  As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops.  However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control.  I can easily rewrite it that way if you prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778

Reviewed By: gchanan

Differential Revision: D23944102

Pulled By: izdeby

fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
2020-10-01 07:51:16 -07:00
Iurii Zdebskyi
d5748d9a1a Enable binary ops with Scalar Lists with for foreach APIs (#45298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931986

Pulled By: izdeby

fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3
2020-09-25 12:58:34 -07:00
Xinyu Li
26001a2334 Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList
Test Plan: revert-hammer

Differential Revision:
D23753711 (71d1b5b0e2)

Original commit changeset: bf3e8c54bc07

fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c
2020-09-24 11:55:49 -07:00
iurii zdebskyi
71d1b5b0e2 Add foreach APIs for binary ops with ScalarList (#44743)
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions

tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743

Reviewed By: bwasti, malfet

Differential Revision: D23753711

Pulled By: izdeby

fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
2020-09-24 08:30:42 -07:00
Mike Ruberry
686e281bcf Updates div to perform true division (#42907)
Summary:
This PR:

- updates div to perform true division
- makes torch.true_divide an alias of torch.div

This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907

Reviewed By: ngimel

Differential Revision: D23622114

Pulled By: mruberry

fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
2020-09-14 15:50:38 -07:00
Iurii Zdebskyi
40d138f7c1 Added alpha overloads for add/sub ops with lists (#43413)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43413

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331896

Pulled By: izdeby

fbshipit-source-id: 2e7484339fec533e21224f18979fddbeca649d2c
2020-09-08 17:02:08 -07:00
Mike Ruberry
63d62d3e44 Skips test_addcmul_cuda if using ROCm (#44304)
Summary:
This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet:

```
03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA)
03:43:11 ----------------------------------------------------------------------
03:43:11 Traceback (most recent call last):
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper
03:43:11     method(*args, **kwargs)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test
03:43:11     result = test(self, *args)
03:43:11   File "test_foreach.py", line 83, in test_addcmul
03:43:11     self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul)
03:43:11   File "test_foreach.py", line 58, in _test_pointwise_op
03:43:11     self.assertEqual(tensors, expected)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual
03:43:11     exact_dtype=exact_dtype, exact_device=exact_device)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual
03:43:11     self.assertTrue(result, msg=msg)
03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304

Reviewed By: malfet, izdeby

Differential Revision: D23578316

Pulled By: mruberry

fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c
2020-09-08 13:14:25 -07:00
Iurii Zdebskyi
cce5982c4c Add unary ops: exp and sqrt (#42537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331889

Pulled By: izdeby

fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
2020-09-07 19:57:34 -07:00
Iurii Zdebskyi
10dd25dcd1 Add binary ops for _foreach APIs (#42536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)

torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331891

Pulled By: izdeby

fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
2020-09-07 10:29:32 -07:00
iurii zdebskyi
2f044d4ee5 Fix CI build (#44068)
Summary:
Some of our machines have only 1 device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44068

Reviewed By: wanchaol

Differential Revision: D23485730

Pulled By: izdeby

fbshipit-source-id: df6bc0aba18feefc50c56a8f376103352fa2a2ea
2020-09-02 17:09:30 -07:00
Iurii Zdebskyi
297c938729 Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331894

Pulled By: izdeby

fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
2020-09-02 12:18:28 -07:00
Iurii Zdebskyi
4cb8d306e6 Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42531

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

---------------
**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](https://github.com/pytorch/pytorch/pull/41554).

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331892

Pulled By: izdeby

fbshipit-source-id: c585b72e1e87f6f273f904f75445618915665c4c
2020-08-28 14:34:46 -07:00
iurii zdebskyi
e995c3d21e Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar) (#41554)
Summary:
Initial PR for the Tensor List functionality.

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA.
- Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)`

**Tests**
Tested via unit tests

**Plan for the next PRs**

1. Cover these ops with `multi_tensor_apply` support
- exponent
- division
- mul_
- add_
- addcmul_
- addcdiv_
- Sqrt

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41554

Reviewed By: cpuhrsch

Differential Revision: D22829724

Pulled By: izdeby

fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
2020-08-04 15:01:09 -07:00