Commit Graph

2119 Commits

Author SHA1 Message Date
kshitij12345
aa51704ce5 [complex32] add chalf alias for complex32 and chalf method
Reference: https://github.com/pytorch/pytorch/issues/74537

Adds chalf alias for complex32 and also adds method `chalf` similar to `cfloat, cdouble`

TODO:
* [x] Add docs
* [x] Add override
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75320
Approved by: https://github.com/anjali411
2022-04-20 23:44:47 +00:00
Edward Z. Yang
ee955b8bb9 Cannibalize noarch CI job into crossref CI job
crossref is a new strategy for performing tests when you want
to run a normal PyTorch API call, separately run some variation of
the API call (e.g., same thing but all the arguments are meta tensors)
and then cross-reference the results to see that they are consistent.
Any logic you add to CrossRefMode will get run on *every* PyTorch API
call that is called in the course of PyTorch's test suite.  This can
be a good choice for correctness testing if OpInfo testing is not
exhaustive enough.

For now, the crossref test doesn't do anything except verify that
we can validly push a mode onto the torch function mode stack for all
functions.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75988

Approved by: https://github.com/seemethere
2022-04-20 11:56:25 +00:00
Edward Z. Yang
30943d1610 Remove noarchTest decorator
These tests are cheap so it doesn't matter if we run them on all
configs.  This is in preparation for removing the noarch build
configuration entirely.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75985

Approved by: https://github.com/seemethere, https://github.com/cbalioglu
2022-04-19 00:48:49 +00:00
Beilei Zheng
332086c08d Add BFloat16 support for multinomial and poisson on CPU
Add BFloat16 support for multinomial and poisson on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63215
Approved by: https://github.com/frank-wei, https://github.com/bigfootjon
2022-04-14 15:42:18 +00:00
Jagadish Krishnamoorthy
26ba7a9297 ROCm: Enable test_masked_scatter_large_tensor
#68487 fixes the issue #60190 for ROCm >= 5.0 release.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes #60190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75455
Approved by: https://github.com/ezyang
2022-04-08 15:59:40 +00:00
Nikita Karetnikov
936a65056e Use the same checks in all grid_sampler functions
Fixes #73187.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75164
Approved by: https://github.com/albanD
2022-04-04 15:21:44 +00:00
Alban Desmaison
0ce02ea52d Revert D35284563: Use the same checks in all grid_sampler functions
Test Plan: revert-hammer

Differential Revision:
D35284563 (835cc66e5d)

Original commit changeset: 1477c506b875

Original Phabricator Diff: D35284563 (835cc66e5d)

fbshipit-source-id: 7260f4dfda23bd60200e5ba2c5bf3e4f833c2646
(cherry picked from commit fbe082905ef678e7dd70dbc9520dca644383ce01)
2022-04-01 16:45:46 +00:00
kshitij12345
65b65af236 [complex32] cat, fill_(partial), item
Reference : #74537

`cat_backwards` (on CUDA) requires support for `fill`, have added support for `fill`. (Also `fill` requires `item` support)

Now `fill` backward requires `sum` (will add it in later PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75010
Approved by: https://github.com/anjali411
2022-04-01 15:19:05 +00:00
Nikita Karetnikov
835cc66e5d Use the same checks in all grid_sampler functions (#74635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74635

Fixes #73187.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D35284563

Pulled By: albanD

fbshipit-source-id: 1477c506b8755d864ca902ee140bee7bdb0069b0
(cherry picked from commit dcbd5242baaae11f9e323d99a9596e5b88e86bd7)
2022-04-01 14:26:16 +00:00
Mikayla Gawarecki
2bfa018462 [BC-breaking] Use ScatterGatherKernel for scatter_reduce (CPU-only) (#74226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74226

Update signature of `scatter_reduce_` to match `scatter_/scatter_add_`

`Tensor.scatter_reduce_(int64 dim, Tensor index, Tensor src, str reduce)`

- Add new reduction options in ScatterGatherKernel.cpp and update `scatter_reduce` to call into the cpu kernel for `scatter.reduce`
- `scatter_reduce` now has the same shape constraints as `scatter_` and `scatter_add_`
- Migrate `test/test_torch.py:test_scatter_reduce` to `test/test_scatter_gather_ops.py`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D35222842

Pulled By: mikaylagawarecki

fbshipit-source-id: 84930add2ad30baf872c495251373313cb7428bd
(cherry picked from commit 1b45139482e22eb0dc8b6aec2a7b25a4b58e31df)
2022-04-01 05:57:45 +00:00
Nikita Shulga
bfac65dfe5
[testing] Update dispatch macros (#74977)
This PR is reland of #74289 
Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
2022-03-30 14:13:21 -07:00
PyTorch MergeBot
2e4152b118 Revert "[testing] Update dispatch macros"
This reverts commit eed19a0f38.

Reverted https://github.com/pytorch/pytorch/pull/74289 on behalf of https://github.com/malfet
2022-03-30 19:52:37 +00:00
Khushi Agrawal
eed19a0f38 [testing] Update dispatch macros
Hi,
This PR is the follow-up PR of #71561. (the previous PR had a couple of merge conflicts and was reverted, this PR resolves that).
Please take a look. Thanks!

cc: @pmeier @mruberry @kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74289
Approved by: https://github.com/pmeier, https://github.com/mruberry
2022-03-30 16:10:16 +00:00
Edward Z. Yang
51e7a3406c Fix formatting of scalar tensors (don't call item)
Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74376

Approved by: https://github.com/bdhirsh
2022-03-25 02:22:25 +00:00
Jane Xu
3f9115dc7a Decorate test_pdist_large for requiring large memory (#74574)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/74154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74574

Reviewed By: george-qi

Differential Revision: D35100229

Pulled By: janeyx99

fbshipit-source-id: d7df377318e45c7f5447c034aa025b1422fcc06e
(cherry picked from commit 335a76d9f2a721b30e1b9e1c869bfbe431f01a2a)
2022-03-24 17:25:37 +00:00
kshitij12345
f7ee308dfb [complex-half] support casting (by updating copy_)
Reference https://github.com/pytorch/pytorch/issues/71680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73847
Approved by: https://github.com/anjali411
2022-03-23 21:42:59 +00:00
Kurt Mohler
79ddc72b85 Virtualize <type>Storage classes (#66970)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66228

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66970

Reviewed By: bdhirsh

Differential Revision: D33245612

Pulled By: ezyang

fbshipit-source-id: 4c61c2cb029e2b94b0e68927c377d3e1c358dd7c
(cherry picked from commit d29fcdfb4bc2cc17b1795d4349e4b56fa0d1cf12)
2022-03-22 23:44:48 +00:00
Saketh Are
46a88036af Refactor error input tests in test_torch.py to OpInfos (#73981)
Summary:
This PR ports several tests in `test/test_torch.py` over to OpInfo ErrorInputs.

Some tests commented "convert to ErrorInputs" still remain in `test_torch.py`. They fall under two categories:
- Memory overlap tests which specifically test the in-place version of an operator (e.g. [this test](424a054d53/test/test_torch.py (L3788)) for index_add_).
- Tests with non-trivial behavior calling `torch.cuda.synchronize()` after calling the operator being tested (e.g. [this test](424a054d53/test/test_torch.py (L4948)) for torch.multinomial).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73981

Reviewed By: qihqi

Differential Revision: D35016669

Pulled By: saketh-are

fbshipit-source-id: bc0016d2b2bfb566a9dfef81ecf44e0adb9e4b14
(cherry picked from commit 99bcbdb05f2c10a717a269b0010aa3a3e24fe5c0)
2022-03-21 22:31:37 +00:00
Nikita Shulga
ef066f0832 Revert D34856571: [pytorch][PR] Replace get_all_ type macros with the ATen dispatch macros.
Test Plan: revert-hammer

Differential Revision:
D34856571 (3ded7b1da3)

Original commit changeset: 0dca038bcad5

Original Phabricator Diff: D34856571 (3ded7b1da3)

fbshipit-source-id: 594553fa0b710d78beba59d5d2b646f1f1270386
(cherry picked from commit 8090eb9b12dcf452a9e7dc01792a66fb91b563b6)
2022-03-15 22:07:11 +00:00
Khushi Agrawal
3ded7b1da3 Replace get_all_ type macros with the ATen dispatch macros. (#71561)
Summary:
Hi, Team!
The PR is motivated from https://github.com/pytorch/pytorch/pull/71153#discussion_r782446738. It aims to replace `get_all` type macros with the ATen dispatch macros.

The files it iterates over are: (Thanks, Lezcano, for the idea!!)

<details>
<summary>

`test/test_autograd.py`</summary>

<p>

```python
43:from torch.testing._internal.common_dtype import get_all_dtypes
8506:        floating_dt = [dt for dt in get_all_dtypes() if dt.is_floating_point]
```

</p>
</details>

<details>
<summary>

`test/test_binary_ufuncs.py`</summary>

<p>

```python
26:    all_types_and_complex_and, integral_types_and, get_all_dtypes, get_all_int_dtypes, get_all_math_dtypes,
27:    get_all_complex_dtypes, get_all_fp_dtypes,
935:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1035:    dtypes(*get_all_dtypes(
1488:    dtypes(*(get_all_dtypes(include_bool=False, include_bfloat16=False)))
1879:    dtypes(*product(get_all_dtypes(include_complex=False), get_all_dtypes(include_complex=False)))
1887:    dtypes(*(get_all_int_dtypes() + [torch.bool]))
1913:    dtypes(*(get_all_fp_dtypes()))
1941:    dtypes(*(get_all_fp_dtypes()))
1977:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
2019:    dtypes(*product(get_all_fp_dtypes(), get_all_fp_dtypes()))
2048:    dtypes(*get_all_dtypes())
2110:    dtypes(*product(get_all_dtypes(include_complex=False),
2111:                     get_all_dtypes(include_complex=False)))
2128:            types = [torch.bool, torch.bfloat16] + get_all_int_dtypes()
2173:        if dtypes[1] in get_all_fp_dtypes():
2178:    dtypes(*product(get_all_fp_dtypes(),
2179:                     get_all_fp_dtypes()))
2260:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2261:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2273:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2274:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2307:    dtypes(*get_all_math_dtypes('cpu'))
2319:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
2331:    dtypes(*get_all_int_dtypes())
2356:    dtypes(*get_all_dtypes(include_bfloat16=False, include_bool=False, include_complex=False))
2393:        if dtype in get_all_int_dtypes():
2614:    dtypes(*get_all_dtypes())
2624:    dtypes(*tuple(itertools.combinations_with_replacement(get_all_dtypes(), 2)))
2806:    dtypes(*list(product(get_all_dtypes(include_complex=False),
2807:                          get_all_dtypes(include_complex=False))))
2866:    dtypes(*list(product(get_all_complex_dtypes(),
2867:                          get_all_complex_dtypes())))
2902:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2906:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2910:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
3019:        dtypes = [torch.float, torch.double] + get_all_complex_dtypes()
3221:    dtypes(*get_all_dtypes(include_complex=False))
3407:    dtypes(*list(product(get_all_dtypes(include_bool=False),
3408:                          get_all_dtypes(include_bool=False))))
3504:    dtypes(*product(get_all_dtypes(include_complex=False, include_bfloat16=False),
3505:                     get_all_dtypes(include_complex=False, include_bfloat16=False)))
3516:            if x.dtype in get_all_int_dtypes() + [torch.bool]:
3643:    dtypes(*product(get_all_dtypes(include_complex=False,
3645:                     get_all_dtypes(include_complex=False,
```

</p>
</details>

<details>
<summary>

`test/test_complex.py`</summary>

<p>

```python
6:from torch.testing._internal.common_dtype import get_all_complex_dtypes
11:    dtypes(*get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_foreach.py`</summary>

<p>

```python
18:    get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
142:            if dtype in get_all_int_dtypes():
179:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
201:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
205:                disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
211:                disable_fastpath |= dtype not in get_all_complex_dtypes()
241:                bool_int_div = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
246:                    disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
248:                    disable_fastpath |= dtype not in get_all_complex_dtypes()
250:                    disable_fastpath |= True and dtype not in get_all_complex_dtypes()
307:        disable_fastpath = dtype in get_all_int_dtypes() + [torch.bool]
365:        if opinfo.name == "_foreach_abs" and dtype in get_all_complex_dtypes():
376:    ops(foreach_unary_op_db, dtypes=get_all_dtypes())
393:         dtypes=get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
401:    ops(foreach_minmax_op_db, dtypes=get_all_fp_dtypes(include_bfloat16=True, include_half=True))
426:            if ord in (1, 2) and dtype in torch.testing.get_all_fp_dtypes():
439:    dtypes(*get_all_dtypes())
449:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
481:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
536:            if dtype in get_all_int_dtypes() + [torch.bool] and foreach_op == torch._foreach_div:
545:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
637:    ops(foreach_pointwise_op_db, allowed_dtypes=get_all_fp_dtypes(include_half=False, include_bfloat16=False))
```

</p>
</details>

<details>
<summary>

`test/test_linalg.py`</summary>

<p>

```python
29:    all_types, floating_types, floating_and_complex_types, get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes,
30:    get_all_fp_dtypes,
111:    dtypes(*(get_all_dtypes()))
794:        float_and_complex_dtypes = get_all_fp_dtypes() + get_all_complex_dtypes()
807:    dtypes(*(get_all_int_dtypes()))
828:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
841:        if dtype in get_all_complex_dtypes():
844:    dtypes(*itertools.product(get_all_dtypes(),
845:                               get_all_dtypes()))
855:        for dtypes0, dtypes1, dtypes2 in product(get_all_dtypes(), repeat=3):
5607:                  *get_all_fp_dtypes(include_half=not CUDA9, include_bfloat16=(CUDA11OrLater and SM53OrLater)))
5608:    dtypes(*(set(get_all_dtypes()) - {torch.half, torch.bool}))
5644:    dtypes(*(get_all_complex_dtypes() + get_all_fp_dtypes()))
6255:    dtypesIfCUDA(*get_all_complex_dtypes(),
6256:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)),
6292:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6323:    dtypesIfCUDA(*get_all_complex_dtypes(),
6324:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6325:    dtypes(*get_all_complex_dtypes(), *get_all_fp_dtypes())
6358:    dtypesIfCUDA(*([torch.float, torch.double] + get_all_complex_dtypes()))
6556:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6668:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6741:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_nn.py`</summary>

<p>

```python
37:from torch.testing._internal.common_dtype import integral_types, get_all_fp_dtypes, get_all_math_dtypes
50:    onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types, \
8862:                for device in get_all_device_types():
9629:            for dt1 in get_all_math_dtypes(device):
9630:                for dt2 in get_all_math_dtypes(device):
9631:                    for dt3 in get_all_math_dtypes(device):
9648:            for input_dtype in get_all_math_dtypes(device):
9664:            for input_dtype in get_all_math_dtypes(device):
13015:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13034:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13159:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17400:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17768:    dtypesIfCUDA(*get_all_fp_dtypes())
17773:    dtypesIfCUDA(*get_all_fp_dtypes())
17778:    dtypesIfCUDA(*get_all_fp_dtypes())
17783:    dtypesIfCUDA(*get_all_fp_dtypes())
17788:    dtypesIfCUDA(*get_all_fp_dtypes())
17793:    dtypesIfCUDA(*get_all_fp_dtypes())
17798:    dtypesIfCUDA(*get_all_fp_dtypes())
17963:    dtypesIfCUDA(*get_all_fp_dtypes())
17977:    dtypesIfCUDA(*get_all_fp_dtypes())
18684:    def test_cross_entropy_loss_prob_target_all_reductions(self, device):
```

</p>
</details>

<details>
<summary>

`test/test_numpy_interop.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import get_all_dtypes
399:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_ops.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import floating_and_complex_types_and, get_all_dtypes
86:        for dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_reductions.py`</summary>

<p>

```python
16:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
360:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
366:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
394:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
750:        for dtype in [dtype for dtype in get_all_math_dtypes('cpu') if dtype != torch.float16]:
1404:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1457:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1458:              get_all_complex_dtypes()))
1465:            return dtype in get_all_int_dtypes()
1494:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1501:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1507:    dtypes(*(get_all_complex_dtypes()))
1514:        dtypes = list(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))
1523:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1531:        if dtype in get_all_fp_dtypes():
1608:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
1837:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1855:    dtypes(*(set(get_all_dtypes(include_bool=False, include_complex=False)) - {torch.uint8}))
3219:        for dtype in get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_serialization.py`</summary>

<p>

```python
26:from torch.testing._internal.common_dtype import get_all_dtypes
586:        for device, dtype in product(devices, get_all_dtypes()):
589:            for other_dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_shape_ops.py`</summary>

<p>

```python
18:from torch.testing._internal.common_dtype import get_all_dtypes
230:    dtypes(*get_all_dtypes(include_complex=False, include_bool=False, include_half=False,
232:    dtypesIfCUDA(*get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False))
344:    dtypes(*get_all_dtypes())
443:    dtypes(*get_all_dtypes())
461:    dtypes(*get_all_dtypes())
570:    dtypes(*get_all_dtypes(include_complex=False))
```

</p>
</details>

<details>
<summary>

`test/test_sort_and_select.py`</summary>

<p>

```python
12:    all_types, all_types_and, floating_types_and, get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes,
136:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
231:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
296:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
647:    dtypesIfCUDA(*get_all_fp_dtypes())
678:    dtypesIfCUDA(*(get_all_dtypes(include_complex=False,
682:    dtypes(*(get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)))
739:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
740:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
799:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
800:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
```

</p>
</details>

<details>
<summary>

`test/test_sparse.py`</summary>

<p>

```python
20:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes
29:    floating_and_complex_types, floating_and_complex_types_and, get_all_dtypes, get_all_int_dtypes,
1963:            return dtype in get_all_int_dtypes()
1994:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2103:            return dtype in get_all_int_dtypes()
2138:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2626:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
2633:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
3230:    dtypes(*get_all_complex_dtypes(),
3231:            *get_all_fp_dtypes(include_half=False, include_bfloat16=False))
3234:                  *get_all_fp_dtypes(
```

</p>
</details>

<details>
<summary>

`test/test_sparse_csr.py`</summary>

<p>

```python
7:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes, floating_and_complex_types, make_tensor
17:from torch.testing._internal.common_dtype import floating_types, get_all_dtypes
120:    dtypes(*get_all_dtypes())
133:    dtypes(*get_all_dtypes())
150:    dtypes(*get_all_dtypes())
180:    dtypes(*get_all_dtypes())
201:    dtypes(*get_all_dtypes())
210:    dtypes(*get_all_dtypes())
225:    dtypes(*get_all_dtypes())
244:    dtypes(*get_all_dtypes())
263:    dtypes(*get_all_dtypes())
285:    dtypes(*get_all_dtypes())
411:    dtypes(*get_all_dtypes())
482:    dtypes(*get_all_dtypes())
502:    dtypes(*get_all_dtypes())
562:    dtypes(*get_all_dtypes())
588:    dtypesIfCUDA(*get_all_complex_dtypes(),
589:                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
745:    dtypesIfCUDA(*get_all_complex_dtypes(),
746:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
765:    dtypesIfCUDA(*get_all_complex_dtypes(),
766:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
801:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
841:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
1182:    dtypes(*get_all_dtypes())
1276:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_bfloat16=False))
1286:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_tensor_creation_ops.py`</summary>

<p>

```python
21:    onlyCUDA, skipCPUIf, dtypesIfCUDA, skipMeta, get_all_device_types)
23:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
150:        for dt in get_all_dtypes():
160:        for dt in get_all_dtypes():
314:        dtypes = [dtype for dtype in get_all_dtypes() if dtype != torch.bfloat16]
1012:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1013:              get_all_complex_dtypes()))
1032:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1033:              get_all_complex_dtypes()))
1050:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1051:              get_all_complex_dtypes()))
1745:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1779:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1868:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1926:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1954:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
1956:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, None)
1957:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
2538:        for device in get_all_device_types():
2645:        for dtype in get_all_dtypes():
2678:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False) +
2679:              get_all_complex_dtypes()))
2716:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
2827:            for dt in get_all_dtypes():
2913:    dtypes(*get_all_dtypes(include_bool=False, include_half=False))
2914:    dtypesIfCUDA(*get_all_dtypes(include_bool=False, include_half=True))
3028:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3033:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3074:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
3075:    dtypesIfCUDA(*((get_all_int_dtypes() + [torch.float32, torch.float16, torch.bfloat16])
3077:                    else get_all_dtypes(include_bool=False, include_half=True, include_complex=False)))
3873:    dtypes(*get_all_dtypes())
3884:    dtypes(*get_all_dtypes(include_bool=False))
3916:            for other in get_all_dtypes():
3922:    dtypes(*get_all_dtypes())
3932:    dtypes(*get_all_dtypes(include_bool=False))
3955:    dtypes(*get_all_dtypes(include_bool=False))
3961:    dtypes(*get_all_dtypes(include_bool=False))
3965:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_testing.py`</summary>

<p>

```python
25:from torch.testing._internal.common_dtype import get_all_dtypes
31:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_torch.py`</summary>

<p>

```python
51:    expectedAlertNondeterministic, get_all_device_types, skipXLA)
57:    get_all_fp_dtypes, get_all_int_dtypes, get_all_math_dtypes, get_all_dtypes, get_all_complex_dtypes
296:            for d in get_all_device_types():
323:            for device in get_all_device_types():
324:                for dt1 in get_all_dtypes():
325:                    for dt2 in get_all_dtypes():
343:            all_dtypes = get_all_dtypes()
350:            all_dtypes = get_all_dtypes()
781:            for dtype in get_all_dtypes():
986:            for device in get_all_device_types():
1017:            for device in get_all_device_types():
1018:                for dtype in get_all_math_dtypes(device):
2792:            for device in get_all_device_types():
3186:    dtypes(*get_all_dtypes())
3195:        for error_dtype in get_all_dtypes():
3203:    dtypes(*get_all_dtypes())
3212:        for error_dtype in get_all_dtypes():
4539:    dtypes(*get_all_fp_dtypes())
4545:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
4577:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
4578:    dtypesIfCPU(*(get_all_fp_dtypes(include_half=False, include_bfloat16=True)))
4579:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4599:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4600:    dtypesIfCPU(*(get_all_dtypes(include_half=False, include_bfloat16=False, include_complex=False)))
4601:    dtypesIfCUDA(*(get_all_dtypes(include_bfloat16=False, include_complex=False)))
4613:        for p_dtype in get_all_fp_dtypes(include_half=device.startswith('cuda'), include_bfloat16=False):
4628:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4629:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4640:    dtypes(*get_all_fp_dtypes())
4723:    dtypes(*get_all_fp_dtypes())
4735:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
4736:    dtypesIfCUDA(*get_all_fp_dtypes())
4747:    dtypes(*get_all_fp_dtypes())
4761:    dtypes(*get_all_fp_dtypes())
4771:    dtypes(*get_all_fp_dtypes())
4792:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
5302:    dtypes(*get_all_dtypes(include_bfloat16=False))
5322:    dtypes(*get_all_dtypes(include_half=False, include_bfloat16=False))
5323:    dtypesIfCPU(*get_all_dtypes(include_bfloat16=False))
5324:    dtypesIfCUDA(*get_all_dtypes(include_bfloat16=False))
5591:        for dt in get_all_dtypes():
5611:        for dt in get_all_dtypes():
5678:        for dt in get_all_dtypes():
5696:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
5697:    dtypes(*set(get_all_math_dtypes('cpu')))
5746:    dtypes(*get_all_dtypes())
5780:    dtypes(*get_all_dtypes())
5885:    dtypes(*get_all_dtypes())
5902:    dtypes(*get_all_dtypes())
5945:    dtypes(*get_all_dtypes())
5979:    dtypes(*get_all_dtypes(include_bool=False))
6049:    dtypes(*get_all_dtypes(include_bool=False))
6092:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6093:              get_all_complex_dtypes()))
6094:    dtypesIfCPU(*get_all_dtypes())
6095:    dtypesIfCUDA(*get_all_dtypes())
6122:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6123:              get_all_complex_dtypes()))
6124:    dtypesIfCPU(*get_all_dtypes())
6125:    dtypesIfCUDA(*get_all_dtypes())
6163:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6164:              get_all_complex_dtypes()))
6165:    dtypesIfCPU(*get_all_dtypes())
6166:    dtypesIfCUDA(*get_all_dtypes())
6190:    dtypes(*(get_all_complex_dtypes() +
6191:              get_all_int_dtypes()))
6238:    dtypes(*get_all_dtypes())
6323:    dtypes(*get_all_dtypes())
6389:    dtypes(*product(get_all_dtypes(), (torch.uint8, torch.bool)))
6699:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
6700:    dtypes(*set(get_all_math_dtypes('cpu')))
7452:    dtypes(*get_all_dtypes(include_bool=False))
7461:    dtypes(*get_all_dtypes(include_bool=False))
7477:    dtypes(*get_all_dtypes(include_bool=False))
7496:    dtypes(*get_all_dtypes(include_bool=False))
7538:    dtypes(*get_all_dtypes(include_bool=False))
8162:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8163:              get_all_complex_dtypes()))
8175:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8176:              get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_type_promotion.py`</summary>

<p>

```python
14:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes
187:        for dtype in get_all_dtypes():
262:        dtypes1 = get_all_math_dtypes('cuda')
263:        dtypes2 = get_all_math_dtypes(device)
339:    dtypes(*itertools.product(get_all_dtypes(), get_all_dtypes()))
468:            for dt1 in get_all_math_dtypes(device):
469:                for dt2 in get_all_math_dtypes(device):
519:            for dt1 in get_all_math_dtypes(device):
520:                for dt2 in get_all_math_dtypes(device):
528:        for dt in get_all_math_dtypes(device):
561:        for dtype in get_all_dtypes():
766:                                          dtypes=get_all_math_dtypes(device))
771:                                          dtypes=get_all_math_dtypes(device))
782:                                          dtypes=get_all_math_dtypes(device))
879:        dtypes = get_all_dtypes(include_bfloat16=False)
898:        dtypes = get_all_dtypes(include_bfloat16=False, include_bool=False)
965:    dtypesIfCUDA(*itertools.product(get_all_dtypes(include_bfloat16=False, include_complex=False),
966:                                     get_all_dtypes(include_bfloat16=False, include_complex=False)))
967:    dtypes(*itertools.product(get_all_dtypes(include_half=False, include_bfloat16=False,
969:                               get_all_dtypes(include_half=False, include_bfloat16=False,
976:            return dtype in get_all_int_dtypes() + [torch.bool]
979:            return dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False)
```

</p>
</details>

<details>
<summary>

`test/test_unary_ufuncs.py`</summary>

<p>

```python
24:    floating_types_and, all_types_and_complex_and, floating_and_complex_types_and, get_all_dtypes, get_all_math_dtypes,
25:    get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
517:    dtypes(*(get_all_int_dtypes() + [torch.bool] +
518:              get_all_fp_dtypes(include_bfloat16=False)))
596:    dtypes(*get_all_fp_dtypes(include_half=True, include_bfloat16=False))
611:        invalid_input_dtypes = get_all_int_dtypes() + \
612:            get_all_complex_dtypes() + \
619:        for dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False):
1048:    dtypes(*get_all_math_dtypes('cpu'))
1182:    dtypesIfCUDA(*get_all_fp_dtypes())
1190:    dtypesIfCUDA(*get_all_fp_dtypes())
1205:    dtypesIfCUDA(*get_all_fp_dtypes())
1215:    dtypesIfCUDA(*get_all_fp_dtypes())
1307:    dtypes(*(get_all_dtypes(include_bool=False)))
1349:    dtypes(*(get_all_fp_dtypes(include_half=False) +
1350:              get_all_complex_dtypes()))
1351:    dtypesIfCUDA(*(get_all_fp_dtypes(include_half=True) +
1352:                    get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_view_ops.py`</summary>

<p>

```python
19:    get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
124:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
131:    dtypes(*get_all_dtypes(include_bfloat16=False))
213:            for view_dtype in [*get_all_fp_dtypes(), *get_all_complex_dtypes()]:
220:    dtypes(*get_all_dtypes())
224:        for view_dtype in get_all_dtypes():
305:    dtypes(*get_all_complex_dtypes(include_complex32=True))
343:    dtypes(*get_all_dtypes())
354:    dtypes(*get_all_dtypes())
364:    dtypes(*get_all_dtypes())
374:    dtypes(*get_all_dtypes())
384:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
395:    dtypes(*get_all_complex_dtypes())
426:    dtypes(*get_all_complex_dtypes())
451:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
1263:    dtypes(*(torch.testing.get_all_dtypes()))
1279:    dtypes(*(torch.testing.get_all_dtypes()))
1405:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1406:              get_all_complex_dtypes()))
1471:    dtypes(*get_all_dtypes(include_bfloat16=False))
1574:    dtypes(*get_all_dtypes())
1601:    dtypes(*get_all_dtypes(include_bfloat16=False))
1632:    dtypes(*get_all_dtypes(include_bfloat16=False))
1711:        for dt in get_all_dtypes():
1717:        for dt in get_all_dtypes():
1724:        for dt in get_all_dtypes():
```

</p>
</details>

I'm looking forward to your viewpoints. Thanks :)

cc: mruberry kshitij12345 anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71561

Reviewed By: samdow

Differential Revision: D34856571

Pulled By: mruberry

fbshipit-source-id: 0dca038bcad5cf69906245c496d2e61ac3876335
(cherry picked from commit b058f67b4313143efa714ab105f36e74083131b9)
2022-03-15 20:31:41 +00:00
Natalia Gimelshein
967606124a port torch cov tests to error inputs (#73977)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73977

Reviewed By: malfet

Differential Revision: D34779552

Pulled By: ngimel

fbshipit-source-id: b4191101a029981eb27c75e1b56d739db046f819
(cherry picked from commit 2c2af726ffdba68f358a4ff0ee07580609bccc34)
2022-03-10 19:04:44 +00:00
Natalia Gimelshein
e47a5a64bb Back out "Revert D34524207: [pytorch][PR] remove _s_where" (#73579)
Summary:
Original commit changeset: 87b1220d851c

Original Phabricator Diff: D34524207 (4eb2482568) (4eb2482568)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73579

Test Plan:
OSS tests
tested with canary https://www.internalfb.com/intern/ads/canary/441912928798660873

Reviewed By: ezyang

Differential Revision: D34688237

Pulled By: ngimel

fbshipit-source-id: 32f3a0046053ef52e95ab45a26bfc1de17e7e061
(cherry picked from commit d1c0acbe3e0ff884c429072923a468ee1d3d447d)
2022-03-08 19:15:30 +00:00
anjali411
37e0d2e361 Fix segfault while real and imaginary attributes are set to a number (#73867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73867

Fixes https://github.com/pytorch/pytorch/issues/72947

Test Plan: Imported from OSS

Reviewed By: davidberard98

Differential Revision: D34695956

Pulled By: anjali411

fbshipit-source-id: 2f3eda272a5214335eae506bd387ce8da4d81b8c
(cherry picked from commit fdb07354cac22c30aa047e65fbac9840608db811)
2022-03-08 18:58:26 +00:00
Natalia Gimelshein
55525632ab Revert D34554432: Back out "Revert D34524207: [pytorch][PR] remove _s_where"
Test Plan: revert-hammer

Differential Revision:
D34554432 (9c03c6163f)

Original commit changeset: 2f3601d3d426

Original Phabricator Diff: D34554432 (9c03c6163f)

fbshipit-source-id: db434750f44c6e6ec545a248c462d8fdcbefbaf8
(cherry picked from commit 866d4d0c795edd7ef519925683b5e57dd9b116ad)
2022-03-04 20:32:39 +00:00
Natalia Gimelshein
9c03c6163f Back out "Revert D34524207: [pytorch][PR] remove _s_where" (#73579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73579

Original commit changeset: 87b1220d851c

Original Phabricator Diff: D34524207 (4eb2482568)

Test Plan: OSS tests

Reviewed By: malfet

Differential Revision: D34554432

fbshipit-source-id: 2f3601d3d4261ebcebb05b4b1aec0c9a8a00ea04
(cherry picked from commit b9cad3f2bc54e12b275567454336cf4d9dcb78c4)
2022-03-04 19:35:41 +00:00
Nikita Karetnikov
eb0d370f14 Write explicit meta-kernels for normal (#70089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70089

See #69386.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34089964

Pulled By: bdhirsh

fbshipit-source-id: eb88eb7c4830545d3d43c82b6f3abb98617cee8e
(cherry picked from commit 89c9c02a0fb1c780495fee6370961104f4b1dcd1)
2022-03-01 23:28:14 +00:00
Nikita Shulga
dd9517cc4a Revert D34524207: [pytorch][PR] remove _s_where
Test Plan: revert-hammer

Differential Revision:
D34524207 (4eb2482568)

Original commit changeset: bc71e27b6d3f

Original Phabricator Diff: D34524207 (4eb2482568)

fbshipit-source-id: 87b1220d851c3d2b51bdd1cf2f8a493c58ab9b14
(cherry picked from commit af1f0cc9e032b00619a7979bbbd2281f69e0fdf0)
2022-03-01 17:43:16 +00:00
Natalia Gimelshein
4eb2482568 remove _s_where (#73468)
Summary:
Per title
Fixes https://github.com/pytorch/pytorch/issues/73135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73468

Reviewed By: albanD

Differential Revision: D34524207

Pulled By: ngimel

fbshipit-source-id: bc71e27b6d3fa50de6737533c92375266d9eadc5
(cherry picked from commit 047b925849370e6e4cbe9e3a722db52bb1e965b9)
2022-03-01 07:30:34 +00:00
Philip Meier
0973c5a1cc align signature of make_tensor with other creation ops (#72702)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72702

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34457729

Pulled By: mruberry

fbshipit-source-id: 83d580c4201eef946dc9cf4b9e28a3d36be55609
(cherry picked from commit aa4cf20fbeb4b795595729b8ac2e6ba7707d8283)
2022-02-25 06:30:31 +00:00
Nikita Shulga
cfb6c942fe scatter_reduce documentation (#73125)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/68580 (which were milestoned for 1.11) plus partial revert of https://github.com/pytorch/pytorch/pull/72543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73125

Reviewed By: bdhirsh

Differential Revision: D34355217

Pulled By: malfet

fbshipit-source-id: 325ecdeaf53183d653b44ee5e6e8839ceefd9200
(cherry picked from commit 71db31748a)
2022-02-22 19:33:46 +00:00
Philip Meier
1f74e082e2 only compare attributes for meta tensors (#72508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72508

Todo:

- [x] document this behavior
- [x] add tests

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D34262452

Pulled By: ezyang

fbshipit-source-id: bc5c9653d5c3ad5c6efccc9c8e0efc0d28e15104
(cherry picked from commit 233142c88e)
2022-02-17 02:33:08 +00:00
Brian Hirsh
f87f753bb9 avoiding adding some functions to the public python API before 1.11 release (#72543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72543

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34085724

Pulled By: bdhirsh

fbshipit-source-id: 941d5a90a6fa5328268d623e0e2b01577e4132ca
(cherry picked from commit 6676a0c79a)
2022-02-14 19:49:01 +00:00
Kurt Mohler
47c6993355 Update from_dlpack tests and documentation (#70543)
Summary:
Part of https://github.com/pytorch/pytorch/issues/58742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70543

Reviewed By: soulitzer

Differential Revision: D34172475

Pulled By: mruberry

fbshipit-source-id: d498764b8651a8b7a19181b3421aeebf28a5db2b
(cherry picked from commit 05332f164c)
2022-02-14 03:35:17 +00:00
anjali411
f607af126e Set correct device id on efficientzerotensors (#71611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71611

Fixes https://github.com/pytorch/pytorch/issues/71160 https://github.com/pytorch/pytorch/issues/69925 #69913

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33897543

Pulled By: anjali411

fbshipit-source-id: f1d8608c351876b8c2619da5ef891f74bad30ab5
(cherry picked from commit 643e666ea3)
2022-02-02 21:51:32 +00:00
Anjali Chourdia
1e4aefaa2f Revert D33834916: Set correct device id on efficientzerotensors
Test Plan: revert-hammer

Differential Revision:
D33834916 (a18cfb790d)

Original commit changeset: 11cec343e95e

Original Phabricator Diff: D33834916 (a18cfb790d)

fbshipit-source-id: 3d3f60b760b445383768161b1d21ea4dadbe5d7c
(cherry picked from commit eba41aa646)
2022-01-31 03:49:56 +00:00
anjali411
a18cfb790d Set correct device id on efficientzerotensors (#71611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71611

Fixes https://github.com/pytorch/pytorch/issues/71160 https://github.com/pytorch/pytorch/issues/69925

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D33834916

Pulled By: anjali411

fbshipit-source-id: 11cec343e95e2ee188ab7576f26f64aa19317891
(cherry picked from commit f6e86f8a6b)
2022-01-30 20:53:15 +00:00
Mikayla Gawarecki
09c417ae65 Add new reduce options and autograd support for scatter_reduce (#71788)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71788

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33778525

Pulled By: cpuhrsch

fbshipit-source-id: 47b8544e29df3075bc6ede894c59499a7ffec876
(cherry picked from commit ddcddac726)
2022-01-27 17:38:50 +00:00
Mikayla Gawarecki
fdec94504f Rename _scatter_reduce to scatter_reduce and make it unstructured (#71787)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71787

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33778524

Pulled By: cpuhrsch

fbshipit-source-id: 55a330e1c2227c0eaaa1c0d2f9205a4dee24a11b
(cherry picked from commit 6e4a8a91da)
2022-01-27 16:29:13 +00:00
Mike Ruberry
0891c908bb Revert D33768645: Set correct device id on efficientzerotensors
Test Plan: revert-hammer

Differential Revision:
D33768645 (5dd6cd55ba)

Original commit changeset: 66ce9907630b

Original Phabricator Diff: D33768645 (5dd6cd55ba)

fbshipit-source-id: 4bb1ad46f01cd33aeb813bdc123741cf665194a8
(cherry picked from commit 8ca385b1d8)
2022-01-26 17:01:32 +00:00
anjali411
5dd6cd55ba Set correct device id on efficientzerotensors (#71611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71611

Fixes https://github.com/pytorch/pytorch/issues/71160

Test Plan: Imported from OSS

Reviewed By: pbelevich, ngimel

Differential Revision: D33768645

Pulled By: anjali411

fbshipit-source-id: 66ce9907630b65a12c0775077147a7e72ff4cee4
(cherry picked from commit 3af98a4d70)
2022-01-25 23:32:11 +00:00
Jonathan Colen
33403f4848 edge_order check in torch.gradient only applies to dim argument (#67926)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67919

The compatibility check on `edge_order` in `pre_check_gradient` now looks only at dim argument if it is present, otherwise it checks all dimensions.

Previously, it would check all dimensions regardless of the dim argument and throw unnecessary errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67926

Reviewed By: albanD

Differential Revision: D33760621

Pulled By: mruberry

fbshipit-source-id: d490cd8610c68ff3787e670fc947de3cbf2db062
(cherry picked from commit 45bc56de9e)
2022-01-25 21:29:31 +00:00
Mike Ruberry
e0d829a266 Kill the test_torch.py mixin and creates test_scatter_gather_ops (#71691)
Summary:
Per title.

Also annotates test_torch.py with additional cleanup tasks and adds empty sample inputs to elementwise unary and binary OpInfos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71691

Reviewed By: ngimel

Differential Revision: D33735126

Pulled By: mruberry

fbshipit-source-id: 8cc097a7581a8b620540c95b2a5889c1165ecf23
(cherry picked from commit 5c6a245a3f)
2022-01-24 09:32:32 +00:00
Mike Ruberry
7680a0ae9d Deprecates _aminmax (#71576)
Summary:
Replaces https://github.com/pytorch/pytorch/pull/62432. Existing callsites are updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71576

Reviewed By: ngimel

Differential Revision: D33689960

Pulled By: mruberry

fbshipit-source-id: fad1ba78347ecec7fd48f21862c3eb606662b8f4
(cherry picked from commit 6cd438e9a1)
2022-01-21 09:23:29 +00:00
Peter Bell
17bb68618f Copy: Fix CPU transpose path ignoring neg and conj bits (#69026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69026

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33064533

Pulled By: anjali411

fbshipit-source-id: 98c25586a1707ac2324f69f652ce5a14dd59c0ad
2022-01-14 10:13:33 -08:00
Emilio Castillo
8dfff8b2e2 Fix scatter for empty indexes (#70662)
Summary:
This PR fixes an issue with `scatter` where the output is garbage for zero-sized indexes.

```py
import torch

null_index = torch.zeros((0, 4), dtype=torch.int64)
null_arr = torch.zeros((0, 4))
zeros_arr = torch.zeros((1, 4))

result = zeros_arr.scatter(0, null_index, null_arr)

print(null_index)
print(null_arr)
print(zeros_arr)
print(result)
```

```
tensor([], size=(0, 4), dtype=torch.int64)
tensor([], size=(0, 4))
tensor([[0., 0., 0., 0.]])
tensor([[1.7036e+19, 2.9965e+32, 3.9133e-14, 1.3585e-19]])
```

the out array is never filled if `index` arg has 0 elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70662

Reviewed By: dagitses

Differential Revision: D33476807

Pulled By: albanD

fbshipit-source-id: 97dbdd9c0133899e58828c43ecba81838807b8af
2022-01-07 09:20:43 -08:00
Peter Bell
917d56a7e4 Copy: Fix conj bit being ignored on type mismatch (#68963)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68963

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33064492

Pulled By: anjali411

fbshipit-source-id: 043f927d6bfff46bf5f8ea6fce9409f250bf8ff8
2022-01-05 17:59:32 -08:00
Brian Hirsh
457ba1dd3e Porting index_add to structured kernels, add an out variant (#65993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65993

This PR attempts to port `index_add` to structured kernels, but does more than that:

* Adds an `out=` variant to `index_add`
* Revises `native_functions.yaml` registrations, to not have multiple entries and instead pass default value to `alpha`.
* Changes in `derivatives.yaml` file for autograd functioning
* Revises error messages, please see: https://github.com/pytorch/pytorch/pull/65993#issuecomment-945441615

Follow-up PRs in near future will attempt to refactor the OpInfo test, and will give another look at tests in `test/test_torch.py` for this function. (hence the use of ghstack for this)

~This is WIP because there are tests failing for `Dimname` variant on mobile/android builds, and I'm working on fixing them.~

Issue tracker: https://github.com/pytorch/pytorch/issues/55070

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32646426

fbshipit-source-id: b035ecf843a9a27d4d1e18b202b035adc2a49ab5
2021-12-14 11:57:13 -08:00
kshitij12345
5b2586fe09 [testing] Ignore expected_regex in assertRaisesRegex for non-native device (#68723)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68723

Reviewed By: zou3519

Differential Revision: D32797061

Pulled By: mruberry

fbshipit-source-id: 3bcae6d3d62d180059dbe39be520b0e7f9aea19f
2021-12-02 14:52:27 -08:00
Emilio Castillo
533e72e0a4 Fix DLPack CUDA stream convention (#67618)
Summary:
Apparently for the array API, cuda default stream and per thread stream should be 1 and 2 instead of 0 and 1:

https://data-apis.org/array-api/latest/API_specification/array_object.html?dlpack-self-stream-none#dlpack-self-stream-none.

This caused a problem in the interop with CuPy https://github.com/cupy/cupy/pull/5970#discussion_r739912926.

cc rgommers leofang mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67618

Reviewed By: albanD

Differential Revision: D32521805

Pulled By: mruberry

fbshipit-source-id: 95777e4014e5edf1f88ba10adc03c6e34c13248d
2021-11-18 08:36:05 -08:00
kshitij12345
d5d2096dab [testing] make @dtypes mandatory when using @dtypesIf (#68186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53647

With this if a test forgets to add `dtypes` while using `dtypesIf`, following error is raised
```
AssertionError: dtypes is mandatory when using dtypesIf however 'test_exponential_no_zero' didn't specify it
```

**Tested Locally**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68186

Reviewed By: VitalyFedyunin

Differential Revision: D32468581

Pulled By: mruberry

fbshipit-source-id: 805e0855f988b77a5d8d4cd52b31426c04c2200b
2021-11-18 08:29:31 -08:00
rusty1s
9807787135 scatter_reduce (#68115)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63780

Basic functionality of a `scatter_reduce` algorithm with `reduce="sum"`:

* `scatter_reduce` is named as `scatter_reduce2` due to compiling issues
* It currently re-uses functionality from `scatter_add`
* Tests are missing: WIP

The error when the `scatter_reduce` naming is used:
```
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13949:18: error: redefinition of ‘struct at::_ops::scatter_reduce’
13949 | struct TORCH_API scatter_reduce {
      |                  ^~~~~~~~~~~~~~
aten/src/ATen/Operators.h:13817:18: note: previous definition of ‘struct at::_ops::scatter_reduce’
13817 | struct TORCH_API scatter_reduce {
      |                  ^~~~~~~~~~~~~~
aten/src/ATen/Operators.h:13960:18: error: redefinition of ‘struct at::_ops::scatter_reduce_out’
13960 | struct TORCH_API scatter_reduce_out {
      |                  ^~~~~~~~~~~~~~~~~~
aten/src/ATen/Operators.h:13839:18: note: previous definition of ‘struct at::_ops::scatter_reduce_out’
13839 | struct TORCH_API scatter_reduce_out {
      |                  ^~~~~~~~~~~~~~~~~~
In file included from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/core/TensorBody.h: In member function ‘at::Tensor at::Tensor::scatter_reduce(int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>) const’:
aten/src/ATen/core/TensorBody.h:3976:83: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 3976 |     return at::_ops::scatter_reduce::call(const_cast<Tensor&>(*this), dim, index, reduce, output_size);
      |                                                                                   ^~~~~~
      |                                                                                   |
      |                                                                                   c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13824:109: note:   initializing argument 4 of ‘static at::Tensor at::_ops::scatter_reduce::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view)’
13824 |   static at::Tensor call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce);
      |                                                                                          ~~~~~~~~~~~~~~~~~~~^~~
In file included from ../aten/src/ATen/ATen.h:15,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Functions.h: In function ‘at::Tensor at::scatter_reduce(const at::Tensor&, int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>)’:
aten/src/ATen/Functions.h:7119:61: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 7119 |     return at::_ops::scatter_reduce::call(self, dim, index, reduce, output_size);
      |                                                             ^~~~~~
      |                                                             |
      |                                                             c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13824:109: note:   initializing argument 4 of ‘static at::Tensor at::_ops::scatter_reduce::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view)’
13824 |   static at::Tensor call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce);
      |                                                                                          ~~~~~~~~~~~~~~~~~~~^~~
In file included from ../aten/src/ATen/ATen.h:15,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Functions.h: In function ‘at::Tensor& at::scatter_reduce_out(at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>)’:
aten/src/ATen/Functions.h:7124:65: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 7124 |     return at::_ops::scatter_reduce_out::call(self, dim, index, reduce, output_size, out);
      |                                                                 ^~~~~~
      |                                                                 |
      |                                                                 c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13846:111: note:   initializing argument 4 of ‘static at::Tensor& at::_ops::scatter_reduce_out::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view, at::Tensor&)’
13846 |   static at::Tensor & call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce, at::Tensor & out);
      |                                                                                            ~~~~~~~~~~~~~~~~~~~^~~
In file included from ../aten/src/ATen/ATen.h:15,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Functions.h: In function ‘at::Tensor& at::scatter_reduce_outf(const at::Tensor&, int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>, at::Tensor&)’:
aten/src/ATen/Functions.h:7129:65: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 7129 |     return at::_ops::scatter_reduce_out::call(self, dim, index, reduce, output_size, out);
      |                                                                 ^~~~~~
      |                                                                 |
      |                                                                 c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13846:111: note:   initializing argument 4 of ‘static at::Tensor& at::_ops::scatter_reduce_out::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view, at::Tensor&)’
13846 |   static at::Tensor & call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce, at::Tensor & out);
      |                                                                                            ~~~~~~~~~~~~~~~~~~~^~~
In file included from aten/src/ATen/NativeFunctions.h:6,
                 from ../aten/src/ATen/TensorIndexing.h:12,
                 from ../aten/src/ATen/ATen.h:20,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/NativeMetaFunctions.h: At global scope:
aten/src/ATen/NativeMetaFunctions.h:496:18: error: redefinition of ‘struct at::meta::structured_scatter_reduce’
  496 | struct TORCH_API structured_scatter_reduce : public at::impl::MetaBase {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/NativeMetaFunctions.h:481:18: note: previous definition of ‘struct at::meta::structured_scatter_reduce’
  481 | struct TORCH_API structured_scatter_reduce : public at::impl::MetaBase {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68115

Reviewed By: albanD

Differential Revision: D32488450

Pulled By: cpuhrsch

fbshipit-source-id: 65e79c6d0555c0d5715535bb52aade8d5fcd9722
2021-11-17 19:53:12 -08:00
Mikayla Gawarecki
cac3cd1433 add torch.diff support for n greater than 1 (#67260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67260

Addressing 54853

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31930294

Pulled By: mikaylagawarecki

fbshipit-source-id: 97c7a27e9200c6688242680ff96b73dfff828479
2021-11-17 09:16:33 -08:00
Nick Anderson
f9ea41f257 Fixes spelling error writeable to writable, improves warning, and documentation (#67664)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46741
pytorchbot

contributors: nickleus27, yanivsagy, and khanhthien123

SmrutiSikha this is mostly your work.  We just did very minor clean up.

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67664

Reviewed By: gchanan

Differential Revision: D32311838

Pulled By: mruberry

fbshipit-source-id: 0e5d4d888caeccb0fd7c80e6ff11b1b1fa8e00d6
2021-11-11 13:05:00 -08:00
Kurt Mohler
db014b8529 Add set_deterministic_debug_mode and get_deterministic_debug_mode (#67778)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67778

Reviewed By: ngimel

Differential Revision: D32310661

Pulled By: mruberry

fbshipit-source-id: 300129e96ca51c22fa711182ce6a9f4d4d2ce57f
2021-11-11 12:48:29 -08:00
Thomas Viehmann
33b7790907 Fix conv_transpose3d backward with non-contiguous grad_out (#67829)
Summary:
Many thanks to Forest Yang (meowmix) from the forum for reporting it with a minimal reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67829

Reviewed By: malfet

Differential Revision: D32184786

Pulled By: albanD

fbshipit-source-id: b63dbd3148b5def2109deb2f4612c08f55f59dfb
2021-11-05 08:34:21 -07:00
soulitzer
83e8612d11 Clean up test autograd (#67413)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/66066

This PR:
 - cleans up op-specific testing from test_autograd. test_autograd should be reserved for testing generic autograd functionality
 - tests related to an operator are better colocated
 - see the tracker for details

What to think about when moving tests to their correct test suite:
 - naming, make sure its not too generic
 - how the test is parametrized, sometimes we need to add/remove a device/dtype parameter
 - can this be merged with existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67413

Reviewed By: jbschlosser, albanD

Differential Revision: D32031480

Pulled By: soulitzer

fbshipit-source-id: 8e13da1e58a38d5cecbfdfd4fe2b4fe6f816897f
2021-11-03 15:26:09 -07:00
kshitij12345
885a8e53ba replace onlyOnCPUAndCUDA with onlyNativeDeviceTypes (#65201)
Summary:
Reference https://github.com/pytorch/pytorch/issues/53849

Replace `onlyOnCPUandCUDA` with `onlyNativeDeviceTypes` which includes `cpu, cuda and meta`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65201

Reviewed By: mrshenli

Differential Revision: D31299718

Pulled By: mruberry

fbshipit-source-id: 2d8356450c035d6a314209ab51b2c237583920fd
2021-11-01 09:22:34 -07:00
kshitij12345
c00806beda Add skipXLA and expectedFailureXLA decorator (#66857)
Summary:
Add skipXLA and expectedFailureXLA decorator and relevant test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66857

Reviewed By: ngimel

Differential Revision: D32039856

Pulled By: mruberry

fbshipit-source-id: 3c99d5e06c1c7684d1f798c11c783bd6ebea9899
2021-10-29 19:53:36 -07:00
jjsjann123
1ec732bc46 Add fp16/fp32 autocasting to JIT/TorchScript (#63939)
Summary:
Adds mixed precision autocasting support between fp32/fp16 to torchscript/JIT. More in depth descriptoin can be found at [torch/csrc/jit/JIT-AUTOCAST.md](https://github.com/pytorch/pytorch/pull/63939/files#diff-1f1772aaa508841c5bb58b74ab98f49a1e577612cd9ea5c386c8714a75db830b)

This PR implemented an autocast optimization pass that inserts casting ops per AMP rule (torch/csrc/jit/passes/autocast.cpp), that mimics the behavior of eager autocast. The pass also takes into consideration the context of `torch.cuda.amp.autocast` and only inserts casting ops within the enabled context manager, giving feature parity as with eager amp autocast.

We currently provide JIT AMP autocast as a prototyping feature, so it is default off and could be turned on via `torch._C._jit_set_autocast_mode(True)`

The JIT support for autocast is subject to different constraints compared to the eager mode implementation (mostly related to the fact that TorchScript is statically typed), restriction on the user facing python code is described in doc torch/csrc/jit/JIT-AUTOCAST.md

This is a prototype, there are also implementation limitation that's necessary to keep this PR small and get something functioning quickly on upstream, so we can iterate on designs.

Few limitation/challenge that is not properly resolved in this PR:
1. Autocast inserts cast operation, which would have impact on scalar type of output tensor feeding downstream operations. We are not currently propagating the updated scalar types, this would give issues/wrong results on operations in promotion rules.

2. Backward for autodiff in JIT misses the casting of dgrad to input scalar type, as what autograd does in eager. This forces us to explicitly mark the casting operation for certain operations (e.g. binary ops), otherwise, we might be feeding dgrad with mismatch scalar type to input. This could potentially break gradient function consuming dgrad. (e.g. gemm backwards, which assumes grad_output to be of same scalar type as input')

3. `torch.autocast` api has an optional argument `dtype` which is not currently supported in the JIT autocast and we require a static value.

Credit goes mostly to:
tlemo
kevinstephano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63939

Reviewed By: navahgar

Differential Revision: D31093381

Pulled By: eellison

fbshipit-source-id: da6e26c668c38b01e296f304507048d6c1794314
2021-10-27 12:11:36 -07:00
Nikita Shulga
77beccaedb Do not build PyTorch with caffe2 by default (#66658)
Summary:
CAFFE2 has been deprecated for a while, but still included in every PyTorch build.
We should stop building it by default, although CI should still validate that caffe2 code is buildable.

Build even fewer dependencies when compiling mobile builds without Caffe2
Introduce `TEST_CAFFE2` in torch.common.utils
Skip `TestQuantizedEmbeddingOps` and `TestJit.test_old_models_bc`  is code is compiled without Caffe2
Should be landed after https://github.com/pytorch/builder/pull/864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66658

Reviewed By: driazati, seemethere, janeyx99

Differential Revision: D31669156

Pulled By: malfet

fbshipit-source-id: 1cc45e2d402daf913a4685eb9f841cc3863e458d
2021-10-21 20:32:47 -07:00
Kurt Mohler
94f4e9a995 Enable warning tests for nondeterministic backward functions (#66736)
Summary:
Followup from https://github.com/pytorch/pytorch/issues/66233

Since https://github.com/pytorch/pytorch/issues/50209 was fixed, we can enable these warning tests now

cc mruberry kurtamohler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66736

Reviewed By: zou3519

Differential Revision: D31723385

Pulled By: mruberry

fbshipit-source-id: dc1922a6d0c45cc80020db85710e755a89113861
2021-10-21 12:51:53 -07:00
Jane Xu
8a65047acc [skip ci] Set test owners for everything considered with module: tests (#66865)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66865

Reviewed By: anjali411

Differential Revision: D31771147

Pulled By: janeyx99

fbshipit-source-id: 8bebe5ac2098364ef1ee93b590abb5f4455b0f89
2021-10-20 09:37:03 -07:00
lezcano
0974215c4d Prefer mT and mH over transpose(-2, -1) and transpose(-2, -1).conj() (#64181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181

This PR replaces all the calls to:
- `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python
- `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python.

It also simplifies two pieces of code, and fixes one bug where a pair
of parentheses were missing in the function `make_symmetric_matrices`.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31692896

Pulled By: anjali411

fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a
2021-10-18 13:02:25 -07:00
Kurt Mohler
a25648953c Add warn_only kwarg to use_deterministic_algorithms (#66233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64883

Adds a `warn_only` kwarg to `use_deterministic_algorithms`. When enabled, calling an operation that does not have a deterministic implementation will raise a warning, rather than an error.

`torch.testing._internal.common_device_type.expectedAlertNondeterministic` is also refactored and documented in this PR to make it easier to use and understand.

cc mruberry kurtamohler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66233

Reviewed By: bdhirsh

Differential Revision: D31616481

Pulled By: mruberry

fbshipit-source-id: 059634a82d54407492b1d8df08f059c758d0a420
2021-10-15 13:54:59 -07:00
anjali411
a82fcd3560 Disable .numpy() and .tolist() for tensor subclasses subclasses and fix .tolist() for conjugated and negated tensors (#66082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66082

Fixes https://github.com/pytorch/pytorch/issues/66024 #65779

cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved albanD

Test Plan: Imported from OSS

Reviewed By: Gamrix, albanD

Differential Revision: D31615588

Pulled By: anjali411

fbshipit-source-id: c3e65ef0fe301630eb76732ccd7819683c09aa19
2021-10-13 13:57:51 -07:00
lezcano
82a216c45b Add tensor.{adjoint(),H,mT,mH} methods and properties (#64179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64179

This PR follows the discussion in https://github.com/pytorch/pytorch/issues/45063#issuecomment-904431478

Fixes https://github.com/pytorch/pytorch/issues/45063

cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved rgommers pmeier asmeurer leofang AnirudhDagar asi1024 emcastillo kmaehashi heitorschueroff

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30730483

Pulled By: anjali411

fbshipit-source-id: 821d25083f5f682450f6812bf852dc96a1cdf9f2
2021-10-13 07:44:43 -07:00
Kurt Mohler
5883523c1d Remove dtype from torch.Storage and use only torch.ByteStorage (#62030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030

Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible

Fixes https://github.com/pytorch/pytorch/issues/47442

* **THE SERIALIZATION FORMAT IS FULLY FC/BC.** We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today.
* There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate.
* As we no longer know what dtype of a storage is, we've **removed** the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes.
* `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls.  `Storage._new_with_file` and `_set_from_file` require explicit element size arguments.
* It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor.
* It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling.
* The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall.
 To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. **If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage** or your serialization code will degrade to standard file-based serialization.

Original pull request: https://github.com/pytorch/pytorch/pull/59671

Reviewed By: soulitzer, ngimel

Differential Revision: D29466819

Pulled By: ezyang

fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e
2021-10-05 13:50:34 -07:00
Philip Meier
aebde1bc2b deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31141433

Pulled By: mruberry

fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732
2021-09-29 02:40:52 -07:00
Alban Desmaison
7c62b6e973 add deepcopy support to subclasses (#65584)
Summary:
Happy to get any feedback on how to make this code cleaner!

This:
- Fix Tensor attribute deepcopy BC-breaking?
- Add a test for Tensor attribute deepcopy
- Fix subclass deepcopy
- Moves the subclass serialization tests into their own class not to interfere with other serialization test logic
- Add a test for subclass deepcopy

cc ezyang gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65584

Reviewed By: gchanan

Differential Revision: D31206590

Pulled By: albanD

fbshipit-source-id: 74a8f0767f4933b9c941fbea880a8fd1b893ea2f
2021-09-27 14:36:22 -07:00
Kshiteej K
ff6b475d4a [fix] don't expose unique_dim in torch (#63080)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62793

This is mostly a quick fix. I think the more correct fix could be updating `unique_dim` to `_unique_dim` which could be BC-breaking for C++ users (� maybe). Maybe something else I am missing.

~~Not sure how to add a test for it.~~ Have tested it locally.

We can add a test like following. Tested this locally, it fails currently but passes with the fix.
```python
        def test_wildcard_import(self):
            exec('from torch import *')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63080

Reviewed By: gchanan

Differential Revision: D30738711

Pulled By: zou3519

fbshipit-source-id: b86d0190e45ba0b49fd2cffdcfd2e3a75cc2a35e
2021-09-14 18:19:17 -07:00
Victor Quach
8131bc85d0 Raise TypeError on assigned grad with wrong type (#64876)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64813

Raises a TypeError when assigned value to a grad is not a Tensor or
None.

Adds tests.

cc ezyang gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64876

Reviewed By: anjali411

Differential Revision: D30901678

Pulled By: soulitzer

fbshipit-source-id: dbb3cb5fd0bbac6918e0b2e2f51d340daa43dee0
2021-09-13 16:41:45 -07:00
Emilio Castillo
1cb3507ed3 Adds DLPack support (#57110)
Summary:
Partially Fixes https://github.com/pytorch/pytorch/issues/55090
Depends on https://github.com/pytorch/pytorch/issues/55365

Inspired by https://github.com/dmlc/dlpack/issues/57#issuecomment-774482973

Questions, in PyTorch we can't create streams or easily synchronize them from just an integer. Should we add an [`ExternalStream`](https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.ExternalStream.html) object like the one we have in CuPy?

TODO: Add tests

Would like some feedback as this design needs quite a few iterations
rgommers leofang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57110

Reviewed By: saketh-are

Differential Revision: D30761481

Pulled By: mruberry

fbshipit-source-id: e85d78df3c1f8defc2a698878da89cd843cb1209
2021-09-12 19:47:15 -07:00
Alban Desmaison
d8ae3cc318 Add more error checking in subclass creation (#64746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64746

This extracts the error checking that used to be in the PR above.
We are not going to land the proposed fix there, but I think we want this error checking in right now as these would lead to respectively a memory leak and arbitrary memory read/write.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867569

Pulled By: albanD

fbshipit-source-id: bf468033fb8b49fcb26eed423f5fad82b4a46c56
2021-09-10 16:49:10 -07:00
Philip Meier
26b7ff5aea deprecate dtype getters from torch.testing namespace (#63554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554

Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:

1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.

We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30662206

Pulled By: mruberry

fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
2021-09-07 08:58:51 -07:00
Ivan Yashchuk
a91a278d60 Fix copy_transpose_valid condition for copy_same_type_transpose_ (#64425)
Summary:
Thanks to ngimel for the hint where the problem might be (https://github.com/pytorch/pytorch/issues/64358#issuecomment-910868849)!

I added a test that fails on master to verify the fix. The shape `(60, 60)` was chosen because of `MIN_SZ = 60 * 60` in `copy_transpose_valid`.

Fixes https://github.com/pytorch/pytorch/issues/64358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64425

Reviewed By: mruberry

Differential Revision: D30752725

Pulled By: ngimel

fbshipit-source-id: f40370ea8365c94e30f8e8a3dcab5f3b3462464a
2021-09-03 18:50:33 -07:00
Kushashwa Ravi Shrimali
76e187aa08 Port gather to structured kernel (#63312)
Summary:
Will add a description once this is ready for review.

cc: ysiraichi ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63312

Reviewed By: iramazanli

Differential Revision: D30597447

Pulled By: ezyang

fbshipit-source-id: d36e59835c2f4b38e286032dd2a1111a7e16b7e5
2021-09-02 01:36:21 -07:00
anjali411
5d80a48cef Add fast path for addmm when the inputs are conjugate (#59380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59380

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28898374

Pulled By: anjali411

fbshipit-source-id: eab0e64d37bb57c18b54cabb8e5c00666338ba04
2021-09-01 16:34:02 -07:00
Philip Meier
401bbb2aa0 remove componentwise comparison of complex values in TestCase.assertEqual (#63572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63572

Addresses #61906. Issue will be fixed later in the stack when `torch.testing.assert_close` got the same treatment.

cc ezyang gchanan

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30633527

Pulled By: mruberry

fbshipit-source-id: c2002a4998a7a75cb2ab83f87190bde43a9d4f7c
2021-08-30 12:36:45 -07:00
Kushashwa Ravi Shrimali
d37636901e [Doc] make_tensor to torch.testing module (#63925)
Summary:
This PR aims to add `make_tensor` to the `torch.testing` module in PyTorch docs.

TODOs:

* [x] Add examples

cc: pmeier mruberry brianjo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63925

Reviewed By: ngimel

Differential Revision: D30633487

Pulled By: mruberry

fbshipit-source-id: 8e5a1f880c6ece5925b4039fee8122bd739538af
2021-08-30 12:25:40 -07:00
mingfeima
b0782f0f32 add BFloat16 support for bernoulli and Dropout on CPU (#56372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56372

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D28836792

Pulled By: VitalyFedyunin

fbshipit-source-id: ede951d172a59276e11383fd767778ab959b5a6b
2021-08-25 12:01:27 -07:00
Aaron Bockover
c78ab28441 Add support for the ONNX Runtime Eager Mode backend (#58248)
Summary:
This PR implements the necessary hooks/stubs/enums/etc for complete ONNX Runtime (ORT) Eager Mode integration. The actual extension will live out of tree at https://github.com/pytorch/ort.

We have been [working on this at Microsoft](https://github.com/microsoft/onnxruntime-pytorch/tree/eager-ort/torch_onnxruntime) for the last few months, and are finally ready to contribute the PyTorch core changes upstream (nothing major or exciting, just the usual boilerplate for adding new backends).

The ORT backend will allow us to ferry [almost] all torch ops into granular ONNX kernels that ORT will eagerly execute against any devices it supports (therefore, we only need a single ORT backend from a PyTorch perspective).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58248

Reviewed By: astaff

Differential Revision: D30344992

Pulled By: albanD

fbshipit-source-id: 69082b32121246340d686e16653626114b7714b2
2021-08-20 11:17:13 -07:00
Philip Meier
99203580a9 Updates internal assert_allclose callsites in favor of assert_close (#61841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841

Redo of #60863.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30408145

Pulled By: mruberry

fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58
2021-08-19 12:50:41 -07:00
Thomas J. Fan
07b00fc324 ENH Migrate nll_loss2d from THC to ATen (#62826)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24608
Fixes https://github.com/pytorch/pytorch/issues/24607

With the following benchmark, the backward pass runs a little slower. This is strange since the implementation should be exactly the same.

<details>
 <summary>Benchmark script</summary>

```python
from itertools import product

import torch
import torch.nn as nn
import torch.nn.functional as F
import time

torch.manual_seed(0)
MS_PER_SECOND = 1000

def _time():
    torch.cuda.synchronize()
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 3
n_runs = 30
reductions = ["none", "sum", "mean"]
Ns = [128, 256, 512]
Hs = [128, 256, 512]

for reduction, N, H in product(reductions, Ns, Hs):
    total_fwd_time = 0
    total_back_time = 0
    if reduction == "none":
        grad_out = torch.randn(N, H, H, device=device)
    else:
        grad_out = torch.randn(1)[0]

    for _ in range(n_runs):
        input = torch.randn(N, C, H, H, device=device, requires_grad=True)
        target = torch.rand(N, H, H, device=device).mul(3).floor().long()

        # forward
        start = _time()
        result = F.nll_loss(input, target, reduction=reduction)
        total_fwd_time += _time() - start

    result = F.nll_loss(input, target, reduction=reduction)
    for _ in range(n_runs):
        # backward
        start = _time()
        result.backward(grad_out, retain_graph=True)
        total_back_time += _time() - start

    fwd_avg = total_fwd_time / n_runs
    bwd_avg = total_back_time / n_runs
    print(
        f"input size({N}, {C}, {H}, {H}), reduction: {reduction}, fwd: {fwd_avg:.2f} (ms), back: {bwd_avg:.2f} (ms)"
    )

```

</details>

<details>
 <summary>master results</summary>

```
input size(128, 3, 128, 128), reduction: none, fwd: 0.34 (ms), back: 0.57 (ms)
input size(128, 3, 256, 256), reduction: none, fwd: 2.56 (ms), back: 3.85 (ms)
input size(128, 3, 512, 512), reduction: none, fwd: 14.54 (ms), back: 16.62 (ms)
input size(256, 3, 128, 128), reduction: none, fwd: 1.26 (ms), back: 1.78 (ms)
input size(256, 3, 256, 256), reduction: none, fwd: 7.07 (ms), back: 8.22 (ms)
input size(256, 3, 512, 512), reduction: none, fwd: 29.38 (ms), back: 33.29 (ms)
input size(512, 3, 128, 128), reduction: none, fwd: 3.41 (ms), back: 4.05 (ms)
input size(512, 3, 256, 256), reduction: none, fwd: 14.32 (ms), back: 16.46 (ms)
input size(512, 3, 512, 512), reduction: none, fwd: 59.20 (ms), back: 66.68 (ms)
input size(128, 3, 128, 128), reduction: sum, fwd: 0.08 (ms), back: 0.21 (ms)
input size(128, 3, 256, 256), reduction: sum, fwd: 0.21 (ms), back: 0.73 (ms)
input size(128, 3, 512, 512), reduction: sum, fwd: 0.82 (ms), back: 2.86 (ms)
input size(256, 3, 128, 128), reduction: sum, fwd: 0.12 (ms), back: 0.39 (ms)
input size(256, 3, 256, 256), reduction: sum, fwd: 0.42 (ms), back: 1.45 (ms)
input size(256, 3, 512, 512), reduction: sum, fwd: 1.53 (ms), back: 5.66 (ms)
input size(512, 3, 128, 128), reduction: sum, fwd: 0.21 (ms), back: 0.74 (ms)
input size(512, 3, 256, 256), reduction: sum, fwd: 0.78 (ms), back: 2.86 (ms)
input size(512, 3, 512, 512), reduction: sum, fwd: 2.98 (ms), back: 11.23 (ms)
input size(128, 3, 128, 128), reduction: mean, fwd: 0.07 (ms), back: 0.21 (ms)
input size(128, 3, 256, 256), reduction: mean, fwd: 0.21 (ms), back: 0.73 (ms)
input size(128, 3, 512, 512), reduction: mean, fwd: 0.82 (ms), back: 2.86 (ms)
input size(256, 3, 128, 128), reduction: mean, fwd: 0.13 (ms), back: 0.39 (ms)
input size(256, 3, 256, 256), reduction: mean, fwd: 0.42 (ms), back: 1.45 (ms)
input size(256, 3, 512, 512), reduction: mean, fwd: 1.54 (ms), back: 5.65 (ms)
input size(512, 3, 128, 128), reduction: mean, fwd: 0.22 (ms), back: 0.74 (ms)
input size(512, 3, 256, 256), reduction: mean, fwd: 0.78 (ms), back: 2.87 (ms)
input size(512, 3, 512, 512), reduction: mean, fwd: 2.98 (ms), back: 11.23 (ms)
```

</details>

<details>
 <summary>PR results</summary>

```
input size(128, 3, 128, 128), reduction: none, fwd: 0.33 (ms), back: 0.59 (ms)
input size(128, 3, 256, 256), reduction: none, fwd: 2.51 (ms), back: 3.92 (ms)
input size(128, 3, 512, 512), reduction: none, fwd: 14.52 (ms), back: 17.05 (ms)
input size(256, 3, 128, 128), reduction: none, fwd: 1.23 (ms), back: 1.85 (ms)
input size(256, 3, 256, 256), reduction: none, fwd: 7.07 (ms), back: 8.45 (ms)
input size(256, 3, 512, 512), reduction: none, fwd: 29.39 (ms), back: 34.21 (ms)
input size(512, 3, 128, 128), reduction: none, fwd: 3.40 (ms), back: 4.18 (ms)
input size(512, 3, 256, 256), reduction: none, fwd: 14.33 (ms), back: 16.90 (ms)
input size(512, 3, 512, 512), reduction: none, fwd: 59.04 (ms), back: 68.36 (ms)
input size(128, 3, 128, 128), reduction: sum, fwd: 0.07 (ms), back: 0.25 (ms)
input size(128, 3, 256, 256), reduction: sum, fwd: 0.21 (ms), back: 0.86 (ms)
input size(128, 3, 512, 512), reduction: sum, fwd: 0.82 (ms), back: 3.33 (ms)
input size(256, 3, 128, 128), reduction: sum, fwd: 0.12 (ms), back: 0.46 (ms)
input size(256, 3, 256, 256), reduction: sum, fwd: 0.42 (ms), back: 1.70 (ms)
input size(256, 3, 512, 512), reduction: sum, fwd: 1.53 (ms), back: 6.58 (ms)
input size(512, 3, 128, 128), reduction: sum, fwd: 0.21 (ms), back: 0.87 (ms)
input size(512, 3, 256, 256), reduction: sum, fwd: 0.78 (ms), back: 3.34 (ms)
input size(512, 3, 512, 512), reduction: sum, fwd: 2.98 (ms), back: 13.07 (ms)
input size(128, 3, 128, 128), reduction: mean, fwd: 0.07 (ms), back: 0.26 (ms)
input size(128, 3, 256, 256), reduction: mean, fwd: 0.21 (ms), back: 0.86 (ms)
input size(128, 3, 512, 512), reduction: mean, fwd: 0.82 (ms), back: 3.34 (ms)
input size(256, 3, 128, 128), reduction: mean, fwd: 0.12 (ms), back: 0.46 (ms)
input size(256, 3, 256, 256), reduction: mean, fwd: 0.42 (ms), back: 1.72 (ms)
input size(256, 3, 512, 512), reduction: mean, fwd: 1.53 (ms), back: 6.60 (ms)
input size(512, 3, 128, 128), reduction: mean, fwd: 0.21 (ms), back: 0.87 (ms)
input size(512, 3, 256, 256), reduction: mean, fwd: 0.78 (ms), back: 3.33 (ms)
input size(512, 3, 512, 512), reduction: mean, fwd: 2.98 (ms), back: 13.07 (ms)
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62826

Reviewed By: bdhirsh

Differential Revision: D30282279

Pulled By: ngimel

fbshipit-source-id: 4aa0ff3f8af0632957417931d332ec486a12b52d
2021-08-12 18:07:15 -07:00
Shen Li
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
Zsolt Dollenstein
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
Matti Picus
658540f43f remove deprecated is_deterministic and set_deterministic (#62158)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62158

Reviewed By: mruberry

Differential Revision: D29909634

Pulled By: ezyang

fbshipit-source-id: ccffbcf8f378e39bd2c7fbeace7ed1cbbe003981
2021-08-04 16:45:23 -07:00
Natalia Gimelshein
d783617216 enable warnings on cuda synchronization (#62092)
Summary:
This creates `torch.cuda.set_warn_on_synchronization()` function that would warn or error when synchronizing operation is performed. We could wrap it in a context manager for ease of use, but it would be a lie, because it sets global, and not thread-local state. Since it's intended for debugging, maybe that's ok though.
As all `torch.cuda.*` functions, it's going through CPython, not pybind, so the argument is converted to long before being passed to c10 function. I'll make python argument a python enum class, but without pybind it'll still have to go thourgh long conversion.

For a test script
```
import torch
torch.cuda.set_warn_on_synchronization(1)
x=torch.randn(10, device="cuda")
x.nonzero()
y=torch.randn((), device="cuda")

if y:
    print("something")
torch.multinomial(x.abs(), 10, replacement=False)
torch.randperm(20000, device="cuda")
ind = torch.randint(10, (3,), device="cuda")
mask = torch.randint(2, (10,), device="cuda", dtype=torch.bool)
val = torch.randn((), device="cuda")
x[mask]=1.
x[mask] = val
torch.cuda.synchronize()
```
the output is
```
/../playground/sync_warn_test.py:4: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  x.nonzero()
/../playground/sync_warn_test.py:7: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  if y:
something
/../playground/sync_warn_test.py:9: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  torch.multinomial(x.abs(), 10, replacement=False)
/../playground/sync_warn_test.py:15: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  x[mask] = val
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62092

Reviewed By: mruberry

Differential Revision: D29968792

Pulled By: ngimel

fbshipit-source-id: cc6f817212c164727ed99ecf6ab050dc29631b9e
2021-07-30 09:13:01 -07:00
Jagadish Krishnamoorthy
64d61901eb [ROCm] Skip test_masked_scatter_large_tensor_cuda (#61313)
Summary:
Refer https://github.com/pytorch/pytorch/issues/60190. Skipping unit test until hipcub issue is fixed.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61313

Reviewed By: iramazanli

Differential Revision: D29626664

Pulled By: malfet

fbshipit-source-id: db2a390d2a3e28ec05a5032a50aa9a35c86b96ca
2021-07-09 10:27:08 -07:00
kshitij12345
5e9bcf9101 fix: support removing hook in the hook (#61250)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/58354

Problem:
Once a hook is called
05c1e5b655/torch/csrc/autograd/python_hook.cpp (L51-L54)

If the hook has `handle.remove()` while executing and if there are no references to the hook function object then `python` is free to garbage collect.

At the subsequent call to
05c1e5b655/torch/csrc/autograd/python_hook.cpp (L54)

we have `hook` pointing to invalid memory

Thus when we try to fetch the name for `hook` from `check_single_result` with
05c1e5b655/torch/csrc/autograd/python_hook.cpp (L175-L177)
we get segfault.

Solution:
Temporarily increase the life-time of hook with `Py_INCREF` till we have verified the result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61250

Reviewed By: iramazanli

Differential Revision: D29623826

Pulled By: soulitzer

fbshipit-source-id: c71322311f19066cafb7203980668868c59d4e5e
2021-07-09 09:27:58 -07:00
Heitor Schueroff
f32f85e6da Implemented torch.corrcoef (#60420)
Summary:
Implements `torch.corrcoef` similar to [`np.corrcoef`](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) using `torch.cov` implemented in https://github.com/pytorch/pytorch/pull/58311.

closes https://github.com/pytorch/pytorch/issues/1254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60420

Reviewed By: mruberry

Differential Revision: D29474687

Pulled By: heitorschueroff

fbshipit-source-id: f3c7c5610363aebd88274a51fc77e3cf879cb611
2021-06-30 12:36:02 -07:00
Victor Bittorf
91c076eadc Add TorchVitals for DataLoader (#60959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60959

Add TorchVitals for Dataloader, this indicates that the data loader was enabled.

This is a no-op if TORCH_VITALS environment variable is not set.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex vitals

Reviewed By: VitalyFedyunin

Differential Revision: D29445146

fbshipit-source-id: d5778fff3dafb3c0463fec7a498bff4905597518
2021-06-29 14:08:32 -07:00
Heitor Schueroff
ec9c03c234 Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

closes https://github.com/pytorch/pytorch/issues/19037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: jbschlosser

Differential Revision: D29431651

Pulled By: heitorschueroff

fbshipit-source-id: 167dea880f534934b145ba94291a9d634c25b01b
2021-06-29 14:02:39 -07:00
kshitij12345
956faea585 [fix] cauchy sampling inf on cuda (#60186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59144

As pointed by ngimel, the issue is indeed with calling `tan`.

However the C++ `std::tan` [documenation](https://en.cppreference.com/w/cpp/numeric/math/tan) states that

```
The function has mathematical poles at π(1/2 + n); however no common floating-point representation
is able to represent π/2 exactly, thus there is no value of the argument for which a pole error occurs.
```

All `torch.tan`,`numpy.tan` and `math.tan` are compliant with the above statement.

<details>

```python
import torch
import math
import numpy as np

# Single Precision
print(torch.tan(torch.tensor(math.pi, device='cuda', dtype=torch.float32) * 0.5))
print(np.tan(np.array(np.pi, dtype=np.float32) * 0.5))

# Double Precision
print(math.tan(math.pi * 0.5))
print(torch.tan(torch.tensor(math.pi, device='cuda', dtype=torch.double) * 0.5))
print(np.tan(np.array(np.pi, dtype=np.float64) * 0.5))
```

Output
```
tensor(-22877334., device='cuda:0')
-22877332.42885646
1.633123935319537e+16
tensor(1.6331e+16, device='cuda:0', dtype=torch.float64)
1.633123935319537e+16
```

</details>

So this issue stems from the use of `__tanf` faster approximation of tan from CUDA library (for float16, bfloat16 and float).

8a839c5478/aten/src/ATen/NumericUtils.h (L91-L100)

The fix in the PR is to use the **slower** but more correct version.

Benchmark::
```
[ cauchy : input dtype torch.float16 device cuda ]
                             |  Before  |  After
1 threads: -------------------------------------
      (128,)                 |    3.8   |    4.3
      (256, 128)             |    3.8   |    4.2
      (2, 512, 256)          |    3.8   |    4.2
      (2, 64, 256, 128)      |   22.8   |   29.6
      (4, 2, 512, 256, 128)  |  649.6   |  869.3

Times are in microseconds (us).

[ cauchy : input dtype torch.bfloat16 device cuda ]
                             |  Before  |  After
1 threads: -------------------------------------
      (128,)                 |    3.8   |    4.3
      (256, 128)             |    3.8   |    4.3
      (2, 512, 256)          |    3.8   |    4.3
      (2, 64, 256, 128)      |   23.8   |   30.8
      (4, 2, 512, 256, 128)  |  682.5   |  904.2

Times are in microseconds (us).

[ cauchy : input dtype torch.float32 device cuda ]
                             |  Before  |  After
1 threads: --------------------------------------
      (128,)                 |     3.8  |     4.2
      (256, 128)             |     3.7  |     4.2
      (2, 512, 256)          |     3.7  |     4.2
      (2, 64, 256, 128)      |    35.3  |    37.1
      (4, 2, 512, 256, 128)  |  1020.0  |  1058.3

Times are in microseconds (us).

[- cauchy : input dtype torch.float64 device cuda ]
                             |   Before  |   After
1 threads: ----------------------------------------
      (128,)                 |      3.8  |      4.2
      (256, 128)             |      8.0  |      8.0
      (2, 512, 256)          |     46.0  |     46.0
      (2, 64, 256, 128)      |    669.2  |    669.4
      (4, 2, 512, 256, 128)  |  21255.0  |  21262.1

Times are in microseconds (us).
```

<details>

Benchmark Script:
```python
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle

print('Using pytorch %s' % (torch.__version__))

cuda_shapes = [(128,), (256, 128), (2, 512, 256), (2, 64, 256, 128), (4, 2, 512, 256, 128)]
cuda_dtypes = [torch.half, torch.bfloat16, torch.float, torch.double]
results = []
repeats = 10

for device in ['cuda']:
    dtypes = cuda_dtypes
    shapes = cuda_shapes

    for dtype in dtypes:
        for shape in shapes:
            t = torch.randn(shape, device=device, dtype=dtype) * 10

            tasks = [("t.cauchy_()", "After", "")]
            timers = [Timer(stmt=stmt, label=f"cauchy : input dtype {dtype} device {device}", sub_label=f"{(shape)}", description=desc, globals=globals()) for stmt, desc, label in tasks]

            for i, timer in enumerate(timers * repeats):
                results.append(
                    timer.blocked_autorange()
                )
                print(f"\r{i + 1} / {len(timers) * repeats}", end="")
                sys.stdout.flush()

with open('after-pr.pkl', 'wb') as f:
    pickle.dump(results, f)

comparison = Compare(results)
comparison.print()
```

Compare Script:
```
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle

with open('before-pr.pkl', 'rb') as f:
    after_results = pickle.load(f)

with open('after-pr.pkl', 'rb') as f:
    before_results = pickle.load(f)

comparison = Compare(after_results + before_results)
comparison.print()
```

</details>

TODO:
* [x] Add comment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60186

Reviewed By: jbschlosser

Differential Revision: D29433897

Pulled By: ngimel

fbshipit-source-id: 9c5f14b83e3372bed72369f70eed9256c04385c6
2021-06-28 12:49:30 -07:00
Victor Bittorf
8b6487c650 Add CUDA Vital (#58059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58059

Add CUDA.used vital sign which is true only if CUDA was "used" which technically means the context was created.

Also adds the following features:
- Force vitals to be written even if vitals are disabled, to enable testing when the env variable is not set from the start of execution
- Add a read_vitals call for python to read existing vital signs.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex basic_vitals

Reviewed By: xuzhao9

Differential Revision: D28357615

fbshipit-source-id: 681bf9ef63cb1458df9f1c241d301a3ddf1e5252
2021-06-25 16:31:11 -07:00
Masaki Kozuki
a404cc9a7b CUDA addcmul and addcdiv do math in float for 16 bits I/O (#60715)
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: 42c8439b6e/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu (L30) and cast inputs and results here:
42c8439b6e/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L133-L135)

Related to https://github.com/pytorch/pytorch/issues/58833 #60227 https://github.com/pytorch/pytorch/issues/60454
cc ptrblck mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
2021-06-25 10:21:35 -07:00
Ilqar Ramazanli
90cd57ee16 To add edge_order=2 and documentation for gradient operator (#58165)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56036
Fixes https://github.com/pytorch/pytorch/issues/56130

* All the interior points are computed using second order accurate central differences method for gradient operator. However, currently we only have first order method computation for edge points. In this PR we are adding second order methods for edge points as well.

* Currently, there is no detailed description of how gradient operator computed using second order method, and how to use parameters correctly. We add detailed explanation of meaning of each parameter, and return of the gradient operator, meanwhile giving description of the second-order computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58165

Reviewed By: mruberry

Differential Revision: D29305321

Pulled By: iramazanli

fbshipit-source-id: 0e0e418eed801c8510b8babe2ad3d064479fb4d6
2021-06-23 03:35:15 -07:00
Philip Meier
0c916c8a4e up the priority of numpy array comparisons in self.assertEqual (#59067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0
2021-06-22 13:07:07 -07:00
praneeth
9b30fb8528 add support for constant (#60166)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58739 Add support for constants according to python array API stipulation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60166

Reviewed By: anjali411

Differential Revision: D29253958

Pulled By: mruberry

fbshipit-source-id: 0bc86b74d3a4eb3ec4a65c941ec2710747402db1
2021-06-21 20:47:21 -07:00
Thomas J. Fan
c16f87949f ENH Adds nn.ReflectionPad3d (#59791)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27655

This PR adds a C++ and Python version of ReflectionPad3d with structured kernels. The implementation uses lambdas extensively to better share code from the backward and forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59791

Reviewed By: gchanan

Differential Revision: D29242015

Pulled By: jbschlosser

fbshipit-source-id: 18e692d3b49b74082be09f373fc95fb7891e1b56
2021-06-21 10:53:14 -07:00
Peter Bell
e8e3394ea8 Recognize transposed dense tensors as a form of partial overlap (#59014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59014

Fixes #48401

`assert_no_overlap` currently has a false-negative where it recognizes
the transpose of a contiguous tensor as fully overlapping. This happens because
the memory regions do fully overlap, but of course the strides are different so
the actual elements don't all overlap.

This goes slightly in the other direction, by requiring strides to exactly
match we get false-positives for some unusual situations, e.g.
```
torch.add(a, a, out=a.view([1, *a.shape]))
```
Or replacing strides of length-1 dimensions, etc. However, I think these are
sufficiently obscure that it's okay to error and the common cases like
inplace operations still work as before.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D29040928

Pulled By: ngimel

fbshipit-source-id: 5a636c67536a3809c83f0d3117d2fdf49c0a45e6
2021-06-18 16:29:25 -07:00
Mike Ruberry
92513038e8 Revert D28994140: [pytorch][PR] Implemented torch.cov
Test Plan: revert-hammer

Differential Revision:
D28994140 (23c232554b)

Original commit changeset: 1890166c0a9c

fbshipit-source-id: 73dfe1b00464e38f004f99960cdeeb604ed4b20a
2021-06-13 02:33:37 -07:00
Heitor Schueroff
23c232554b Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

TODO

- [x] Improve documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: mruberry

Differential Revision: D28994140

Pulled By: heitorschueroff

fbshipit-source-id: 1890166c0a9c01e0a536acd91571cd704d632f44
2021-06-11 09:40:50 -07:00
Kimish Patel
4f79270b89 [PyTorch ] Thread parallel bmm across batch dim (#59596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59596

Parallelize batch matmul across batch dim. This was found to improve perf for
some usecases on mobile.
ghstack-source-id: 130989569

Test Plan: CI unit tests

Reviewed By: albanD

Differential Revision: D26833417

fbshipit-source-id: 9b84d89d29883a6c9d992d993844dd31a25f76b1
2021-06-10 08:25:40 -07:00
Yukio Siraichi
84061dadad Add reduce variants for scatter operation. (#57015)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56463 #56464

- Add reduce variants for `scatter` in both _native_functions.yaml_ and _TensorAdvancedIndexing.cpp_
- Add `OpInfo` tests and reduce tests in _test_torch.py_
- Fix default reduce argument for `scatter_` in __tensor_docs.py_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57015

Reviewed By: mrshenli

Differential Revision: D28162657

Pulled By: ezyang

fbshipit-source-id: 4d37ed1569ce8560aca1085c9cf5349f11427c4f
2021-06-08 13:37:26 -07:00
Mike Ruberry
de40c8e495 Adds remaining OpInfos and removes redundant test generators (#55558)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55558

Reviewed By: ngimel

Differential Revision: D28922522

Pulled By: mruberry

fbshipit-source-id: 89cefd93788bc8aa0683f4583cf5caa81aa2dc93
2021-06-06 14:52:26 -07:00
Natalia Gimelshein
344ecb2e71 flip via TI (#59509)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59509

Reviewed By: mruberry

Differential Revision: D28918665

Pulled By: ngimel

fbshipit-source-id: b045c7b35eaf22e53b1bc359ffbe5a4fda05dcda
2021-06-05 15:43:29 -07:00
Natalia Gimelshein
5117ac3bb4 Revert D28877076: [pytorch][PR] torch.flip via TI
Test Plan: revert-hammer

Differential Revision:
D28877076 (d82bc3feb8)

Original commit changeset: 4fa6eb519085

fbshipit-source-id: c81e7d3283ff6822db913bf9f49a1533268755d0
2021-06-04 23:03:53 -07:00
lezcano
d82bc3feb8 torch.flip via TI (#58747)
Summary:
Implements an idea by ngimel to improve the performance of `torch.flip` via a clever hack into TI to bypass the fact that TI is not designed to work with negative indices.

Something that might be added is vectorisation support on CPU, given how simple the implementation is now.

Some low-hanging fruits that I did not implement:
- Write it as a structured kernel
- Migrate the tests to opinfos
- Have a look at `cumsum_backward` and `cumprod_backward`,  as I think that they could be implemented faster with `flip`, now that `flip` is fast.

**Edit**
This operation already has OpInfos and it cannot be migrated to a structured kernel because it implements quantisation

Summary of the PR:
- x1.5-3 performance boost on CPU
- x1.5-2 performance boost on CUDA
- Comparable performance across dimensions, regardless of the strides (thanks TI)
- Simpler code

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(size, dims, num_threads, device):
    x = torch.rand(*size, device=device)

    timer = Timer(
        "torch.flip(x, dims=dims)",
        globals={"x": x, "dims": dims},
        label=f"Flip {device}",
        description=f"dims: {dims}",
        sub_label=f"size: {size}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    sizes = ((1000,)*2, (1000,)*3, (10000,)*2)
    for size, device in product(sizes, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        list_dims = [(0,), (1,), (0, 1)]
        if len(size) == 3:
            list_dims.append((0, 2))
        for num_threads, dims in product(threads, list_dims):
            yield size, dims, num_threads, device

def compare():
    compare = Compare([get_timer(*params) for params in get_params()])
    compare.trim_significant_figures()
    compare.colorize()
    compare.print()

compare()
```
</details>

<details>
<summary>
Benchmark PR
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139954-81e46d80-ba3b-11eb-9aad-e825e515d41b.png)

</details>

<details>
<summary>
Benchmark master
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139915-76914200-ba3b-11eb-9aa8-84b3ca220c93.png)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58747

Reviewed By: agolynski

Differential Revision: D28877076

Pulled By: ngimel

fbshipit-source-id: 4fa6eb519085950176cb3a9161eeb3b6289ec575
2021-06-04 20:13:38 -07:00
Elton Leander Pinto
2119efd234 reflection_pad1d_backward: Port to structured (#59103)
Summary:
Tracking Issue: https://github.com/pytorch/pytorch/issues/55070
Port `reflection_pad1d_backward` to structured kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59103

Test Plan: Pre-existing tests

Reviewed By: jbschlosser

Differential Revision: D28836043

Pulled By: ezyang

fbshipit-source-id: 4c3b0880edf305896f540113dcab70c8af24253b
2021-06-04 10:23:53 -07:00
Edward Yang
f05d5bec48 Preserve PyObject even when it goes dead (#56017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56017

Fixes #55686

This patch is seemingly straightforward but some of the changes are very
subtle.  For the general algorithmic approach, please first read the
quoted issue.  Based on the algorithm, there are some fairly
straightforward changes:

- New boolean on TensorImpl tracking if we own the pyobj or not
- PythonHooks virtual interface for requesting deallocation of pyobj
  when TensorImpl is being released and we own its pyobj, and
  implementation of the hooks in python_tensor.cpp
- Modification of THPVariable to MaybeOwned its C++ tensor, directly
  using swolchok's nice new class

And then, there is python_variable.cpp.  Some of the changes follow the
general algorithmic approach:

- THPVariable_NewWithVar is simply adjusted to handle MaybeOwned and
  initializes as owend (like before)
- THPVariable_Wrap adds the logic for reverting ownership back to
  PyObject when we take out an owning reference to the Python object
- THPVariable_dealloc attempts to resurrect the Python object if
  the C++ tensor is live, and otherwise does the same old implementation
  as before
- THPVariable_tryResurrect implements the resurrection logic.  It is
  modeled after CPython code so read the cited logic and see if
  it is faithfully replicated
- THPVariable_clear is slightly updated for MaybeOwned and also to
  preserve the invariant that if owns_pyobj, then pyobj_ is not null.
  This change is slightly dodgy: the previous implementation has a
  comment mentioning that the pyobj nulling is required to ensure we
  don't try to reuse the dead pyobj.  I don't think, in this new world,
  this is possible, because the invariant says that the pyobj only
  dies if the C++ object is dead too.  But I still unset the field
  for safety.

And then... there is THPVariableMetaType.  colesbury explained in the
issue why this is necessary: when destructing an object in Python, you
start off by running the tp_dealloc of the subclass before moving up
to the parent class (much in the same way C++ destructors work).  The
deallocation process for a vanilla Python-defined class does irreparable
harm to the PyObject instance (e.g., the finalizers get run) making it
no longer valid attempt to resurrect later in the tp_dealloc chain.
(BTW, the fact that objects can resurrect but in an invalid state is
one of the reasons why it's so frickin' hard to write correct __del__
implementations).  So we need to make sure that we actually override
the tp_dealloc of the bottom most *subclass* of Tensor to make sure
we attempt a resurrection before we start finalizing.  To do this,
we need to define a metaclass for Tensor that can override tp_dealloc
whenever we create a new subclass of Tensor.  By the way, it was totally
not documented how to create metaclasses in the C++ API, and it took
a good bit of trial error to figure it out (and the answer is now
immortalized in https://stackoverflow.com/q/67077317/23845 -- the things
that I got wrong in earlier versions of the PR included setting
tp_basicsize incorrectly, incorrectly setting Py_TPFLAGS_HAVE_GC on
the metaclass--you want to leave it unset so that it inherits, and
determining that tp_init is what actually gets called when you construct
a class, not tp_call as another not-to-be-named StackOverflow question
suggests).

Aside: Ordinarily, adding a metaclass to a class is a user visible
change, as it means that it is no longer valid to mixin another class
with a different metaclass.  However, because _C._TensorBase is a C
extension object, it will typically conflict with most other
metaclasses, so this is not BC breaking.

The desired new behavior of a subclass tp_dealloc is to first test if
we should resurrect, and otherwise do the same old behavior.  In an
initial implementation of this patch, I implemented this by saving the
original tp_dealloc (which references subtype_dealloc, the "standard"
dealloc for all Python defined classes) and invoking it.  However, this
results in an infinite loop, as it attempts to call the dealloc function
of the base type, but incorrectly chooses subclass type (because it is
not a subtype_dealloc, as we have overridden it; see
b38601d496/Objects/typeobject.c (L1261) )
So, with great reluctance, I must duplicate the behavior of
subtype_dealloc in our implementation.  Note that this is not entirely
unheard of in Python binding code; for example, Cython
c25c3ccc4b/Cython/Compiler/ModuleNode.py (L1560)
also does similar things.  This logic makes up the bulk of
THPVariable_subclass_dealloc

To review this, you should pull up the CPython copy of subtype_dealloc
b38601d496/Objects/typeobject.c (L1230)
and verify that I have specialized the implementation for our case
appropriately.  Among the simplifications I made:

- I assume PyType_IS_GC, because I assume that Tensor subclasses are
  only ever done in Python and those classes are always subject to GC.
  (BTW, yes!  This means I have broken anyone who has extend PyTorch
  tensor from C API directly.  I'm going to guess no one has actually
  done this.)

- I don't bother walking up the type bases to find the parent dealloc;
  I know it is always THPVariable_dealloc.  Similarly, I can get rid
  of some parent type tests based on knowledge of how
  THPVariable_dealloc is defined

- The CPython version calls some private APIs which I can't call, so
  I use the public PyObject_GC_UnTrack APIs.

- I don't allow the finalizer of a Tensor to change its type (but
  more on this shortly)

One alternative I discussed with colesbury was instead of copy pasting
the subtype_dealloc, we could transmute the type of the object that was
dying to turn it into a different object whose tp_dealloc is
subtype_dealloc, so the stock subtype_dealloc would then be applicable.
We decided this would be kind of weird and didn't do it that way.

TODO:

- More code comments

- Figure out how not to increase the size of TensorImpl with the new
  bool field

- Add some torture tests for the THPVariable_subclass_dealloc, e.g.,
  involving subclasses of Tensors that do strange things with finalizers

- Benchmark the impact of taking the GIL to release C++ side tensors
  (e.g., from autograd)

- Benchmark the impact of adding a new metaclass to Tensor (probably
  will be done by separating out the metaclass change into its own
  change)

- Benchmark the impact of changing THPVariable to conditionally own
  Tensor (as opposed to unconditionally owning it, as before)

- Add tests that this actually indeed preserves the Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27765125

Pulled By: ezyang

fbshipit-source-id: 857f14bdcca2900727412aff4c2e2d7f0af1415a
2021-06-03 10:50:36 -07:00
Thomas J. Fan
7f2e620105 FIX Validates that weights are 2d in embedding (#59314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59314

Reviewed By: H-Huang

Differential Revision: D28837753

Pulled By: jbschlosser

fbshipit-source-id: 683378244c61b0937c95563f91ef87ab09fd1653
2021-06-02 12:52:21 -07:00
Natalia Gimelshein
12418a4f86 Back out "Revert D28664514: [pytorch][PR] various TensorIterator speed improvements"
Summary: Original commit changeset: fcad039b7dc8

Test Plan: Existing tests

Reviewed By: mruberry

Differential Revision: D28720186

fbshipit-source-id: 14ac99ee2d7cafb86b20c979f8917beeefd616e1
2021-05-26 12:22:48 -07:00
Edward Yang
17fb651a3b Make torch.Tensor(torch.tensor(1.0)) work (#58885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58885

Fixes #58884

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28687510

Pulled By: ezyang

fbshipit-source-id: 81325f501cc3e83cbac02f7c44ded9d396356bb8
2021-05-26 11:33:05 -07:00
Natalia Gimelshein
8398ebaa86 Revert D28664514: [pytorch][PR] various TensorIterator speed improvements
Test Plan: revert-hammer

Differential Revision:
D28664514 (8a28bbeeb9)

Original commit changeset: 2e03cf90b37a

fbshipit-source-id: fcad039b7dc823fec8afa694ab74a7ac5011f8ab
2021-05-26 10:49:58 -07:00
Xiang Gao
c88333484f [resubmit] masked_scatter thrust->cub (#58865)
Summary:
See ae7760cf50bb2cddff4663a07b9d68decf4b6c75 for the fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58865

Reviewed By: mruberry

Differential Revision: D28657940

Pulled By: ngimel

fbshipit-source-id: 9155c710b0e18ebb3bfa2dabfdd117355ac30840
2021-05-25 11:00:50 -07:00
Natalia Gimelshein
8a28bbeeb9 various TensorIterator speed improvements (#58810)
Summary:
1) remove pushing back to strides vector for 1D tensors, those strides are never used in the loop anyway
2) avoid calling get_data_ptrs unless necessary
3) don't call into assert_no_partial_overlap if tensorImpls are the same (assert_no_partial_overlap has this comparison too, but after a couple of nested function calls)
4) is_non_overlapping_and_dense instead of is_contiguous in memory overlap (which, for some reason, is faster than is_contiguous, though I hoped after is_contiguous is non-virtualized, it should be the same).

Altogether, brings instruction count down from ~110K to 102735 for the following binary inplace benchmark:
```
In [2]:  timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({1}); at::Tensor b = torch::empty({1});", language="c++", timer=timeit.default_timer)
   ...:  stats=timer.collect_callgrind(number=30, repeats=3)
   ...:  print(stats[1].as_standardized().stats(inclusive=False))
```
similar improvements for unary inplace.

Upd: returned stride packing for now, counts is now 104295, so packing is worth ~ 52 instructions, we should think about how to remove it  safely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58810

Reviewed By: bhosmer

Differential Revision: D28664514

Pulled By: ngimel

fbshipit-source-id: 2e03cf90b37a411d9994a7607402645f1d8f3c93
2021-05-25 10:44:51 -07:00
Serhat Yilmaz
b4f3a989da [torch][repeat_interleave] Fix ambigious function call (#58881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58881

recently added new parameter to the function with PR: https://github.com/pytorch/pytorch/pull/58417

However, this introduced ambiguity when making call below:
  some_tensor.repeat_interleave(some_integer_value)

Making it optional to avoid the issue.

Reviewed By: ezyang, ngimel

Differential Revision: D28653820

fbshipit-source-id: 5bc0b1f326f069ff505554b51e3b24d60e69c843
2021-05-25 00:31:32 -07:00
Yu Guo
74c12da451 add deterministic path for scatter_add_cuda for 1D tensors (#58761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58761

previously we implemented deterministic path for gather_backward in https://github.com/pytorch/pytorch/pull/55573, which replaced non-deterministic scatter_add_cuda.

It's better to move it inside scatter_add so scatter_add can benefit from the deterministic path.

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_scatter_add_one_dim_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.063)
    ✓ Pass: caffe2/test:torch_cuda - test_scatter_add_one_dim_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (30.909)
    ✓ Pass: caffe2/test:torch_cuda - main (30.909)
Summary
  Pass: 2
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_gather_backward

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (4.613)
    ✓ Pass: caffe2/test:torch_cuda - test_gather_backward_deterministic_path_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.369)

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.356)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_CTCLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_accumulate_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_scatter_add_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_NLLLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_median_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_gather_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_bincount_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_histc_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bicubic_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_MaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_EmbeddingBag_max_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_trilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_kthvalue_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_linear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - main (28.146)
Summary
  Pass: 30
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28585659

fbshipit-source-id: 1ad003d4130501ceff5f6a7a870ca3dbc9a3f1f2
2021-05-23 21:36:02 -07:00
kshitij12345
ee3ea31f12 OpInfo: split, split_with_sizes (#58184)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58184

Reviewed By: ngimel

Differential Revision: D28627271

Pulled By: mruberry

fbshipit-source-id: e6c0d2b005904ddebc9dab76685403530a6f6519
2021-05-23 15:47:35 -07:00
Serhat Yilmaz
4ca4640bae [torch][repeat_interleave] remove stream syncronization if output size is given (#58417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58417

Same as title.

Test Plan:
Rely on CI signal.

Update unit test to exercise new code path as well.

Reviewed By: ngimel

Differential Revision: D28482927

fbshipit-source-id: 3ec8682810ed5c8547b1e8d3869924480ce63dcd
2021-05-22 20:53:28 -07:00
Natalia Gimelshein
9e261de630 Revert D28547564: [pytorch][PR] masked_scatter thrust->cub
Test Plan: revert-hammer

Differential Revision:
D28547564 (5152cf8647)

Original commit changeset: 83aeddfaf702

fbshipit-source-id: d5259afb584e0f6c0a11de4d4cb3d56a2a562eb7
2021-05-21 09:18:34 -07:00
Xiang Gao
5152cf8647 masked_scatter thrust->cub (#56750)
Summary:
Benchmark:

```python
import torch
import itertools

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

run50_sync(lambda: torch.randperm(1000000, device='cuda'))

def benchmark(M):
    a = torch.randn(M, device='cuda')
    m = torch.randint(1, (M,), dtype=torch.long, device='cuda').bool()
    v = torch.randn(M, device='cuda')

    torch.cuda.synchronize()

    %timeit run50_sync(lambda:a.masked_scatter_(m, v))

for M in (100, 1000, 100000, 10000000):
    print(M)
    benchmark(M)
```

Before:
```
100
8.65 ms ± 80.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.75 ms ± 72.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
9.27 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
33.6 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

After
```
100
8.04 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.09 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
8.63 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
31.9 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56750

Reviewed By: ailzhang

Differential Revision: D28547564

Pulled By: ngimel

fbshipit-source-id: 83aeddfaf7023f9f9501c6b1e2faf91e8b6277b1
2021-05-20 10:27:58 -07:00
lezcano
452569dffb cfloat and cdouble functions (#58137)
Summary:
This adds the methods `Tensor.cfloat()` and `Tensor.cdouble()`.

I was not able to find the tests for `.float()` functions. I'd be happy to add similar tests for these functions  once someone points me to them.

Fixes https://github.com/pytorch/pytorch/issues/56014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58137

Reviewed By: ejguan

Differential Revision: D28412288

Pulled By: anjali411

fbshipit-source-id: ff3653cb3516bcb3d26a97b9ec3d314f1f42f83d
2021-05-13 21:13:37 -07:00
kshitij12345
6b1eeef601 OpInfo: squeeze (#58080)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58080

Reviewed By: agolynski

Differential Revision: D28379485

Pulled By: mruberry

fbshipit-source-id: 2b288036f595a5bd6b948a072494ee87f82322ce
2021-05-12 21:29:31 -07:00
Yu Guo
8a45006765 enable deterministic path for index_copy_cuda with index_put (#58144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58144

reland D28291041 (14badd9929), which was reverted due to a type error from Tuple[torch.Tensor], seems that mypy requires Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
    ✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
    ✓ Pass: caffe2/test:torch_cuda - main (25.750)

Reviewed By: ngimel

Differential Revision: D28383178

fbshipit-source-id: 38896fd6ddd670cfcce36e079aee7ad52adc2a28
2021-05-12 16:26:50 -07:00
kshitij12345
d09abf004c OpInfo: narrow (#58082)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58082

Reviewed By: agolynski

Differential Revision: D28379371

Pulled By: mruberry

fbshipit-source-id: 484e560b1e6ceba234e497585ed308a27cd8b7a0
2021-05-12 15:39:15 -07:00
Mike Ruberry
c911c30520 Revert D28291041: enable deterministic path for index_copy_cuda with index_put
Test Plan: revert-hammer

Differential Revision:
D28291041 (14badd9929)

Original commit changeset: 7f0cf3ec7280

fbshipit-source-id: 6117bc6e5b2044ce70d4e4a19bccd8c183ea3702
2021-05-12 03:33:57 -07:00
Kurt Mohler
c7fb0a0e82 Remove beta warning for use_deterministic_algorithms (#58074)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58074

Reviewed By: ngimel

Differential Revision: D28373676

Pulled By: mruberry

fbshipit-source-id: cae9a92ebbf6ac5f8d3008aa6a6a9cd5c1041c9f
2021-05-12 03:30:12 -07:00
Yu Guo
14badd9929 enable deterministic path for index_copy_cuda with index_put (#57870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57870

this is similar to index_add_cuda with index_put accumulate = True

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
    ✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
    ✓ Pass: caffe2/test:torch_cuda - main (25.750)

Reviewed By: ngimel

Differential Revision: D28291041

fbshipit-source-id: 7f0cf3ec72805f3617fd1de9ff03e1d49114fed8
2021-05-12 00:32:35 -07:00
Yu Guo
a07a0190f9 enable deterministic path for index_put with accumulate=False on CPU and CUDA (#57839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57839

we reuse the `index_put_accum_kernel`, rename it to  `index_put_deterministic_kernel` and add a bool `accumulate` in `index_backward_kernel`

Test Plan:
buck test mode/opt //caffe2/test:torch -- test_index_put_non_accumulate_deterministic

    ✓ Pass: caffe2/test:torch - test_index_put_non_accumulate_deterministic_cpu (test_torch.TestTorchDeviceTypeCPU) (5.120)
Summary
  Pass: 1
  Skip: 1
    ↻ caffe2/test:torch - test_index_put_non_accumulate_deterministic_meta (test_torch.TestTorchDeviceTypeMETA)
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_index_put_non_accumulate_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (6.397)
    ✓ Pass: caffe2/test:torch_cuda - test_index_put_non_accumulate_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (26.030)
    ✓ Pass: caffe2/test:torch_cuda - main (26.030)
Summary
  Pass: 2
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28290699

fbshipit-source-id: df8bbe7af2e72017566161b05b85737fda4ceb3f
2021-05-12 00:31:19 -07:00
Ilqar Ramazanli
8b816e9010 To implement gradient for Pytorch (#54617)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54617

Reviewed By: anjali411

Differential Revision: D28057452

Pulled By: iramazanli

fbshipit-source-id: 9bd86679282d34f5e5393e6447121586517eb4f0
2021-05-11 18:52:20 -07:00
kshitij12345
502eb664ae OpInfo: chunk (#57935)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57935

Reviewed By: ngimel

Differential Revision: D28346217

Pulled By: mruberry

fbshipit-source-id: 331995aa18fd2983fc2122a9af31fba43ab9839c
2021-05-11 10:16:10 -07:00
Edward Yang
da8cc355a3 Relax tp_new so that it is OK to call (#57544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57544

Instead of removing tp_new from the superclass (which causes
super().__new__ to not work), I now still install tp_new on the
superclass, but verify that you are not trying to directly
construct _TensorBase.

Fixes https://github.com/pytorch/pytorch/issues/57421

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28189475

Pulled By: ezyang

fbshipit-source-id: 9397a3842a77f5428d182dd62244b42425bca827
2021-05-05 09:04:39 -07:00
Peter Bell
33eea146ee torch.clamp with tensor min and max (#52695)
Summary:
Fixes gh-2793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52695

Reviewed By: mruberry

Differential Revision: D27395977

Pulled By: ezyang

fbshipit-source-id: f86aa240feb034d42e4c45447e72218f6a773c24
2021-05-03 12:56:16 -07:00
kshitij12345
154eca0309 OpInfo: ravel, view, view_as (#56910)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56910

Reviewed By: ngimel

Differential Revision: D28141867

Pulled By: mruberry

fbshipit-source-id: bff49d40d7e3bb36bc83d1405bd77f5529eeffe9
2021-05-02 22:10:36 -07:00
Ivan Yashchuk
eaf00bf7d4 Skip linalg.qr saved mode check if compiled without LAPACK (#56284)
Summary:
This PR also removes qr and eig tests from test/test_torch.py. They were not skipped if compiled without LAPACK and they are now replaced with OpInfos.

Fixes https://github.com/pytorch/pytorch/issues/55929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56284

Reviewed By: ejguan

Differential Revision: D27827077

Pulled By: mruberry

fbshipit-source-id: 1dceb955810a9fa34bb6baaccbaf0c8229444d3a
2021-05-02 16:07:07 -07:00
kshitij12345
41099ef71c OpInfo: mvlgamma (#56907)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56907

Reviewed By: astaff

Differential Revision: D28118669

Pulled By: mruberry

fbshipit-source-id: f54ad6dc64ddb6bcfca5c5c7fd8f395cd9761128
2021-05-01 20:51:01 -07:00
Wenlei Xie
20085f6d23 Support auto generation of device check (#56872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872

ghstack-source-id: 127914018

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986429

fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f
2021-05-01 12:02:09 -07:00
Emilio Castillo
0a9c9cc674 Update DLPack to 0.4 (#55365)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55090

I included the header directly, but I am not sure if we should add this as a git submodule, what do you guys think?
Also regarding the implementation, in ATen lanes seems not to be supported, but from CuPy complex types are exported with 2 lanes, I am not sure wether this is correct or not. However, in PyTorch this seems to be working properly, so I forgive 2 lanes for complex datatypes.

TODO: add tests for complex and bfloat

Easy test script against cupy

```python
import cupy
import torch

from torch.utils.dlpack import to_dlpack
from torch.utils.dlpack import from_dlpack

# Create a PyTorch tensor.
tx1 = torch.tensor(
    [2 + 1j, 3 + 2j, 4 + 3j, 5 + 4j], dtype=torch.complex128
).cuda()

# Convert it into a DLPack tensor.
dx = to_dlpack(tx1)

# Convert it into a CuPy array.
cx = cupy.fromDlpack(dx)

# Convert it back to a PyTorch tensor.
tx2 = from_dlpack(cx.toDlpack())
torch.testing.assert_allclose(tx1, tx2)
```

Thanks to leofang who updated CuPy's dlpack version and his PR served me as the guide for this one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55365

Reviewed By: ngimel

Differential Revision: D27724923

Pulled By: mruberry

fbshipit-source-id: 481eadb882ff3dd31e7664e08e8908c60a960f66
2021-04-30 10:30:05 -07:00
Edward Yang
e362ee6f8a Make it illegal to directly construct _TensorBase (#56150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56150

See #56017 for full context; the short story is that by making
it illegal to directly construct _TensorBase, we need only
write a *single* tp_dealloc function which will work universally
for all _TensorBase subclasses, rather than having to write two
versions, one for _TensorBase itself, and others for Python subclasses
of _TensorBase.  This means simpler code.

The subtlety here is that we only install our custom `tp_new` for direct subclasses of TensorBase.  This is important, because overriding the `tp_new` also overrides any user defined constructor.  Fortunately class Tensor(_TensorBase) has no nontrivial constructors and doesn't mind, but other subclasses like Parameter definitely mind!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28028746

Pulled By: ezyang

fbshipit-source-id: 3c03a14666ad1ded1145fe676afb0a7623cdb9bb
2021-04-28 09:25:25 -07:00
Arindam Roy
5d7e48c9fc Disable one test in rocm (#56951)
Summary:
The test seems to be failing in ROCM 4.1 on CI node.  Disabling the same for now. The test will be    re-enabled for ROCM when CI transitions to 4.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56951

Reviewed By: zou3519

Differential Revision: D28059808

Pulled By: ezyang

fbshipit-source-id: a9b064b7525ae6dce89c51fe29ff07f37b7ac796
2021-04-28 08:58:51 -07:00
Yukio Siraichi
cf17fd6dd5 Fix multinomial CUDA misalignment and non-deterministic behavior (#55364)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46702

- fails on probability distribution with odd items
  - trying to access an `acc_type` (`float`) in a `scalar_t` (`float16`) aligned memory
- produce unrepeatable result for large input tensor
  - parallel cumsum not monotonic at some positions

### Fixes
- computing cumsum on `acc_type` (`float`) instead of using `scalar_t` (`float16`) fixed both issues
- the non-monotonic behavior may happen even using `float`, though
  - in these cases, deterministic behavior may be achieved by eliminating the race condition when writing the result, using the atomic function `atomicMax`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55364

Reviewed By: mruberry

Differential Revision: D28031666

Pulled By: ngimel

fbshipit-source-id: 0fc6289e0b9ea2d31ef3771e7ca370de8f5c02de
2021-04-27 12:04:32 -07:00
Yu Guo
f5c24cc891 add deterministic path for index_copy_cpu (#56900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56900

use serial copy with iter.serial_for_each in the deterministic mode

Test Plan:
buck test mode/opt //caffe2/test:torch -- test_index_copy_deterministic

    ✓ Pass: caffe2/test:torch - test_index_copy_deterministic_cpu (test_torch.TestTorchDeviceTypeCPU) (5.581)

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert_index_copy

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (11.565)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_index_copy_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (29.172)
    ✓ Pass: caffe2/test:torch_cuda - main (29.172)

Reviewed By: ngimel

Differential Revision: D27992992

fbshipit-source-id: cebeefd8508553f9dbc4145819fe90dd625502f3
2021-04-26 16:57:47 -07:00
Yu Guo
72c3ee073f add deterministic path for index_add_cuda (#56521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56521

index_add_cuda is non-deterministic due to cuda atomicAdd. Here we add a deterministic code path with index_put(accumulate=True)

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_add_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (12.289)
    ✓ Pass: caffe2/test:torch_cuda - test_index_add_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (27.190)
    ✓ Pass: caffe2/test:torch_cuda - main (27.190)
Summary
  Pass: 2
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (16.088)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_kthvalue_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_bincount_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_index_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_EmbeddingBag_max_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_MaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_histc_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_linear_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_NLLLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_accumulate_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_trilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bicubic_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_scatter_add_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_CTCLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_index_copy_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_median_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_gather_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - main (37.654)
Summary
  Pass: 32
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D27861072

fbshipit-source-id: c33731017b863751f3e3068a23135129c555b66f
2021-04-26 12:14:58 -07:00
kshitij12345
9eee14704a OpInfo: roll and rot90 (#56770)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56770

Reviewed By: ngimel

Differential Revision: D27987820

Pulled By: mruberry

fbshipit-source-id: c6b86cdc1b89d91eeda2215020137582e7c20c65
2021-04-25 22:12:38 -07:00
kshitij12345
9e027d7ea3 [OpInfo] Add opinfo for transpose and its aliases (#56122)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56122

Reviewed By: ezyang

Differential Revision: D27962878

Pulled By: mruberry

fbshipit-source-id: cfd84bb0dcedeb98233a10e2c9754281f7cb76af
2021-04-25 21:58:16 -07:00
kshitij12345
298db67220 [OpInfo] Add Function Variant and Opinfo for permute (#56125)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56125

Reviewed By: ezyang

Differential Revision: D27960312

Pulled By: mruberry

fbshipit-source-id: b9dd89f7e69d7dff29f3b53828656c13df898fa5
2021-04-25 21:26:44 -07:00
Kurt Mohler
1f04494c0e Consolidate nondeterministic error tests (#55631)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55631

Reviewed By: malfet

Differential Revision: D27909953

Pulled By: mruberry

fbshipit-source-id: 9115b2433f9c276555be55bd51b270a7a2846829
2021-04-22 23:37:01 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Brandon Lin
d806b06167 Support int32 indices in torch.repeat_interleave (#55102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55102

To avoid casting a tensor to `.long()`, we introduce support for int32 in `torch.repeat_interleave`.

Reviewed By: ezyang

Differential Revision: D27478235

fbshipit-source-id: 08b4cce65fe94ff10535ddc07e1ba2bacea6a2cf
2021-04-19 09:07:25 -07:00
Winston Smith
36b476ccdd Added OpInfos for eq, ne, ge, gt, le, and lt (#55709)
Summary:
A https://github.com/pytorch/pytorch/issues/54261 task
Added OpInfos for `eq`, `ne`, `ge`, `gt`, `le`, and `lt`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55709

Reviewed By: jbschlosser

Differential Revision: D27760382

Pulled By: mruberry

fbshipit-source-id: 30d8c9633c69a097c1e4a9daf4178c617c0a9093
2021-04-17 22:52:47 -07:00
Victor Bittorf
52f1a07b63 Python API for Vitals (#53238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53238

There is a tension for the Vitals design: (1) we want a macro based logging API for C++ and (2) we want a clean python API. Furthermore, we want to this to work with "print on destruction" semantics.

The unfortunate resolution is that there are (2) ways to define vitals:
(1) Use the macros for local use only within C++ - this keeps the semantics people enjoy
(2) For vitals to be used through either C++ or Python, we use a global VitalsAPI object.

Both these go to the same place for the user: printing to stdout as the globals are destructed.

The long history on this diff shows many different ways to try to avoid having 2 different paths... we tried weak pointers & shared pointers, verbose switch cases, etc. Ultimately each ran into an ugly trade-off and this cuts the difference better the alternatives.

Test Plan:
buck test mode/dev caffe2/test:torch -- --regex vital
buck test //caffe2/aten:vitals

Reviewed By: orionr

Differential Revision: D26736443

fbshipit-source-id: ccab464224913edd07c1e8532093f673cdcb789f
2021-04-15 16:06:43 -07:00
Nikita Shulga
6daa1760d7 Skip geqrf test if compiled without LAPACK (#56105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56105

Reviewed By: walterddr

Differential Revision: D27785443

Pulled By: malfet

fbshipit-source-id: 9701f693a71f77259c0a6371106e7185cc49a803
2021-04-15 08:07:51 -07:00
Yu Guo
8596ac186b deterministic code path for gather_backward for dim = 1 (#55573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55573

provide a deterministic code path for gather_backward when dim = 1

Test Plan:
buck test //caffe2/test:torch -- test_gather_backward
    ✓ Pass: caffe2/test:torch - test_gather_backward_one_dim (test_torch.TestTorch) (1.099)
    ✓ Pass: caffe2/test:torch - test_gather_backward_deterministic_path (test_torch.TestTorch) (1.166)

test on GPU

buck test mode/opt //caffe2/test:torch_cuda -- test_gather_backward_deterministic

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1407375070421778
    ✓ ListingSuccess: caffe2/test:torch_cuda - main (7.484)
    ✓ Pass: caffe2/test:torch_cuda - test_gather_backward_deterministic_path_cuda (test_torch.TestTorchDeviceTypeCUDA) (26.145)
    ✓ Pass: caffe2/test:torch_cuda - main (26.145)
Summary
  Pass: 2
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D27632008

fbshipit-source-id: ec27475332a3b36360cc014193256c21cba77d63
2021-04-13 15:18:00 -07:00
Kurt Mohler
5a45b1b2f2 Add nondeterministic alert for index_put_ when accumulate=False (#55827)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55827

Reviewed By: yinghai

Differential Revision: D27725794

Pulled By: ngimel

fbshipit-source-id: f6b5b3e635170524fdb5a0141ebd27925c37e8d9
2021-04-13 14:28:16 -07:00
Winston Smith
aceceb3d5c Reland #50999 (Added pow() on CPU for float16 & bfloat16) (#55280)
Summary:
#### Reason for relanding
Line 1607 of `torch/testing/_internal/common_methods_invocations.py` of https://github.com/pytorch/pytorch/issues/50999  had `dtype` instead of `dtype=torch.bool`, so 4 of the 9 sample inputs for `bool` had incorrect dtype. This bug was caught by https://github.com/pytorch/pytorch/issues/54949.

1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it).  It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `tan` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.
7. Removed redundant `dtypesIfCPU` and `dtypesIfCUDA` from `OpInfo`s where they are equal to `dtypes`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55280

Reviewed By: jbschlosser

Differential Revision: D27591772

Pulled By: heitorschueroff

fbshipit-source-id: c7420811b32595bb3353149a61e54a73f2eb352b
2021-04-13 13:23:29 -07:00
albanD
505f6f325f port addcdiv to opinfo (#55518)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55518

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27649411

Pulled By: albanD

fbshipit-source-id: cfb0a235d94ef62589acbeb9bf11d2ea17248484
2021-04-13 06:21:10 -07:00
albanD
9ccae89102 port addcmul to OpInfo (#55517)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55517

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27649413

Pulled By: albanD

fbshipit-source-id: e1faf25cf7f9c3636f62db1512aee78fd7c4f9b6
2021-04-13 06:19:33 -07:00
Wenlei Xie
561b507843 Eliminate device guard in generic dispatch key kernel wrappers (#55131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55131

Benchmark `zeros_out`:

```python
from torch.utils.benchmark import Timer
counts = Timer(
    stmt="""at::zeros_out(t, {1});""",
    setup="auto t = at::empty({1});",
    language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```

With device guard:
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f834f095ca0>
at::zeros_out(t, {1});
setup: auto t = at::empty({1});
                           All          Noisy symbols removed
    Instructions:      1396022                    1396022
    Baseline:                0                          0
1000 runs per measurement, 1 thread
```

Without device guard:
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f25e48927c0>
at::zeros_out(t, {1});
setup: auto t = at::empty({1});
                           All          Noisy symbols removed
    Instructions:      1296022                    1296022
    Baseline:                0                          0
1000 runs per measurement, 1 thread
```

We see about `7.7%` improvement.

ghstack-source-id: 126295368

Test Plan:
```
buck build //caffe2/aten/...
buck test mode/dev mode/no-gpu //caffe2/test:torch  -- 'caffe2/test:torch - test_msnpu_error (test_torch.TestTorch)'
```

Reviewed By: ezyang

Differential Revision: D27496584

fbshipit-source-id: 97f783a809b77b28f77a93096d69b3da9ee69df7
2021-04-12 15:42:19 -07:00
Mike Ruberry
399b66c813 Ports logdet from method_tests() to op_db (#55743)
Summary:
Per title. Also updates some tensor construction helpers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55743

Reviewed By: ngimel

Differential Revision: D27702060

Pulled By: mruberry

fbshipit-source-id: f64b7bee855733ad1f4fd182819ceec5831d9878
2021-04-11 20:39:16 -07:00
Yukio Siraichi
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
Nikita Shulga
add49e7e4e Enforce PEP263 for PyTorch python codebase (#55346)
Summary:
All python files containing non-ASCII characters should be correctly annotated with `# -*- coding: utf-8 -*-` comment

Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346

Reviewed By: samestep

Differential Revision: D27582044

Pulled By: malfet

fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44
2021-04-06 18:31:38 -07:00
lezcano
fd02fc5d71 Port put_ and take from TH to ATen (#53356)
Summary:
The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel.

Resolves https://github.com/pytorch/pytorch/issues/24751
Resolves https://github.com/pytorch/pytorch/issues/24614
Resolves https://github.com/pytorch/pytorch/issues/24640
Resolves https://github.com/pytorch/pytorch/issues/24772

This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388

This PR also makes these two functions correct in the following aspects (all of them added to the tests as well):
- Support for complex numbers
- Correct handling of scalar inputs and zero-dimensional inputs
- Implementation that does not do any copies nor sorting of any of the input tensors
- Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`)
- Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats)
- Adds the `torch.put` function that was missing, (`index_put` exists, for example)
- Corrected docs

It also adds a much more thorough testing to the operations and their gradients.

There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one?

**Edit.** Benchmarks:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device, cmd):
    print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    large_tensor = torch.rand(*([size] * ndims), device=device)
    small_tensor = torch.rand((index_len,), device=device)
    index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device)
    if cmd == "put":
        command = "large_tensor.put_(index, small_tensor, accumulate=False)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "accumulate":
        command = "large_tensor.put_(index, small_tensor, accumulate=True)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "take":
        command = "torch.take(large_tensor, index)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

for method, device in product(["accumulate", "put", "take"], [cpu, cuda]):
    run_test(3, 1000, 10, device, method)
    run_test(3, 1000, 1000, device, method)
    run_test(3, 1000, 10000, device, method)
    run_test(2, 10000, 100000, device, method)
```
</details>

```python
put_(accumulate=False)
```

<details>
<summary>ATen CPU (1.5x - 2x speedup)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

```python
put_(accumulate=True)
```

<details>
<summary>ATen CPU (x2 speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (3x - 11x speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

```python
take()
```

<details>
<summary>ATen CPU (1.1x speedup)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356

Reviewed By: mruberry

Differential Revision: D27520243

Pulled By: ngimel

fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093
2021-04-05 18:05:38 -07:00
Edward Yang
3acbaf834e Make structured functions properly check device/dtype of explicit out args (#55150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55150

Somehow I forgot to add these checks.  Now they're in here.  Thanks
ngimel for noticing.

This is probably a slight efficiency hit on TensorIterator, which is
probably already doing all these checks.  Would be good to follow up
on this, though it may not be easily fixable with the TI rewrite.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D27523879

Pulled By: ezyang

fbshipit-source-id: 458e617dbc6de6fcfa9e5841148b30b99f52e001
2021-04-05 14:42:43 -07:00
kshitij12345
0a81034dd0 Port atan2 to structured kernel (#55130)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55130

Reviewed By: gchanan

Differential Revision: D27502777

Pulled By: ezyang

fbshipit-source-id: 9c368e2c3670f5633e059024ccff8b3e95e2733e
2021-04-05 00:12:42 -07:00
Nikita Shulga
8377e6221a Revert D27478225: [pytorch][PR] Added pow() on CPU for float16 & bfloat16
Test Plan: revert-hammer

Differential Revision:
D27478225 (6d030c14cf)

Original commit changeset: d309dd98d5a9

fbshipit-source-id: e0518f15185b41946caf3a8456c7af3f52e5a910
2021-04-03 10:26:44 -07:00
Winston Smith
6d030c14cf Added pow() on CPU for float16 & bfloat16 (#50999)
Summary:
Added the functionality desired in https://github.com/pytorch/pytorch/issues/50789.

1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it).  It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `linalg.norm` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50999

Reviewed By: zou3519

Differential Revision: D27478225

Pulled By: heitorschueroff

fbshipit-source-id: d309dd98d5a96d0cb9b08281757bb1c65266d011
2021-04-02 15:57:06 -07:00
lezcano
36c27fd0ac SVD docs improved (#54002)
Summary:
- Corrected a few errata in the SVD docs
- Made the notation more uniform (refer to `Vh` in `linalg.svd`, always use double tilts...)
- Wrote a better explanation about why the gradients of `U` and `V` are not well-defined when the input is complex or real but has repeated singular values. The previous one pointed to a somewhat obscure post on gauge theory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54002

Reviewed By: malfet

Differential Revision: D27459502

Pulled By: mruberry

fbshipit-source-id: f5c35eca02d35dadd2fc0eeadfacc8824f409400
2021-04-01 09:31:40 -07:00
Kurt Mohler
6c235ef267 Allow std=0 in torch.normal, and error if std<0 (#51317)
Summary:
Part of https://github.com/pytorch/pytorch/issues/49998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317

Reviewed By: bdhirsh

Differential Revision: D27253939

Pulled By: mruberry

fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b
2021-03-31 21:06:07 -07:00
Edward Yang
6c8d783830 Generate no-op meta functions for all inplace operations (#54901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901

Some subtleties:
- Need to make sure not to clobber composite definitions when
  deciding when to generate
- I was lazy and so I didn't make inplace on TensorList work,
  nor did I make inplace functions that returned void work
- A few tests started complaining that these noop meta functions
  weren't raising the errors they needed.  This is tracked
  in https://github.com/pytorch/pytorch/issues/54897

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27407232

Pulled By: ezyang

fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29
2021-03-30 09:31:39 -07:00
Edward Yang
1f36ce6e4d Restore storage on meta tensors; increase meta coverage (#53973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973

Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y.

The first part is restoring the concept of storage to meta tensors.  Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by:

* Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage
* Turn on memory overlap checking in TensorIterator even for meta tensors
* Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment).

The second part is adding more support for the most used functions in the test suite.

* Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!)
* `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case)
* `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added)
* Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them
* Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway)
* `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently.

Getting more meta function support triggers a number of bugs in the test suite, which I then fix:

- Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739
- dlpack obviously doesn't work with meta tensors, I just disabled the test

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D27036572

Test Plan: Imported from OSS

Reviewed By: agolynski, bdhirsh

Pulled By: ezyang

fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78
2021-03-29 08:37:46 -07:00
Xiang Gao
eec48303c0 Make index_add take a scalar argument alpha (#54176)
Summary:
```
index_add(Tensor self, int dim, Tensor index, Tensor source) -> Tensor
```
now becomes
```
index_add(Tensor self, int dim, Tensor index, Tensor source, Scalar alpha=1) -> Tensor
```
Generally, this sounds useful and harmless, and inside PyTorch, we are already needing this feature in `add_out_dense_sparse_cuda`, see the `SparseCUDATensorMath.cu` change in this PR.

**Test not added yet. Will add if after discussion we believe this is a good idea.**
- [ ] TODO: add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54176

Reviewed By: ngimel

Differential Revision: D27319198

Pulled By: mruberry

fbshipit-source-id: fe43be082d1230c87c5313458213d5252be2ff23
2021-03-28 00:22:45 -07:00
lezcano
5870346173 Port index_copy from TH to ATen (#52203)
Summary:
The design of the `TensorIterator` was similar to that in https://github.com/pytorch/pytorch/pull/50578

Resolves https://github.com/pytorch/pytorch/issues/24670
Resolves https://github.com/pytorch/pytorch/issues/24523

Timings:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device):
    print(f"ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    x = torch.rand(*([size] * ndims), device=device)
    index = torch.randint(size, (index_len,), dtype=torch.long, device=device)
    for d in range(ndims):
        shape_t = [size] * d + [index_len] + [size] * (ndims - d - 1)
        t = torch.rand(*shape_t, device=device)
        command = "x.index_copy(d, index, t)"
        if device == cuda:
            command = command + "; torch.cuda.synchronize()"
        ipython.magic(f"timeit {command}")
    print()

run_test(3, 700, 10, cpu)
run_test(3, 700, 100, cpu)
run_test(3, 700, 700, cpu)
run_test(2, 10000, 10000, cpu)

run_test(3, 700, 10, cuda)
run_test(3, 700, 100, cuda)
run_test(3, 700, 700, cuda)
run_test(2, 10000, 10000, cuda)
```

</details>

<details>
<summary>CPU ATen</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cpu
327 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
329 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
378 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 100, device: cpu
348 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
359 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
526 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 700, device: cpu
560 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
552 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
932 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cpu
163 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
302 ms ± 5.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
</details>

<details>
<summary>CUDA ATen</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cuda
9.63 ms ± 441 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.65 ms ± 230 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.4 ms ± 881 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 700, index_len: 100, device: cuda
10.8 ms ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11 ms ± 417 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
21.2 ms ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 700, index_len: 700, device: cuda
19 ms ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.8 ms ± 493 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
25.8 ms ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cuda
5.59 ms ± 109 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
10 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

</details>

<details>
<summary>CPU TH</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cpu
333 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
327 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
366 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 100, device: cpu
336 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
345 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
884 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 700, device: cpu
441 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
514 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.46 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cpu
141 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.13 s ± 855 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

</details>

<details>
<summary>CUDA TH</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cuda
9.64 ms ± 390 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.68 ms ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 928 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 700, index_len: 100, device: cuda
11.6 ms ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
30.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 700, index_len: 700, device: cuda
27.2 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.6 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cuda
6.5 ms ± 3.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
64.7 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

</details>

According to these we see a slight performance improvement across both CPU and GPU.

cc: nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52203

Reviewed By: jbschlosser

Differential Revision: D27066572

Pulled By: mruberry

fbshipit-source-id: 6101e461cf731afa3db042a383b723d3d6bfdc26
2021-03-22 22:36:35 -07:00
kshitij12345
afb560065c [testing] OpInfo for sgn and sign (#53885)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO:
* [x] Check rendered docs. https://11525594-65600975-gh.circle-artifacts.com/0/docs/generated/torch.sgn.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53885

Reviewed By: ejguan

Differential Revision: D27114318

Pulled By: mruberry

fbshipit-source-id: 678179d87741aacd3b50f03dc460207c5aa29589
2021-03-22 09:39:40 -07:00
lezcano
9d9986fd10 Support for Half / bfloat16 / index_select and better testing (#53898)
Summary:
Added the support for half / bfloat / bool for `index_select`, as suggested by ngimel in
https://github.com/pytorch/pytorch/issues/49707#issuecomment-788140578

For the tests to pass, I also added the support for `index_add`.

I added `OpInfo` tests for `index_add` and more thorough forward tests for `index_select` to test these changes.

While doing so, I found that the support for scalar types in the derivative of `index_add` was not correct, so I corrected it.

Resolves https://github.com/pytorch/pytorch/issues/49707

It should also resolve similar issues that I encountered when porting `index_copy`, `take` and `put`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53898

Reviewed By: mruberry

Differential Revision: D27193294

Pulled By: ngimel

fbshipit-source-id: 5a0af2c62a0cf24f3cc9c74f230ab4f3712bbb7a
2021-03-19 20:37:48 -07:00
Edward Yang
49f1336106 Add Tensor::is_cpu, genericize TensorIterator (#54079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54079

Fixes https://github.com/pytorch/pytorch/issues/53815

Instead of testing if something is CUDA, we instead test if something
is not CPU.  This in the general theming of "Don't be so darn CUDA
centric".

Intruigingly, we didn't have a is_cpu() method on Tensor.  Which seems
like a big oversight and one of the reasons how we ended up in this
mess.  So in it goes.  Maybe we should also get this for Python bindings
as well (but in that case, should probably look into redoing all of the
is_X bindings so they aren't done manually).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27109507

Pulled By: ezyang

fbshipit-source-id: abbe72c2e688c452ffe098d206cb79938b5824b1
2021-03-19 09:10:24 -07:00
Edward Yang
3c457043fb Also propagate storage_access_should_throw_ when copying tensor metadata (#53816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53816

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27036574

Pulled By: ezyang

fbshipit-source-id: 71e61b0aa3d46159c9af1112c262cbfa7eaa1879
2021-03-16 15:18:37 -07:00
Edward Yang
547f435763 Fix restriding logic for structured kernels (#53759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53759

Fixes #53587, see issue for in-depth explanation of the bug.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971342

Pulled By: ezyang

fbshipit-source-id: 805983fed2658e27fb033f36a71fd30950a29328
2021-03-14 20:41:23 -07:00
Edward Yang
d47d246206 Add 'noarch' tests which only run in one CI config (#53747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53747

Fixes #53743

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971343

Pulled By: ezyang

fbshipit-source-id: cee7aa10063ae674f741406a3af830e4b4f128df
2021-03-14 20:39:07 -07:00
Brian Hirsh
c68cc24cee update upsample tests in test_nn.py to test for memory_format (#53665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665

ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly.

There were two reasons the original test didn't pick up on a memory format regression:
- They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)`
- Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26929683

Pulled By: bdhirsh

fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612
2021-03-10 14:21:14 -08:00
Natalia Gimelshein
6aa5148df2 Filter 0's returned by exponential distribution (#53480)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48841 for half datatype (it was fixed for other datatypes before).
The reason for https://github.com/pytorch/pytorch/issues/48841 happening for half was that `exponential_` for half was producing 0s.
Exponential distribution implementation on cuda is here e08aae2613/aten/src/ATen/native/cuda/DistributionTemplates.h (L535-L545)
with `transformation::exponential` defined here
e08aae2613/aten/src/ATen/core/TransformationHelper.h (L113-L123)
It takes a uniformly distributed random number and takes `log` of it. If necessary, the result is then converted to low precision datatype (half). To avoid 0's, before applying `log`,  ones are replaced with std::nextafter(1,0). This seems fine, because log(1-eps) is still representable in half precision (`torch.tensor([1.], device="cuda").nextafter(torch.tensor([0.], device="cuda")).log().half()` produces 5.96e-8) , so casting to `scalar_t` should work. However, since fast log approximation is used (`__logf`), the log result is ~3e-9 instead of more accurate 5.96e-8, and underflows when casting to half. Using `::log` instead of fast approximation fixes it, however, it comes with ~20% perf penalty on exponential kernel for fp32 datatype, probably more for half.

Edit: alternative approach used now is to filter all small values returned by transformation. The result is equivalent to squashing of 1's to 1-eps that was used before, and computing correct log of 1-eps (which is -eps, exactly equal even for doubles). This doesn't incur noticeable performance hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53480

Reviewed By: mruberry

Differential Revision: D26924622

Pulled By: ngimel

fbshipit-source-id: dc1329e4773bf91f26af23c8afa0ae845cfb0937
2021-03-10 00:35:31 -08:00
Brian Hirsh
233b9490c2 fix channels_last bug in upsample kernels (#53535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53535

During the port to structured kernels for upsample kernels, I missed that a subset of them explicitly pass `memory_format` information from the input to the output tensors.

Note 1:
I added the logic into the `meta` function of each op, which feels morally correct since this logic affects the output shape/metadata. One consequence is that all backend implementations will get the logic. I synced with fmassa that this seems reasonable.

Note 2:
This logic used to happen in the following operators, which this PR fixes:
- upsample_nearest3d
- upsample_trilinear3d
- upsample_nearest2d
- upsample_bilinear2d

I explicitly didn't patch the other upsample kernels, which look like they never forwarded memory_format information:
- `upsample_bicubic2d` (maybe this should though? `UpSampleBicubic2d.cpp` isn't currently written to do anything different for `channels_last` tensors)
- All of the `upsample_{mode}1d` operators. Probably because, afaik, channels_last isn't supported for 3d tensors
- The corresponding backwards operator for every upsample op.

Note 3:
I'm also wondering why memory_format isn't just directly a part of the `tensor::options()` method, which would cause all ops to universally forward memory_format information from input to output tensors, rather than just the upsample ops. My guess is:
- BC-breakage. I'm not sure whether this would really *break* people, but it's an API change
- performance. `tensor::options()` is called everywhere, and adding a call to `suggest_memory_format()` would probably noticeably hit microbenchmarks. We could probably deal with that by making `memory_format` a precomputed field on the tensor?

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26891540

Pulled By: bdhirsh

fbshipit-source-id: b3845f4dd5646b88bf738b9e41fe829be6b0e5cf
2021-03-09 15:23:53 -08:00
Jane Xu
d0b32156f0 move test to CUDA only (#53561)
Summary:
Helps make master green by removing this hefty memory allocating from CPU test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53561

Reviewed By: malfet, albanD

Differential Revision: D26897941

Pulled By: janeyx99

fbshipit-source-id: 9f6c2d55f4eea1ab48665f7819fc113f21991036
2021-03-08 16:32:14 -08:00
mattip
54a2498919 Modify tests to use assertWarnsOnceRegex instead of maybeWarnsRegex (#52387)
Summary:
Related to https://github.com/pytorch/pytorch/issues/50006

Follow on for https://github.com/pytorch/pytorch/issues/48560 to ensure TORCH_WARN_ONCE warnings are caught. Most of this is straight-forward find-and-replace, but I did find one place where the TORCH_WARN_ONCE warning was not wrapped into a python warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52387

Reviewed By: albanD

Differential Revision: D26773387

Pulled By: mruberry

fbshipit-source-id: 5be7efbc8ab4a32ec8437c9c45f3b6c3c328f5dd
2021-03-08 03:32:14 -08:00
Edward Yang
758fb94fcb Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276

- One of the tests had a syntax error (but the test
  wasn't fine grained enough to catch this; any error
  was a pass)
- Doesn't work on ROCm

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D26820048

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: ezyang

fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45
2021-03-05 17:36:01 -08:00
Edward Yang
cfd9360d09 Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26837780

Original commit changeset: 21567cab5c0f

fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0
2021-03-04 20:45:35 -08:00
Edward Yang
1accffe450 Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26819810

Original commit changeset: e528260e1aa9

fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30
2021-03-04 18:48:56 -08:00
Edward Yang
9e5e5a7d96 Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26815021

Original commit changeset: 972eaafcdf14

fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607
2021-03-04 09:28:25 -08:00
Mike Ruberry
b864457743 Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26744062 (12d63cc2f5)

Original commit changeset: be6d2653afe5

fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759
2021-03-04 04:11:25 -08:00
Edward Yang
12d63cc2f5 Add assert_async (#53086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086

Fixes #36853

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26744062

Pulled By: ezyang

fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d
2021-03-03 16:18:07 -08:00
Edward Yang
0f81a69a96 Make meta a device (getting rid of empty_meta) (#53143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143

Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors.  This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)

Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
  tensors are device agnostic.  It's hard to say exactly what the
  correct level of abstraction here is, but in this particular case
  implementation considerations trump semantic considerations: it
  is way easier to have just a meta device, than to have a meta device
  AND a cpu device AND a cuda device.  This may limit the applicability
  of meta tensors for tracing models that do explicit cpu()/cuda()
  conversions (unless, perhaps, we make those operations no-ops on meta
  tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
  Are they really supposed to be all caps?  That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
  dispatch keys.
- I intentionally did NOT add a Backend for Meta.  For now, I'm going to
  hope meta tensors never exercise any of the Backend conversion code;
  even if it does, better to fix the code to just stop converting to and
  from Backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26763552

Pulled By: ezyang

fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
2021-03-03 11:24:13 -08:00
Natalia Gimelshein
e5e54ada61 fix logcumsumexp functor to properly handle infs and nans (#52947)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52213
Nans were previously inconsistently propagated due to std::min always returning first argument if one of the args in nan
when reduction functor was called on 2 `-inf` arguments, `std::min(x,y) - std::max(x,y)` resulted in `-inf - (-inf)` = nan, even though logcumsumexp is well defined for `-inf, -inf` pair.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52947

Reviewed By: H-Huang

Differential Revision: D26718456

Pulled By: ngimel

fbshipit-source-id: a44433889da352cc959786dd15b6361a68fcfed7
2021-03-02 10:58:01 -08:00
kshitij12345
f5617b0932 [testing] Add Opinfo for torch.frac and minor fixes (#52660)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52660

Reviewed By: ailzhang

Differential Revision: D26618151

Pulled By: mruberry

fbshipit-source-id: cf0df38e46f44d3afff6e0015af5a840c661aa0e
2021-03-01 04:58:31 -08:00
Nikita Vedeneev
0048d97eda remove index_fill side-effect for scalar tensors (#52209)
Summary:
`index_fill` silently promotes zero dim Tensors to 1-dim Tensors. This PR fixes that.
Was:
```
In [1]: import torch

In [2]: x = torch.tensor(1)

In [3]: idx = torch.tensor(0).long()

In [4]: x.dim()
Out[4]: 0

In [5]: x.index_fill(0, idx, -1).dim()
Out[5]: 1

```
Now:
```
In [6]: x.index_fill(0, idx, -1).dim()
Out[6]: 0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52209

Reviewed By: ejguan

Differential Revision: D26446470

Pulled By: ngimel

fbshipit-source-id: 4737e6941a7216b57f3416b59362817834df3a3a
2021-02-25 00:35:27 -08:00
Jane Xu
09516d2d0c Reenables skipped tests for all CUDA versions except 11.2 (#52359)
Summary:
This PR adds functionality to skip a test based on CUDA version.

This way, we can be more specific when skipping a test, such as when the test only fails for a particular CUDA version.

This allows us to add back the skipped tests for CUDA 11.2 for other CUDA versions, such as 10.1 and 11.1.

I tested this locally (by using 11.0 instead of 11.2), but will run all the CI to make sure it works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52359

Reviewed By: walterddr

Differential Revision: D26487951

Pulled By: janeyx99

fbshipit-source-id: 45c71cc6105ffd9985054880009cf68ea5ef3f6a
2021-02-19 15:30:55 -08:00
Nikita Vedeneev
9699c703c2 Stable sort for the CPU take 2. (#51790)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38681.
A duplicate of https://github.com/pytorch/pytorch/pull/50052 created to become importable to the fb internal tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51790

Reviewed By: agolynski

Differential Revision: D26279045

Pulled By: glaringlee

fbshipit-source-id: 348e171dee9c370a76002b65d0c82c329f57a421
2021-02-19 09:28:57 -08:00
Xiong Wei
c7b0005831 Enhance Tensor.unflatten to support -1 as the inferred size (#51955)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51719, https://github.com/pytorch/pytorch/issues/28142

**Change**
- Update `torch.Tensor.unflatten` to support users pass`-1` as the inferred size for both tensors and named tensors.
- Examples of using `-1` in the `unflatten` function are added to the docs.
- Fix the rendered issue of original `unflatten` docs by removing a blank line between its example section.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51955

Reviewed By: agolynski

Differential Revision: D26467198

Pulled By: zou3519

fbshipit-source-id: 6a3ede25561223187273796427ad0cb63f125364
2021-02-18 08:37:41 -08:00
Ailing Zhang
83fa713f2b Fix test to use proper condition. (#52216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52216

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26427506

Pulled By: ailzhang

fbshipit-source-id: ba4f2f66794cb2843926e5566eb4d25582f7fb2b
2021-02-12 12:59:35 -08:00
Kshiteej K
d7ea0fe75a [testing] Add OpInfo for rad2deg and deg2rad (#51283)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

We should probably add aliases for these operators to be consistent with NumPy names i.e. `np.degrees` and `np.radians`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51283

Reviewed By: ngimel

Differential Revision: D26171163

Pulled By: mruberry

fbshipit-source-id: 1869604ed400820d95f6ff50a0e3cba1de1ffa84
2021-02-10 19:45:10 -08:00
Jane Xu
bff8194522 Replace 11.1 with 11.2 on CI for Windows (#51598)
Summary:
Adding CUDA 11.2 to Windows CI.

Disabled tests:

The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below)
`test_where_scalar_valid_combination_cuda_complex128` in test_torch.py
`test_sgn_complex_cuda` in test_autograd.py

The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002)
test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64
test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598

Reviewed By: mrshenli

Differential Revision: D26344965

Pulled By: janeyx99

fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef
2021-02-10 17:59:11 -08:00
vfdev
8b0cb5ede3 OpInfo: Added clamp and trunc tests with aliases (#51167)
Summary:
Description:
- Added clamp, trunc tests with aliases
- Added tests for aliases for asin(h), acos(h), etc
- fixed 'fix' alias implementation
- fixed annotations in test_jit_alias_remapping
- updated native_functions.yaml aliases guidelines

Blocked by https://github.com/pytorch/pytorch/issues/50368

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51167

Reviewed By: gchanan

Differential Revision: D26245753

Pulled By: mruberry

fbshipit-source-id: e17b657f0515139735a8a677b1ae284904f98aef
2021-02-10 05:36:18 -08:00
Mike Ruberry
594a66d778 Warn about floor_divide performing incorrect rounding (#50281) (#50281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51745

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: mruberry

Differential Revision: D26257855

fbshipit-source-id: e5d497cf07b0c746838ed081c5d0e82fb4cb701b
2021-02-10 03:13:34 -08:00
kshitij12345
768662913a Migrate masked_fill__cuda to ATen (#51404)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51404

Reviewed By: mrshenli

Differential Revision: D26329833

Pulled By: ngimel

fbshipit-source-id: 510988888fad015239ab4766eb391a89b742130b
2021-02-09 22:57:03 -08:00
mattip
b97a040f71 ENH: toggle TORCH_WARN_ONCE to TORCH_WARN for tests (#48560)
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624

~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~

Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560

Reviewed By: ngimel

Differential Revision: D26171175

Pulled By: mruberry

fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
2021-02-08 08:21:19 -08:00
wanyu2018umac
444203c52f Fix torch.cdist backward CUDA error due to illegal gridDim setting (#51569)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51569

Reviewed By: mruberry

Differential Revision: D26215694

Pulled By: ngimel

fbshipit-source-id: 0710417e6a802424e2dcada325f27452c95d042f
2021-02-02 20:41:24 -08:00
Jeffrey Wan
b18eeaa80a Implement np.diff for single order differences (#50569)
Summary:
Implements `np.diff` for single order differences only:
 - method and function variants for `diff` and function variant for `diff_out`
 - supports out variant, but not in-place since shape changes
 - adds OpInfo entry, and test in `test_torch`
 - automatic autograd because we are using the `Math` dispatch

_Update: we only support Tensors for prepend and append in this PR. See discussion below and comments for more details._

Currently there is a quirk in the c++ API based on how this is implemented: it is not possible to specify scalar prepend and appends without also specifying all 4 arguments.

That is because the goal is to match NumPy's diff signature of `diff(int n=1, int dim=-1, Union[Scalar, Tensor] prepend=None, Union[Scalar, Tensor] append)=None` where all arguments are optional, positional and in the correct order.
There are a couple blockers. One is c++ ambiguity. This prevents us from simply doing `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)` etc for all combinations of {Tensor, Scalar} x {Tensor, Scalar}.

Why not have append, prepend not have default args and then write out the whole power set of {Tensor, Scalar, omitted} x {Tensor, Scalar, omitted} you might ask. Aside from having to write 18 overloads, this is actually illegal because arguments with defaults must come after arguments without defaults. This would mean having to write `diff(prepend, append, n, dim)` which is not desired. Finally writing out the entire power set of all arguments n, dim, prepend, append is out of the question because that would actually involve 2 * 2 * 3 * 3 = 36 combinations. And if we include the out variant, that would be 72 overloads!

With this in mind, the current way this is implemented is actually to still do `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)`. But also make use of `cpp_no_default_args`. The idea is to only have one of the 4 {Tensor, Scalar} x {Tensor, Scalar} provide default arguments for the c++ api, and add `cpp_no_default_args` for the remaining 3 overloads. With this, Python api works as expected, but some calls such as `diff(prepend=1)` won't work on c++ api.

We can optionally add 18 more overloads that cover the {dim, n, no-args} x {scalar-tensor, tensor-scalar, scalar-scalar} x {out, non-out} cases for c++ api. _[edit: counting is hard - just realized this number is still wrong. We should try to count the cases we do cover instead and subtract that from the total: (2 * 2 * 3 * 3) - (3 + 2^4) = 17. 3 comes from the 3 of 4 combinations of {tensor, scalar}^2 that we declare to be `cpp_no_default_args`, and the one remaining case that has default arguments has covers 2^4 cases. So actual count is 34 additional overloads to support all possible calls]_

_[edit: thanks to https://github.com/pytorch/pytorch/issues/50767 hacky_wrapper is no longer necessary; it is removed in the latest commit]_
 hacky_wrapper was also necessary here because `Tensor?` will cause dispatch to look for the `const optional<Tensor>&` schema but also generate a `const Tensor&` declaration in Functions.h. hacky_wrapper allows us to define our function as `const Tensor&` but wraps it in optional for us, so this avoids both the errors while linking and loading.

_[edit: rewrote the above to improve clarity and correct the fact that we actually need 18 more overloads (26 total), not 18 in total to complete the c++ api]_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50569

Reviewed By: H-Huang

Differential Revision: D26176105

Pulled By: soulitzer

fbshipit-source-id: cd8e77cc2de1117c876cd71c29b312887daca33f
2021-02-02 20:25:16 -08:00
Max Balandat
a990ff7001 [SobolEngine] Fix edge case of dtype of first sample (#51578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51578

https://github.com/pytorch/pytorch/pull/49710 introduced an edge case in which
drawing a single sample resulted in ignoring the `dtype` arg to `draw`. This
fixes this and adds a unit test to cover this behavior.

Test Plan: Unit tests

Reviewed By: danielrjiang

Differential Revision: D26204393

fbshipit-source-id: 441a44dc035002e7bbe6b662bf6d1af0e2cd88f4
2021-02-02 14:24:56 -08:00
vfdev
b106250047 Introduced AliasInfo for OpInfo (#50368)
Summary:
Introduced AliasInfo for OpInfo.

Context: Split of https://github.com/pytorch/pytorch/issues/49158

cc mruberry , please let me know if you'd like to see here more code to cover

> [ ] fold test_op_aliases.py into OpInfo-based testing in test_ops.py

from https://github.com/pytorch/pytorch/issues/50006

and/or add `UnaryUfuncInfo('abs')` as discussed https://github.com/pytorch/pytorch/pull/49158/files#r548774221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50368

Reviewed By: ngimel

Differential Revision: D26177261

Pulled By: mruberry

fbshipit-source-id: 2e3884a387e8d5365fe05945375f0a9d1b5f5d82
2021-02-02 00:10:09 -08:00
kshitij12345
4b65a27a35 [testing] Add OpInfo for round and logit (#51272)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51272

Reviewed By: ngimel

Differential Revision: D26177020

Pulled By: mruberry

fbshipit-source-id: 4728b14c7a42980c7ca231ca1946430e0e38ed5b
2021-02-01 21:15:40 -08:00
Nikita Vedeneev
b198cf4f1c port index_fill_ from TH to ATen. (#50578)
Summary:
As per title. The port is based on TensorIterator.
Supports complex input.

Resolves https://github.com/pytorch/pytorch/issues/24714.
Resolves https://github.com/pytorch/pytorch/issues/24577.
Resolves https://github.com/pytorch/pytorch/issues/36328.
Possibly resolves https://github.com/pytorch/pytorch/issues/48230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50578

Reviewed By: ngimel

Differential Revision: D26049539

Pulled By: anjali411

fbshipit-source-id: 2be4e78f7a01700c593a9e893e01f69191e51ab1
2021-02-01 16:08:37 -08:00
kshitij12345
50fa415a4d [testing] Add OpInfo for ceil and floor (#51198)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51198

Reviewed By: malfet

Differential Revision: D26105099

Pulled By: mruberry

fbshipit-source-id: 6cfa89f42b87cca66dbc5bf474d17a6cad7eb45a
2021-02-01 10:10:36 -08:00
Max Balandat
449098c2d2 [SobolEngine] Update direction numbers to 21201 dims (#49710)
Summary:
Performs the update that was suggested in https://github.com/pytorch/pytorch/issues/41489

Adjust the functionality to largely match that pf the scipy companion PR https://github.com/scipy/scipy/pull/10844/, including
- a new `draw_base2` method
- include zero as the first point in the (unscrambled) Sobol sequence

The scipy PR is also quite opinionated if the `draw` method doesn't get called with a base 2 number (for which the resulting sequence has nice properties, see the scipy PR for a comprehensive discussion of this).

Note that this update is a **breaking change** in the sense that sequences generated with the same parameters after as before will not be identical! They will have the same (better, arguably) distributional properties, but calling the engine with the same seed will result in different numbers in the sequence.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49710

Test Plan:
```
from torch.quasirandom import SobolEngine

sobol = SobolEngine(3)
sobol.draw(4)

sobol = SobolEngine(4, scramble=True)
sobol.draw(5)

sobol = SobolEngine(4, scramble=True)
sobol.draw_base2(2)
```

Reviewed By: malfet

Differential Revision: D25657233

Pulled By: Balandat

fbshipit-source-id: 9df50a14631092b176cc692b6024aa62a639ef61
2021-02-01 08:44:31 -08:00
kshitij12345
a88e1d3ddf [complex] Complex support for masked_scatter and autograd support for masked_scatter and masked_select (#51281)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/33152

Changes
* Enable complex support for masked_scatter
* Enable half support for masked_scatter CPU
* Enable complex autograd support for masked_scatter CPU and masked_select (both CPU and CUDA).

**Note**:
Complex Support for masked_scatter CUDA is disabled as it depends on `masked_fill` which is yet to be ported to ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51281

Reviewed By: ailzhang

Differential Revision: D26127561

Pulled By: anjali411

fbshipit-source-id: 6284926b934942213c5dfc24b5bcc8538d0231af
2021-01-29 13:49:31 -08:00
kshitij12345
eaf5ca09dc Migrate masked_scatter_ CUDA to ATen (#50039)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50039

Reviewed By: heitorschueroff

Differential Revision: D26096247

Pulled By: ngimel

fbshipit-source-id: ec1810d3412e0d7ab6b950265a3123519ad886c1
2021-01-27 14:17:02 -08:00
kshitij12345
6d098095eb [numpy] torch.lgamma: promote integer inputs to float (#50140)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50140

Reviewed By: mrshenli

Differential Revision: D25951094

Pulled By: mruberry

fbshipit-source-id: e53f1dbddff889710f05d43dbc9587382d3decb0
2021-01-27 12:08:46 -08:00
Peter Bell
9b6d463704 Move std and var tests to OpInfos (#50901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50901

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26083289

Pulled By: mruberry

fbshipit-source-id: 7e14ff37bba46dd456e0bc0aa9c4e0a632d0734c
2021-01-27 10:50:51 -08:00
mattip
345844d9d8 test, fix deepcopy of tensor with grad (#50663)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/3307

Previously, `self.grad` was not ~cloned~ deepcopied to the returned tensor in `deepcopy`. Added a test and an implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50663

Reviewed By: heitorschueroff

Differential Revision: D26074811

Pulled By: albanD

fbshipit-source-id: 536dad36415f1d03714b4ce57453f406ad802b8c
2021-01-26 16:19:53 -08:00
anjali411
e544d74c55 [CPU] Add torch.trace for complex tensors (#50380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50380

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25949361

Pulled By: anjali411

fbshipit-source-id: 9910bc5b532c9bf3add530221d643b2c41c62d01
2021-01-23 09:04:31 -08:00
kshitij12345
a291b254ee Migrate masked_scatter_ CPU to ATen (#49732)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49541

Reference: https://github.com/pytorch/pytorch/issues/24507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49732

Reviewed By: ejguan

Differential Revision: D25991438

Pulled By: ngimel

fbshipit-source-id: a43bd0bfe043d8e32a6cadbbf736a0eaa697e7ec
2021-01-22 12:05:56 -08:00
Kurt Mohler
8ab1a1495d Rename set_deterministic to use_deterministic_algorithms (#49904)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49904

Reviewed By: ezyang, mrshenli

Differential Revision: D25956761

Pulled By: mruberry

fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6
2021-01-22 11:27:07 -08:00
Kyle Chen
16faabe7f0 [ROCm] re-enable tests (#50691)
Summary:
Signed-off-by: Kyle Chen <kylechen@amd.com>

cc: jeffdaily

re-enable test_torch.py and test_unary_ufuncs.py tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50691

Reviewed By: mruberry

Differential Revision: D25967842

Pulled By: ngimel

fbshipit-source-id: dc0f6cb68fe4d151c2719bdf67ead96e1396acf2
2021-01-20 11:23:39 -08:00
Xinyu Li
7526e38cd3 Revert "Stable sort for CPU (#50052)" (#50752)
Summary:
This reverts commit c99f356051.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50752

Reviewed By: zou3519

Differential Revision: D25958146

Pulled By: glaringlee

fbshipit-source-id: f4068d038f9bd337bac8b673eaeb46a4646f6c77
2021-01-19 18:21:25 -08:00
kshitij12345
316f0b89c3 [testing] Port torch.{repeat, tile} tests to use OpInfo machinery (#50199)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50199

Reviewed By: ngimel

Differential Revision: D25949791

Pulled By: mruberry

fbshipit-source-id: 10eaf2d749fac8c08847f50461e72ad1c75c61e3
2021-01-19 06:02:27 -08:00
nikitaved
c458558334 kill multinomial_alias_setup/draw (#50489)
Summary:
As per title. Partially Fixes https://github.com/pytorch/pytorch/issues/49421.
These functions appear to be dead code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50489

Reviewed By: mruberry

Differential Revision: D25948912

Pulled By: ngimel

fbshipit-source-id: 108723bd4c76cbc3535eba902d6f74597bfdfa58
2021-01-19 00:23:58 -08:00
76181208+imaginary-person@users.noreply.github.com
3f052ba07b Remove unnecessary dtype checks for complex types & disable complex dispatch for CPU min/max pointwise ops (#50465)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50064

**PROBLEM DESCRIPTION:**
1. Had not removed dtype checks for complex types in the previous PR (https://github.com/pytorch/pytorch/issues/50347) for this issue.
These type-checks were added in https://github.com/pytorch/pytorch/issues/36377, but are no longer necessary,
as we now rely upon dispatch macros to produce error messages.
2. dtype checks in `clamp_max()` and `clamp_min()` for complex inputs had not been removed either.
3. For min/max pointwise ops in TensorCompareKernel.cpp, complex dispatch had not been removed for min/max functions.

### **FIX DESCRIPTION:**
**FIX SUMMARY:**
1. Removed dtype checks added in https://github.com/pytorch/pytorch/issues/36377, and added 3 more in TensorCompare.cpp.
2. Removed dtype checks for complex inputs in `clamp_max()` and `clamp_min()`.
3.  Disabled complex dispatch for min/max pointwise ops in TensorCompareKernel.cpp.
4. Error messages in the exceptions raised due to min/max ops not being implemented are now checked for containing the text _not support_ (which can also be present in _not supported_), or _not implemented_, so one of them should be a part of error messages, in order for them to be informative.

**REASON FOR NOT CHANGING DISPATCH FOR CUDA AND CLAMP OPS**:

As for the CUDA min/max operations, their kernels do not seem to be compiled & dispatched for complex types anyway, so no further changes seem to be required. Basically, the dispatch macros currently being used don't have cases for complex types.

For example,

1. the reduce CUDA ops use [AT_DISPATCH_ALL_TYPES_AND2 (678fe9f077)](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h#L548-L575) in [ReduceMinMaxKernel.cu](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ReduceMinMaxKernel.cu), and that macro doesn't allow complex types.

2. In [MinMaxElementwiseKernel.cu](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/MaxMinElementwiseKernel.cu), the CUDA pointwise ops use [`AT_DISPATCH_FLOATING_TYPES_AND2 (678fe9f077)`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h#L240-L263) for non-integral & non-boolean types, and this marco doesn't have a case for complex types either.

3. [clamp CUDA ops](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/UnaryOpsKernel.cu#L170-L211) use `AT_DISPATCH_ALL_TYPES_AND2 (678fe9f077)`, which doesn't have a case for complex types.

Similarly, [CPU clamp min/max ops](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp#L428-L458) use the `AT_DISPATCH_ALL_TYPES_AND `dispatch macro, which doesn't have a case for complex types.

**REASON FOR ADDING 3 dtype CHECKS:**
There are a few cases in which the methods corresponding to `min_stub()` or `max_stub()` are not called, so dispatch macros don't get invoked, resulting in no exceptions being raised. Hence, `dtype` checks are necessary at 3 places to raise exceptions:

1. 52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L342)
2. 52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L422)
3. 52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L389)

The first dtype check requirement can be verified from the following example Python code based on `test_complex_unsupported()`:
```
import unittest
import torch

class MyTestCase(unittest.TestCase):

   def test_1(self):
      t = torch.tensor((1 + 1j), device='cpu', dtype=torch.complex128)
      with self.assertRaises(Exception):
         torch.max(t, dim=0)

if __name__ == '__main__':
    unittest.main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50465

Reviewed By: mruberry

Differential Revision: D25938106

Pulled By: ngimel

fbshipit-source-id: 95e2df02ba8583fa3ce87d4a2fdcd60b912dda46
2021-01-17 22:00:05 -08:00
nikitaved
c99f356051 Stable sort for CPU (#50052)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/38681](https://github.com/pytorch/pytorch/issues/38681) for the CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50052

Reviewed By: mrshenli

Differential Revision: D25900823

Pulled By: glaringlee

fbshipit-source-id: 1a3fa336037d0aa2344d79f46dcacfd478a353d1
2021-01-15 19:34:27 -08:00
kshitij12345
5546a12fe3 remove redundant tests from tensor_op_tests (#50096)
Summary:
All these Unary operators have been an entry in OpInfo DB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50096

Reviewed By: zhangguanheng66

Differential Revision: D25870048

Pulled By: mruberry

fbshipit-source-id: b64e06d5b9ab5a03a202cda8c22fdb7e4ae8adf8
2021-01-12 04:53:12 -08:00
kshitij12345
9f832c8d3e [numpy] torch.exp: promote integer inputs to float (#50093)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50093

Reviewed By: H-Huang

Differential Revision: D25803549

Pulled By: mruberry

fbshipit-source-id: e6f245b5e728f2dca6072f8c359f03dff63aa14d
2021-01-08 06:30:18 -08:00
Thomas Viehmann
def8aa5499 Remove cpu half and dead code from multinomial (#50063)
Summary:
Based on ngimel's (Thank you!) feedback, cpu half was only accidental, so I'm removing it.

This lets us ditch the old codepath for without replacement in favour of the new, better one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50063

Reviewed By: mruberry

Differential Revision: D25772449

Pulled By: ngimel

fbshipit-source-id: 608729c32237de4ee6d1acf7e316a6e878dac7f0
2021-01-05 19:46:33 -08:00
anjali411
8fb5f16931 Complex backward for indexing, slicing, joining, and mutating ops (#49552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49552

This PR:
1. Migrates independent autograd test for `hstack`, `dstack`, `vstack`, `movedim`, `moveaxis` from `test_autograd.py` to the new `OpInfo` based tests.
2. Migrates autograd test for `gather`, `index_select` from the method_tests to the new `OpInfo` based tests.
2. Enables complex backward for `stack, gather, index_select, index_add_` and adds tests for complex autograd for all the above mentioned ops.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25682511

Pulled By: anjali411

fbshipit-source-id: 5d8f89db4a9ec340ab99a6196987d44a23e2c6c6
2021-01-04 19:44:15 -08:00
kshitij12345
42d2e31cd6 [numpy] torch.rsqrt : promote integer inputs to float (#47909)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47909

Reviewed By: ngimel

Differential Revision: D25730876

Pulled By: mruberry

fbshipit-source-id: c87a8f686e1dd64e511640e0278021c4a584ccf2
2020-12-30 10:33:14 -08:00
kshitij12345
963f7629b5 [numpy] torch.digamma : promote integer inputs to float (#48302)
Summary:
**BC-breaking Note:**

This PR updates PyTorch's digamma function to be consistent with SciPy's special.digamma function. This changes the result of the digamma function on the nonpositive integers, where the gamma function is not defined. Since the gamma function is undefined at these points, the (typical) derivative of the logarithm of the gamma function is also undefined at these points, and for negative integers this PR updates digamma to return NaN. For zero, however, it returns -inf to be consistent with SciPy.

Interestingly, SciPy made a similar change, which was noticed by at least one user: https://github.com/scipy/scipy/issues/9663#issue-396587679.

SciPy's returning of negative infinity at zero is intentional:
59347ae8b8/scipy/special/cephes/psi.c (L163)

This change is consistent with the C++ standard for the gamma function:
https://en.cppreference.com/w/cpp/numeric/math/tgamma

**PR Summary:**
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48302

Reviewed By: ngimel

Differential Revision: D25664087

Pulled By: mruberry

fbshipit-source-id: 1168e81e218bf9fe5b849db0e07e7b22e590cf73
2020-12-24 22:42:55 -08:00
Kshiteej K
3f4b98d568 [numpy] torch.erfinv: promote integer inputs to float (#49155)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49155

Reviewed By: ngimel

Differential Revision: D25664234

Pulled By: mruberry

fbshipit-source-id: 630fd1d334567d78c8130236a67dda0f5ec02560
2020-12-23 14:22:03 -08:00
Kshiteej K
461aafe389 [numpy] torch.angle: promote integer inputs to float (#49163)
Summary:
**BC-Breaking Note:**

This PR updates PyTorch's angle operator to be consistent with NumPy's. Previously angle would return zero for all floating point values (including NaN). Now angle returns `pi` for negative floating point values, zero for non-negative floating point values, and propagates NaNs.

**PR Summary:**

Reference: https://github.com/pytorch/pytorch/issues/42515

TODO:

* [x] Add BC-Breaking Note (Prev all real numbers returned `0` (even `nan`)) -> Fixed to match the correct behavior of NumPy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49163

Reviewed By: ngimel

Differential Revision: D25681758

Pulled By: mruberry

fbshipit-source-id: 54143fe6bccbae044427ff15d8daaed3596f9685
2020-12-22 18:43:14 -08:00
Xiang Gao
50b361a821 Enable BF16 for indexing on CUDA (#48801)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48801

Reviewed By: glaringlee

Differential Revision: D25542914

Pulled By: ngimel

fbshipit-source-id: 4113eb2729d15b40a89268172cc37122b5213624
2020-12-14 17:24:31 -08:00
Chester Liu
3a943e9f82 Use Unicode friendly API on Win32 in THAllocator (#47905)
Summary:
This replaces the narrow character set APIs with the wide character set ones in `THAllocator.cpp`. This fixes the potential crashes caused by passing non-ASCII characters in `torch::from_file` on Windows.

See: https://github.com/pytorch/pytorch/issues/47422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47905

Reviewed By: zhangguanheng66

Differential Revision: D25399146

Pulled By: ezyang

fbshipit-source-id: 0a183b65de171c48ed1718fa71e773224eaf196f
2020-12-14 14:24:20 -08:00
Brian Hirsh
f54ab8fbfe Revert "Revert D25003113: make validate debug-only in Device copy ctr" (#49123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49123

This reverts commit 7a4a2df225.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25463531

Pulled By: bdhirsh

fbshipit-source-id: 7c7ecdc1d63ffd137b84a129887c424b2083a958
2020-12-14 07:33:37 -08:00
kiyosora
15200e385a Enable torch.where() to support Float16 & BFloat16 type inputs (#49004)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/49075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49004

Reviewed By: zou3519

Differential Revision: D25495225

Pulled By: H-Huang

fbshipit-source-id: 09418ee5503f65c8862e40119c5802779505a4db
2020-12-11 13:36:41 -08:00
kshitij12345
eb9516eaa4 [numpy] torch.exp{2, m1}: promote integer inputs to float (#48926)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48926

Reviewed By: zhangguanheng66

Differential Revision: D25392344

Pulled By: mruberry

fbshipit-source-id: ddbabcfd58cc4c944153b1a224cc232efa022104
2020-12-10 00:14:22 -08:00
Kurt Mohler
27f7d1c286 Port eig CPU from TH to ATen (#43215)
Summary:
Also consolidates shared logic between `eig` CPU and CUDA implementations

Fixes https://github.com/pytorch/pytorch/issues/24693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43215

Reviewed By: VitalyFedyunin, zhangguanheng66

Differential Revision: D23862622

Pulled By: ngimel

fbshipit-source-id: ca1002428850520cd74cd5b7ed8cb4d12dbd9c52
2020-12-09 23:27:35 -08:00
Peter Bell
5765bbd78c Review memory overlap checks for advanced indexing operations (#48651)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45964

Indexing operators e.g. `scatter`/`gather` use tensor restriding so the `TensorIterator` built in overlap checking needs to be disabled. This adds the missing overlap checks for these operators.

In addition, some indexing operators don't work will with `MemOverlapStatus::FULL` which is explicitly allowed by `assert_no_partial_overlap`. So, I've introduced `assert_no_overlap` that will raise an error on partial _or_ full overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48651

Reviewed By: zhangguanheng66

Differential Revision: D25401047

Pulled By: ngimel

fbshipit-source-id: 53abb41ac63c4283f3f1b10a0abb037169f20b89
2020-12-09 15:10:52 -08:00
Supriya Rao
7a4a2df225 Revert D25003113: make validate debug-only in Device copy ctr
Test Plan: revert-hammer

Differential Revision:
D25003113 (4b26cafb8f)

Original commit changeset: e17e6495db65

fbshipit-source-id: fd636c954a97bd80892464feb974a11b9dd96899
2020-12-09 13:58:11 -08:00
Brian Hirsh
4b26cafb8f make validate debug-only in Device copy ctr (#47854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47854

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25003113

Pulled By: bdhirsh

fbshipit-source-id: e17e6495db65c48c7daf3429acbd86742286a1f3
2020-12-09 08:11:24 -08:00
Rong Rong
58c13cf685 Back out "Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA"
Summary: Revert D25397144 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3

Test Plan: Revert Hammer

Reviewed By: janeyx99

Differential Revision: D25397572

fbshipit-source-id: 625ca2a32e4558ae4582a15697b6e1cc57cc1573
2020-12-08 07:52:59 -08:00
Rong Rong
39445f718c Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA
Test Plan: revert-hammer

Differential Revision:
D25375885 (e3893b867f)

Original commit changeset: 2e19fe725ae9

fbshipit-source-id: 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3
2020-12-08 07:05:33 -08:00
Xiang Gao
e3893b867f Reenable some BF16 tests on CUDA (#48805)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48805

Reviewed By: agolynski

Differential Revision: D25375885

Pulled By: ailzhang

fbshipit-source-id: 2e19fe725ae9450bd1a2bc4e2d308c59b9f94fac
2020-12-07 16:16:07 -08:00
Gao, Xiang
a39398b9e5 CUDA BF16 norm (#48806)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48806

Reviewed By: mruberry

Differential Revision: D25358465

Pulled By: ngimel

fbshipit-source-id: 1a2afd86f39e96db0754d04bf81de045b1e1235c
2020-12-06 23:41:05 -08:00
Kurt Mohler
2cb9204159 Add nondeterministic alert to index_copy, median CUDA and kthvalue CUDA (#46942)
Summary:
Also fixes issue where skipped tests did not properly restore deterministic flag.

Fixes https://github.com/pytorch/pytorch/issues/46743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46942

Reviewed By: heitorschueroff

Differential Revision: D25298020

Pulled By: mruberry

fbshipit-source-id: 14b1680e1fa536ec72018d0cdb0a3cf83b098767
2020-12-03 11:03:07 -08:00
Edward Yang
f9a0abfc43 Fix code review from #48659 and #48116 (#48731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48731

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278034

Pulled By: ezyang

fbshipit-source-id: 73652311b48d8d80c06e9385b7ff18ef3a158ae8
2020-12-03 08:26:17 -08:00
kshitij12345
90a3049a9a [fix] repr(torch.device) (#48655)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48585

In the following commit 4c9eb57914, type of `DeviceIndex` was changed from `uint16_t` to `uint8_t`.
`uint8_t` is treated as ascii chars by std::cout and other stream operators. Hence the broken `repr`

Stackoverflow Reference: https://stackoverflow.com/questions/19562103/uint8-t-cant-be-printed-with-cout

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48655

Reviewed By: bdhirsh

Differential Revision: D25272289

Pulled By: ezyang

fbshipit-source-id: a1549f5f8d417138cf38795e4c373e3a487d3691
2020-12-02 15:48:17 -08:00
Erjia Guan
c98c98d77d Migrate fmod and fmod_ from TH to ATen (CUDA) (#47323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47323

Fixes #24565

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24763086

Pulled By: ejguan

fbshipit-source-id: fa004baea19bbbdbeb44814903db29226805ef0e
2020-12-02 09:38:29 -08:00
Edward Yang
b4f5efa7b2 Structured kernels generate Meta registrations (#48116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48116

If you port kernels to be structured, you get Meta kernels automatically
generated for you.  This is one payoff of structured kernels.

Code generation was mercifully really simple, although at risk of
"swiss cheese" syndrome: there's two new conditionals in the codegen
to tweak behavior when generating for meta keys.  It's not too bad
right now but there's a risk of things getting out of hand.  One
way to rationalize the logic here would be to transmit "TensorMeta-ness"
inside the TensorOptions (so tensor_from_meta can deal with it); then
the "Meta" kernel magic would literally just be generating empty
out_impls to call after all the scaffolding is done.  But I didn't
do this because it seemed like it would be more annoying short term.

Also had to teach resize_ to work on meta tensors, since we use them
to implement the out kernels.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer, ailzhang

Differential Revision: D25056640

Pulled By: ezyang

fbshipit-source-id: f8fcfa0dbb58a94d9b4196748f56e155f83b1521
2020-12-02 07:54:48 -08:00
kshitij12345
bcc85a363e [numpy] torch.sigmoid : promote integer inputs to float (#47551)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47551

Reviewed By: ngimel

Differential Revision: D25211953

Pulled By: mruberry

fbshipit-source-id: 9174cda401aeba0fd585a4c9bda166dbcf64f42f
2020-12-01 23:28:57 -08:00
Taylor Robie
27905dfe9c Expose CXX_FLAGS through __config__ (#47861)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47861

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199263

Pulled By: robieta

fbshipit-source-id: 3cfdb0485d686a03a68dd0907d1733634857963f
2020-12-01 19:58:29 -08:00
Mike Ruberry
36c87f1243 Refactors test_torch.py to be fewer than 10k lines (#47356)
Summary:
Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356

Reviewed By: ngimel

Differential Revision: D25202268

Pulled By: mruberry

fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6
2020-11-28 20:11:40 -08:00
kiyosora
272f4db043 Implement NumPy-like function torch.float_power() (#44937)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.float_power()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44937

Reviewed By: ngimel

Differential Revision: D25192119

Pulled By: mruberry

fbshipit-source-id: 2e446b8e0c2825f045fe057e30c9419335557a05
2020-11-27 18:01:42 -08:00
Antonio Cuni
344918576c Migrate eig from the TH to Aten (CUDA) (#44105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105

Reviewed By: ngimel

Differential Revision: D25192116

Pulled By: mruberry

fbshipit-source-id: 87f1ba4924b9174bfe0d9e2ab14bbe1c6bae879c
2020-11-27 15:15:48 -08:00
elfringham
db1b0b06c4 Flake8 fixes (#48453)
Summary:
Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453

Reviewed By: mruberry

Differential Revision: D25181871

Pulled By: ngimel

fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641
2020-11-25 19:09:50 -08:00
Xiao Wang
4ab2055857 Re-enable only cuda tests wrongly disabled before (#48429)
Summary:
Close https://github.com/pytorch/pytorch/issues/46536

Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332

See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987

~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429

Reviewed By: ngimel

Differential Revision: D25176368

Pulled By: mruberry

fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9
2020-11-25 13:26:35 -08:00
kshitij12345
9ecaeb0962 [numpy] Add unary-ufunc tests for erf variants (#47155)
Summary:
Adding Unary Ufunc Test entry for `erf` variants.

We use scipy functions for reference implementation.

We can later update the tests once these functions will update integer input to float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47155

Reviewed By: ngimel

Differential Revision: D25176654

Pulled By: mruberry

fbshipit-source-id: cb08efed1468b27650cec4f87a9a34e999ebd810
2020-11-25 13:20:14 -08:00
Fayçal Arbai
2e0a8b75d8 An implementation of torch.tile as requested in pytorch/pytorch#38349 (#47974)
Summary:
The approach is to simply reuse `torch.repeat` but adding one more functionality to tile, which is to prepend 1's to reps arrays if there are more dimensions to the tensors than the reps given in input. Thus for a tensor of shape (64, 3, 24, 24) and reps of (2, 2) will become (1, 1, 2, 2), which is what NumPy does.

I've encountered some instability with the test on my end, where I could get a random failure of the test (due to, sometimes, random value of `self.dim()`, and sometimes, segfaults). I'd appreciate any feedback on the test or an explanation for this instability so I can this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47974

Reviewed By: ngimel

Differential Revision: D25148963

Pulled By: mruberry

fbshipit-source-id: bf63b72c6fe3d3998a682822e669666f7cc97c58
2020-11-24 18:07:25 -08:00
Kurt Mohler
b6654906c7 Fix assertEqual's handling of numpy array inputs (#48217)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48217

Reviewed By: mrshenli

Differential Revision: D25119607

Pulled By: mruberry

fbshipit-source-id: efe84380d3797d242c2aa7d43d2209bcba89cee0
2020-11-22 00:13:42 -08:00
Nikita Shulga
dc843fe197 Fix test_ldexp on Windows (#48335)
Summary:
Force `torch.randint` to generate tensor of int32 rather than tensor of int64
Delete unneeded copies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48335

Reviewed By: ranman

Differential Revision: D25133312

Pulled By: malfet

fbshipit-source-id: 70bfcb6b7ff3bea611c4277e6634dc7473541288
2020-11-20 15:41:59 -08:00
Randall Hunt
562d4c3bc5 Add basic ldexp operator for numpy compatibility (#45370)
Summary:
Adds ldexp operator for https://github.com/pytorch/pytorch/issues/38349

I'm not entirely sure the changes to `NamedRegistrations.cpp` were needed but I saw other operators in there so I added it.

Normally the ldexp operator is used along with the frexp to construct and deconstruct floating point values. This is useful for performing operations on either the mantissa and exponent portions of floating point values.

Sleef, std math.h, and cuda support both ldexp and frexp but not for all data types. I wasn't able to figure out how to get the iterators to play nicely with a vectorized kernel so I have left this with just the normal CPU kernel for now.

This is the first operator I'm adding so please review with an eye for errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45370

Reviewed By: mruberry

Differential Revision: D24333516

Pulled By: ranman

fbshipit-source-id: 2df78088f00aa9789aae1124eda399771e120d3f
2020-11-20 04:09:39 -08:00
kiyosora
008f840e7a Implement in-place method torch.cumsum_ and torch.cumprod_ (#47651)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47651

Reviewed By: zou3519

Differential Revision: D24992438

Pulled By: ezyang

fbshipit-source-id: c38bea55f4af1fc92be780eaa8e1d462316e6192
2020-11-19 11:20:12 -08:00
mfkasim91
8819bad86c Implement igammac (3rd PR) (#48171)
Summary:
Related: https://github.com/pytorch/pytorch/issues/46183 (torch.igamma)
This is the regularized upper incomplete gamma function.

This is supposed to be exactly the same as https://github.com/pytorch/pytorch/issues/47463, but after rebasing the `viable/strict` branch.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48171

Reviewed By: zhangguanheng66

Differential Revision: D25060107

Pulled By: mruberry

fbshipit-source-id: 89780dea21dbb2141cbc4f7f18192cb78a769b17
2020-11-18 23:44:32 -08:00
Edward Yang
a97d059614 Get TestTorch.test_empty_meta working again (#48113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48113

Fix is simple: just treat Meta as a backend covered by AutogradOther.
This semantically makes sense, since meta kernels are just like regular
CPU/CUDA kernels, they just don't do any compute.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25056641

Pulled By: ezyang

fbshipit-source-id: 7b68911982352b3e0ee8616b38cd9c70bd58a740
2020-11-18 19:50:27 -08:00
Scott Wolchok
4c9eb57914 [PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023

DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430

Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect

Reviewed By: dzhulgakov

Differential Revision: D24605460

fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
2020-11-18 19:39:40 -08:00
Mike Ruberry
ea1e78a0c5 Revert D24853669: [pytorch][PR] Migrate eig from the TH to Aten (CUDA)
Test Plan: revert-hammer

Differential Revision:
D24853669 (866f8591be)

Original commit changeset: a513242dc7f4

fbshipit-source-id: a0c8c424b61b1e627d9102de6b4c6d0717a6c06d
2020-11-18 16:53:18 -08:00
Antonio Cuni
866f8591be Migrate eig from the TH to Aten (CUDA) (#44105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105

Reviewed By: heitorschueroff

Differential Revision: D24853669

Pulled By: mruberry

fbshipit-source-id: a513242dc7f49f55dbc6046c18d8a9d9aa2aaf8d
2020-11-18 12:10:18 -08:00
kshitij12345
68a3a3f3b5 Add torch.swapdims and torch.swapaxes (#46041)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349

Delegates to `torch.transpose` (not sure what is the best way to alias)

TODO:
* [x] Add test
* [x] Add documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46041

Reviewed By: gchanan

Differential Revision: D25022816

Pulled By: mruberry

fbshipit-source-id: c80223d081cef84f523ef9b23fbedeb2f8c1efc5
2020-11-18 11:35:53 -08:00
Ivan Yashchuk
81b1673a21 Enable complex tests that depend on batched matmul on CUDA (#47910)
Summary:
Now when https://github.com/pytorch/pytorch/pull/42553 is merged we can delete a bit of code from the tests and enable some of the skipped complex tests.

Unfortunately, `test_pinverse_complex_xfailed` and `test_symeig_complex_xfailed` had bugs and it wasn't caught automatically that these tests xpass. Need to be careful next time with `unittest.expectedFailure`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47910

Reviewed By: zhangguanheng66

Differential Revision: D25052130

Pulled By: mruberry

fbshipit-source-id: 29512995c024b882f9cb78b7bede77733d5762d0
2020-11-18 10:44:47 -08:00
Heitor Schueroff
2ff748a680 Move kthvalue scalar test to separate method for XLA (#48042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48042

Moving scalar test to a separate method so the XLA team can continue to test for the other cases without failing. Requested here https://github.com/pytorch/xla/issues/2620#issuecomment-725696108

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25055677

Pulled By: heitorschueroff

fbshipit-source-id: 5da66bac78ea197821fee0b9b8a213ff2dc19c67
2020-11-18 07:49:14 -08:00
Xiang Gao
d293413b3e Batched matmul dtypes (#47873)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47873

Reviewed By: navahgar

Differential Revision: D24928256

Pulled By: anjali411

fbshipit-source-id: a26aef7a15a13fc0b5716e905971265d8b1cea61
2020-11-14 22:45:48 -08:00
anjali411
db1f217d8d Add complex support for torch.addcmul and torch.addcdiv (#46639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46639

Resolves: https://github.com/pytorch/pytorch/issues/46546#issuecomment-713122245

Test Plan: Imported from OSS

Reviewed By: izdeby, ansley

Differential Revision: D24879099

Pulled By: anjali411

fbshipit-source-id: 76131dc68ac964e67a633f62e07f7c799df4463e
2020-11-14 21:27:34 -08:00
Ivan Yashchuk
260daf088d Added linalg.cholesky (#46083)
Summary:
This PR adds `torch.linalg.cholesky` function that matches `numpy.linalg.cholesky`.

Fixed `lda` argument to `lapackCholesky` calls.
Added `random_hermitian_pd_matrix` helper function for tests.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46083

Reviewed By: ailzhang

Differential Revision: D24861752

Pulled By: mruberry

fbshipit-source-id: 214dbceb4e8a2c589df209493efd843962d25593
2020-11-13 16:50:40 -08:00
Richard Zou
1c7c612af0 Revert D24543682: [pytorch][PR] Added support for complex input for torch.lu_solve
Test Plan: revert-hammer

Differential Revision:
D24543682 (ffd0003022)

Original commit changeset: 165bde39ef95

fbshipit-source-id: 790b4157fdbc7149aaf0748555efe6daed7e1a23
2020-11-13 08:24:53 -08:00
Ivan Yashchuk
ffd0003022 Added support for complex input for torch.lu_solve (#46862)
Summary:
`torch.lu_solve` now works for complex inputs both on CPU and GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex dtypes, but I didn't modify/improve the body of the tests.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46862

Reviewed By: nikithamalgifb

Differential Revision: D24543682

Pulled By: anjali411

fbshipit-source-id: 165bde39ef95cafebf976c5ba4b487297efe8433
2020-11-13 02:35:31 -08:00
Gao, Xiang
0652d755d3 Fix some flaky tests in test_torch.py and test_nn.py (#46941)
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941

Reviewed By: heitorschueroff

Differential Revision: D24852725

Pulled By: mruberry

fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
2020-11-12 22:35:42 -08:00
kshitij12345
3649a2c170 [numpy] torch.sqrt : promote integer inputs to float (#47293)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47293

Reviewed By: malfet

Differential Revision: D24855994

Pulled By: mruberry

fbshipit-source-id: 1e6752f2eeba6d638dea0bdea0c650cf722718c9
2020-11-12 16:16:09 -08:00
Ivan Yashchuk
149190c014 Added CUDA support for complex input for torch.solve (#47045)
Summary:
`torch.solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.
Differentiation also works correctly with complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/41084
Ref. https://github.com/pytorch/pytorch/issues/33152

anjali411 I hope you don't mind that I took over https://github.com/pytorch/pytorch/pull/42737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47045

Reviewed By: nikithamalgifb

Differential Revision: D24921503

Pulled By: anjali411

fbshipit-source-id: 4c3fc4f193a84b6e28c43c08672d480715000923
2020-11-12 12:22:59 -08:00
Gregory Chanan
b6cb2caa68 Revert "Fixed einsum compatibility/performance issues (#46398)" (#47821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47821

This reverts commit a5c65b86ce.

 Conflicts:
	test/test_linalg.py

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24909923

Pulled By: gchanan

fbshipit-source-id: 9dcf98e7c4a3c7e5aaffe475867fa086f3bb6ff2
2020-11-12 08:11:40 -08:00
anjali411
e1ee3bfc0e Port bmm and baddbmm from TH to ATen (#42553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42553

Ports `torch.bmm` and `torch.baddbmm` from TH to ATen, as well as adds support for complex dtypes. Also removes dead TH code for Level 2 functions.

Closes #24539

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24893511

Pulled By: anjali411

fbshipit-source-id: 0eba3f2aec99c48b3018a5264ee7789279cfab58
2020-11-12 07:57:42 -08:00
Ivan Yashchuk
52ec8b9340 Added CUDA support for complex input for torch.triangular_solve (#46916)
Summary:
`torch.triangular_solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46916

Reviewed By: navahgar, agolynski

Differential Revision: D24706647

Pulled By: anjali411

fbshipit-source-id: fe780eac93d2ae1b2549539bb385e5fac25213b3
2020-11-11 16:08:11 -08:00
Ivan Yashchuk
a1db5b0f2b Added CUDA support for complex input for torch.inverse #2 (#47595)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Opening a new PR here. The previous PR was merged and reverted due to a bug in tests marked with `slowTest`.
Previous PR https://github.com/pytorch/pytorch/pull/45034

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47595

Reviewed By: navahgar

Differential Revision: D24840955

Pulled By: anjali411

fbshipit-source-id: ec49fffdc4b3cb4ae7507270fa24e127be14f59b
2020-11-11 11:06:08 -08:00
Heitor Schueroff
a5c65b86ce Fixed einsum compatibility/performance issues (#46398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46398

This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases.

fixes #45854, #37628, #30194, #15671

fixes #41467 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.randn(10000, 100, 101, device='cuda')
b = torch.randn(10000, 101, 3, device='cuda')

c = torch.randn(10000, 100, 1, device='cuda')
d = torch.randn(10000, 100, 1, 3, device='cuda')

print(Timer(
    stmt='torch.einsum("bij,bjf->bif", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("bic,bicf->bif", c, d)',
    globals={'c': c, 'd': d}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850>
torch.einsum("bij,bjf->bif", a, b)
  Median: 4.53 ms
  IQR:    0.00 ms (4.53 to 4.53)
  45 measurements, 1 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700>
torch.einsum("bic,bicf->bif", c, d)
  Median: 63.86 us
  IQR:    1.52 us (63.22 to 64.73)
  4 measurements, 1000 runs per measurement, 1 thread
```

fixes #32591 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda")
b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda")

print(Timer(
    stmt='(a * b).sum(dim = (-3, -2, -1))',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850>
(a * b).sum(dim = (-3, -2, -1))
  Median: 17.86 ms
  2 measurements, 10 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0>
torch.einsum("...ijk, ...ijk -> ...", a, b)
  Median: 296.11 us
  IQR:    1.38 us (295.42 to 296.81)
  662 measurements, 1 runs per measurement, 1 thread
```

TODO

- [x] add support for ellipsis broadcasting
- [x] fix corner case issues with sumproduct_pair
- [x] update docs and add more comments
- [x] add tests for error cases

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860367

Pulled By: heitorschueroff

fbshipit-source-id: 31110ee598fd598a43acccf07929b67daee160f9
2020-11-10 19:38:43 -08:00
Heitor Schueroff
bf6a156f64 Fix kthvalue error for scalar input (#47600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47600

fixes https://github.com/pytorch/pytorch/issues/30818

Note that the median case was already fixed by https://github.com/pytorch/pytorch/pull/45847

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860337

Pulled By: heitorschueroff

fbshipit-source-id: 69ccbbb6c7c86671e5712b1c2056c012d898b4f2
2020-11-10 17:21:52 -08:00
kshitij12345
6575e674ce [numpy] torch.{all, any} : Extend Dtype Support (#44790)
Summary:
Reference https://github.com/pytorch/pytorch/issues/44779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44790

Reviewed By: bdhirsh

Differential Revision: D24393119

Pulled By: heitorschueroff

fbshipit-source-id: a9b88e9d06b3c282f2e5360b6eaea4ae8ef77c1d
2020-11-10 17:11:39 -08:00
Natalia Gimelshein
c9d37675b2 Back out "[pytorch][PR] The dimension being reduced should not be coalesced by TensorIterator" (#47642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47642

Original commit changeset: 02bb2b15694c

Test Plan: Covered by CI tests

Reviewed By: anjali411

Differential Revision: D24849072

fbshipit-source-id: a8790cbf46936aee7a6f504dac8595997175fc65
2020-11-10 16:31:33 -08:00
Radhakrishnan Venkataramani
163adb9fa7 Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092

Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.

These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.

Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*

Run benchmark locally using

```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```

AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.

Benchmark for 512x512 tensor with FbGeMM implementation

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```

Benchmark for 512x512 tensor trunk with no FbGeMM integration.

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```

Reviewed By: ngimel

Differential Revision: D23824869

fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
2020-11-10 12:00:53 -08:00
Gregory Chanan
65a72cae2c Fix type promotion for trace on CPU. (#47305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47305

Fixes https://github.com/pytorch/pytorch/issues/47127.

Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24729627

Pulled By: gchanan

fbshipit-source-id: 151b786b53e7b958f0929c803dbf8e95981c6884
2020-11-10 07:46:03 -08:00
John Kilpatrick
8aca85dbcd Add diagflat complex support (#47564)
Summary:
Adds complex numbers support for `torch.diag`
``` python
>>> import torch
>>> a = torch.ones(2, dtype=torch.complex128)
>>> torch.diagflat(a)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], dtype=torch.complex128)
>>> b = a.cuda()
>>> torch.diagflat(b)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], device='cuda:0', dtype=torch.complex128)
```

Note that automatic differentiation isn't implemented:
``` python
>>> d = torch.ones(1, dtype=torch.complex128, requires_grad=True)
>>> torch.diagflat(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: diag does not support automatic differentiation for outputs with complex dtype.
```

Fixes https://github.com/pytorch/pytorch/issues/47499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47564

Reviewed By: heitorschueroff

Differential Revision: D24844467

Pulled By: anjali411

fbshipit-source-id: 9c8cb795d52880b7dcffab0c059b0f6c2e5ef151
2020-11-09 20:28:23 -08:00
Xiang Gao
f23a2a1115 The dimension being reduced should not be coalesced by TensorIterator (#47237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583#issuecomment-720172838

Also add overload of `<<` for convenience of debugging.

This PR is tested by `test_reduction_split_cuda` which was added in https://github.com/pytorch/pytorch/pull/37788.

Reproduce
```python
import torch

a = torch.zeros(8, 1, 128, 1024, 1024)
a.cuda().sum(1)
```

Before

```
TensorIterator @ 0x7ffd05b10ba0 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1073741824]
  strides(*) = {
    (0) = [4]
    (1) = [4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

After

```
TensorIterator @ 0x7fffc9051010 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1, 1073741824]
  strides(*) = {
    (0) = [0, 4]
    (1) = [536870912, 4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47237

Reviewed By: ejguan

Differential Revision: D24734763

Pulled By: ngimel

fbshipit-source-id: 02bb2b15694c68f96434f55033b63b6e5ff7085b
2020-11-07 01:30:24 -08:00
Xiong Wei
f90da88d8f Add complex support for torch.mean [CUDA] (#47048)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47048

Reviewed By: heitorschueroff

Differential Revision: D24729895

Pulled By: anjali411

fbshipit-source-id: 8e948480eb87c37de810207edf909375c0380772
2020-11-06 21:29:19 -08:00
Howard Huang
451e7d3db4 Enable diag for bool Tensors (#47455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47455

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772483

Pulled By: H-Huang

fbshipit-source-id: 08ea4af4352972617db3c6475943b326f36b3049
2020-11-06 21:29:17 -08:00
Howard Huang
3253ccbd9f Add bool tensor support for where (#47454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47454

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772482

Pulled By: H-Huang

fbshipit-source-id: ea488aae5bf64ac20f7a5d001e8edf55eed16eaf
2020-11-06 21:26:24 -08:00
Rong Rong
5614f72534 Suppres test issues in test_torch running in sandcastle (#47474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474

After enabling GPU/Re, some issues were specific to those runs

Test Plan:
```
buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled
```

Reviewed By: malfet, janeyx99

Differential Revision: D24771578

fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb
2020-11-06 10:34:28 -08:00
Edward Yang
1aeefcdaa6 Revert D24730264: [pytorch][PR] Added CUDA support for complex input for torch.inverse
Test Plan: revert-hammer

Differential Revision:
D24730264 (33acbedace)

Original commit changeset: b9c94ec46301

fbshipit-source-id: beb9263700e9bc92685f74c37c46aa33f3b595b9
2020-11-06 07:28:14 -08:00
Ivan Yashchuk
33acbedace Added CUDA support for complex input for torch.inverse (#45034)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Test cases with complex matrices are xfailed for now. For example, batched matmul does not work with complex yet.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45034

Reviewed By: zou3519

Differential Revision: D24730264

Pulled By: anjali411

fbshipit-source-id: b9c94ec463012913c117278a884adeee96ea02aa
2020-11-05 16:30:11 -08:00
Heitor Schueroff
a4ba018e57 Updated docs/test for dot and vdot (#47242)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47242

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D24733771

Pulled By: heitorschueroff

fbshipit-source-id: 92e3b0e28e0565918335fa85d52abe5db9eeff57
2020-11-05 06:27:50 -08:00
Xiang Gao
f19637e6ee Expand the test of torch.addbmm and torch.baddbmm (#47079)
Summary:
This is to satisfy the request at https://github.com/pytorch/pytorch/pull/42553#issuecomment-673673914. See also https://github.com/pytorch/pytorch/pull/47124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47079

Reviewed By: ejguan

Differential Revision: D24735356

Pulled By: ngimel

fbshipit-source-id: 122fceb4902658f350c2fd6f92455adadd0ec2a4
2020-11-04 21:11:26 -08:00
Xiang Gao
030caa190f Expand the test of torch.bmm on CUDA (#47124)
Summary:
basically https://github.com/pytorch/pytorch/pull/47070, enabled on all CI with `ci-all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47124

Reviewed By: ejguan

Differential Revision: D24735130

Pulled By: ngimel

fbshipit-source-id: c2124562a9f9d1caf24686e5d8a1106c79366233
2020-11-04 17:29:34 -08:00
Brian Hirsh
fe17269e75 Revert "Revert D24335982: explicitly error out in comparison ops when the types don't match" (#47288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47288

This reverts commit b3eb0c86cf.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24706531

Pulled By: bdhirsh

fbshipit-source-id: f3bf34ddba7882932155819251b6c7dcb5c6b56c
2020-11-04 09:27:47 -08:00
Erjia Guan
f1ac63d324 Implement copysign (#46396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396

Related #38349

[numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign)
- No in-place function
- No method
- Optional output
- Available: byte, char, bool, int, short, long, float, double, half
- Integral promoted to float
- Not available: float/double complex

`c = np.copysign(a, b)`
|  a |  b |  c | a.grad |
| -1 | -1 | -1 |   1  |
| -0 | -1 | -0 |   0  |
|  0 | -1 | -0 |  0  |
|  1 | -1 | -1 |  -1  |
| -1 | -0 |  -1 |  1  |
| -0 | -0 |  0 |  0  |
|  0 | -0 |  0 |   0  |
|  1 | -0 |  -1 |   -1  |
| -1 |  0 |  1 |  -1  |
| -0 |  0 |  0 |  0  |
|  0 |  0 |  0 |   0  |
|  1 |  0 |  1 |   1  |
| -1 |  1 |  1 |  -1  |
| -0 |  1 |  0 |  0  |
|  0 |  1 |  0 |   0  |
|  1 |  1 |  1 |   1  |

This function becomes **non-differentiable** at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0.

TODO:
- [x] test (cpu/gpu)
- [x] doc
- [x] ~kernel_vec~

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24401366

Pulled By: ejguan

fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d
2020-11-04 08:08:57 -08:00
Qi Zhou
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
Howard Huang
a8ef4d3f0b Provide 'out' parameter for 'tensordot' (#47278)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42102

Added an optional out parameter to the tensordot operation to allow using buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278

Test Plan: pytest test/test_torch.py -k tensordot -v

Reviewed By: agolynski

Differential Revision: D24706258

Pulled By: H-Huang

fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69
2020-11-03 15:56:00 -08:00
Xiao Wang
774b638eb6 Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332)
Summary:
Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`.

There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it.

Definition of `largeTensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)

`_has_sufficient_memory`

d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)

`largeCUDATensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332

Reviewed By: ngimel

Differential Revision: D24698690

Pulled By: mruberry

fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307
2020-11-03 11:43:49 -08:00
Richard Zou
86151da19e Port CPU Trace from TH to ATen (#47126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126

Context
-------
This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360.
I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it.

Benchmarks
----------

TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears.

```
import torch

shapes = [
    [1, 1],
    [100, 100],
    [1000, 1000],
    [10000, 10000],
    [100000, 100000],
]

for shape in shapes:
    x = torch.ones(shape)
    %timeit x.trace()

Before:
1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After:
2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Future work
-----------
Things that can be done after this PR:
- add complex tensor support
- Fix the type promotion discrepancy between CPU and CUDA

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24683259

Pulled By: zou3519

fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb
2020-11-02 16:03:22 -08:00
Richard Zou
8054ae3e77 Add test for trace (#47125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125

We didn't actually have any tests for torch.trace. The tests expose a
discrepancy between the behavior of torch.trace on CPU and CUDA that
I'll file an issue for.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24683260

Pulled By: zou3519

fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640
2020-11-02 16:00:33 -08:00
Brian Hirsh
b3eb0c86cf Revert D24335982: explicitly error out in comparison ops when the types don't match
Test Plan: revert-hammer

Differential Revision:
D24335982 (60fea510a1)

Original commit changeset: 3dfb02bcb403

fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668
2020-11-02 14:08:01 -08:00
Xiong Wei
22b3d414de Enhance the torch.pow testcase for the complex scalar base (#47101)
Summary:
Related https://github.com/pytorch/pytorch/issues/45259

This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664

- leverage the `make_tensor`  function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent.
- add some special cases for the zero exponents and the `1 + 0j` base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101

Reviewed By: mruberry

Differential Revision: D24682430

Pulled By: zou3519

fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149
2020-11-02 13:13:15 -08:00
Brian Hirsh
60fea510a1 explicitly error out in comparison ops when the types don't match (#46399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399

Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details.

fixes #42660

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24335982

Pulled By: bdhirsh

fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2
2020-11-02 11:42:32 -08:00
Nikita Shulga
edac4060d7 Fix mul cuda for bool (#47031)
Summary:
Also, add tests for tensor by scalar multiplication / division

Fixes https://github.com/pytorch/pytorch/issues/47007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031

Reviewed By: walterddr

Differential Revision: D24608874

Pulled By: malfet

fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d
2020-10-30 10:38:32 -07:00
Heitor Schueroff
ddeacf1565 Fix median bug on discontigous tensors (#46917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917

fixes https://github.com/pytorch/pytorch/issues/46814

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633412

Pulled By: heitorschueroff

fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3
2020-10-29 17:12:22 -07:00
Xiong Wei
74d730c0b5 implement NumPy-like functionality column_stack, row_stack (#46313)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`.

Todo

- [x] docs
- [x] alias pattern for `row_stack`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313

Reviewed By: ngimel

Differential Revision: D24585471

Pulled By: mruberry

fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c
2020-10-29 12:14:39 -07:00
mfkasim91
6eaa324c9f Implement torch.igamma (#46183)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41637
This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`.

cc fritzo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183

Reviewed By: gchanan

Differential Revision: D24479126

Pulled By: mruberry

fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe
2020-10-29 11:40:18 -07:00
Sameer Deshmukh
2249a293b7 Fix segfault with torch.orgqr. (#46700)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41768

The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700

Reviewed By: albanD

Differential Revision: D24616427

Pulled By: mruberry

fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3
2020-10-29 10:34:39 -07:00
Kurt Mohler
b75b961934 Fix requires_grad arg for new_full, new_empty, new_zeros (#46486)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486

Reviewed By: gchanan

Differential Revision: D24497034

Pulled By: ezyang

fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f
2020-10-28 09:34:53 -07:00
kiyosora
53839ac9d7 Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/46681

```
>>> x = torch.randn(10, device='cuda')
>>> y = torch.tensor(1.)
>>> torch.heaviside(x, y)
tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831

Reviewed By: navahgar

Differential Revision: D24567953

Pulled By: izdeby

fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea
2020-10-27 16:47:33 -07:00
Hong Xu
bcbb6baccf Add a warning message that torch.sign would not support complex numbers (#43280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24538769

Pulled By: anjali411

fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1
2020-10-26 21:13:12 -07:00
Xiang Gao
7731370e71 CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997

Reviewed By: izdeby

Differential Revision: D24547748

Pulled By: ngimel

fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a
2020-10-26 16:01:22 -07:00
Xiang Gao
99cf3b1ce4 CUDA BFloat16 signal windows (#45155)
Summary:
Looks like this op is never tested for the support of different dtypes?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155

Reviewed By: zou3519

Differential Revision: D24438839

Pulled By: ngimel

fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef
2020-10-26 15:53:30 -07:00
Alexander Grund
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
Pearu Peterson
905ed3c840 Revised sparse tensor documentation. (#45400)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44635.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400

Reviewed By: ezyang

Differential Revision: D24359410

Pulled By: mruberry

fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b
2020-10-22 02:07:54 -07:00
Xiao Wang
fe4f90c40b Cusolver inverse check info (#46625)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625

Reviewed By: zou3519

Differential Revision: D24438577

Pulled By: ngimel

fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906
2020-10-21 21:46:33 -07:00
lixinyu
a651b876a7 preserve non-dense or overlapping tensor's layout in *_like functions (#46046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046

*_like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg.  passing a channel last contiguous tensor t with ‘shape/stride’  (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost.

This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator.  The behavior changes are listed below:

| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

This is to solve the non-dense tensor layout problem in #45505

TODO:
- [x] Fix all the BC broken test cases in pytorch
- [ ] Investigate if any fb internal tests are broken

This change will cover all kinds of non-dense tensors.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24288970

Pulled By: glaringlee

fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d
2020-10-20 19:49:49 -07:00
Kurt Mohler
e6ed887908 Add view test for tensor_split (#46427)
Summary:
Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427

Reviewed By: ezyang

Differential Revision: D24355107

Pulled By: mruberry

fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99
2020-10-20 09:56:37 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Ailing Zhang
8c629ecc9a [WIP] Move catchAll to Math (#45939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165890

Pulled By: ailzhang

fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b
2020-10-16 16:17:16 -07:00
Nikita Vedeneev
9300a27702 Make torch.lu support complex input on CUDA. (#45898)
Summary:
As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898

Reviewed By: heitorschueroff

Differential Revision: D24306951

Pulled By: anjali411

fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b
2020-10-16 10:29:39 -07:00
Jane Xu
c99378af1b Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix:
```
In [1]: import torch
In [2]: a=torch.randn(3)
In [3]: b=torch.tensor(2, device="cuda")
In [4]: torch.pow(a,b) #should not work and throws exception now!

In [5]: a=torch.tensor(3, device="cuda")
In [6]: b=torch.tensor(2)
In [7]: torch.pow(a,b) #should work, and now does

In [8]: a=torch.randn(3, device="cuda")
In [9]: torch.pow(a,b) # yeah, that one is fixed and still works
```

To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320

Reviewed By: malfet

Differential Revision: D24306610

Pulled By: janeyx99

fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a
2020-10-15 13:43:47 -07:00
Ivan Yashchuk
c1141b6f68 Added support for complex torch.pinverse (#45819)
Summary:
This PR adds support for complex-valued input for `torch.pinverse`.
Fixed cuda SVD implementation to return singular values with real dtype.

Fixes https://github.com/pytorch/pytorch/issues/45385.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819

Reviewed By: heitorschueroff

Differential Revision: D24306539

Pulled By: anjali411

fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a
2020-10-15 12:28:22 -07:00
Xiang Gao
5ce46fbbca BFloat16 support for torch.sign (#45244)
Summary:
Added BF16 support for torch.sign on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244

Reviewed By: zou3519

Differential Revision: D23932304

Pulled By: izdeby

fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599
2020-10-15 12:23:14 -07:00
Jane Xu
ad376f1a62 trying to make pow work for tensor raised to the power of a scalar (#46185)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I'm not sure this is the most performant solution, but this works:

torch.pow(cuda_tensor, 5) should work and worked before.
torch.pow(cuda_tensor, torch.tensor(5)), should work **and works now!**
torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185

Reviewed By: glaringlee, malfet

Differential Revision: D24257687

Pulled By: janeyx99

fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98
2020-10-13 10:14:36 -07:00
Erjia Guan
bed3b40523 Implement ravel (#46098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098

Doc:
![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png)

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24253213

Pulled By: ejguan

fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b
2020-10-12 16:00:44 -07:00
kshitij12345
a814231616 [fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45721

TODO
* [x] Test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802

Reviewed By: ngimel

Differential Revision: D24236706

Pulled By: mruberry

fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9
2020-10-11 20:13:08 -07:00
Kurt Mohler
a0a8bc8870 Fix mistakes and increase clarity of norm documentation (#42696)
Summary:
* Removes incorrect statement that "the vector norm will be applied to the last dimension".
* More clearly describe each different combination of `p`, `ord`, and input size.
* Moves norm tests from `test/test_torch.py` to `test/test_linalg.py`
* Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs

Fixes https://github.com/pytorch/pytorch/issues/41388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696

Reviewed By: bwasti

Differential Revision: D23876862

Pulled By: mruberry

fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85
2020-10-10 14:12:43 -07:00
Nikita Shulga
f363a2e106 Mark top 3 slowest tests as slow (#46068)
Summary:
`TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout)
`TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each
`TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish

Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068

Reviewed By: mruberry

Differential Revision: D24208660

Pulled By: malfet

fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7
2020-10-08 21:10:03 -07:00
Ivan Yashchuk
f010df35e5 Added CUDA support for complex input for QR decomposition (#45032)
Summary:
QR decomposition now works for complex inputs on GPU.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032

Reviewed By: ailzhang

Differential Revision: D24199105

Pulled By: anjali411

fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e
2020-10-08 13:24:21 -07:00
Heitor Schueroff de Souza
636eb18029 Fixed median nan propagation and implemented nanmedian (#45847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847

Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24136629

Pulled By: heitorschueroff

fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
2020-10-08 11:20:21 -07:00
Kurt Mohler
ef4817fe5a Add tensor_split function, based on numpy.array_split (#45168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168

Reviewed By: ngimel

Differential Revision: D24166164

Pulled By: mruberry

fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6
2020-10-07 23:14:48 -07:00
Xiang Gao
b2bff9e431 Workaround for cublas bug for 45724 (#46001)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001

Reviewed By: mruberry

Differential Revision: D24184058

Pulled By: ngimel

fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9
2020-10-07 22:38:19 -07:00
Your Name
c59c4b0d77 Fix cholesky TF32 tests (#45492)
Summary:
This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492

Reviewed By: ezyang

Differential Revision: D24101876

Pulled By: ngimel

fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6
2020-10-07 20:42:06 -07:00
Xiang Gao
903acc6b83 CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247)
Summary:
Add CUDA BFloat16 support of clamp, remainder, lshift, rshift

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247

Reviewed By: dzhulgakov

Differential Revision: D24174258

Pulled By: ngimel

fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638
2020-10-07 20:37:06 -07:00
Vaidotas Simkus
e154b36685 Standardized clamp kernels to Numpy-like implementation (#43288)
Summary:
**BC-breaking note**

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)

but in other places it clamps differently:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)

78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

```
t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)
```

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

**PR Summary**

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288

Reviewed By: colesbury

Differential Revision: D24079453

Pulled By: mruberry

fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7
2020-10-06 13:42:08 -07:00
KyleCZH
a9a9d0b181 Rocm skip test cases (#45782)
Summary:
Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA)
- test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)
- test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA)
- test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest)
jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782

Reviewed By: VitalyFedyunin

Differential Revision: D24115581

Pulled By: xw285cornell

fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5
2020-10-05 15:12:25 -07:00
Xiang Gao
e1ff46b6e5 CUDA BFloat16 TopK (#44755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755

Reviewed By: mruberry

Differential Revision: D23741680

Pulled By: ngimel

fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0
2020-10-04 11:38:00 -07:00
Nikita Shulga
3a27fc966a Test torch.svd using complex float and double numbers (take 2) (#45795)
Summary:
Adds support for magmaSvd for complex numbers

Fixes use-after-free error in `apply_symeig`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795

Reviewed By: ezyang

Differential Revision: D24096955

Pulled By: malfet

fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0
2020-10-03 11:33:28 -07:00
Nikita Shulga
5a47a2126d Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers
Test Plan: revert-hammer

Differential Revision:
D24018160 (888f3c12e7)

Original commit changeset: 1b6103f5af94

fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9
2020-10-02 13:33:11 -07:00
Nikita Shulga
888f3c12e7 Test torch.svd using complex float and double numbers (#45572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572

Reviewed By: anjali411

Differential Revision: D24018160

Pulled By: malfet

fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34
2020-10-02 08:29:14 -07:00