Commit Graph

383 Commits

Author SHA1 Message Date
Michael Carilli
3b040c478a Make custom_fwd a no-op when not executed under autocast (#36171)
Summary:
Currently, a custom autograd function written with
```
torch.cuda.amp.custom_fwd(cast_inputs=dtype)
def forward(ctx, *args):
    ...
```
casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region.  I think I had the wrong idea there.  Autocast-disabled regions should give the user control of input types.  Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions.  C++-side casting wrappers have no effect when autocast is disabled, and  `custom_fwd`'s casting should behave the same way.

The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171

Differential Revision: D22179511

Pulled By: ngimel

fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd
2020-06-23 10:23:02 -07:00
Michael Carilli
8066fba226 [RELAND2] Change AccumulateGrad to yield .grads that match weights' memory layout (#40358)
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.

This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master.  In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.

Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358

Differential Revision: D22165785

Pulled By: albanD

fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
2020-06-22 17:13:21 -07:00
Alban Desmaison
08227fea4f Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D22079377

Original commit changeset: 9bd2b7e0c34f

fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59
2020-06-17 10:17:27 -07:00
Michael Carilli
1ec8ece2b9 [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout (#40129)
Summary:
https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).

This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129

Differential Revision: D22079377

Pulled By: albanD

fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
2020-06-17 09:02:54 -07:00
Alban Desmaison
f1e575a0bf Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D20496044

Original commit changeset: 248d680f4b1b

fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d
2020-06-16 10:38:40 -07:00
Michael Carilli
2beb9690c3 Change AccumulateGrad to yield .grads that match weights' memory layout (#34904)
Summary:
Currently, whether `AccumulateGrad`  [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout.  If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient.  This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).

Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads.  This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout).  The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes.  I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix.  The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

## BC notes
ngimel albanD gchanan

#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change.  Any user code that was accustomed to `view(-1)`ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed.  In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params.  Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous.  IMO this is a mild BC breakage.  Param backward hooks still see grads come in with whatever format the backward kernel gave them.  The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`.  Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place.  Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout.  After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`.  This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR):  Presetting `param.grad` does not ensure in-place accumulation all the time.  For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~  PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately.  Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904

Differential Revision: D20496044

Pulled By: albanD

fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
2020-06-16 08:43:31 -07:00
kshitij12345
97dfdaaad8 torch.multinomial : fast-path for replacement=False (#39742)
Summary:
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import time
import torch
import numpy as np

for n, t in [(500_000, 10),
             (1_000_000, 10)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.from_numpy(np.random.rand(n)).to(dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        print(f'Took:', time.time() - start)

print('****' * 10)

for n, t in [(50_000, 100),
             (100_000, 100)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.rand(n, device='cuda', dtype=dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # torch.cuda.synchronize()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        # torch.cuda.synchronize()
        print(f'CUDA Took:', time.time() - start)
```

Before:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 80.64455389976501
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 3.7778031826019287
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 5.045570611953735
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.53191947937012
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 7.640851736068726
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 10.399673461914062
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 4.873984098434448
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 4.713594436645508
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 11.167185068130493
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 7.195427417755127
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 7.669712066650391
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 20.20938801765442
```

After:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 81.09321522712708
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 0.06062650680541992
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 0.0862889289855957
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.85304307937622
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 0.13271093368530273
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 0.17215657234191895
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 0.035035133361816406
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 0.03631949424743652
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 0.05507040023803711
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 0.05105161666870117
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 0.05449223518371582
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 0.09161853790283203
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742

Differential Revision: D21976915

Pulled By: ngimel

fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306
2020-06-10 20:42:55 -07:00
rohithkrn
ab6c447f59 [ROCm] Enable AMP autocast tests on ROCm (#39616)
Summary:
Enables AMP autocast tests on ROCm.

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39616

Differential Revision: D21924219

Pulled By: ezyang

fbshipit-source-id: f4df4ad32cd8fae8c4620cd8ab18b00d74fb46bd
2020-06-08 10:30:39 -07:00
xueht-fnst
faf0a3bd7a Move bernoulli_() to DistributionTemplates (#38558)
Summary:
resolve the feature introduced in https://github.com/pytorch/pytorch/issues/37373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38558

Differential Revision: D21920685

Pulled By: pbelevich

fbshipit-source-id: 50c77d9aaa334b3276a2352afe6c4ad03f12be31
2020-06-07 07:18:30 -07:00
Edward Yang
de5b8797e6 Remove unboxed only from AMP registrations for cat. (#39156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39156

TensorList is now supported for boxing, so we can remove
unboxed only from it.  I didn't check if there were other
operators that were incorrectly classified.

Fixes https://github.com/pytorch/pytorch/issues/38958

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21819821

Pulled By: ezyang

fbshipit-source-id: 6dcf91bc196554e1721d2c704f3bf524f069534b
2020-06-02 07:49:02 -07:00
peter
a5d44800f0 Implement CUDA_KERNEL_ASSERT for MSVC (#39218)
Summary:
Tested locally on CPU/GPU + Debug/Release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39218

Differential Revision: D21786500

Pulled By: malfet

fbshipit-source-id: 7e871003d3509436952932b5ff3599e36bb8f205
2020-05-29 11:44:54 -07:00
Jeff Daily
c25b3d4305 [ROCm] in test_cuda.py, re-enable skipped tests (#37952)
Summary:
- test_stream_context
- test_cublas_multiple_threads_same_device
- test_cusparse_multiple_threads_same_device

These tests passed three rounds of CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37952

Differential Revision: D21532027

Pulled By: vincentqb

fbshipit-source-id: dce7fc4f0943e2be43da71e213e168c455c66751
2020-05-29 11:38:47 -07:00
Jeff Daily
7e16dd299a [ROCm] enable mem leak check for rocm (#35953)
Summary:
CC iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35953

Differential Revision: D21742926

Pulled By: zou3519

fbshipit-source-id: f18534dbb88a84fe98b8d85ce8fde652916a72d5
2020-05-28 07:05:47 -07:00
Nikita Shulga
f5bc91f851 Get rid of multiple inheritence in test_torch (#39110)
Summary:
`_TestTorchMixin` is base class which is instantiated across multiple types.
It was inherited from `object` in order to hide it from unittest test discovery mechanism.
But this approach makes it almost impossible to use static code analyzer on the class.
This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512

Change imported class access path in `test_cuda.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110

Test Plan:
run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change:
```
$ python test_torch.py --discover-tests|md5sum
2ca437bb5d65700763ce04cdacf6de3e  -
$ python test_cuda.py --discover-tests|md5sum
b17df916fb0eeb6f0dd7222d7dae392c  -
```

Differential Revision: D21759265

Pulled By: malfet

fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e
2020-05-27 22:45:06 -07:00
Mike Ruberry
13120bf677 Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21740237

Pulled By: mruberry

fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
2020-05-27 06:31:07 -07:00
Rohan Varma
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
Mike Ruberry
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
Nik Ved
f80df4ca79 port scatter_add to ATen (CUDA) (#38262)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/24622 ](https://github.com/pytorch/pytorch/issues/24622).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38262

Differential Revision: D21656729

Pulled By: ngimel

fbshipit-source-id: 63dcbf8eeaf59d8295bf4e5c8bb9d28ad165d4eb
2020-05-20 19:03:41 -07:00
Michael Carilli
25f918548d Allow GradScaler to be pickled (#38296)
Summary:
Should unblock https://github.com/PyTorchLightning/pytorch-lightning/issues/1782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38296

Differential Revision: D21553296

Pulled By: albanD

fbshipit-source-id: 9041a72d7cf8833e4b01bc767fd2321f17c7c5f2
2020-05-14 09:14:28 -07:00
SsnL
ae392a77a6 Add better device idx parse checks (#37376)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37376

Differential Revision: D21476036

Pulled By: zou3519

fbshipit-source-id: 86907083c23cbaf165b645307fb340f2656b814e
2020-05-14 09:07:12 -07:00
Ailing Zhang
9232356e5f remove uses of type() and type_as() part 1. (#38029)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38029

Differential Revision: D21468523

Pulled By: ailzhang

fbshipit-source-id: 14b7185d43eb03f630cfaa2d70e02d637ff8551b
2020-05-08 08:16:24 -07:00
ashishfarmer
bbd2350c99 Disable tests failing on test2 in ROCm CI (#37427)
Summary:
This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2`
The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427

Differential Revision: D21302909

Pulled By: ezyang

fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616
2020-04-29 09:56:28 -07:00
Emilio Castillo
5fc391a646 Enforce type promotion in torch.cat (#35030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35014

CUDA `cat` implementation doesn't use `TensorIterator` so there is the need of manually doing some checks in the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35030

Differential Revision: D21155853

Pulled By: nairbv

fbshipit-source-id: 9e78bb7591f806734e12555831157061c925ff40
2020-04-22 13:35:07 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
Peter Bell
e99c53dc86 Fix broadcast_coalesce for empty tensors (#35965)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35470
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35965

Differential Revision: D20919377

Pulled By: ezyang

fbshipit-source-id: cfbcb35a44507de1c3ed7e0732cfc3b124b9bc0b
2020-04-08 11:02:11 -07:00
Michael Carilli
0f0271e255 [RELAND2] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35102)
Summary:
This is the second reland attempt for https://github.com/pytorch/pytorch/pull/32140.

The first reland attempt https://github.com/pytorch/pytorch/pull/35011 failed due a [small incompatible change](https://github.com/pytorch/pytorch/pull/35011#issuecomment-601754216) in recent master (`skipIfRocm` was removed from `test_data_parallel.py`).

The present PR restores skipIfRocm.

Description from first reland attempt https://github.com/pytorch/pytorch/pull/35011:

> https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](d0577e19f0) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI.  The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer.
>
> The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8.  The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8.  All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140.
>
> Original description of https://github.com/pytorch/pytorch/pull/32140:
> > Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
> Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081
>
> > In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35102

Differential Revision: D20596918

Pulled By: ezyang

fbshipit-source-id: 60caa279bb0ce4a9bb0b28c1d585d42cf1cc7e50
2020-03-24 09:08:04 -07:00
Xiao Wang
37e355622a Pass the missed "non_blocking" argument for to() (#35144)
Summary:
The following code
```python
a = torch.randn(42,)
b = a.cuda(non_blocking=True)
```
will be **blocked** in the current master, and will **not be blocked** in pytorch 1.4 release. This can be verified by a `nvprof --print-api-trace python script.py` profiling. It is causing performance issue.

I isolated the problem, and jjsjann123 & ptrblck pointed out the fix. Thanks!

cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35144

Differential Revision: D20601163

Pulled By: ngimel

fbshipit-source-id: edd2b1dabd8e615c106188f30ddb3e763bde7471
2020-03-23 13:49:23 -07:00
Mike Ruberry
fe276d541e Revert D20541921: [pytorch][PR] [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix)
Test Plan: revert-hammer

Differential Revision:
D20541921

Original commit changeset: abb5488dca86

fbshipit-source-id: d2c6038978f80e5429632f8b49107090a8a247f4
2020-03-19 22:39:12 -07:00
Michael Carilli
991b97277a [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35011)
Summary:
https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](d0577e19f0) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI.  The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer.

The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8.  The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8.  All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140.

Original description of https://github.com/pytorch/pytorch/pull/32140:
> Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081

> In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35011

Differential Revision: D20541921

Pulled By: ezyang

fbshipit-source-id: abb5488dca8620b0daac4306ebf2bb47fc36e4f5
2020-03-19 20:18:18 -07:00
Edward Yang
d0577e19f0 Revert D20346700: [pytorch][PR] Eager autocasting, out-of-place ops only
Test Plan: revert-hammer

Differential Revision:
D20346700

Original commit changeset: 12d77b391731

fbshipit-source-id: 108d72bf24232f443c0be293ec932c0c478d6a60
2020-03-18 11:42:51 -07:00
Michael Carilli
aaa8f02156 Eager autocasting, out-of-place ops only (#32140)
Summary:
Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081

In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests.  Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140

Differential Revision: D20346700

Pulled By: ezyang

fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f
2020-03-18 10:28:21 -07:00
Emilio Castillo
31cc311143 Expose CUDACachingAllocator raw_alloc and raw_delete to python (#33860)
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).

Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126

We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860

Differential Revision: D20212788

Pulled By: ngimel

fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
2020-03-03 17:50:11 -08:00
Michael Carilli
fc6a153688 [WIP] Reanimate gradient scaling API with original scale update heuristic (#33366)
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.

This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366

Differential Revision: D20099026

Pulled By: ngimel

fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
2020-02-25 19:00:34 -08:00
Mike Ruberry
8291e06f8f Fixes cuda->numpy and non-strided->numpy segfaults (#33612)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/33300.

Calling .numpy() on a CUDA or non-strided (e.g. sparse) tensor segfaults in current PyTorch. This fixes the segfaults and throws the appropriate TypeError, as was intended.

Two tests, one in test_cuda.py and the other in test_sparse.py, are added to verify the behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33612

Differential Revision: D20038210

Pulled By: mruberry

fbshipit-source-id: 265531dacd37c392232fd3ec763489a62ef54795
2020-02-21 22:23:08 -08:00
Xiang Gao
e8a03438cc Make TestCuda.test_memory_stats more robust (#33575)
Summary:
IIUC Python does not guarantee when an object is garbage collected. So it is possible that, some other test running before `TestCuda.test_memory_stats` creates object which is only garbage collected during  `TestCuda.test_memory_stats`, causing mem stats to change and causing this test to fail. This kind of failure is very hard to debug (it took me and mcarilli and ptrblck quite a while to figure out what is happening), and it is the root cause of mcarilli's gradient scaling PR https://github.com/pytorch/pytorch/pull/26512 failing on Windows.

cc: csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33575

Differential Revision: D20009260

Pulled By: ngimel

fbshipit-source-id: 62f2716aefac3aa6c7d1898aa8a78e6b8aa3075a
2020-02-20 21:02:55 -08:00
Peter Bell
c882425c24 Add 64-bit indexing support to THC index reductions (#33405)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32863, (together with https://github.com/pytorch/pytorch/issues/33310 for the `TensorIterator` reductions)

This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.

I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33405

Differential Revision: D20010769

Pulled By: ezyang

fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
2020-02-20 15:20:14 -08:00
Edward Yang
ae53f8dd25 Revert D19859905: [pytorch][PR] Gradient scaling API
Test Plan: revert-hammer

Differential Revision:
D19859905

Original commit changeset: bb8ae6966214

fbshipit-source-id: 28f1c93e8a00e3a4bbe8cc981499b15468f0b970
2020-02-14 11:03:27 -08:00
Michael Carilli
40246fa63c Gradient scaling API (#26512)
Summary:
This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing.  Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081.

Volume-wise, this PR is mostly documentation and tests.  The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight .  The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant.

The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package.  `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases.  Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P).

Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible.

Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API.

However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates.  User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself.  [I know what the contract with custom optimizers should be](35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`.

Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081.  The gradient scaling API is intended to be orthogonal/modular relative to autocasting.  Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well.

### Todo
- [ ] How do I get c10 registered status for my two custom kernels?  They're very simple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512

Differential Revision: D19859905

Pulled By: mruberry

fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923
2020-02-13 11:06:06 -08:00
Mike Ruberry
ad90c97c0a Removes flaky check (#33146)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/32949.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33146

Differential Revision: D19836001

Pulled By: mruberry

fbshipit-source-id: 773069ae0c181e1a050b65b888c87590c1dddb32
2020-02-11 12:21:07 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00
Michael Carilli
4bdfc71421 Fix race condition for to() backward that spans devices (#31930)
Summary:
While putting finishing touches on the gradient scaling PR (https://github.com/pytorch/pytorch/pull/26512), I discovered my multi-GPU test (which uses `to()` to transfer tensors between devices) was intermittently failing with bad numerics.  I knew it was going to be [a weird case from the start](https://www.imdb.com/title/tt8946378/quotes/qt4868203) and spent a week descending into madness.  It turns out, for backward ops that create gradients on a different device from the device on whose stream the op is executed, the streaming backward synchronizations in [input_buffer.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L46-L83) do not properly tell later ops to wait on the population/creation of those gradients.  For example, a cross-device `to()` backward (CopyBackward Node) enqueues a cudaMemcpyAsync on the current stream of the source (incoming gradient's) device, then [syncs getCurrentCUDAStream on the destination device with the cudaMemcpyAsync](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Copy.cu#L76).  However, `input_buffer.cpp` in such cases ([case (3)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L77-L81)) was not properly telling `opt_consumer_stream` to wait on the current stream of the destination device (`var`'s device).

Circumstances needed to repro in current master (see [my test](https://github.com/pytorch/pytorch/compare/master...mcarilli:backward_to_race_fix#diff-e68a7bc6ba14f212e5e7eb3727394b40R1901)):
- 2 devices, with non-default streams used for forward-pass ops on both devices (which is the default behavior in test_cuda.py)
- A `to()` that transfers a tensor requiring grad from one device to another
- A backward pass that routes back through to()'s backward (aka CopyBackward).

Under these circumstances, backward ops following CopyBackward on CopyBackward's destination device (aka the original forward-pass source device) race with the device-to-device transfer, and execute using partially-transferred data.

The present PR fixes the race condition and ensures that later ops wait on the CopyBackward transfer.  This PR should also make streaming backward safe for other backward ops that span devices, as long as they play nice and populate any new gradients they create using the "current stream" of the device(s) on which they create those gradients.

There are a couple minor issues where I'm not sure of the best approach:
- Should we guard onto the var's device for the entire body of InputBuffer::add?
- I'm fairly sure we need to `recordStream` on `var` if the consumer stream is different from the stream on which (we expect) `var` was created, but calling `c10::cuda::CUDACachingAllocator::recordStream` in input_buffer.cpp might break CPU-only builds.  I couldn't find a different API call to record streams that seemed CPU-build-agnostic.  Could I wrap the call with a macro?

Thanks to mruberry for helpful suggestions and also the organization/naming of the stream pool and streaming backward code that allowed me to (just barely) wrap my head around the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31930

Differential Revision: D19517617

Pulled By: mruberry

fbshipit-source-id: 183d5460aefa5d27366b465b0473b80ec80fa044
2020-01-22 16:32:24 -08:00
Sameer Deshmukh
2f5eefe525 Raise ValueError if CUDA device is specified without specifying the : (#29087)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29087

Differential Revision: D19298959

Pulled By: ezyang

fbshipit-source-id: 878ea4840682012f07177d8d159a77c0e5afada6
2020-01-07 10:29:49 -08:00
Vitaly Fedyunin
fde3d707ad Switch default memory format of to (and similar) operators to Preserve
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30088

Test Plan: Imported from OSS

Differential Revision: D18624984

Pulled By: VitalyFedyunin

fbshipit-source-id: 54901786d7496c7dce785140b0585ac9093b1d86
2019-12-14 20:29:01 -08:00
hxia11
06c7420fa2 Raise error if a block can not be found from a CUDA tensor (#30870)
Summary:
After several discussions, we agreed not to put any extra safety check for recordStream as either the check will cause failures in certain scenarios or there is no need to throw for user errors.

As a summary, it simply does what is described in https://github.com/pytorch/pytorch/issues/27405, check if a tensor is indeed allocated by a CUDACachingAllocator instance, if it is, then throw internal error if a block can not be retrieved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30870

Differential Revision: D18851669

Pulled By: yxia11

fbshipit-source-id: c2f01798cd24f1fd0f35db8764057d5d333dab95
2019-12-10 08:04:00 -08:00
Michael Suo
62b10721fb Actually make flake8 do something (#30892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30892

Fixes all outstanding lints and actually installs a properly configured
flake8

Test Plan: Imported from OSS

Differential Revision: D18862825

Pulled By: suo

fbshipit-source-id: 08e9083338a7309272e17bb803feaa42e348aa85
2019-12-06 17:50:50 -08:00
Natalia Gimelshein
2171f91053 reenable cuda_kernel_loop_overflow_large test (#30797)
Summary:
Fix https://github.com/pytorch/pytorch/issues/30771 has landed, original issue https://github.com/pytorch/pytorch/issues/26838 is now closed

cc peterjc123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30797

Differential Revision: D18827307

Pulled By: ngimel

fbshipit-source-id: 41b3db5fc9db85daeaa1b53c55b468976c996285
2019-12-05 10:09:39 -08:00
Mingbo Wan
3636cb0364 windows build (#30556)
Summary:
based on https://github.com/pytorch/pytorch/pull/28677
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30556

Differential Revision: D18764040

Pulled By: mingbowan

fbshipit-source-id: 53104636800f5887b74a82c154bc5e9603de9322
2019-12-02 14:54:22 -08:00
Junjie Bai
45e980a243 Skip broken test test_cuda_kernel_loop_overflow_large (#30021)
Summary:
The previous "expectedFailure" decoration has broken ROCm CI

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/7674//console

```
16:23:52 test_cuda_kernel_loop_overflow_large (__main__.TestCuda) ... unexpected success

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30021

Differential Revision: D18574931

Pulled By: bddppq

fbshipit-source-id: 7b5240f9f3a610adda633f8b0dd9137e40b12e2f
2019-11-18 12:38:37 -08:00
Edward Yang
a573f8f7d7 Disable broken test_cuda_kernel_loop_overflow_large test (#29904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29904

See https://github.com/pytorch/pytorch/issues/26838

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18539740

Pulled By: ezyang

fbshipit-source-id: c3dcaaa0d8eedcfa4173c2b6ec139090bdace4b4
2019-11-18 07:38:34 -08:00
Vitaly Fedyunin
b80c4f60fb Add channels last support to cuda.comm.scatter and gather
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28077

Test Plan: Imported from OSS

Differential Revision: D17980305

Pulled By: VitalyFedyunin

fbshipit-source-id: e4741194baac3d93f2d53724582dc4c38f82ee84
2019-11-18 05:35:35 -08:00