Summary:
Fixes gh-42282
This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505
Reviewed By: mruberry
Differential Revision: D23331651
Pulled By: ngimel
fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0
Summary:
Should close https://github.com/pytorch/pytorch/issues/36428.
The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout. This PR implements a "speed of light" fix: [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).
It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.
The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes. As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.
Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600).
Side quests discovered:
- Should we update [persistent RNN heuristics](dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)) to include compute capability 8.0? Could be another PR but seems easy enough to include.
- Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8. I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385
Reviewed By: zhangguanheng66
Differential Revision: D23077782
Pulled By: ezyang
fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139
A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.
Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579
Test Plan: Run the offending unit tests.
Reviewed By: mrshenli
Differential Revision: D22779135
fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
Summary:
In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops.
The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157
Reviewed By: albanD
Differential Revision: D22811096
Pulled By: mruberry
fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6
Summary:
Skipping the test test_streams as it is flaky on rocm.
cc: jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41697
Reviewed By: zhangguanheng66
Differential Revision: D22644600
Pulled By: malfet
fbshipit-source-id: b1b16d496e58a91c44c40d640851fd62a5d7393d
Summary:
The test asserts that the stream is "ready" but doesn't wait for the
event to be "executed" which makes it fail on some platforms where the
`query` call occurs "soon enough".
Fixes https://github.com/pytorch/pytorch/issues/38807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41398
Reviewed By: zhangguanheng66
Differential Revision: D22540012
Pulled By: ezyang
fbshipit-source-id: 6f56d951e48133ce4f6a9a54534298b7d2877c80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330
`torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get:
```
TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch._C._cudart.cudaError) -> str
Invoked with: 34
```
Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give:
```
AttributeError: 'str' object has no attribute 'decode'
```
This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test.
ghstack-source-id: 107628709
Test Plan: Unit tests
Reviewed By: ezyang
Differential Revision: D22500549
fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1
Summary:
Should close https://github.com/pytorch/pytorch/issues/35810.
I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble.
For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself. At least I hope it's that easy.
For fp16 sparse grads, it's tricker. Sparse tensors can be uncoalesced. From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor):
> Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries.
An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows.** If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should.
So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`. `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor. The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32.
An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`. It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`. I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place".
** you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786
Reviewed By: ezyang
Differential Revision: D22202832
Pulled By: ngimel
fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735
Summary:
Currently, a custom autograd function written with
```
torch.cuda.amp.custom_fwd(cast_inputs=dtype)
def forward(ctx, *args):
...
```
casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region. I think I had the wrong idea there. Autocast-disabled regions should give the user control of input types. Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions. C++-side casting wrappers have no effect when autocast is disabled, and `custom_fwd`'s casting should behave the same way.
The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171
Differential Revision: D22179511
Pulled By: ngimel
fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.
This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.
Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358
Differential Revision: D22165785
Pulled By: albanD
fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
Summary:
Currently, whether `AccumulateGrad` [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.
The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).
Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.
For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.
Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.
This PR also updates autograd docs to describe potential BC-breaking changes below.
## BC notes
ngimel albanD gchanan
#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break.
Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations.
#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.
One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.
----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904
Differential Revision: D20496044
Pulled By: albanD
fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
Summary:
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti
```python
import time
import torch
import numpy as np
for n, t in [(500_000, 10),
(1_000_000, 10)]:
for dtype in (torch.half, torch.float, torch.double):
# Input Setup
p = torch.from_numpy(np.random.rand(n)).to(dtype)
want = 1000
print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
start = time.time()
# Iterate
for _ in range(t):
torch.multinomial(p, want, replacement=False)
print(f'Took:', time.time() - start)
print('****' * 10)
for n, t in [(50_000, 100),
(100_000, 100)]:
for dtype in (torch.half, torch.float, torch.double):
# Input Setup
p = torch.rand(n, device='cuda', dtype=dtype)
want = 1000
print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
start = time.time()
# torch.cuda.synchronize()
# Iterate
for _ in range(t):
torch.multinomial(p, want, replacement=False)
# torch.cuda.synchronize()
print(f'CUDA Took:', time.time() - start)
```
Before:
```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 80.64455389976501
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 3.7778031826019287
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 5.045570611953735
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.53191947937012
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 7.640851736068726
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 10.399673461914062
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 4.873984098434448
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 4.713594436645508
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 11.167185068130493
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 7.195427417755127
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 7.669712066650391
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 20.20938801765442
```
After:
```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 81.09321522712708
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 0.06062650680541992
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 0.0862889289855957
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.85304307937622
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 0.13271093368530273
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 0.17215657234191895
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 0.035035133361816406
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 0.03631949424743652
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 0.05507040023803711
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 0.05105161666870117
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 0.05449223518371582
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 0.09161853790283203
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742
Differential Revision: D21976915
Pulled By: ngimel
fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39156
TensorList is now supported for boxing, so we can remove
unboxed only from it. I didn't check if there were other
operators that were incorrectly classified.
Fixes https://github.com/pytorch/pytorch/issues/38958
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21819821
Pulled By: ezyang
fbshipit-source-id: 6dcf91bc196554e1721d2c704f3bf524f069534b
Summary:
`_TestTorchMixin` is base class which is instantiated across multiple types.
It was inherited from `object` in order to hide it from unittest test discovery mechanism.
But this approach makes it almost impossible to use static code analyzer on the class.
This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512
Change imported class access path in `test_cuda.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110
Test Plan:
run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change:
```
$ python test_torch.py --discover-tests|md5sum
2ca437bb5d65700763ce04cdacf6de3e -
$ python test_cuda.py --discover-tests|md5sum
b17df916fb0eeb6f0dd7222d7dae392c -
```
Differential Revision: D21759265
Pulled By: malfet
fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.
In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872
Differential Revision: D21740237
Pulled By: mruberry
fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.
In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872
Differential Revision: D21717199
Pulled By: mruberry
fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
Summary:
This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2`
The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles
cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427
Differential Revision: D21302909
Pulled By: ezyang
fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615
Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).
Test Plan: CI
Differential Revision: D20842886
Pulled By: dreiss
fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
Summary:
The following code
```python
a = torch.randn(42,)
b = a.cuda(non_blocking=True)
```
will be **blocked** in the current master, and will **not be blocked** in pytorch 1.4 release. This can be verified by a `nvprof --print-api-trace python script.py` profiling. It is causing performance issue.
I isolated the problem, and jjsjann123 & ptrblck pointed out the fix. Thanks!
cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35144
Differential Revision: D20601163
Pulled By: ngimel
fbshipit-source-id: edd2b1dabd8e615c106188f30ddb3e763bde7471
Summary:
Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081
In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140
Differential Revision: D20346700
Pulled By: ezyang
fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).
Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126
We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860
Differential Revision: D20212788
Pulled By: ngimel
fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.
This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366
Differential Revision: D20099026
Pulled By: ngimel
fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
Summary:
Addresses https://github.com/pytorch/pytorch/issues/33300.
Calling .numpy() on a CUDA or non-strided (e.g. sparse) tensor segfaults in current PyTorch. This fixes the segfaults and throws the appropriate TypeError, as was intended.
Two tests, one in test_cuda.py and the other in test_sparse.py, are added to verify the behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33612
Differential Revision: D20038210
Pulled By: mruberry
fbshipit-source-id: 265531dacd37c392232fd3ec763489a62ef54795
Summary:
IIUC Python does not guarantee when an object is garbage collected. So it is possible that, some other test running before `TestCuda.test_memory_stats` creates object which is only garbage collected during `TestCuda.test_memory_stats`, causing mem stats to change and causing this test to fail. This kind of failure is very hard to debug (it took me and mcarilli and ptrblck quite a while to figure out what is happening), and it is the root cause of mcarilli's gradient scaling PR https://github.com/pytorch/pytorch/pull/26512 failing on Windows.
cc: csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33575
Differential Revision: D20009260
Pulled By: ngimel
fbshipit-source-id: 62f2716aefac3aa6c7d1898aa8a78e6b8aa3075a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32863, (together with https://github.com/pytorch/pytorch/issues/33310 for the `TensorIterator` reductions)
This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.
I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33405
Differential Revision: D20010769
Pulled By: ezyang
fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3