pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Gao, Xiang	5e97f251a8	Enable TF32 support for cuDNN (#40737 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737 Reviewed By: mruberry Differential Revision: D22801525 Pulled By: ngimel fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2	2020-09-01 15:34:24 -07:00
Peter Bell	42f6c3b1f4	Raise error on device mismatch in addmm (#43505 ) Summary: Fixes gh-42282 This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505 Reviewed By: mruberry Differential Revision: D23331651 Pulled By: ngimel fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0	2020-08-26 09:37:57 -07:00
Michael Carilli	fbf274f5a7	Autocast support for cudnn RNNs (#42385 ) Summary: Should close https://github.com/pytorch/pytorch/issues/36428. The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout. This PR implements a "speed of light" fix: [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight). It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are. The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes. As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side. Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600). Side quests discovered: - Should we update [persistent RNN heuristics](`dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)`) to include compute capability 8.0? Could be another PR but seems easy enough to include. - Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8. I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385 Reviewed By: zhangguanheng66 Differential Revision: D23077782 Pulled By: ezyang fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e	2020-08-18 13:37:42 -07:00
Pritam Damania	872237c1f2	Output to stderr in distributed tests. (#42139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139 A bunch of tests were failing with buck since we would output to stdout and buck would fail parsing stdout in some cases. Moving these print statements to stderr fixes this issue. ghstack-source-id: 108606579 Test Plan: Run the offending unit tests. Reviewed By: mrshenli Differential Revision: D22779135 fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad	2020-07-29 19:23:34 -07:00
Mike Ruberry	4b6e5f42a4	Creates spectral ops test suite (#42157 ) Summary: In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops. The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157 Reviewed By: albanD Differential Revision: D22811096 Pulled By: mruberry fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6	2020-07-29 11:36:18 -07:00
lcskrishna	1f11e930d0	[ROCm] skip test_streams on rocm. (#41697 ) Summary: Skipping the test test_streams as it is flaky on rocm. cc: jeffdaily sunway513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41697 Reviewed By: zhangguanheng66 Differential Revision: D22644600 Pulled By: malfet fbshipit-source-id: b1b16d496e58a91c44c40d640851fd62a5d7393d	2020-07-21 08:55:07 -07:00
Xiang Gao	23174ca71b	[reland] Enable TF32 support for cuBLAS (#41498 ) Summary: fix rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498 Reviewed By: mruberry Differential Revision: D22560572 Pulled By: ngimel fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041	2020-07-15 21:00:55 -07:00
Alexander Grund	563b60b890	Fix flaky test_stream_event_nogil due to missing event sync (#41398 ) Summary: The test asserts that the stream is "ready" but doesn't wait for the event to be "executed" which makes it fail on some platforms where the `query` call occurs "soon enough". Fixes https://github.com/pytorch/pytorch/issues/38807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41398 Reviewed By: zhangguanheng66 Differential Revision: D22540012 Pulled By: ezyang fbshipit-source-id: 6f56d951e48133ce4f6a9a54534298b7d2877c80	2020-07-15 11:03:35 -07:00
Shen Li	3a63a939d4	Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS Test Plan: revert-hammer Differential Revision: D22517785 (`288ece89e1`) Original commit changeset: 87334c893561 fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458	2020-07-15 08:15:48 -07:00
Xiang Gao	288ece89e1	Enable TF32 support for cuBLAS (#40800 ) Summary: Benchmark on a fully connected network and torchvision models (time in seconds) on GA100: \| model \| batch size \| forward(TF32) \| forward(FP32) \| backward(TF32) \| backward(FP32) \| \|--------------------\|------------\|---------------\|---------------\|----------------\|----------------\| \| FC 512-128-32-8 \| 512 \| 0.000211 \| 0.000321 \| 0.000499 \| 0.000532 \| \| alexnet \| 512 \| 0.0184 \| 0.0255 \| 0.0486 \| 0.0709 \| \| densenet161 \| 128 \| 0.0665 \| 0.204 \| 0.108 \| 0.437 \| \| googlenet \| 256 \| 0.0925 \| 0.110 \| 0.269 \| 0.326 \| \| inception_v3 \| 256 \| 0.155 \| 0.214 \| 0.391 \| 0.510 \| \| mnasnet1_0 \| 512 \| 0.108 \| 0.137 \| 0.298 \| 0.312 \| \| mobilenet_v2 \| 512 \| 0.114 \| 0.294 \| 0.133 \| 0.303 \| \| resnet18 \| 512 \| 0.0722 \| 0.100 \| 0.182 \| 0.228 \| \| resnext50_32x4d \| 256 \| 0.170 \| 0.237 \| 0.373 \| 0.479 \| \| shufflenet_v2_x1_0 \| 512 \| 0.0463 \| 0.0473 \| 0.125 \| 0.123 \| \| squeezenet1_0 \| 512 \| 0.0870 \| 0.0948 \| 0.205 \| 0.214 \| \| vgg16 \| 256 \| 0.167 \| 0.234 \| 0.401 \| 0.502 \| \| wide_resnet50_2 \| 512 \| 0.186 \| 0.310 \| 0.415 \| 0.638 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800 Reviewed By: mruberry Differential Revision: D22517785 Pulled By: ngimel fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e	2020-07-14 13:21:10 -07:00
Luca Wehrstedt	c20426f86d	Fix torch.cuda.check_error type errors (#41330 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330 `torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get: ``` TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported: 1. (arg0: torch._C._cudart.cudaError) -> str Invoked with: 34 ``` Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give: ``` AttributeError: 'str' object has no attribute 'decode' ``` This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test. ghstack-source-id: 107628709 Test Plan: Unit tests Reviewed By: ezyang Differential Revision: D22500549 fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1	2020-07-14 00:47:14 -07:00
SsnL	de7ac60cf4	Add out= variants for cuda.comm.broadcast/gather/scatter (#39681 ) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/38911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681 Differential Revision: D22161342 Pulled By: mrshenli fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a	2020-06-24 12:58:19 -07:00
Michael Carilli	b4ccdef090	Allow torch.cuda.amp.GradScaler to support sparse gradients (#36786 ) Summary: Should close https://github.com/pytorch/pytorch/issues/35810. I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble. For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself. At least I hope it's that easy. For fp16 sparse grads, it's tricker. Sparse tensors can be uncoalesced. From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor): > Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries. An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows. If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should. So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`. `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor. The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32. An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`. It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`. I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place". you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786 Reviewed By: ezyang Differential Revision: D22202832 Pulled By: ngimel fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735	2020-06-24 09:10:49 -07:00
Michael Carilli	3b040c478a	Make custom_fwd a no-op when not executed under autocast (#36171 ) Summary: Currently, a custom autograd function written with ``` torch.cuda.amp.custom_fwd(cast_inputs=dtype) def forward(ctx, *args): ... ``` casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region. I think I had the wrong idea there. Autocast-disabled regions should give the user control of input types. Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions. C++-side casting wrappers have no effect when autocast is disabled, and `custom_fwd`'s casting should behave the same way. The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers). Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171 Differential Revision: D22179511 Pulled By: ngimel fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd	2020-06-23 10:23:02 -07:00
Michael Carilli	8066fba226	[RELAND2] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40358 ) Summary: https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test. This PR is intended as the "master copy" for merge, and it runs on full CI. Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`). - https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself. - https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs. Both fail the same way, indicating failure is unrelated to this PR's other diffs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358 Differential Revision: D22165785 Pulled By: albanD fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e	2020-06-22 17:13:21 -07:00
Alban Desmaison	08227fea4f	Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D22079377 Original commit changeset: 9bd2b7e0c34f fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59	2020-06-17 10:17:27 -07:00
Michael Carilli	1ec8ece2b9	[RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40129 ) Summary: https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1	2020-06-17 09:02:54 -07:00
Alban Desmaison	f1e575a0bf	Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D20496044 Original commit changeset: 248d680f4b1b fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d	2020-06-16 10:38:40 -07:00
Michael Carilli	2beb9690c3	Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#34904 ) Summary: Currently, whether `AccumulateGrad` [steals](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)`) or [clones](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)`) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown. The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)). Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense. For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple. Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general: - Grads should be stashed with memory layouts matching their params. - Src and dst tensors on opposite ends of collectives should have matching dense layouts. This PR also updates autograd docs to describe potential BC-breaking changes below. ## BC notes ngimel albanD gchanan #### BC-breaking In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break. Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations. #### BC escape hatches At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point. One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides. ---------------------------- I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility: 1. make sure Reducer's ops sync with AccumulateGrad streams 2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do. 3. https://github.com/pytorch/pytorch/issues/38942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904 Differential Revision: D20496044 Pulled By: albanD fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217	2020-06-16 08:43:31 -07:00
kshitij12345	97dfdaaad8	torch.multinomial : fast-path for replacement=False (#39742 ) Summary: Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import time import torch import numpy as np for n, t in [(500_000, 10), (1_000_000, 10)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.from_numpy(np.random.rand(n)).to(dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) print(f'Took:', time.time() - start) print('***' 10) for n, t in [(50_000, 100), (100_000, 100)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.rand(n, device='cuda', dtype=dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # torch.cuda.synchronize() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) # torch.cuda.synchronize() print(f'CUDA Took:', time.time() - start) ``` Before: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 80.64455389976501 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 3.7778031826019287 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 5.045570611953735 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.53191947937012 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 7.640851736068726 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 10.399673461914062 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 4.873984098434448 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 4.713594436645508 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 11.167185068130493 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 7.195427417755127 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 7.669712066650391 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 20.20938801765442 ``` After: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 81.09321522712708 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 0.06062650680541992 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 0.0862889289855957 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.85304307937622 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 0.13271093368530273 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 0.17215657234191895 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 0.035035133361816406 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 0.03631949424743652 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 0.05507040023803711 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 0.05105161666870117 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 0.05449223518371582 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 0.09161853790283203 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742 Differential Revision: D21976915 Pulled By: ngimel fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306	2020-06-10 20:42:55 -07:00
rohithkrn	ab6c447f59	[ROCm] Enable AMP autocast tests on ROCm (#39616 ) Summary: Enables AMP autocast tests on ROCm. ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/39616 Differential Revision: D21924219 Pulled By: ezyang fbshipit-source-id: f4df4ad32cd8fae8c4620cd8ab18b00d74fb46bd	2020-06-08 10:30:39 -07:00
xueht-fnst	faf0a3bd7a	Move bernoulli_() to DistributionTemplates (#38558 ) Summary: resolve the feature introduced in https://github.com/pytorch/pytorch/issues/37373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38558 Differential Revision: D21920685 Pulled By: pbelevich fbshipit-source-id: 50c77d9aaa334b3276a2352afe6c4ad03f12be31	2020-06-07 07:18:30 -07:00
Edward Yang	de5b8797e6	Remove unboxed only from AMP registrations for cat. (#39156 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39156 TensorList is now supported for boxing, so we can remove unboxed only from it. I didn't check if there were other operators that were incorrectly classified. Fixes https://github.com/pytorch/pytorch/issues/38958 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21819821 Pulled By: ezyang fbshipit-source-id: 6dcf91bc196554e1721d2c704f3bf524f069534b	2020-06-02 07:49:02 -07:00
peter	a5d44800f0	Implement CUDA_KERNEL_ASSERT for MSVC (#39218 ) Summary: Tested locally on CPU/GPU + Debug/Release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39218 Differential Revision: D21786500 Pulled By: malfet fbshipit-source-id: 7e871003d3509436952932b5ff3599e36bb8f205	2020-05-29 11:44:54 -07:00
Jeff Daily	c25b3d4305	[ROCm] in test_cuda.py, re-enable skipped tests (#37952 ) Summary: - test_stream_context - test_cublas_multiple_threads_same_device - test_cusparse_multiple_threads_same_device These tests passed three rounds of CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37952 Differential Revision: D21532027 Pulled By: vincentqb fbshipit-source-id: dce7fc4f0943e2be43da71e213e168c455c66751	2020-05-29 11:38:47 -07:00
Jeff Daily	7e16dd299a	[ROCm] enable mem leak check for rocm (#35953 ) Summary: CC iotamudelta Pull Request resolved: https://github.com/pytorch/pytorch/pull/35953 Differential Revision: D21742926 Pulled By: zou3519 fbshipit-source-id: f18534dbb88a84fe98b8d85ce8fde652916a72d5	2020-05-28 07:05:47 -07:00
Nikita Shulga	f5bc91f851	Get rid of multiple inheritence in test_torch (#39110 ) Summary: `_TestTorchMixin` is base class which is instantiated across multiple types. It was inherited from `object` in order to hide it from unittest test discovery mechanism. But this approach makes it almost impossible to use static code analyzer on the class. This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512 Change imported class access path in `test_cuda.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110 Test Plan: run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change: ``` $ python test_torch.py --discover-tests\|md5sum 2ca437bb5d65700763ce04cdacf6de3e - $ python test_cuda.py --discover-tests\|md5sum b17df916fb0eeb6f0dd7222d7dae392c - ``` Differential Revision: D21759265 Pulled By: malfet fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e	2020-05-27 22:45:06 -07:00
Mike Ruberry	13120bf677	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21740237 Pulled By: mruberry fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042	2020-05-27 06:31:07 -07:00
Rohan Varma	63e545e0fe	Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol Test Plan: revert-hammer Differential Revision: D21717199 Original commit changeset: 9feb856f94ee fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259	2020-05-26 18:23:59 -07:00
Mike Ruberry	6ddca30b2d	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21717199 Pulled By: mruberry fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a	2020-05-26 08:30:23 -07:00
Nik Ved	f80df4ca79	port `scatter_add` to ATen (CUDA) (#38262 ) Summary: Fixes [https://github.com/pytorch/pytorch/issues/24622 ](https://github.com/pytorch/pytorch/issues/24622). Pull Request resolved: https://github.com/pytorch/pytorch/pull/38262 Differential Revision: D21656729 Pulled By: ngimel fbshipit-source-id: 63dcbf8eeaf59d8295bf4e5c8bb9d28ad165d4eb	2020-05-20 19:03:41 -07:00
Michael Carilli	25f918548d	Allow GradScaler to be pickled (#38296 ) Summary: Should unblock https://github.com/PyTorchLightning/pytorch-lightning/issues/1782. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38296 Differential Revision: D21553296 Pulled By: albanD fbshipit-source-id: 9041a72d7cf8833e4b01bc767fd2321f17c7c5f2	2020-05-14 09:14:28 -07:00
SsnL	ae392a77a6	Add better device idx parse checks (#37376 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37376 Differential Revision: D21476036 Pulled By: zou3519 fbshipit-source-id: 86907083c23cbaf165b645307fb340f2656b814e	2020-05-14 09:07:12 -07:00
Ailing Zhang	9232356e5f	remove uses of type() and type_as() part 1. (#38029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38029 Differential Revision: D21468523 Pulled By: ailzhang fbshipit-source-id: 14b7185d43eb03f630cfaa2d70e02d637ff8551b	2020-05-08 08:16:24 -07:00
ashishfarmer	bbd2350c99	Disable tests failing on test2 in ROCm CI (#37427 ) Summary: This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2` The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles cc: ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427 Differential Revision: D21302909 Pulled By: ezyang fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616	2020-04-29 09:56:28 -07:00
Emilio Castillo	5fc391a646	Enforce type promotion in `torch.cat` (#35030 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35014 CUDA `cat` implementation doesn't use `TensorIterator` so there is the need of manually doing some checks in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35030 Differential Revision: D21155853 Pulled By: nairbv fbshipit-source-id: 9e78bb7591f806734e12555831157061c925ff40	2020-04-22 13:35:07 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
Peter Bell	e99c53dc86	Fix broadcast_coalesce for empty tensors (#35965 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35965 Differential Revision: D20919377 Pulled By: ezyang fbshipit-source-id: cfbcb35a44507de1c3ed7e0732cfc3b124b9bc0b	2020-04-08 11:02:11 -07:00
Michael Carilli	0f0271e255	[RELAND2] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35102 ) Summary: This is the second reland attempt for https://github.com/pytorch/pytorch/pull/32140. The first reland attempt https://github.com/pytorch/pytorch/pull/35011 failed due a [small incompatible change](https://github.com/pytorch/pytorch/pull/35011#issuecomment-601754216) in recent master (`skipIfRocm` was removed from `test_data_parallel.py`). The present PR restores skipIfRocm. Description from first reland attempt https://github.com/pytorch/pytorch/pull/35011: > https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](`d0577e19f0`) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI. The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer. > > The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8. The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8. All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140. > > Original description of https://github.com/pytorch/pytorch/pull/32140: > > Initial integration of eager autocasting, supporting out-of-place ops only for easier review. > Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081 > > > In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35102 Differential Revision: D20596918 Pulled By: ezyang fbshipit-source-id: 60caa279bb0ce4a9bb0b28c1d585d42cf1cc7e50	2020-03-24 09:08:04 -07:00
Xiao Wang	37e355622a	Pass the missed "non_blocking" argument for to() (#35144 ) Summary: The following code ```python a = torch.randn(42,) b = a.cuda(non_blocking=True) ``` will be blocked in the current master, and will not be blocked in pytorch 1.4 release. This can be verified by a `nvprof --print-api-trace python script.py` profiling. It is causing performance issue. I isolated the problem, and jjsjann123 & ptrblck pointed out the fix. Thanks! cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/35144 Differential Revision: D20601163 Pulled By: ngimel fbshipit-source-id: edd2b1dabd8e615c106188f30ddb3e763bde7471	2020-03-23 13:49:23 -07:00
Mike Ruberry	fe276d541e	Revert D20541921: [pytorch][PR] [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) Test Plan: revert-hammer Differential Revision: D20541921 Original commit changeset: abb5488dca86 fbshipit-source-id: d2c6038978f80e5429632f8b49107090a8a247f4	2020-03-19 22:39:12 -07:00
Michael Carilli	991b97277a	[RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35011 ) Summary: https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](`d0577e19f0`) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI. The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer. The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8. The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8. All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140. Original description of https://github.com/pytorch/pytorch/pull/32140: > Initial integration of eager autocasting, supporting out-of-place ops only for easier review. Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081 > In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35011 Differential Revision: D20541921 Pulled By: ezyang fbshipit-source-id: abb5488dca8620b0daac4306ebf2bb47fc36e4f5	2020-03-19 20:18:18 -07:00
Edward Yang	d0577e19f0	Revert D20346700: [pytorch][PR] Eager autocasting, out-of-place ops only Test Plan: revert-hammer Differential Revision: D20346700 Original commit changeset: 12d77b391731 fbshipit-source-id: 108d72bf24232f443c0be293ec932c0c478d6a60	2020-03-18 11:42:51 -07:00
Michael Carilli	aaa8f02156	Eager autocasting, out-of-place ops only (#32140 ) Summary: Initial integration of eager autocasting, supporting out-of-place ops only for easier review. Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081 In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140 Differential Revision: D20346700 Pulled By: ezyang fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f	2020-03-18 10:28:21 -07:00
Emilio Castillo	31cc311143	Expose `CUDACachingAllocator` `raw_alloc` and `raw_delete` to python (#33860 ) Summary: This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls). Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126 We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860 Differential Revision: D20212788 Pulled By: ngimel fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c	2020-03-03 17:50:11 -08:00
Michael Carilli	fc6a153688	[WIP] Reanimate gradient scaling API with original scale update heuristic (#33366 ) Summary: Also, windows memory failures responsible for the earlier reversion have been fixed. This PR (initially) contains 2 commits: * a revert of the revert * all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366 Differential Revision: D20099026 Pulled By: ngimel fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529	2020-02-25 19:00:34 -08:00
Mike Ruberry	8291e06f8f	Fixes cuda->numpy and non-strided->numpy segfaults (#33612 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/33300. Calling .numpy() on a CUDA or non-strided (e.g. sparse) tensor segfaults in current PyTorch. This fixes the segfaults and throws the appropriate TypeError, as was intended. Two tests, one in test_cuda.py and the other in test_sparse.py, are added to verify the behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33612 Differential Revision: D20038210 Pulled By: mruberry fbshipit-source-id: 265531dacd37c392232fd3ec763489a62ef54795	2020-02-21 22:23:08 -08:00
Xiang Gao	e8a03438cc	Make TestCuda.test_memory_stats more robust (#33575 ) Summary: IIUC Python does not guarantee when an object is garbage collected. So it is possible that, some other test running before `TestCuda.test_memory_stats` creates object which is only garbage collected during `TestCuda.test_memory_stats`, causing mem stats to change and causing this test to fail. This kind of failure is very hard to debug (it took me and mcarilli and ptrblck quite a while to figure out what is happening), and it is the root cause of mcarilli's gradient scaling PR https://github.com/pytorch/pytorch/pull/26512 failing on Windows. cc: csarofeen Pull Request resolved: https://github.com/pytorch/pytorch/pull/33575 Differential Revision: D20009260 Pulled By: ngimel fbshipit-source-id: 62f2716aefac3aa6c7d1898aa8a78e6b8aa3075a	2020-02-20 21:02:55 -08:00
Peter Bell	c882425c24	Add 64-bit indexing support to THC index reductions (#33405 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32863, (together with https://github.com/pytorch/pytorch/issues/33310 for the `TensorIterator` reductions) This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime. I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33405 Differential Revision: D20010769 Pulled By: ezyang fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3	2020-02-20 15:20:14 -08:00
Edward Yang	ae53f8dd25	Revert D19859905: [pytorch][PR] Gradient scaling API Test Plan: revert-hammer Differential Revision: D19859905 Original commit changeset: bb8ae6966214 fbshipit-source-id: 28f1c93e8a00e3a4bbe8cc981499b15468f0b970	2020-02-14 11:03:27 -08:00

1 2 3 4 5 ...

396 Commits