pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Michael Carilli	e841f335aa	[RELAND] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#57373 ) Summary: https://github.com/pytorch/pytorch/pull/56433 was reverted because the test perceived internal dropout state creation as a memory leak. This PR resubmits with the leak check skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57373 Reviewed By: anjali411 Differential Revision: D28152186 Pulled By: ezyang fbshipit-source-id: 9a593fcdbbabbb09dc4e4221191663e94b697503	2021-05-03 11:41:40 -07:00
Wenlei Xie	20085f6d23	Support auto generation of device check (#56872 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872 ghstack-source-id: 127914018 Test Plan: auto test Reviewed By: ezyang Differential Revision: D27986429 fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f	2021-05-01 12:02:09 -07:00
Michael Carilli	bbc3cc6718	[CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562 ) Summary: I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work: ```python # Create "static_input" with dummy data, run warmup iterations, # call optimizer.zero_grad(set_to_none=True), then g = torch.cuda._Graph() s.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(s): optimizer.zero_grad(set_to_none=True) g.capture_begin() with autocast(): out = model(static_input) loss = loss_fn(out) scaler.scale(loss).backward() g.capture_end() torch.cuda.current_stream().wait_stream(s) # Training loop: for b in data: # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads static_input.copy_(b) g.replay() scaler.step(optimizer) scaler.update() ``` Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations. I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji. Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562 Reviewed By: zou3519 Differential Revision: D28046159 Pulled By: ngimel fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910	2021-04-30 13:03:05 -07:00
Nikita Shulga	0a30d64c83	Revert D27966444: [pytorch][PR] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout Test Plan: revert-hammer Differential Revision: D27966444 (`610c984d2e`) Original commit changeset: fe0df843c521 fbshipit-source-id: 8223b7f8b7183f0e7c9df6a7aa8f6b164e5634db	2021-04-28 14:51:10 -07:00
Michael Carilli	610c984d2e	[CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#56433 ) Summary: Cudnn rnn calls that use use cudnn dropout maintain a "state" buffer across calls. [DropoutState](`fe3f6f2da2/aten/src/ATen/native/cudnn/RNN.cpp (L1388-L1402)`)'s lock() and unlock() ensure the current call's use of the state buffer syncs with the end of the previous call's use of the state buffer (in case the previous call was on a different stream). Telling a capturing stream to wait on an event recorded in a non-capturing stream is an error (1). Telling a non-capturing stream to wait on an event recorded during capture is also an error (2). So DropoutState's flow can error in either of two simple use cases: ```python rnn = nn.LSTM(512, 512, 2, dropout=0.5).cuda() out1 = rnn(in1) # calling cudnn rnn with dropout in capture after calling it uncaptured triggers 1 capture_stream.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(capture_stream): graph.capture_begin() out2 = rnn(in2) graph.capture_end() torch.cuda.current_stream().wait_stream(capture_stream) # calling cudnn rnn with dropout uncaptured after calling it in capture triggers 2 out3 = rnn(in3) ``` This PR fixes both cases by telling `DropoutState::lock()`: "if the most recent end-of-usage event was in a different capture state (ie, we crossed a capturing<->noncapturing border) or in a different capture, don't sync on it." While considering the fix I had two assumptions in mind: - only one capture using the RNN can be underway at a time in this process - no noncapturing ops in this process are issuing RNN calls while the capture using the RNN is underway. That second assumption seems brittle if, for example, someone wants to capture an internal region of the forward method of a model wrapped with DataParallel: multiple threads could be issuing RNN calls with some currently capturing and some not. We should talk about whether that use case seems realistic. (Bigger-picture thoughts: I don't know if forcing calls to serialize on using the shared state buffer is the best design. And if we want to do it that way, we might as well run all cudnn rnns with dropout on a dedicated side stream synced with the surrounding stream (capturing or not), in which case I don't think this PR's event-handling diffs would be needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/56433 Reviewed By: heitorschueroff Differential Revision: D27966444 Pulled By: ezyang fbshipit-source-id: fe0df843c521e0d48d7f2c81a17aff84c5497e20	2021-04-28 12:52:03 -07:00
Michael Carilli	ffdecc1ac4	[CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860 ) Summary: Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream. The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events(). This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860 Reviewed By: ejguan Differential Revision: D27822557 Pulled By: ezyang fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e	2021-04-18 20:32:10 -07:00
Arindam Roy	4cfbb2401f	[ROCM] Re-enable 3 previously faling tests in test_cuda.py (#55813 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53190 The following tests are passing in ROCM 4.1. Hence re-enabling them. test_grad_scaling_multigpu test_streaming_backwards_device_transfer test_streaming_backwards_multiple_streams Pull Request resolved: https://github.com/pytorch/pytorch/pull/55813 Reviewed By: yinghai Differential Revision: D27725547 Pulled By: ngimel fbshipit-source-id: d8b3ed69fa44c2086f0666b4db0fabb30ad59439	2021-04-13 01:09:11 -07:00
Yukio Siraichi	93bf0ae6fc	Remove legacy constructor calls from pytorch codebase. (#54142 ) Summary: Follow up from https://github.com/pytorch/pytorch/issues/53889 Related to https://github.com/pytorch/pytorch/issues/47112 Removing every occurrence of the legacy constructor call present in PyTorch at: - _docs_ - _benchmarks_ - _test_ - _caffe2_ - _CONTRIBUTING.md_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142 Reviewed By: ngimel Differential Revision: D27699450 Pulled By: mruberry fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546	2021-04-11 15:45:17 -07:00
Heitor Schueroff	5d68b3695c	[Relanding] Implemented torch.linalg.multi_dot (#52859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52859 This reverts commit `92a4ee1cf6`. Added support for bfloat16 for CUDA 11 and removed fast-path for empty input tensors that was affecting autograd graph. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D27402390 Pulled By: heitorschueroff fbshipit-source-id: 73c5ccf54f3da3d29eb63c9ed3601e2fe6951034	2021-04-01 04:49:05 -07:00
Kurt Mohler	6c235ef267	Allow `std=0` in `torch.normal`, and error if `std<0` (#51317 ) Summary: Part of https://github.com/pytorch/pytorch/issues/49998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317 Reviewed By: bdhirsh Differential Revision: D27253939 Pulled By: mruberry fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b	2021-03-31 21:06:07 -07:00
Kurt Mohler	3ddc6174da	Raise error in clip_grad_norm_ if norm is non-finite (#53843 ) Summary: BC-breaking note: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False` Fixes https://github.com/pytorch/pytorch/issues/46849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843 Reviewed By: malfet Differential Revision: D27291838 Pulled By: jbschlosser fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4	2021-03-29 08:41:21 -07:00
albanD	1126d51de9	Remove useless contiguous calls from torch.matmul (#54616 ) Summary: This reduces the memory usage of matmul significantly for expanded batch size. This reduces the peak memory usage of ``` a = torch.rand(1, 1024, 1024, device="cuda") b = torch.rand(1024, 1024, 1, device="cuda") out = torch.matmul(a, b) ``` From 4GB to 16MB which is not too bad. It also fixes the same problem when `b` is not batched. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54616 Reviewed By: ailzhang Differential Revision: D27327056 Pulled By: albanD fbshipit-source-id: 4bb5f4015aeab4174148512f3c5b8d1ffa97bf54	2021-03-26 06:34:24 -07:00
Nikita Vedeneev	61b074581c	`torch.prod` backward for complex types. (#48125 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53511 torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125 Reviewed By: pbelevich Differential Revision: D27188589 Pulled By: anjali411 fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a	2021-03-19 09:44:08 -07:00
Michael Carilli	b27e678dfb	[RELAND] [CUDA graphs] Private mempools for CUDA graphs (#54038 ) Summary: Resubmit of https://github.com/pytorch/pytorch/pull/51436. Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038 Reviewed By: mruberry Differential Revision: D27068649 Pulled By: ngimel fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207	2021-03-16 12:13:33 -07:00
Natalia Gimelshein	76129c7cdf	Revert D26993790: [pytorch][PR] [CUDA graphs] Private mempools for CUDA graphs Test Plan: revert-hammer Differential Revision: D26993790 (`90dfdef226`) Original commit changeset: a992eaee1b8c fbshipit-source-id: 6ddb4aedd6154d7d89847aa5a34181158d06a309	2021-03-12 13:07:28 -08:00
Michael Carilli	90dfdef226	[CUDA graphs] Private mempools for CUDA graphs (#51436 ) Summary: Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad). [High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82) The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want. Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better. Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875: ```python # Same bindings as 48875, but now implicitly grabs a private mempool graph1.capture_begin() graph1.capture_end() # pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool graph2.capture_begin(pool=graph1.pool()) graph2.capture_end() # graph3 also implicitly creates its own mempool graph3.capture_begin() graph3.capture_end() ``` Test plan (other suggestions appreciated): - [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other. - [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory. - [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory. - [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)). - [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436 Reviewed By: mruberry Differential Revision: D26993790 Pulled By: ngimel fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da	2021-03-12 11:07:47 -08:00
Jagadish Krishnamoorthy	ec6a7cace3	[ROCm] Fix the flaky test test_stream_event_nogil (#53850 ) Summary: Fix the flaky test in https://github.com/pytorch/pytorch/issues/53192 properly. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/53850 Reviewed By: albanD Differential Revision: D26993582 Pulled By: malfet fbshipit-source-id: b0aefb188a236a5e94ee31a30ede7e8175443ff5	2021-03-11 16:07:41 -08:00
Jagadish Krishnamoorthy	0a549f9412	[ROCm] Disable flaky tests on ROCm (#53192 ) Summary: The disabled tests are tracked by https://github.com/pytorch/pytorch/issues/53190 Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192 Reviewed By: zhangguanheng66 Differential Revision: D26782204 Pulled By: mrshenli fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27	2021-03-11 08:29:12 -08:00
Edward Yang	758fb94fcb	Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276 - One of the tests had a syntax error (but the test wasn't fine grained enough to catch this; any error was a pass) - Doesn't work on ROCm Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D26820048 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: ezyang fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45	2021-03-05 17:36:01 -08:00
Edward Yang	cfd9360d09	Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26837780 Original commit changeset: 21567cab5c0f fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0	2021-03-04 20:45:35 -08:00
Edward Yang	1accffe450	Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26819810 Original commit changeset: e528260e1aa9 fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30	2021-03-04 18:48:56 -08:00
Edward Yang	9e5e5a7d96	Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26815021 Original commit changeset: 972eaafcdf14 fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607	2021-03-04 09:28:25 -08:00
Mike Ruberry	b864457743	Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26744062 (`12d63cc2f5`) Original commit changeset: be6d2653afe5 fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759	2021-03-04 04:11:25 -08:00
Edward Yang	12d63cc2f5	Add assert_async (#53086 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086 Fixes #36853 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26744062 Pulled By: ezyang fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d	2021-03-03 16:18:07 -08:00
Kyle Chen	f2657d2e4f	[ROCm] Enable test cases in test_cuda.py for ROCm (#52739 ) Summary: Enabling four test cases in test_cuda.py for ROCm because they are passing. Signed-off-by: Kyle Chen <kylechen@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/52739 Reviewed By: H-Huang Differential Revision: D26706321 Pulled By: ngimel fbshipit-source-id: 6907c548c4ac4e387f0eb7c646e8a01f0d036c8a	2021-03-01 12:54:40 -08:00
AJ San Joaquin	578f0a04c7	fix torch.nn.parallel.scatter_gather.gather to handle NamedTuples and handle moving output to CPU (#51104 ) Summary: Fixes #{[50510](https://github.com/pytorch/pytorch/issues/50510)} Allows ```torch.nn.parallel.scatter_gather.gather``` to accept a list of NamedTuples as input and returns a NamedTuple whose elements are tensors. I added the author's fix using the ```is_namedtuple``` function. While testing this fix, I encountered a deprecation warning instructing me to use ```'cpu'``` instead of ```-1``` to move the outputs to the CPU. However, doing this causes an assertion error in the ```_get_device_index``` function. I solved this by handling the CPU case in the affected ```forward``` function. rohan-varma Pull Request resolved: https://github.com/pytorch/pytorch/pull/51104 Reviewed By: albanD Differential Revision: D26395578 Pulled By: rohan-varma fbshipit-source-id: 6e98c9ce1d9f1725973c18d24a6554c1bceae465	2021-02-11 15:50:28 -08:00
Chester Liu	58eb23378f	Clean up usage of torch._six partially (#49785 ) Summary: See https://github.com/pytorch/pytorch/issues/42919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49785 Reviewed By: mruberry Differential Revision: D25963833 Pulled By: bugra fbshipit-source-id: 11c90d6b8d3f206c9d0a4d8621b773beb10c6ba2	2021-02-08 13:58:34 -08:00
Jagadish Krishnamoorthy	506fdf9abf	[ROCm] disable tests for ROCm 4.0.1 (#51510 ) Summary: These tests are failing for ROCm 4.0/4.0.1 release. Disable the tests until they are fixed. - TestCuda.test_cudnn_multiple_threads_same_device - TestCudaFuser.test_reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/51510 Reviewed By: H-Huang Differential Revision: D26205179 Pulled By: seemethere fbshipit-source-id: 0c3d29989d711deab8b5046b458c772a1543d8ed	2021-02-02 14:39:08 -08:00
Nikita Shulga	43f0ccd1ec	torch.cuda.memory_allocated to return `{}` if not initialized (#51179 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179 Reviewed By: ngimel Differential Revision: D26094932 Pulled By: malfet fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d	2021-01-28 20:38:17 -08:00
Jeffrey Wan	6e3e57095c	Add complex support for torch.nn.L1Loss (#49912 ) Summary: Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640) Things added in this PR: 1. Modify backward and double-backward formulas 2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1) 3. Modify some existing tests to support complex Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912 Reviewed By: zhangguanheng66 Differential Revision: D25853036 Pulled By: soulitzer fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad	2021-01-15 15:53:15 -08:00
Nikita Shulga	bf4fcab681	Fix SyncBatchNorm usage without stats tracking (#50126 ) Summary: In `batch_norm_gather_stats_with_counts_cuda` use `input.scalar_type()` if `running_mean` is not defined In `SyncBatchNorm` forward function create count tensor with `torch.float32` type if `running_mean` is None Fix a few typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/50126 Test Plan: ``` python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))" ``` Fixes https://github.com/pytorch/pytorch/issues/49730 Reviewed By: ngimel Differential Revision: D25797930 Pulled By: malfet fbshipit-source-id: 22a91e3969b5e9bbb7969d9cc70b45013a42fe83	2021-01-07 18:31:13 -08:00
Michael Carilli	ee271047b5	torch.utils.checkpoint.checkpoint + torch.cuda.amp (#49757 ) Summary: Adds a test to orphaned original PR (https://github.com/pytorch/pytorch/pull/40221). Should fix https://github.com/pytorch/pytorch/issues/49738 and https://github.com/pytorch/pytorch/issues/47183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49757 Reviewed By: mruberry Differential Revision: D25689609 Pulled By: ngimel fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010	2020-12-22 22:25:11 -08:00
Nikita Shulga	befe337072	Fix test_cuda_init_race skip rules (#49693 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49693 Reviewed By: walterddr, janeyx99 Differential Revision: D25668027 Pulled By: malfet fbshipit-source-id: 802cbd39e4ebe585709179f332b680f5f7978814	2020-12-21 14:30:00 -08:00
Michael Carilli	c068180a17	[CUDA graphs] Cuda RNG-safe graph capture and replay bindings (#48875 ) Summary: Part 2 of https://github.com/pytorch/pytorch/pull/46148 refactor. (part 1 was https://github.com/pytorch/pytorch/pull/48694.) Contains - a few more CUDAGeneratorImpl diffs to clean up graph capture interaction - Capture and replay bindings that interact correctly with CUDAGeneratorImpl - Tests. Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds. See [Note [CUDA Graph-safe RNG states]](`02d89f9f1d/aten/src/ATen/CUDAGeneratorImpl.h (L13-L85)`) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48875 Reviewed By: zou3519 Differential Revision: D25482654 Pulled By: ngimel fbshipit-source-id: 634dbc4c6c9d7d0d9a62dc81a52d430561f905fe	2020-12-14 10:51:58 -08:00
Jeff Daily	d5c4a80cfd	Allow ROCm CI to use non-default stream. (#48424 ) Summary: Revert https://github.com/pytorch/pytorch/issues/26394. Fixes https://github.com/pytorch/pytorch/issues/27356. Not all MIOpen handles were setting their stream to the current stream prior to running the op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48424 Reviewed By: H-Huang Differential Revision: D25420384 Pulled By: mruberry fbshipit-source-id: 051683ba9e3d264b71162bd344031a0c58bf6a41	2020-12-10 09:55:11 -08:00
x00480351	47aa253632	[Feature] Allow user to specify a fraction of the GPU memory. (#48172 ) Summary: Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: https://github.com/pytorch/pytorch/issues/18626 The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction Note， this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator. ## usage In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using. Eg. device: 0, each part takes half memory, the code as follows: ``` torch.cuda.set_per_process_memory_fraction(0.5, 0) ``` There is an example to show what it is. ```python import torch torch.cuda.set_per_process_memory_fraction(0.5, 0) torch.cuda.empty_cache() total_memory = torch.cuda.get_device_properties(0).total_memory # less than 0.5 will be ok: tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda') del tmp_tensordel tmp_tensor torch.cuda.empty_cache() # this allocation will raise a OOM: torch.empty(total_memory // 2, dtype=torch.int8, device='cuda') """ It raises an error as follows: RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch) """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172 Reviewed By: bdhirsh Differential Revision: D25275381 Pulled By: VitalyFedyunin fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f	2020-12-03 11:45:56 -08:00
pbialecki	22c3ae8b57	Disable autocast cache for tensor views as fix for #48049 (#48696 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48049 Root cause of the issue explained [here](https://github.com/pytorch/pytorch/issues/48049#issuecomment-736701769). This PR implements albanD's suggestion to add the `!t.is_view()` check and disable autocast caching for views of tensors. The added test checks for an increase in memory usage by comparing the initially allocated memory with the memory after 3 iterations using a single `nn.Linear` layer in a `no_grad` and `autocast` context. After this PR the memory usage in the original issue doesn't grow anymore and yields: ```python autocast: True 0: 0MB (peak 1165MB) 1: 0MB (peak 1264MB) 2: 0MB (peak 1265MB) 3: 0MB (peak 1265MB) 4: 0MB (peak 1265MB) 5: 0MB (peak 1265MB) 6: 0MB (peak 1265MB) 7: 0MB (peak 1265MB) 8: 0MB (peak 1265MB) 9: 0MB (peak 1265MB) ``` CC ngimel mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/48696 Reviewed By: bdhirsh Differential Revision: D25276231 Pulled By: ngimel fbshipit-source-id: e2571e9f166c0a6f6f569b0c28e8b9ca34132743	2020-12-02 20:25:13 -08:00
Jeff Daily	5dfced3b0d	work around #47028 until a proper fix is identified (#48405 ) Summary: Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue. CC albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/48405 Reviewed By: mrshenli Differential Revision: D25183473 Pulled By: ngimel fbshipit-source-id: 0fa19b5497a713cc6c5d251598e57cc7068604be	2020-11-26 18:33:19 -08:00
Gao, Xiang	315122ce15	Bump up the CUDA OOM test memory size (#48029 ) Summary: 80GB is no longer large any more https://nvidianews.nvidia.com/news/nvidia-doubles-down-announces-a100-80gb-gpu-supercharging-worlds-most-powerful-gpu-for-ai-supercomputing Hopefully, the new size could be OK until the end of Moore's Law :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/48029 Reviewed By: linbinyu Differential Revision: D25003603 Pulled By: zou3519 fbshipit-source-id: 626b9c031daee950df8453be4d7643dd67647213	2020-11-17 11:16:31 -08:00
Jeff Daily	6906701bde	[ROCm] enable stream priorities (#47136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136 Reviewed By: mruberry Differential Revision: D24672457 Pulled By: ngimel fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3	2020-11-02 11:25:44 -08:00
Michael Carilli	3c643d112e	Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39694. [`torch.cuda._sleep(int(100 * get_cycles_per_ms()))`](https://github.com/pytorch/pytorch/pull/46878/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R511-R513) in the test helps avoid flakiness noted by ngimel (https://github.com/pytorch/pytorch/pull/35144#issuecomment-602103631). Pull Request resolved: https://github.com/pytorch/pytorch/pull/46878 Reviewed By: izdeby Differential Revision: D24550403 Pulled By: xw285cornell fbshipit-source-id: 1ecc35ef75f9a38ab332aacdf4835955105edafc	2020-10-29 15:42:55 -07:00
Jeff Daily	151f31ba27	remove event not ready assertion from TestCuda.test_copy_non_blocking (#46857 ) Summary: It is incorrect to assume that a newly recorded event will immediately query as False. This test is flaky on ROCm due to this incorrect assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46857 Reviewed By: albanD Differential Revision: D24565581 Pulled By: mrshenli fbshipit-source-id: 0e9ba02cf52554957b29dbeaa5093696dc914b67	2020-10-27 14:21:40 -07:00
anjali411	d94bd998ec	Update backward formulas (Re #44444 ) (#46275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46275 Re #44444 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D24285785 Pulled By: anjali411 fbshipit-source-id: c60ecd4fe4f144132085f2c91d3b950e92b2a491	2020-10-25 19:40:59 -07:00
ashish	88e94da580	Enable softmax and tiny norm FP16 tests on ROCm (#46363 ) Summary: This pull request enables the following tests on ROCm: * TestCuda.test_tiny_half_norm_ * TestNNDeviceTypeCUDA.test_softmax_cuda_float16 * TestNNDeviceTypeCUDA.test_softmax_cuda_float32 * TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16 * TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32 The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8. The pull request fixes https://github.com/pytorch/pytorch/issues/37493 cc: jeffdaily ezyang malfet mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363 Reviewed By: heitorschueroff Differential Revision: D24325639 Pulled By: ezyang fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9	2020-10-22 19:40:00 -07:00
Richard Barnes	52a970bac9	Minor cleaning of `test_cuda.py` (#46617 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46617 Sort includes, fix deprecated test warning Test Plan: ``` buck run mode/dev-nosan //caffe2/test:cuda ``` Reviewed By: drdarshan Differential Revision: D24429247 fbshipit-source-id: 65f53d7c904032e5c8f8ca45d1d2bb437358ffdd	2020-10-22 09:03:30 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Michael Carilli	5640b79bf8	Allow consumer ops to sync on GraphRoot's gradient (#45787 ) Summary: Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition. The race condition can exist even if the user doesn't give a manually populated gradient: ```python with torch.cuda.stream(side_stream): # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream. loss.backward() # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward, # and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where # the side_stream context causes a problem. ``` This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.) The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs. With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops: ```python # implicit population is safe with torch.cuda.stream(side_stream): loss.backward() # explicit population in side stream then backward in side stream is safe with torch.cuda.stream(side_stream): kickoff_grad = torch.ones_like(loss) loss.backward(gradient=kickoff_grad) # explicit population in one stream then backward kickoff in another stream # is NOT safe, even with this PR's diffs, but that unsafety is consistent with # stream-semantics relationship of any pair of ops kickoff_grad = torch.ones_like(loss) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) # Safe, as you'd expect for any pair of ops kickoff_grad = torch.ones_like(loss) side_stream.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) ``` This PR also adds the last three examples above to cuda docs and references them from autograd docstrings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787 Reviewed By: nairbv Differential Revision: D24138376 Pulled By: albanD fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3	2020-10-07 08:53:53 -07:00
Rohan Varma	f8c1ca5dd8	Enable NamedTuple data type to work with DDP (#44220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220 Closes https://github.com/pytorch/pytorch/issues/44009 Currently if a dataloader returns objects created with a collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple. Fix this in `scatter_gather.py` to resolve the issue reported in https://github.com/pytorch/pytorch/issues/44009 ghstack-source-id: 113423287 Test Plan: CI Reviewed By: colesbury Differential Revision: D23536752 fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b	2020-10-02 13:33:08 -07:00
Michael Carilli	72bc3d9de4	Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778 ) Summary: Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work! Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293). The PR also modifies Unary/Binary/Pointwise Functors to - do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about. - accept an instantiated op functor rather than an op functor template (`template<class> class Op`). This allows calling code to pass lambdas. Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778 Reviewed By: gchanan Differential Revision: D23944102 Pulled By: izdeby fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d	2020-10-01 07:51:16 -07:00
Nikita Shulga	c3a5aed5f7	Run pytorch_core CUDA tests on GPU using TPX Summary: Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: walterddr, cspanda Differential Revision: D23854578 fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623	2020-09-24 12:12:23 -07:00

1 2 3 4 5 ...

449 Commits