pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Ivan Yashchuk	528158af47	Updated derivatives for complex mm, mv, ger, bmm, triangular_solve (#45737 ) Summary: This PR updates derivatives for a few functions so that `gradgradcheck` for `torch.cholesky` is passed ([ref](https://github.com/pytorch/pytorch/pull/45267#discussion_r494439967)). Some tests (that call to `bmm_cuda`) fail with with `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexDouble` until PR https://github.com/pytorch/pytorch/issues/42553 is merged. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45737 Reviewed By: bdhirsh Differential Revision: D24279917 Pulled By: anjali411 fbshipit-source-id: 7b696d2cfc2ef714332c2e3e5d207e257be67744	2020-10-15 11:27:30 -07:00
Elias Ellison	908c23579d	[JIT] Revert Freezing shared type PR (#46285 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45902 by reverting https://github.com/pytorch/pytorch/pull/42457 The test case introduced by https://github.com/pytorch/pytorch/pull/42457 was fixed by https://github.com/pytorch/pytorch/pull/46250, which I'm assuming is the real source of the bug. In the future it would be good to provide repro's for freezing issues without including a quantization dependency; there was another another issue in freezing (see: https://github.com/pytorch/pytorch/pull/46054) who's root cause was the same quantization issue https://github.com/pytorch/pytorch/pull/46250. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46285 Reviewed By: bdhirsh Differential Revision: D24288739 Pulled By: eellison fbshipit-source-id: b69ee8c713f749cd93d5eba370c3eafed86568bb	2020-10-15 10:57:30 -07:00
Yanan Cao	86abc8cd48	[JIT] Make InsertInstruction overflow check a warning instead of fatal (#46369 ) Summary: This diff restores previous behavior of silently allow overflowing when inserting instructions. The behavior was changed recently in https://github.com/pytorch/pytorch/issues/45382. But it started to break some existing use cases that haver overflow problems. Restoring original behavior but throw a warning to to unblock existing use cases where overflowing happens. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46369 Reviewed By: kwanmacher, wanchaol, fbhuba Differential Revision: D24324345 Pulled By: gmagogsfm fbshipit-source-id: 1c0fac421d4de38f070e21059bbdc1b788575bdf	2020-10-14 23:09:53 -07:00
Alexander Golynski	e7e919fc34	Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D24294437 Pulled By: gmagogsfm fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6	2020-10-14 16:27:37 -07:00
BowenBao	b28b5d3c68	[ONNX] Update squeeze test for opset 9 (#45369 ) Summary: Only under static axes does opset 9 supports no-op squeeze when dim is not 1. Updating the test case where it was setting dynamic axes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45369 Reviewed By: anjali411 Differential Revision: D24280180 Pulled By: bzinodev fbshipit-source-id: d7cda88ab338a1c41a68052831dcebe739a3843c	2020-10-14 12:53:13 -07:00
shubhambhokare1	9d389b1dcc	[ONNX] Preprocess index_put with bool inputs to masked_scatter/masked_fill (#45584 ) Summary: When the input to an indexing operation is a boolean, for example array[True] = value, the subsequent index_put node formed needs to be converted to masked_scatter/masked_fill node based on the type of val the indexing node is equated. If that value is just a single scalar, then we use the masked_fill functionality and if value is a tensor of appropriate size, we use the masked_scatter functionality. Fixes https://github.com/pytorch/pytorch/issues/34054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45584 Reviewed By: VitalyFedyunin Differential Revision: D24116921 Pulled By: bzinodev fbshipit-source-id: ebd66e06d62e15f0d49c8191d9997f55edfa520e	2020-10-14 10:58:55 -07:00
Mikhail Zolotukhin	d790ec6de0	[JIT] Update comment in jit_log.h. (#46301 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46301 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24295281 Pulled By: ZolotukhinM fbshipit-source-id: a4f84c773029845065895a81f9d753a9c82a99e0	2020-10-13 23:42:28 -07:00
Rohan Varma	f7398759b4	Only populate grad accumulator to var mapping for find_unused_parameters=True in DDP (#45942 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45942 We only need to keep track of this for traversing the autograd graph when find_unused_parameters=True. Without that, we populate and keep this mapping in memory, which occupies sizeof(pointer) * number of grad accumulators of extra memory. ghstack-source-id: 114219289 Test Plan: CI Reviewed By: mrshenli Differential Revision: D24154407 fbshipit-source-id: 220d723e262f36590a03a3fd2dab47cbfdb87d40	2020-10-13 21:12:59 -07:00
olegfaust	ac3f23deb0	Fixed usage of std::move function (#46199 ) Summary: Removed std::move in situations when move wasn't really possible (therefore std::move didn't move anything but created copy instead). Pull Request resolved: https://github.com/pytorch/pytorch/pull/46199 Reviewed By: bdhirsh Differential Revision: D24287408 Pulled By: glaringlee fbshipit-source-id: f88b9500e7bbaa709bff62b845966e2adc7fa588	2020-10-13 19:13:30 -07:00
Martin Yuan	173363f31a	Use tensor's quantized properties directly in pickler (#46267 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46267 Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D24283008 Pulled By: iseeyuan fbshipit-source-id: 76c8410d428a5fc487381e65a9f3a789a9f04eb0	2020-10-13 19:05:52 -07:00
Pritam Damania	f89498f3f8	Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221 The RPC framework only allowed sending RPCs based on provided WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier to refer to everything in terms of ranks since DDP doesn't support names yet. As a result, support a `to` parameter in the RPC APIs which allow for specifying a rank as well would be helpful. ghstack-source-id: 114207172 Test Plan: 1) waitforbuildbot 2) Unit Tests Reviewed By: mrshenli Differential Revision: D24264989 fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70	2020-10-13 17:52:54 -07:00
Michael Ranieri	b1d24dded1	make a way to disable callgrind (#46116 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116 Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource. Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND` Reviewed By: malfet Differential Revision: D24227360 fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f	2020-10-13 16:18:04 -07:00
Supriya Rao	95ccf34fb9	[quant][graph][fix] Set type for GetAttr nodes in remapTypes (#46250 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46250 Previously the type of GetAttr nodes was getting set incorrectly and wasn't matching the module type Test Plan: Existing quantization tests Imported from OSS Reviewed By: jerryzh168 Differential Revision: D24279872 fbshipit-source-id: 2b2e3027f6e9ad8ba9e9b7937bd5cc5daaf6e17c	2020-10-13 12:59:28 -07:00
chengjun	5741de883a	Define the record_stream method in native_functions.yaml (#44301 ) Summary: The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device. Fixes https://github.com/pytorch/pytorch/issues/36556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301 Reviewed By: glaringlee Differential Revision: D23763954 Pulled By: ezyang fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293	2020-10-13 09:15:22 -07:00
Brian Hirsh	a3caa719af	fix #45552 - adding add_done_callback(fn) to torch.futures.Future (#45675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45675 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24055353 Pulled By: bdhirsh fbshipit-source-id: 9233c8e17acc878f0fecbe740a4397fb55cf722f	2020-10-13 07:47:36 -07:00
Tao Xu	a277c097ac	[iOS][GPU] Add Metal/MPSCNN support on iOS (#46112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112 ### Summary This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta. allow-large-files - Users API ``` auto module = torch::jit::load(model); module.eval(); at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal(); auto output = module.forward({input}).toTensor().cpu(); ``` - Supported Models - Person Segmentation v106 (FB Internal) - Mobilenetv2 - Supported Operators - aten::conv2d - aten::addmm - aten::add.Tensor - aten::sub.Tensor - aten::mul.Tensor - aten::relu - aten::hardtanh - aten::hardtanh_ - aten::sigmoid - aten::max_pool2d - aten::adaptive_avg_pool2d - aten::reshape - aten::t - aten::view - aten::log_softmax.int - aten::upsample_nearest2d.vec - Supported Devices - Apple A9 and above - iOS 10.2 and above - CMake scripts - `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON` ### Test Plan - Circle CI ghstack-source-id: 114155638 Test Plan: 1. Sandcastle CI 2. Circle CI Reviewed By: dreiss Differential Revision: D23236555 fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625	2020-10-13 01:46:56 -07:00
Nikita Shulga	ba1e0a88bb	Use const-references in nodes_to_rewrite range loop Test Plan: CI Reviewed By: supriyar Differential Revision: D24267389 fbshipit-source-id: c56d6bf1924b4c4c993fdf1328cfd5ab0d890869	2020-10-12 20:08:34 -07:00
Nick Gibson	f3db68776c	[NNC] Fix two more bugs in Cuda Half support (#46129 ) Summary: Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats: 1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules. 2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in. Fixes https://github.com/pytorch/pytorch/issues/45953. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129 Reviewed By: glaringlee Differential Revision: D24253639 Pulled By: nickgg fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac	2020-10-12 13:31:07 -07:00
Brian Hirsh	c02efdefa8	adding complex support for distributed functions and . fix #45760 (#45879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24127949 Pulled By: bdhirsh fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223	2020-10-12 12:44:47 -07:00
partypyro	8d5256e6dd	Made exception message for torch.LongTensor() legacy constructor more readable (#46147 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46085 Made exception message for torch.LongTensor() legacy constructor more readable ![exception_screenshot](https://user-images.githubusercontent.com/13827698/95664789-e3387b80-0aff-11eb-8e8e-bd2ee449cd7e.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46147 Reviewed By: glaringlee Differential Revision: D24252617 Pulled By: mrshenli fbshipit-source-id: 6c03b66fef50cf18f9d37c7047d3b98c847ae287	2020-10-12 11:26:38 -07:00
Gregory Chanan	2070834b9e	Improve error checking of Storage._writeFile. (#46036 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46036 Previously, this function didn't do error-bounds checking on the GetItem (GET_ITEM) calls, which led to issues like https://github.com/pytorch/pytorch/issues/46020. A better solution would be to use pybind, but given writing the file is going to dominate bounds checking, this is strictly better. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24228370 Pulled By: gchanan fbshipit-source-id: f5d0a3d21ff12b4380beefe1e9954fa81ea2f567	2020-10-12 11:10:04 -07:00
Nick Gibson	2fa91fa305	[NNC] Fix crash when simplifying certain subtractions (#46108 ) Summary: Fixes a crash bug in the IRSimplifier when the LHS is a Term (e.g. 2x) and the RHS is a Polynomial (e.g. 2x+1). This case crashes 100% of the time so I guess it's not very common in models we've been benchmarking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46108 Reviewed By: agolynski Differential Revision: D24226593 Pulled By: nickgg fbshipit-source-id: ef454c855ff472febaeba16ec34891df932723c0	2020-10-09 15:15:55 -07:00
Ailing Zhang	0ddcc0ce35	Add alias dispatch key DefaultBackend. (#45718 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45718 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165892 Pulled By: ailzhang fbshipit-source-id: ed28bf62b7c6320d966fd10b7a44b14efffe2f62	2020-10-09 12:02:44 -07:00
Rohan Varma	62554a3bd2	Prioritize raising error message about unused parameters when rebuild_buckets fails (#45933 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933 Occasionally users run DDP with models with unused params, in this case we would like to surface an error message telling them to run with find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made it so that we raise a size mismatch error when this happens, but the information about unused parameters is likely to be more useful and likely to be the most common case of failure. Prefer raising this error over the subsequent size mismatch errors. ghstack-source-id: 113914759 Test Plan: Added unittest Reviewed By: mrshenli Differential Revision: D24151256 fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675	2020-10-09 09:16:45 -07:00
generatedunixname89002005325676	9fb8e33a5b	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D24215555 fbshipit-source-id: 21d10bd60ab302c7cf7e245979b2d2ef0a142a1c	2020-10-09 08:37:54 -07:00
Raghavan Raman	a5c0dbc519	Add support for Softmax. (#45286 ) Summary: This PR adds support for Softmax in NNC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45286 Reviewed By: mrshenli Differential Revision: D24042901 Pulled By: navahgar fbshipit-source-id: 120bafe17586d3ecf0918f9aee852a7c3a8f4990	2020-10-08 23:57:02 -07:00
Mingzhe Li	8cd3857bc7	[NCCL] Add torch::cuda::nccl::send/recv (#45926 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926 torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517 ghstack-source-id: 113910530 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24147802 fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3	2020-10-08 19:20:40 -07:00
Shen Li	96d48178c8	Make pipeWrite and pipeRead noexcept (#45783 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783 After the previous device maps commits, `pipeWrite` might throw. In this case, if we increment active calls before `pipeWrite` on the caller, that active call won't be decremented properly when `pipeWrite` throws. As a result, `shutdown` can silently timeout. I noticed this as some tests take more than 60s to finish. This commit extract the tensor device checking logic out of pipeWrite, and make sure the error is thrown before the active call count is incremented. Differential Revision: D24094803 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: mrshenli fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b	2020-10-08 18:53:51 -07:00
Supriya Rao	31888b2e77	[quant][pyper] Rename the sparse argument for embedding_bag ops (#46003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003 sparse is confusing because itt is used in training for sparse gradients Test Plan: Imported from OSS Reviewed By: radkris-git, qizzzh Differential Revision: D24178248 fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d	2020-10-08 16:15:28 -07:00
Supriya Rao	8c80ee8ba5	[quant] Set sparse to False for embedding_bag ops in graph mode (#45997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45997 The current sparse field using in the float module is for sparse gradients, which is not applicable to inference. The sparse field in the quantizd ops denotes pruned weights. Test Plan: python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag Imported from OSS Reviewed By: qizzzh Differential Revision: D24176543 fbshipit-source-id: a05b4ff949e0375462ae411947f68076e1b460d2	2020-10-08 16:13:12 -07:00
huaidong.xiong	e3112e3ed6	aten::set_grad_enabled should not push as it does not return a value (#45559 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45558 This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is: ```cpp Operator( "aten::set_grad_enabled(bool val) -> ()", [](Stack* stack) { torch::GradMode::set_enabled(pop(stack).toBool()); push(stack, IValue()); }, aliasAnalysisConservative()), ``` which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value. If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure. The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45559 Reviewed By: albanD Differential Revision: D24142153 Pulled By: SplitInfinity fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d	2020-10-08 14:42:11 -07:00
Nick Gibson	402abdfdf4	[NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869 ) Summary: Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing: Before: ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int j_1 = 0; j_1 < 10; j_1++) { B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41)); } ``` After `cacheAccesses(A->buf(), "A_local", j_loop);` ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int i_2 = 0; i_2 < 2; i_2++) { for (int j_1 = 0; j_1 < 11; j_1++) { A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40]; } } for (int j_2 = 0; j_2 < 10; j_2++) { B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]); } } ``` Or this reduction: ``` for (int l1 = 0; l1 < 4; l1++) { sum[l1] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } } ``` After `l.cacheAccesses(d->buf(), "d_local", n_loop);`: ``` for (int l1 = 0; l1 < 4; l1++) { Allocate(d_local, float, {1}); sum[l1] = 0.f; d_local[0] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } sum[l1] = (sum[l1]) + (d_local[0]); Free(d_local); } ``` I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know. This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup. While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869 Reviewed By: mruberry Differential Revision: D24195276 Pulled By: nickgg fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155	2020-10-08 14:13:28 -07:00
Elias Ellison	1197a38a63	[JIT] Bind log1p and lgamma (#45791 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45791 Most of the lowering for log1p and lgamma already existed, add JIT integration. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24169536 Pulled By: eellison fbshipit-source-id: a009c77a3471f3b5d378bad5de6d8e0880e9da3c	2020-10-08 12:06:34 -07:00
Elias Ellison	338283057b	[JIT] [3/3] Make sure fusion occurs in test_tensorexpr (#45790 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45790 Making sure that more tests invoke a run with a Fusion Group. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24169534 Pulled By: eellison fbshipit-source-id: a2666df53fbb12c64571e960f59dbe94df2437e4	2020-10-08 12:06:25 -07:00
Elias Ellison	564296f051	[2/3] [JIT] Make sure fusion occurs in test_tensorexpr (#45789 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789 Making sure that more tests invoke a run with a Fusion Group. Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D24169535 Pulled By: eellison fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c	2020-10-08 12:06:16 -07:00
Elias Ellison	1b97ffa07a	[1/3] [JIT] Make sure fusion occurs in test_tensorexpr file (#45788 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788 We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24169537 Pulled By: eellison fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a	2020-10-08 12:02:57 -07:00
Heitor Schueroff de Souza	636eb18029	Fixed median nan propagation and implemented nanmedian (#45847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847 Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24136629 Pulled By: heitorschueroff fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9	2020-10-08 11:20:21 -07:00
Thomas Viehmann	d3d8da7a8e	Enable CUDA Fuser for ROCm (#45965 ) Summary: This enables the cuda fuser on ROCm and enables tests for them. Part of this patch is based on work of Rohith Nallamaddi, thank you. Errors are my own, of course. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965 Reviewed By: seemethere Differential Revision: D24170457 Pulled By: walterddr fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6	2020-10-08 10:41:56 -07:00
Hao Lu	ea4fbb2e5e	[StaticRuntime] Replace hashtable based workspace with vector<IValue> (#45892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892 Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`. Reviewed By: dzhulgakov Differential Revision: D24098763 fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6	2020-10-08 09:50:30 -07:00
Mikhail Zolotukhin	6e4de44501	[TensorExpr] LoopNest: add a constructor that takes Stmt instead of list of Tensors. (#45949 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45949 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24156001 Pulled By: ZolotukhinM fbshipit-source-id: 6f4f050b04e802e274c42ed64be74c21ba79c29f	2020-10-08 00:58:13 -07:00
Mikhail Zolotukhin	1036b77416	[TensorExpr] LoopNest: replace output_tensors_ with output_bufs_. (#45948 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45948 No functionality changes expected, it's just a preparation for further changes in the LoopNest interface. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24156000 Pulled By: ZolotukhinM fbshipit-source-id: f95ab07aac0aba128bc4ed5376a3251ac9c31c06	2020-10-08 00:58:10 -07:00
Mikhail Zolotukhin	29da553dd9	[TensorExpr] Loopnest: unify intermediate_tensors_ and temp_bufs_. (#45947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45947 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24155999 Pulled By: ZolotukhinM fbshipit-source-id: d82acf6aba570f6a675eea683c306088e2a41f91	2020-10-08 00:58:08 -07:00
Mikhail Zolotukhin	598caddd93	[TensorExpr] Add shorthand versions for `splitWith{Mask,Tail}` functions. (#45946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45946 Also, make these functions static - they are not using anything from `LoopNest` and can be applied to any `Stmt`. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24156002 Pulled By: ZolotukhinM fbshipit-source-id: 1c7d205f85a2a1684e07eb836af662f10d0a50fc	2020-10-08 00:58:06 -07:00
Mikhail Zolotukhin	b65ffa365c	[TensorExpr] Nuke `Function` class and directly use `Tensor` instead. (#45936 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45936 `Tensor` has been a view into a `Function` that was supposed to be used for a more general case when we have multiple computations over the same domain (aka multiple output functions). We have never got to a point where we need this and now have other ideas in mind on how to support this case if need be. For now, let's just nuke `Function` to reduce the overall system complexity. The change should not affect any existing behavior. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24153214 Pulled By: ZolotukhinM fbshipit-source-id: 26d5f11db5d661ff5e1135f4a49eff1c6d4c1bd5	2020-10-08 00:55:31 -07:00
Nikita Shulga	c19b9cd18d	Add torch::cuda::ncll::all2all (#45900 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes https://github.com/pytorch/pytorch/issues/42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so \| ^ \| \| --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`\|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send\|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1	2020-10-07 23:56:31 -07:00
Nick Gibson	19da1d22fe	[NNC] Registerizer V2, supporting partial and conditional replacement (#45574 ) Summary: This is a rewrite of the Registerizer, supporting scalar replacement in vastly more situations. As a refresher, the registerizer does this: Before: ``` A[0] = 0; for (int x = 0; x < 10; x++) { A[0] = (A[0]) + x; } ``` After: ``` int A_ = 0; for (int x = 0; x < 10; x++) { A_ = x + A_; } A[0] = A_; ``` Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups). I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet. It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses. There are a few cases that make this more challenging: - For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements. E.g. Before: ``` A[0] = 2; for (int x1 = 0; x1 < 10; x1++) { A[0] = (A[0]) + x1; } for (int x2 = 1; x2 < 10; x2++) { A[x2] = A[x2 - 1]; } for (int x3 = 0; x3 < 10; x3++) { A[0] = (A[0]) + x3; } ``` After: ``` int A_1 = 2; for (int x1 = 0; x1 < 10; x1++) { A_1 = A_1 + x1; } A[0] = A_1; for (int x2 = 1; x2 < 10; x2++) { A[x2] = A[x2 - 1]; } int A_2 = A[0]; for (int x3 = 0; x3 < 10; x3++) { A_2 = A_2 + x3; } A[0] = A_2; ``` - Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it. E.g. this example: ``` if (x<5 ? 1 : 0) { A[x] = (A[x]) + 1; } A[x] = (A[x]) + 1; if (x>5 ? 1 : 0) { A[x] = (A[x]) + 1; } ``` The A[x] access can be registerized due to the unconditional access between the two conditions: ``` int A_1 = A[x]; if (x<5 ? 1 : 0) { A_1 = A_1 + 1; } A_1 = A_1 + 1; if (x>5 ? 1 : 0) { A_1 = A_1 + 1; } A[x] = A_1; ``` But this example has no accesses that can be registerized: ``` if (x<5 ? 1 : 0) { A[x] = (A[x]) + 1; } if (x>5 ? 1 : 0) { A[x] = (A[x]) + 1; } ``` - IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within. E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it. ``` A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]); ``` But the equivalent kernel using Cond can be registerized: ``` if (x<3 ? 1 : 0) { float B_1 = B[x]; A[x] = B_1 + B_1; } else { A[x] = B[x]; } ``` - Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var. E.g. no accesses in this example can be registerized: ``` for (int x = 0; x < 10; x++) { int y = 30; A[y] = x + (A[y]); } ``` But they can in this example: ``` int y = 30; for (int x = 0; x < 10; x++) { A[y] = x + (A[y]); } ``` Testing The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests. Performance Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount. Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574 Reviewed By: albanD Differential Revision: D24151517 Pulled By: nickgg fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae	2020-10-07 18:17:27 -07:00
Elias Ellison	c86655a815	[JIT] Fix Dict bug in constant hashing (#45929 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929 We were checking `and` when we should have been checking `or`. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24148804 Pulled By: eellison fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d	2020-10-07 17:40:17 -07:00
Elias Ellison	72e4f51bc0	[JIT] fix dict update (#45857 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45857 Fix for https://github.com/pytorch/pytorch/issues/45627 Op was calling `insert` instead of `insert_or_assign`, so it wouldn't overwrite an existing key. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24148805 Pulled By: eellison fbshipit-source-id: bf39c71d5d928890b82cff1a9a0985dc47c1ffac	2020-10-07 17:36:02 -07:00
Michael Carilli	5640b79bf8	Allow consumer ops to sync on GraphRoot's gradient (#45787 ) Summary: Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition. The race condition can exist even if the user doesn't give a manually populated gradient: ```python with torch.cuda.stream(side_stream): # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream. loss.backward() # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward, # and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where # the side_stream context causes a problem. ``` This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.) The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs. With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops: ```python # implicit population is safe with torch.cuda.stream(side_stream): loss.backward() # explicit population in side stream then backward in side stream is safe with torch.cuda.stream(side_stream): kickoff_grad = torch.ones_like(loss) loss.backward(gradient=kickoff_grad) # explicit population in one stream then backward kickoff in another stream # is NOT safe, even with this PR's diffs, but that unsafety is consistent with # stream-semantics relationship of any pair of ops kickoff_grad = torch.ones_like(loss) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) # Safe, as you'd expect for any pair of ops kickoff_grad = torch.ones_like(loss) side_stream.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) ``` This PR also adds the last three examples above to cuda docs and references them from autograd docstrings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787 Reviewed By: nairbv Differential Revision: D24138376 Pulled By: albanD fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3	2020-10-07 08:53:53 -07:00
James Reed	be45c3401a	[JIT] Make objects throw Python AttributeError on nonexistant attr access (#45911 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45911 Test Plan: Imported from OSS Reviewed By: robieta Differential Revision: D24140971 Pulled By: jamesr66a fbshipit-source-id: 046a2cffff898efad5bcc36a41bf992f36f555f9	2020-10-07 01:57:29 -07:00

1 2 3 4 5 ...

6316 Commits