pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
shubhambhokare1	9d389b1dcc	[ONNX] Preprocess index_put with bool inputs to masked_scatter/masked_fill (#45584 ) Summary: When the input to an indexing operation is a boolean, for example array[True] = value, the subsequent index_put node formed needs to be converted to masked_scatter/masked_fill node based on the type of val the indexing node is equated. If that value is just a single scalar, then we use the masked_fill functionality and if value is a tensor of appropriate size, we use the masked_scatter functionality. Fixes https://github.com/pytorch/pytorch/issues/34054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45584 Reviewed By: VitalyFedyunin Differential Revision: D24116921 Pulled By: bzinodev fbshipit-source-id: ebd66e06d62e15f0d49c8191d9997f55edfa520e	2020-10-14 10:58:55 -07:00
Mikhail Zolotukhin	d790ec6de0	[JIT] Update comment in jit_log.h. (#46301 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46301 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24295281 Pulled By: ZolotukhinM fbshipit-source-id: a4f84c773029845065895a81f9d753a9c82a99e0	2020-10-13 23:42:28 -07:00
Rohan Varma	f7398759b4	Only populate grad accumulator to var mapping for find_unused_parameters=True in DDP (#45942 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45942 We only need to keep track of this for traversing the autograd graph when find_unused_parameters=True. Without that, we populate and keep this mapping in memory, which occupies sizeof(pointer) * number of grad accumulators of extra memory. ghstack-source-id: 114219289 Test Plan: CI Reviewed By: mrshenli Differential Revision: D24154407 fbshipit-source-id: 220d723e262f36590a03a3fd2dab47cbfdb87d40	2020-10-13 21:12:59 -07:00
olegfaust	ac3f23deb0	Fixed usage of std::move function (#46199 ) Summary: Removed std::move in situations when move wasn't really possible (therefore std::move didn't move anything but created copy instead). Pull Request resolved: https://github.com/pytorch/pytorch/pull/46199 Reviewed By: bdhirsh Differential Revision: D24287408 Pulled By: glaringlee fbshipit-source-id: f88b9500e7bbaa709bff62b845966e2adc7fa588	2020-10-13 19:13:30 -07:00
Martin Yuan	173363f31a	Use tensor's quantized properties directly in pickler (#46267 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46267 Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D24283008 Pulled By: iseeyuan fbshipit-source-id: 76c8410d428a5fc487381e65a9f3a789a9f04eb0	2020-10-13 19:05:52 -07:00
Pritam Damania	f89498f3f8	Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221 The RPC framework only allowed sending RPCs based on provided WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier to refer to everything in terms of ranks since DDP doesn't support names yet. As a result, support a `to` parameter in the RPC APIs which allow for specifying a rank as well would be helpful. ghstack-source-id: 114207172 Test Plan: 1) waitforbuildbot 2) Unit Tests Reviewed By: mrshenli Differential Revision: D24264989 fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70	2020-10-13 17:52:54 -07:00
Michael Ranieri	b1d24dded1	make a way to disable callgrind (#46116 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116 Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource. Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND` Reviewed By: malfet Differential Revision: D24227360 fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f	2020-10-13 16:18:04 -07:00
Supriya Rao	95ccf34fb9	[quant][graph][fix] Set type for GetAttr nodes in remapTypes (#46250 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46250 Previously the type of GetAttr nodes was getting set incorrectly and wasn't matching the module type Test Plan: Existing quantization tests Imported from OSS Reviewed By: jerryzh168 Differential Revision: D24279872 fbshipit-source-id: 2b2e3027f6e9ad8ba9e9b7937bd5cc5daaf6e17c	2020-10-13 12:59:28 -07:00
chengjun	5741de883a	Define the record_stream method in native_functions.yaml (#44301 ) Summary: The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device. Fixes https://github.com/pytorch/pytorch/issues/36556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301 Reviewed By: glaringlee Differential Revision: D23763954 Pulled By: ezyang fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293	2020-10-13 09:15:22 -07:00
Brian Hirsh	a3caa719af	fix #45552 - adding add_done_callback(fn) to torch.futures.Future (#45675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45675 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24055353 Pulled By: bdhirsh fbshipit-source-id: 9233c8e17acc878f0fecbe740a4397fb55cf722f	2020-10-13 07:47:36 -07:00
Tao Xu	a277c097ac	[iOS][GPU] Add Metal/MPSCNN support on iOS (#46112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112 ### Summary This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta. allow-large-files - Users API ``` auto module = torch::jit::load(model); module.eval(); at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal(); auto output = module.forward({input}).toTensor().cpu(); ``` - Supported Models - Person Segmentation v106 (FB Internal) - Mobilenetv2 - Supported Operators - aten::conv2d - aten::addmm - aten::add.Tensor - aten::sub.Tensor - aten::mul.Tensor - aten::relu - aten::hardtanh - aten::hardtanh_ - aten::sigmoid - aten::max_pool2d - aten::adaptive_avg_pool2d - aten::reshape - aten::t - aten::view - aten::log_softmax.int - aten::upsample_nearest2d.vec - Supported Devices - Apple A9 and above - iOS 10.2 and above - CMake scripts - `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON` ### Test Plan - Circle CI ghstack-source-id: 114155638 Test Plan: 1. Sandcastle CI 2. Circle CI Reviewed By: dreiss Differential Revision: D23236555 fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625	2020-10-13 01:46:56 -07:00
Nikita Shulga	ba1e0a88bb	Use const-references in nodes_to_rewrite range loop Test Plan: CI Reviewed By: supriyar Differential Revision: D24267389 fbshipit-source-id: c56d6bf1924b4c4c993fdf1328cfd5ab0d890869	2020-10-12 20:08:34 -07:00
Nick Gibson	f3db68776c	[NNC] Fix two more bugs in Cuda Half support (#46129 ) Summary: Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats: 1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules. 2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in. Fixes https://github.com/pytorch/pytorch/issues/45953. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129 Reviewed By: glaringlee Differential Revision: D24253639 Pulled By: nickgg fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac	2020-10-12 13:31:07 -07:00
Brian Hirsh	c02efdefa8	adding complex support for distributed functions and . fix #45760 (#45879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24127949 Pulled By: bdhirsh fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223	2020-10-12 12:44:47 -07:00
partypyro	8d5256e6dd	Made exception message for torch.LongTensor() legacy constructor more readable (#46147 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46085 Made exception message for torch.LongTensor() legacy constructor more readable ![exception_screenshot](https://user-images.githubusercontent.com/13827698/95664789-e3387b80-0aff-11eb-8e8e-bd2ee449cd7e.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46147 Reviewed By: glaringlee Differential Revision: D24252617 Pulled By: mrshenli fbshipit-source-id: 6c03b66fef50cf18f9d37c7047d3b98c847ae287	2020-10-12 11:26:38 -07:00
Gregory Chanan	2070834b9e	Improve error checking of Storage._writeFile. (#46036 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46036 Previously, this function didn't do error-bounds checking on the GetItem (GET_ITEM) calls, which led to issues like https://github.com/pytorch/pytorch/issues/46020. A better solution would be to use pybind, but given writing the file is going to dominate bounds checking, this is strictly better. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24228370 Pulled By: gchanan fbshipit-source-id: f5d0a3d21ff12b4380beefe1e9954fa81ea2f567	2020-10-12 11:10:04 -07:00
Nick Gibson	2fa91fa305	[NNC] Fix crash when simplifying certain subtractions (#46108 ) Summary: Fixes a crash bug in the IRSimplifier when the LHS is a Term (e.g. 2x) and the RHS is a Polynomial (e.g. 2x+1). This case crashes 100% of the time so I guess it's not very common in models we've been benchmarking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46108 Reviewed By: agolynski Differential Revision: D24226593 Pulled By: nickgg fbshipit-source-id: ef454c855ff472febaeba16ec34891df932723c0	2020-10-09 15:15:55 -07:00
Ailing Zhang	0ddcc0ce35	Add alias dispatch key DefaultBackend. (#45718 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45718 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165892 Pulled By: ailzhang fbshipit-source-id: ed28bf62b7c6320d966fd10b7a44b14efffe2f62	2020-10-09 12:02:44 -07:00
Rohan Varma	62554a3bd2	Prioritize raising error message about unused parameters when rebuild_buckets fails (#45933 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933 Occasionally users run DDP with models with unused params, in this case we would like to surface an error message telling them to run with find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made it so that we raise a size mismatch error when this happens, but the information about unused parameters is likely to be more useful and likely to be the most common case of failure. Prefer raising this error over the subsequent size mismatch errors. ghstack-source-id: 113914759 Test Plan: Added unittest Reviewed By: mrshenli Differential Revision: D24151256 fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675	2020-10-09 09:16:45 -07:00
generatedunixname89002005325676	9fb8e33a5b	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D24215555 fbshipit-source-id: 21d10bd60ab302c7cf7e245979b2d2ef0a142a1c	2020-10-09 08:37:54 -07:00
Raghavan Raman	a5c0dbc519	Add support for Softmax. (#45286 ) Summary: This PR adds support for Softmax in NNC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45286 Reviewed By: mrshenli Differential Revision: D24042901 Pulled By: navahgar fbshipit-source-id: 120bafe17586d3ecf0918f9aee852a7c3a8f4990	2020-10-08 23:57:02 -07:00
Mingzhe Li	8cd3857bc7	[NCCL] Add torch::cuda::nccl::send/recv (#45926 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926 torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517 ghstack-source-id: 113910530 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24147802 fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3	2020-10-08 19:20:40 -07:00
Shen Li	96d48178c8	Make pipeWrite and pipeRead noexcept (#45783 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783 After the previous device maps commits, `pipeWrite` might throw. In this case, if we increment active calls before `pipeWrite` on the caller, that active call won't be decremented properly when `pipeWrite` throws. As a result, `shutdown` can silently timeout. I noticed this as some tests take more than 60s to finish. This commit extract the tensor device checking logic out of pipeWrite, and make sure the error is thrown before the active call count is incremented. Differential Revision: D24094803 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: mrshenli fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b	2020-10-08 18:53:51 -07:00
Supriya Rao	31888b2e77	[quant][pyper] Rename the sparse argument for embedding_bag ops (#46003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003 sparse is confusing because itt is used in training for sparse gradients Test Plan: Imported from OSS Reviewed By: radkris-git, qizzzh Differential Revision: D24178248 fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d	2020-10-08 16:15:28 -07:00
Supriya Rao	8c80ee8ba5	[quant] Set sparse to False for embedding_bag ops in graph mode (#45997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45997 The current sparse field using in the float module is for sparse gradients, which is not applicable to inference. The sparse field in the quantizd ops denotes pruned weights. Test Plan: python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag Imported from OSS Reviewed By: qizzzh Differential Revision: D24176543 fbshipit-source-id: a05b4ff949e0375462ae411947f68076e1b460d2	2020-10-08 16:13:12 -07:00
huaidong.xiong	e3112e3ed6	aten::set_grad_enabled should not push as it does not return a value (#45559 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45558 This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is: ```cpp Operator( "aten::set_grad_enabled(bool val) -> ()", [](Stack* stack) { torch::GradMode::set_enabled(pop(stack).toBool()); push(stack, IValue()); }, aliasAnalysisConservative()), ``` which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value. If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure. The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45559 Reviewed By: albanD Differential Revision: D24142153 Pulled By: SplitInfinity fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d	2020-10-08 14:42:11 -07:00
Nick Gibson	402abdfdf4	[NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869 ) Summary: Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing: Before: ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int j_1 = 0; j_1 < 10; j_1++) { B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41)); } ``` After `cacheAccesses(A->buf(), "A_local", j_loop);` ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int i_2 = 0; i_2 < 2; i_2++) { for (int j_1 = 0; j_1 < 11; j_1++) { A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40]; } } for (int j_2 = 0; j_2 < 10; j_2++) { B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]); } } ``` Or this reduction: ``` for (int l1 = 0; l1 < 4; l1++) { sum[l1] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } } ``` After `l.cacheAccesses(d->buf(), "d_local", n_loop);`: ``` for (int l1 = 0; l1 < 4; l1++) { Allocate(d_local, float, {1}); sum[l1] = 0.f; d_local[0] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } sum[l1] = (sum[l1]) + (d_local[0]); Free(d_local); } ``` I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know. This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup. While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869 Reviewed By: mruberry Differential Revision: D24195276 Pulled By: nickgg fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155	2020-10-08 14:13:28 -07:00
Elias Ellison	1197a38a63	[JIT] Bind log1p and lgamma (#45791 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45791 Most of the lowering for log1p and lgamma already existed, add JIT integration. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24169536 Pulled By: eellison fbshipit-source-id: a009c77a3471f3b5d378bad5de6d8e0880e9da3c	2020-10-08 12:06:34 -07:00
Elias Ellison	338283057b	[JIT] [3/3] Make sure fusion occurs in test_tensorexpr (#45790 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45790 Making sure that more tests invoke a run with a Fusion Group. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24169534 Pulled By: eellison fbshipit-source-id: a2666df53fbb12c64571e960f59dbe94df2437e4	2020-10-08 12:06:25 -07:00
Elias Ellison	564296f051	[2/3] [JIT] Make sure fusion occurs in test_tensorexpr (#45789 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789 Making sure that more tests invoke a run with a Fusion Group. Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D24169535 Pulled By: eellison fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c	2020-10-08 12:06:16 -07:00
Elias Ellison	1b97ffa07a	[1/3] [JIT] Make sure fusion occurs in test_tensorexpr file (#45788 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788 We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24169537 Pulled By: eellison fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a	2020-10-08 12:02:57 -07:00
Heitor Schueroff de Souza	636eb18029	Fixed median nan propagation and implemented nanmedian (#45847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847 Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24136629 Pulled By: heitorschueroff fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9	2020-10-08 11:20:21 -07:00
Thomas Viehmann	d3d8da7a8e	Enable CUDA Fuser for ROCm (#45965 ) Summary: This enables the cuda fuser on ROCm and enables tests for them. Part of this patch is based on work of Rohith Nallamaddi, thank you. Errors are my own, of course. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965 Reviewed By: seemethere Differential Revision: D24170457 Pulled By: walterddr fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6	2020-10-08 10:41:56 -07:00
Hao Lu	ea4fbb2e5e	[StaticRuntime] Replace hashtable based workspace with vector<IValue> (#45892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892 Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`. Reviewed By: dzhulgakov Differential Revision: D24098763 fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6	2020-10-08 09:50:30 -07:00
Mikhail Zolotukhin	6e4de44501	[TensorExpr] LoopNest: add a constructor that takes Stmt instead of list of Tensors. (#45949 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45949 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24156001 Pulled By: ZolotukhinM fbshipit-source-id: 6f4f050b04e802e274c42ed64be74c21ba79c29f	2020-10-08 00:58:13 -07:00
Mikhail Zolotukhin	1036b77416	[TensorExpr] LoopNest: replace output_tensors_ with output_bufs_. (#45948 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45948 No functionality changes expected, it's just a preparation for further changes in the LoopNest interface. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24156000 Pulled By: ZolotukhinM fbshipit-source-id: f95ab07aac0aba128bc4ed5376a3251ac9c31c06	2020-10-08 00:58:10 -07:00
Mikhail Zolotukhin	29da553dd9	[TensorExpr] Loopnest: unify intermediate_tensors_ and temp_bufs_. (#45947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45947 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24155999 Pulled By: ZolotukhinM fbshipit-source-id: d82acf6aba570f6a675eea683c306088e2a41f91	2020-10-08 00:58:08 -07:00
Mikhail Zolotukhin	598caddd93	[TensorExpr] Add shorthand versions for `splitWith{Mask,Tail}` functions. (#45946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45946 Also, make these functions static - they are not using anything from `LoopNest` and can be applied to any `Stmt`. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24156002 Pulled By: ZolotukhinM fbshipit-source-id: 1c7d205f85a2a1684e07eb836af662f10d0a50fc	2020-10-08 00:58:06 -07:00
Mikhail Zolotukhin	b65ffa365c	[TensorExpr] Nuke `Function` class and directly use `Tensor` instead. (#45936 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45936 `Tensor` has been a view into a `Function` that was supposed to be used for a more general case when we have multiple computations over the same domain (aka multiple output functions). We have never got to a point where we need this and now have other ideas in mind on how to support this case if need be. For now, let's just nuke `Function` to reduce the overall system complexity. The change should not affect any existing behavior. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24153214 Pulled By: ZolotukhinM fbshipit-source-id: 26d5f11db5d661ff5e1135f4a49eff1c6d4c1bd5	2020-10-08 00:55:31 -07:00
Nikita Shulga	c19b9cd18d	Add torch::cuda::ncll::all2all (#45900 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes https://github.com/pytorch/pytorch/issues/42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so \| ^ \| \| --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`\|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send\|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1	2020-10-07 23:56:31 -07:00
Nick Gibson	19da1d22fe	[NNC] Registerizer V2, supporting partial and conditional replacement (#45574 ) Summary: This is a rewrite of the Registerizer, supporting scalar replacement in vastly more situations. As a refresher, the registerizer does this: Before: ``` A[0] = 0; for (int x = 0; x < 10; x++) { A[0] = (A[0]) + x; } ``` After: ``` int A_ = 0; for (int x = 0; x < 10; x++) { A_ = x + A_; } A[0] = A_; ``` Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups). I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet. It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses. There are a few cases that make this more challenging: - For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements. E.g. Before: ``` A[0] = 2; for (int x1 = 0; x1 < 10; x1++) { A[0] = (A[0]) + x1; } for (int x2 = 1; x2 < 10; x2++) { A[x2] = A[x2 - 1]; } for (int x3 = 0; x3 < 10; x3++) { A[0] = (A[0]) + x3; } ``` After: ``` int A_1 = 2; for (int x1 = 0; x1 < 10; x1++) { A_1 = A_1 + x1; } A[0] = A_1; for (int x2 = 1; x2 < 10; x2++) { A[x2] = A[x2 - 1]; } int A_2 = A[0]; for (int x3 = 0; x3 < 10; x3++) { A_2 = A_2 + x3; } A[0] = A_2; ``` - Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it. E.g. this example: ``` if (x<5 ? 1 : 0) { A[x] = (A[x]) + 1; } A[x] = (A[x]) + 1; if (x>5 ? 1 : 0) { A[x] = (A[x]) + 1; } ``` The A[x] access can be registerized due to the unconditional access between the two conditions: ``` int A_1 = A[x]; if (x<5 ? 1 : 0) { A_1 = A_1 + 1; } A_1 = A_1 + 1; if (x>5 ? 1 : 0) { A_1 = A_1 + 1; } A[x] = A_1; ``` But this example has no accesses that can be registerized: ``` if (x<5 ? 1 : 0) { A[x] = (A[x]) + 1; } if (x>5 ? 1 : 0) { A[x] = (A[x]) + 1; } ``` - IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within. E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it. ``` A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]); ``` But the equivalent kernel using Cond can be registerized: ``` if (x<3 ? 1 : 0) { float B_1 = B[x]; A[x] = B_1 + B_1; } else { A[x] = B[x]; } ``` - Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var. E.g. no accesses in this example can be registerized: ``` for (int x = 0; x < 10; x++) { int y = 30; A[y] = x + (A[y]); } ``` But they can in this example: ``` int y = 30; for (int x = 0; x < 10; x++) { A[y] = x + (A[y]); } ``` Testing The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests. Performance Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount. Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574 Reviewed By: albanD Differential Revision: D24151517 Pulled By: nickgg fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae	2020-10-07 18:17:27 -07:00
Elias Ellison	c86655a815	[JIT] Fix Dict bug in constant hashing (#45929 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929 We were checking `and` when we should have been checking `or`. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24148804 Pulled By: eellison fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d	2020-10-07 17:40:17 -07:00
Elias Ellison	72e4f51bc0	[JIT] fix dict update (#45857 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45857 Fix for https://github.com/pytorch/pytorch/issues/45627 Op was calling `insert` instead of `insert_or_assign`, so it wouldn't overwrite an existing key. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D24148805 Pulled By: eellison fbshipit-source-id: bf39c71d5d928890b82cff1a9a0985dc47c1ffac	2020-10-07 17:36:02 -07:00
Michael Carilli	5640b79bf8	Allow consumer ops to sync on GraphRoot's gradient (#45787 ) Summary: Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition. The race condition can exist even if the user doesn't give a manually populated gradient: ```python with torch.cuda.stream(side_stream): # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream. loss.backward() # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward, # and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where # the side_stream context causes a problem. ``` This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.) The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs. With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops: ```python # implicit population is safe with torch.cuda.stream(side_stream): loss.backward() # explicit population in side stream then backward in side stream is safe with torch.cuda.stream(side_stream): kickoff_grad = torch.ones_like(loss) loss.backward(gradient=kickoff_grad) # explicit population in one stream then backward kickoff in another stream # is NOT safe, even with this PR's diffs, but that unsafety is consistent with # stream-semantics relationship of any pair of ops kickoff_grad = torch.ones_like(loss) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) # Safe, as you'd expect for any pair of ops kickoff_grad = torch.ones_like(loss) side_stream.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(side_stream): loss.backward(gradient=kickoff_grad) ``` This PR also adds the last three examples above to cuda docs and references them from autograd docstrings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787 Reviewed By: nairbv Differential Revision: D24138376 Pulled By: albanD fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3	2020-10-07 08:53:53 -07:00
James Reed	be45c3401a	[JIT] Make objects throw Python AttributeError on nonexistant attr access (#45911 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45911 Test Plan: Imported from OSS Reviewed By: robieta Differential Revision: D24140971 Pulled By: jamesr66a fbshipit-source-id: 046a2cffff898efad5bcc36a41bf992f36f555f9	2020-10-07 01:57:29 -07:00
Peter Bell	8b39498a23	codegen: Allow string arguments to have defaults (#45665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45665 Fixes #43944 Note that the codegen doesn't use a proper parser so, in the same way as with lists, the string `, ` cannot appear in defaults or it will be interpreted as a splitting point between arguments. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24141835 Pulled By: ezyang fbshipit-source-id: 578127861fd2504917f4486c44100491a2c40343	2020-10-06 21:53:56 -07:00
Hao Lu	e8d8de32b4	[StaticRuntime] Implement StaticRuntime::benchmark (#45639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639 `StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan. Test Plan: Test results are fb internal only. Reviewed By: yinghai, dzhulgakov Differential Revision: D24012088 fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0	2020-10-06 20:54:43 -07:00
Meghan Lele	4fdba30500	[JIT] Add API for ignoring arbitrary module attributes (#45262 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45262 Summary This commit adds an API for ignoring arbitrary module attributes during scripting. A class attribute named `ignored_attributes` containing names of attributes to ignore can be added to the class of the instance being scripted. Attributes ignored in this fashion cannot be used in `forward`, methods used by `forward` or by `exported` methods. They are, however, copied to the `RecursiveScriptModule` wrapper and can be used by `ignored` methods and regular Python code. Test Plan This commit adds unit tests to `TestScriptPy3` to test this new API. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D23971882 Pulled By: SplitInfinity fbshipit-source-id: 8c81fb415fde7b78aa2f87e5d83a477e876a7cc3	2020-10-06 18:02:06 -07:00
Bert Maher	624084e6d6	[te][llvm] Enable fused multiply-add (fma) in code generation (#45906 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45906 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142404 Pulled By: bertmaher fbshipit-source-id: a8db2e66c1e65bbb255886e165a1773723cbcd20	2020-10-06 16:57:34 -07:00
Ansley Ussery	5072728d88	Fix stride printing/parsing formatting (#45156 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45156 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D24078695 Pulled By: ansley fbshipit-source-id: dab993277d43b31105c38d12098c37653747b42a	2020-10-06 15:06:46 -07:00
Yanan Cao	64681d6bec	Add all remaining method declarations from torch.distributed Python API to C++ (#45768 ) Summary: Also ran formatter on previous sections Pull Request resolved: https://github.com/pytorch/pytorch/pull/45768 Reviewed By: wanchaol Differential Revision: D24129467 Pulled By: gmagogsfm fbshipit-source-id: aa8a5c45c3609d5b96e5f585b699d9e3e71394c8	2020-10-06 12:36:36 -07:00
Nikita Shulga	930bddd403	Cleanup nccl.cpp (#45899 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45899 Use function polymorphism to avoid repeated casts I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24138012 Pulled By: malfet fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1	2020-10-06 11:26:14 -07:00
Peter Bell	d44eaf63d1	torch.fft helper functions (#44877 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44877 Part of gh-42175. This implements the `torch.fft` helper functions: `fftfreq`, `rfftfreq`, `fftshift` and `ifftshift`. * #43009 Cleanup tracer handling of optional arguments Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D24043473 Pulled By: mruberry fbshipit-source-id: 35de7b70b27658a426773f62d23722045ea53268	2020-10-05 22:04:52 -07:00
Pritam Damania	bf85642c4c	Remove lock from GraphTask::set_exception_without_signal. (#45867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867 In most cases the lock ordering was hold a lock in local autograd and then hold a lock in DistAutogradContext. In case of `set_exception_without_signal` the lock order was in reverse and as a result we saw potential deadlock issues in our TSAN tests. To fix this, I removed the lock and instead just used std::atomic exchange. In addition to this, I fixed TestE2E to ensure that we use the appropriate timeout. TestE2EProcessGroup was flaky for these two reasons and now is fixed. ghstack-source-id: 113592709 Test Plan: waitforbuildbot. Reviewed By: albanD Differential Revision: D24120962 fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66	2020-10-05 20:02:29 -07:00
Mingzhe Li	59083d6176	[NCCL] Support NCCL Send/Recv (#44921 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921 This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context. ghstack-source-id: 113592785 Test Plan: unittest Reviewed By: jiayisuse Differential Revision: D23709848 fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258	2020-10-05 18:27:57 -07:00
Nikita Shulga	1558a3657b	Add LazyNVRTC (#45674 ) Summary: Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables: On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template: ``` auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME))); if (!fn) throw std::runtime_error("Can't get" ## NAME); lazyNVRTC.NAME = fn; return fn(...) ``` Fixes https://github.com/pytorch/pytorch/issues/31985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674 Reviewed By: ezyang Differential Revision: D24073946 Pulled By: malfet fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874	2020-10-05 16:27:40 -07:00
Ansley Ussery	f18cc9c57d	Change type inferred from empty annotation (#45360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45360 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D24078645 Pulled By: ansley fbshipit-source-id: 5d37d07df75bd7a2111d44638befe53c1021ee82	2020-10-05 15:16:56 -07:00
Hao Lu	8a6b919163	[StaticRuntime] Fix broken tests (#45813 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45813 Fix tests broken by D23996656 (`2b48dd168d`). Test Plan: ``` buck test mode/opt //pytorch/tensorboardX:test_pytorchtb -- 'test_pytorch_graph \(pytorch\.tensorboardX\.tests\.test_pytorch_graph\.PytorchGraphTest\)' buck test mode/opt //pytext/tests: buck test mode/dev-nosan //mobile-vision/projects/detectron2go/tests:test_caffe2_compatibles ``` Reviewed By: yinghai Differential Revision: D24100807 fbshipit-source-id: e2f92aadca4161f5cf9f552e922fb4d6500af3a4	2020-10-03 16:54:22 -07:00
Nikita Shulga	24fa2daea6	Revert D24100389: Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb Test Plan: revert-hammer Differential Revision: D24100389 Original commit changeset: b32c5163e4fb fbshipit-source-id: 9ce7bfbcf411c0584e5d535ee107fb5a135ee6e6	2020-10-03 15:33:42 -07:00
Nikita Shulga	ff568a0e6b	Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb Test Plan: revert-hammer Differential Revision: D24072697 (`e3d2defdc8`) Original commit changeset: 7f56b9f3cbe5 fbshipit-source-id: b32c5163e4fb6df99447f95fdb82674e5ae62f22	2020-10-03 12:27:26 -07:00
Hao Lu	2b48dd168d	[StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640 Reviewed By: dzhulgakov Differential Revision: D23996656 fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5	2020-10-02 23:03:05 -07:00
Edward Yang	546aab66c1	Revert D24027761: Update backward definition for more operators and reenable tests in test_ops.py Test Plan: revert-hammer Differential Revision: D24027761 (`7d809f5d8e`) Original commit changeset: c1f707c2a039 fbshipit-source-id: 30750d2f08886036fb8b2cd0ae51c7732d3b7b19	2020-10-02 18:52:57 -07:00
Shen Li	8cb7280242	Revert "Remove device maps from TensorPipe for v1.7 release (#45353 )" (#45762 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762 This reverts commit `5211fb97ac`. Test Plan: Imported from OSS Reviewed By: colesbury Differential Revision: D24088231 Pulled By: mrshenli fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4	2020-10-02 15:14:05 -07:00
Yanan Cao	d150d3e276	Make sure each warnings.warn only executes once inside TorchScript. (#45382 ) Summary: * Add a pass at end of runCleanupPasses to annotate `aten::warn` so that each has its unique id * Enhanced interpreter so that it tracks which `aten::warn` has been executed before and skip them * Improved insertInstruction so that it correctly checks for overflow Fixes https://github.com/pytorch/pytorch/issues/45108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45382 Reviewed By: mrshenli Differential Revision: D24060677 Pulled By: gmagogsfm fbshipit-source-id: 9221bc55b9ce36b374bdf614da3fe47496b481c1	2020-10-02 14:55:10 -07:00
anjali411	7d809f5d8e	Update backward definition for more operators and reenable tests in test_ops.py (#44444 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44444 This PR: 1. Fixes https://github.com/pytorch/pytorch/issues/41510. Updates backward formula for the following functions: `asin`, `acos`, `asinh`, `acosh`, `atan`, `atanh`, `div`, `log`, `log10`, `log2`, `log1p`, `pow`, `reciprocal`, `angle`. 2. Re-enables the tests in `test_ops.py`. 3. Adds dispatch for complex dtypes for `tanh_backward`. 4. Re-enables commented tests in `common_methods_invocation.py`. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24027761 Pulled By: anjali411 fbshipit-source-id: c1f707c2a039149a6e04bbde53ee120d9119d99a	2020-10-02 13:37:10 -07:00
Bert Maher	e3d2defdc8	[te] Get llvm codegen to compile with llvm9 and llvm-fb (#45726 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45726 FB has an old internal platform that uses some random llvm version that looks sort of like llvm 7. I've guarded that with the appropriate LLVM_VERSION_PATCH. I've also swapped out some of our uses of ThreadSafeModule/ThreadSafeContext for the variants without ThreadSafe in the name. As far as I can tell we weren't using the bundled locks anyways, but I'm like 85% sure this is OK since we compile under the Torch JIT lock anyways. Test Plan: unit tests Reviewed By: ZolotukhinM, asuhan Differential Revision: D24072697 fbshipit-source-id: 7f56b9f3cbe5e6d54416acdf73876338df69ddb2	2020-10-02 13:33:13 -07:00
Omkar Salpekar	3799ba83e5	[Docs] Adding Store API Docs (#45543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543 This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store. ghstack-source-id: 113409195 Test Plan: Will verify screenshots by building the docs. Reviewed By: pritamdamania87 Differential Revision: D24005598 fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044	2020-10-02 11:16:56 -07:00
Eli Uriegas	a052597e6c	Bump nightlies to 1.8.0 (#45696 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45696 Similar to https://github.com/pytorch/pytorch/pull/40519 Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: samestep Differential Revision: D24064381 Pulled By: seemethere fbshipit-source-id: 1484b9c4fc5fa8cfa7be591a0a5d4b6e05968589	2020-10-02 11:10:34 -07:00
Pritam Damania	6e43f0db8b	Use correct signatures for METH_NOARGS. (#45528 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45528 As described in https://github.com/pytorch/pytorch/issues/45419, resolving a bunch of cpython signature issues. #Closes: https://github.com/pytorch/pytorch/issues/45419 ghstack-source-id: 113385726 Test Plan: sentinel Reviewed By: albanD Differential Revision: D24000626 fbshipit-source-id: d334596f1f0256063691aa044c8fb2face260817	2020-10-02 10:43:58 -07:00
Andrew Millspaugh	cdf93b03de	Add string versions of argument funcs in jit Node (#45464 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45464 Usage of Symbols to find arguments requires one to generate a nonsense symbol for inputs which don't already have one. The intention of symbols appears to be something of an internalized string, but the namespace component doesn't apply to an argument. In order to access the arguments by name without adding new symbols, versions of those functions with std::string input was added. These can be proved valid based on the existing codepath. Additionally, a hasNamedInput convenience function was added to remove the necessity of a try/catch block in user code. The primary motivation is to be able to easily handle the variable number of arguments in glow, so that the arange op may be implemented. Reviewed By: eellison Differential Revision: D23972315 fbshipit-source-id: 3e0b41910cf07e916186f1506281fb221725a91b	2020-10-02 10:26:29 -07:00
Supriya Rao	04526a49d3	[quant] creating quint4x2 dtype for quantized tensors (#44678 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678 This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2 The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it. This change uses most of the existing scaffolding for qtensor storage. We allocate storage based on the dtype before creating a new qtensor. It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info while quantizing and packing the qtensor (when we add 2-bit qtensor) Kernels that use this dtype should be aware of the packing format. Test Plan: Locally tested ``` x = torch.ones((100, 100), dtype=torch.float) qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8) qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2) torch.save(x, "temp.p") print('Size float (B):', os.path.getsize("temp.p")) os.remove('temp.p') torch.save(qx_8bit, "temp.p") print('Size quantized 8bit(B):', os.path.getsize("temp.p")) os.remove('temp.p') torch.save(qx, "temp.p") print('Size quantized 4bit(B):', os.path.getsize("temp.p")) os.remove('temp.p') ``` Size float (B): 40760 Size quantized 8bit(B): 10808 Size quantized 4bit(B): 5816 Imported from OSS Reviewed By: raghuramank100 Differential Revision: D23993134 fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d	2020-10-01 23:53:34 -07:00
Nikolay Korovaiko	a0d08b2199	Set the default bailout depth to 20 (#45710 ) Summary: This modifies the default bailout depth to 20 which gives us a reasonable performance in benchmarks we considered (fastrnns, maskrcnn, hub/benchmark, etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/45710 Reviewed By: robieta Differential Revision: D24071861 Pulled By: Krovatkin fbshipit-source-id: 472aacc136f37297b21f577750c1d60683a6c81e	2020-10-01 23:37:41 -07:00
Abaho Katabarwa	de3a48013a	Use CAFFE2_USE_MSVC_STATIC_RUNTIME to determine when to avoid waiting for global destructors on Windows (#43532 ) Summary: We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](`54c05fa34e/torch/csrc/autograd/engine.cpp (L228)`) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day. This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained. I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds. Fixes https://github.com/pytorch/pytorch/issues/44470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532 Reviewed By: mrshenli Differential Revision: D24053767 Pulled By: albanD fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8	2020-10-01 16:41:14 -07:00
generatedunixname89002005325676	84cf3372d1	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D24044108 fbshipit-source-id: 6dfe2f1201304fa58e42472e3f53c72cbb63d7d2	2020-10-01 05:29:03 -07:00
Xingying Cheng	4339f5c076	[PyTorch][QPL] Add instance_key into MOBILE_MODULE_LOAD_STATS logging. (#45518 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45518 Similar to previous diff, Add instance_key into MOBILE_MODULE_LOAD_STATS logging. ghstack-source-id: 113149713 Test Plan: ``` 09-29 11:50:23.345 6477 9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterLoadModel instance_key = 2015064908 09-29 11:50:23.409 6477 9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_name = bi_pytext_v10 09-29 11:50:23.410 6477 9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_type = FBNet 09-29 11:50:23.410 6477 9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"] 09-29 11:50:23.410 6477 9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitLoadModel instance_key = 2015064908 ``` Reviewed By: iseeyuan Differential Revision: D23996150 fbshipit-source-id: 7bf76af3b7e6b346afd20ab341204743c81cfe83	2020-09-30 23:31:35 -07:00
BowenBao	3da4cea658	[ONNX] Add dim_param support in export with onnx shape inference (#44920 ) Summary: * Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic. * Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12. * Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy. * This PR depends on existing PR such as 44332. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44920 Reviewed By: eellison Differential Revision: D23958398 Pulled By: bzinodev fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d	2020-09-30 21:56:24 -07:00
Xingying Cheng	3f440d74fc	[PyTorch][QPL] Add instance_key into MOBILE_MODULE_STATS logging. (#45517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45517 Add unique instance_key instead of the default one into MOBILE_MODULE_STATS logging to avoid multiple events overlaps. ghstack-source-id: 113149453 Test Plan: Make sure that each event's start, annotate and end are having the same instancekey: ``` 09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, method_name = forward 09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_name = bi_pytext_v10 09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_type = FBNet 09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"] 09-28 23:46:03.181 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1123198800 09-28 23:46:04.183 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, method_name = forward 09-28 23:46:04.184 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, model_name = __torch__.Model 09-28 23:46:04.205 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1521608147 ``` Reviewed By: iseeyuan Differential Revision: D23985178 fbshipit-source-id: bcd5db8dc680e3cf8d12edf865377e80693cc23b	2020-09-30 20:13:33 -07:00
Jerry Zhang	9d5607fcd9	[quant] Use PlaceholderObserver as default dynamic quant observer (#45343 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45343 Current default dynamic quant observer is not correct since we don't accumulate min/max and we don't need to calculate qparams. Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D23933995 fbshipit-source-id: 3ff497c9f5f74c687e8e343ab9948d05ccbba09b	2020-09-30 19:01:18 -07:00
Taylor Robie	2b13d9413e	Re-land: Add callgrind collection to Timer #44717 (#45586 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586 Test Plan: The unit test has been softened to be less platform sensitive. Reviewed By: mruberry Differential Revision: D24025415 Pulled By: robieta fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf	2020-09-30 17:43:06 -07:00
Yanan Cao	3a2d45304d	[Experimental][Partial] New implementation for torch.distributed APIs in C++ (#45547 ) Summary: This is an attempt at refactoring `torch.distributed` implementation. Goal is to push Python layer's global states (like _default_pg) to C++ layer such that `torch.distributed` becomes more TorchScript friendly. This PR adds the skeleton of C++ implementation, at the moment it is not included in any build (and won't be until method implementations are filled in). If you see any test failures related, feel free to revert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45547 Reviewed By: izdeby Differential Revision: D24024213 Pulled By: gmagogsfm fbshipit-source-id: 2762767f63ebef43bf58e17f9447d53cf119f05f	2020-09-30 17:35:51 -07:00
Hector Yuen	f2c2b75e80	flush the buffer when printing the IR (#45585 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45585 I discovered this bug when I was trying to print the graph to a file. Turns out I had to close the file, but flushing should be a good safeguard in case other users forget. Test Plan: Tested with and without flushing. with P144064292 without P144064767 Reviewed By: mortzur Differential Revision: D24023819 fbshipit-source-id: 39574b3615feb28e5b5939664c04ddfb1257706a	2020-09-30 16:55:27 -07:00
Zino Benaissa	4be42034b6	Clear shape information before finalizing graph-mode quantization (#45282 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45282 Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23909601 Pulled By: bzinodev fbshipit-source-id: 3062cda46b15a79094a360216c35906afab7c723	2020-09-30 16:13:55 -07:00
Negin Raoof	6b42ca2d69	[ONNX] Update embedding_bag export (#44693 ) Summary: Export of embedding bag with dynamic list of offsets. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44693 Reviewed By: malfet Differential Revision: D23831980 Pulled By: bzinodev fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a	2020-09-30 13:36:40 -07:00
Xinyu Li	c9bb990707	[c++] Distance-agnostic triplet margin loss (#45377 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45377 This PR adds a C++ implementation of the TripletMarginWithDistanceLoss, for which the Python implementation was introduced in PR #43680. It's based on PR #44072, but I'm resubmitting this to unlink it from Phabricator. Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D24003973 fbshipit-source-id: 2d9ada7260a6f27425ff2fdbbf623dad0fb79405	2020-09-30 12:37:35 -07:00
Rohan Varma	181afd5220	Add an option to DDP to take a list of parameters to ignore upfront. (#44826 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826 As described in https://github.com/pytorch/pytorch/issues/43690, there is a need for DDP to be able to ignore certain parameters in the module (not install allreduce hooks) for certain use cases. `find_unused_parameters` is sufficient from a correctness perspective, but we can get better performance with this upfront list if users know which params are unused, since we won't have to traverse the autograd graph every iteration. To enable this, we add a field `parameters_to_ignore` to DDP init and don't pass in that parameter to reducer if that parameter is in the given list. ghstack-source-id: 113210109 Test Plan: Added unittest Reviewed By: xw285cornell, mrshenli Differential Revision: D23740639 fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314	2020-09-30 11:52:50 -07:00
Mike Ruberry	51d0ae9207	Revert D24010742: [pytorch][PR] Add callgrind collection to Timer Test Plan: revert-hammer Differential Revision: D24010742 (`9b27e0926b`) Original commit changeset: df6bc765f8ef fbshipit-source-id: 4c1edd57ea932896f7052716427059c924222501	2020-09-30 10:15:46 -07:00
anjali411	415ed434aa	Add whitelist for complex backward (#45461 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45461 This PR disables autograd for all C -> C, R -> C functions which are not included in the whitelist `GRADIENT_IMPLEMENTED_FOR_COMPLEX`. In practice, there will be a RuntimeError during forward computation when the outputs are differentiable: ``` >>> x=torch.randn(4, 4, requires_grad=True, dtype=torch.cdouble) >>> x.pow(3) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: pow does not support automatic differentiation for outputs with complex dtype. ``` The implicit assumption here is that all the C -> R functions have correct backward definitions. So before merging this PR, the following functions must be tested and verified to have correct backward definitions: `torch.abs` (updated in #39955 ), `torch.angle`, `torch.norm`, `torch.irfft`, `torch.istft`. Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D23998156 Pulled By: anjali411 fbshipit-source-id: 370eb07fe56ac84dd8e2233ef7bf3a3eb8aeb179	2020-09-30 08:45:55 -07:00
VinodSKumar	e02868e12d	Unify Transformer coder Constructors (#45515 ) Summary: Fixes #{[45502](https://github.com/pytorch/pytorch/issues/45502)} Pull Request resolved: https://github.com/pytorch/pytorch/pull/45515 Reviewed By: zhangguanheng66, ZolotukhinM Differential Revision: D23994644 Pulled By: glaringlee fbshipit-source-id: b8728e8dfd8857e27246ebb11b17c2d1b48796ca	2020-09-30 07:05:41 -07:00
Nikolay Korovaiko	7566823779	Enable PE + TE (#45546 ) Summary: This PR enables PE + TE for 1.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45546 Reviewed By: ZolotukhinM Differential Revision: D24006940 Pulled By: Krovatkin fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115	2020-09-30 06:49:59 -07:00
Taylor Robie	9b27e0926b	Add callgrind collection to Timer (#44717 ) Summary: This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but: A) Python specific overheads (e.g. parsing) are important B) JIT might do rewrites which would complicate measurement. Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484: ``` from torch.utils._benchmark import Timer counts = Timer( "x.backward()", setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)" ).collect_callgrind() for c, fn in counts[:20]: print(f"{c:>12} {fn}") ``` ``` 812800 ???:_dl_update_slotinfo 355600 ???:update_get_addr 308300 work/Python/ceval.c:_PyEval_EvalFrameDefault'2 304800 ???:__tls_get_addr 196059 ???:_int_free 152400 ???:__tls_get_addr_slow 138400 build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta) 126526 work/Objects/dictobject.c:_PyDict_LoadGlobal 114268 ???:malloc 101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV 85900 work/Python/ceval.c:_PyEval_EvalFrameDefault 79946 work/Objects/typeobject.c:_PyType_Lookup 72000 build/../c10/core/Device.h:c10::Device::validate() 70000 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() 66400 work/Objects/object.c:_PyObject_GenericGetAttrWithDict 63000 ???:pthread_mutex_lock 61200 work/Objects/dictobject.c:PyDict_GetItem 59800 ???:free 58400 work/Objects/tupleobject.c:tupledealloc 56707 work/Objects/dictobject.c:lookdict_unicode_nodummy ``` Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them: ``` print(f"Head instructions: {sum(c for c, _ in counts)}") print(f"1.6 instructions: {sum(c for c, _ in counts_1_6)}") count_dict = {fn: c for c, fn in counts} for c, fn in counts_1_6: _ = count_dict.setdefault(fn, 0) count_dict[fn] -= c count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True) for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]: print(f"{c:>8} {fn}") ``` ``` Head instructions: 7609547 1.6 instructions: 6059648 169600 ???:_dl_update_slotinfo 101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV 74200 ???:update_get_addr 63600 ???:__tls_get_addr 46800 work/Python/ceval.c:_PyEval_EvalFrameDefault 33512 work/Objects/dictobject.c:_PyDict_LoadGlobal 31800 ???:__tls_get_addr_slow 31700 build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope) 28300 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object, _object, _object, _object, bool) 27800 work/Objects/object.c:_PyObject_GenericGetAttrWithDict 27401 work/Objects/dictobject.c:lookdict_unicode_nodummy 24115 work/Objects/typeobject.c:_PyType_Lookup 24080 ???:_int_free 21700 work/Objects/dictobject.c:PyDict_GetItemWithError 20700 work/Objects/dictobject.c:PyDict_GetItem ... -3200 build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) -3400 build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&) -3500 /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock() -3700 build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object, _object, _object) -4207 work/Objects/obmalloc.c:PyMem_Calloc -4500 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() -4800 build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar) -5000 build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey) -5300 work/Objects/listobject.c:PyList_New -5400 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&) -5600 /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock() -6231 work/Objects/obmalloc.c:PyMem_Free -6300 work/Objects/listobject.c:list_repeat -11200 work/Objects/listobject.c:list_dealloc -28900 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object, _object, _object*, bool) ``` Remaining TODOs: Include a timer in the generated script for cuda sync. * Add valgrind to CircleCI machines and add a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717 Reviewed By: soumith Differential Revision: D24010742 Pulled By: robieta fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623	2020-09-30 05:52:54 -07:00
Ilia Cherniavskii	f5c95d5cf1	Source code level attribution in profiler (#43898 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898 Adding with_source parameter to enable tracking source code (filename and line) in profiler for eager, torchscript and autograd modes Test Plan: python test/test_profiler.py ``` Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Source Location ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- -------------------------------------------- ts_method_1 10.43% 235.364us 36.46% 822.920us 822.920us 1 test/test_profiler.py(70): test_source aten::add 7.52% 169.833us 8.88% 200.439us 200.439us 1 test/test_profiler.py(69): test_source aten::normal_ 6.26% 141.380us 6.26% 141.380us 141.380us 1 test/test_profiler.py(67): test_source aten::add 5.80% 130.830us 8.41% 189.800us 63.267us 3 test/test_profiler.py(72): test_source aten::sum 5.02% 113.340us 8.39% 189.475us 189.475us 1 test/test_profiler.py(64): ts_method_1 aten::add 4.58% 103.346us 6.33% 142.847us 142.847us 1 test/test_profiler.py(62): ts_method_1 aten::mul 4.05% 91.498us 9.62% 217.113us 217.113us 1 test/test_profiler.py(71): test_source aten::add 4.03% 90.880us 5.60% 126.405us 126.405us 1 test/test_profiler.py(58): ts_method_2 aten::empty 3.49% 78.735us 3.49% 78.735us 19.684us 4 test/test_profiler.py(72): test_source ``` Reviewed By: ngimel Differential Revision: D23432664 Pulled By: ilia-cher fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615	2020-09-30 00:57:35 -07:00
Peng-Jen Chen	93650a82c9	Move prim::tolist math.log and aten::cpu to lite interpreter for translation model (#45482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482 Working on some models that need these ops on lite interpreter. Test Plan: locally build and load/run the TS model without problem. Reviewed By: iseeyuan Differential Revision: D23906581 fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e	2020-09-29 21:42:18 -07:00
Mikhail Zolotukhin	4aca63d38a	[TensorExpr] Change API for creating Load and Store expressions. (#45520 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520 With this change `Load`s and `Store`s no longer accept `Placeholder`s in their constructor and `::make` functions and can only be built with `Buf`. `Placeholder` gets its own `store`, `load`, `storeWithMask`, and `loadWithMask` method for more convenient construction. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D23998789 Pulled By: ZolotukhinM fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912	2020-09-29 20:52:38 -07:00
Thomas Viehmann	22a34bcf4e	ROCm {emoji:2764} TensorExpr (#45506 ) Summary: This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 . The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506 Reviewed By: zhangguanheng66 Differential Revision: D23991410 Pulled By: Krovatkin fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e	2020-09-29 16:52:16 -07:00
Randall Hunt	ab5cf16b6c	fix standard deviation gradient NaN behavior (#45468 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/4320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45468 Reviewed By: zhangguanheng66 Differential Revision: D23991064 Pulled By: albanD fbshipit-source-id: d4274895f2dac8b2cdbd73e5276ce3df466fc341	2020-09-29 13:47:29 -07:00
anjali411	18876b5722	Update backward formula for torch.dot and add backward definition for torch.vdot (#45074 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45074 TODO: Add R -> C tests in https://github.com/pytorch/pytorch/pull/44744 (blocked on some JIT changes) Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D23975361 Pulled By: anjali411 fbshipit-source-id: 3512bd2962b588a198bc317673bd18cc96ac823f	2020-09-29 12:52:03 -07:00
Ivan Yashchuk	f47fd0eb72	Updated `cholesky_backward` for complex inputs (#45267 ) Summary: Updated `cholesky_backward` to work correctly for complex input. Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do? Ref. https://github.com/pytorch/pytorch/issues/44895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267 Reviewed By: bwasti Differential Revision: D23975269 Pulled By: anjali411 fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6	2020-09-29 11:07:32 -07:00
Xingying Cheng	ea59251f51	Fix model_name not logged properly issue. (#45488 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488 model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it. ghstack-source-id: 113103942 Test Plan: made sure that now the model_name is logged from module_->name(). verified with one model which does not contain the model metadata, and the model_name field is logged as below: 09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model 09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model 09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log model_name = __torch__.Model 09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log method_name = labels 09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod() Reviewed By: linbinyu Differential Revision: D23984165 fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8	2020-09-29 10:37:36 -07:00
Akshit Khurana	5f49d14be2	Add mobile_optimized tag to optimized model. (#45479 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479 Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized. Test Plan: buck test //caffe2/test:mobile passes Reviewed By: kimishpatel Differential Revision: D23956728 fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8	2020-09-29 10:06:57 -07:00
Mike Ruberry	ab5edf21b0	Revert D23789657: [wip] fast typeMeta/ScalarType conversion approach 2 Test Plan: revert-hammer Differential Revision: D23789657 (`1ed1a2f5b0`) Original commit changeset: 5afdd52d24bd fbshipit-source-id: 6d827be8895bcb39c8e85342eee0f7a3f5056c76	2020-09-29 09:40:53 -07:00

1 2 3 4 5 ...

6361 Commits