pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jeremy Lilley	1b2d2ba504	[PyTorch] Fix write-after-free (TSAN) in GraphTask::set_error() (#33156 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33156 When dist_autograd_spawn_thrift's 'test_backward_node_failure_python_udf' test is run, it was encountering a TSAN error related to holding the mutex while the underlying datastructure was being dealloced. In this change, we simply get a shared_ptr<> reference to the future, and set_exception() without having the lock held, to avoid deallocing underneath the lock. ghstack-source-id: 98303434 Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/rpc:dist_autograd_spawn_thrift -- 'test_backward_node_failure_python_udf \(test_dist_autograd_spawn\.DistAutogradTestWithSpawn\)' Differential Revision: D19821362 fbshipit-source-id: 82f735e33f8e608552418ae71592400fa3621e40	2020-02-14 09:32:17 -08:00
Pritam Damania	05d18ffaf5	Distributed Autograd: Allow multiple backward passes to accumulate gradients. (#32506 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32506 In this PR, we've introduced a `retain_graph` parameter to distributed autograd similar to `torch.autograd.backward`. In terms of design, this parameter is sent over RPC to all nodes and is used to create the GraphTask on the local nodes. This enables us to run `dist_autograd.backward()` multiple times in the same context. The use case currently for this is to benchmark only the backward pass for distributed autograd. We'd like to measure the QPS for the backward pass and as a result, running a single forward pass and multiple backward passes in a loop is one way to benchmark backward pass performance. ghstack-source-id: 97868900 Test Plan: waitforbuildbot Differential Revision: D19521288 fbshipit-source-id: 7ad8521059fd400d7b5a6ab77ce56e1927ced90a	2020-02-06 23:27:21 -08:00
Edward Yang	67c1d930eb	Lock graph_task before writing leaf_streams. (#31995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31995 Fixes #31906. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D19331259 Pulled By: ezyang fbshipit-source-id: 5d24bf3555e632211a9b6f8e50ff241603c18b3d	2020-01-09 13:26:36 -08:00
Alban Desmaison	1314f7f4f4	Ensure the original grad_mode is restored during backward (#31884 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31884 Fix #31715 Test Plan: Imported from OSS Differential Revision: D19301076 Pulled By: albanD fbshipit-source-id: 2d20c01bfb6364fa96c8fe5aa5ce7ea39defa3ce	2020-01-08 14:16:51 -08:00
Pritam Damania	bf8e1c0710	Integrate async mode for autograd engine with distributed autograd. (#31508 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31508 This PR builds on top of https://github.com/pytorch/pytorch/pull/31230 to ensure that distributed autograd doesn't block an RPC thread anymore during the backward pass. I've also added a unit test where all ranks hammer rank 0 without about 60 backward calls (which would cause a deadlock earlier), but now such a test passes without any issues. ghstack-source-id: 96345097 Test Plan: waitforbuildbot Differential Revision: D19188749 fbshipit-source-id: b21381b38175699afd0f9dce1ddc8ea6a220f589	2020-01-07 11:01:16 -08:00
Pritam Damania	5cc62f2913	Ensure autograd callbacks are called only once for reentrant backward. (#31909 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31909 https://github.com/pytorch/pytorch/pull/31230 introduced a bug where we would end up calling `graph_task_post_processing` twice for reentrant backward calls (once when we mark the future completed and then we we called graph_task_post_processing in execute_with_graph_task). This PR fixes the issues by verifying the future we return in that case is completed and we remove the call to graph_task_post_processing. In addition to that I added a test that reproduced the problem and verified it is fixed by this PR. ghstack-source-id: 96349102 Test Plan: waitforbuildbot Differential Revision: D19296363 fbshipit-source-id: dc01a4e95989709ad163bb0357b1d191ef5a4fb2	2020-01-07 10:35:04 -08:00
Pritam Damania	fde94e7556	Provide async mode for local autograd engine. (#31230 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31230 A major issue with distributed autograd currently is that we block an RPC thread when we call Engine::execute_with_graph_task. To resolve this issue, I've made modifications to the local autograd engine such that `execute_with_graph_task` returns a Future instead. The `execute()` methods for Engine::execute() and DistEngine::execute() still wait() on this Future which ensures there is no change in behavior yet. In follow up PRs we can modify the distributed autograd engine to take advantage of this Future. Closes #26359 ghstack-source-id: 96298057 Test Plan: waitforbuildbot Differential Revision: D18999709 fbshipit-source-id: 388f54467fd2415a0acb7df17bd063aedc105229	2020-01-05 00:29:28 -08:00
Shen Li	8a57362000	Fix index out of bound error in Engine::ready_queue_size when called before start_threads Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30967 Test Plan: Imported from OSS Differential Revision: D18887178 Pulled By: mrshenli fbshipit-source-id: 67baeac9214a4749ce7e9b4d89862c93620b2d5e	2019-12-09 14:39:07 -08:00
Pritam Damania	776fdda753	Add debug info API for distributed autograd. (#30642 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30642 Adding a couple of basic metrics for distributed autograd which would help in determining stuckness. ghstack-source-id: 95156189 Test Plan: waitforbuildbot Differential Revision: D18776478 fbshipit-source-id: a0556ad6fe2b7c3cd0082ee2350c1c78cafaaec5	2019-12-07 13:56:51 -08:00
Nathan Goldbaum	f531815526	Deprecate tensor.type() (#30281 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/29161. I looked a bit at the code changes related to this and think I have all of the use cases of `DeprecatedTypeProperties` covered in the message, but suggestions from someone with more context on this would be very much appreciated :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/30281 Differential Revision: D18830818 Pulled By: ezyang fbshipit-source-id: 1a7fcee15354ae09e6644577e7fa33bd26acfe20	2019-12-05 10:55:34 -08:00
Prasun Anand	3cf8382984	detect_anomaly() for SparseTensors (#29803 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/28649 1. Modified detect_anomaly() to use isnan() 2. isnan() for SparseTensors returns a bool Tensor of _values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29803 Differential Revision: D18594299 Pulled By: ezyang fbshipit-source-id: 3f4190c569f53219be330584fc604ca43c4a6c7a	2019-12-03 15:42:51 -08:00
Pritam Damania	1322daa506	Improve error handling for distributed autograd engine. (#27940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Test Plan: Added unit tests to test failures. Differential Revision: D17916844 fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0	2019-10-25 12:07:27 -07:00
Pritam Damania	3bccd3fc0d	Distributed Autograd - FAST mode backward pass implementation. (#27022 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022 This change implements the "FAST" mode distributed autograd backward pass as described in https://github.com/pytorch/pytorch/issues/23110. At a high level the backward pass works as follows: 1. We start by computing dependencies on the node that calls `torch.distributed.backward`. 2. This node computes the dependencies starting from the root nodes provided in the backward call and all the 'send' functions present in the current autograd context. The "FAST" mode assumes all 'send' functions are part of the autograd computation. 3. Once the dependency computation is done, the distributed autograd engine calls the local autograd engine to execute the autograd graph. Note that the autograd graph on a single node is not necessarily connected because of inter-node communication. As a result, we have special handling to ensure the local autograd engine ensures we execute the entire graph starting from the provided roots and all 'send' functions on the node. 4. When the local autograd engine hits a 'recv' function, it performs an async RPC to send the gradients over to the appropriate node and stores a future in the autograd context to keep track of this RPC. 5. On the destination node, the appropriate 'send' function is looked up and enqueued on the local autograd engine. If this is the first time the node is hearing about this autograd context id on the backward pass, then the node computes dependencies for the local autograd engine. 6. As part of compute dependencies, the distributed autograd engine discovers all leaf nodes and ensures those are passed as 'outputs' to the local autograd engine. This avoids running the 'AccumulateGrad' function. 7. The gradients computed for the leaf nodes are then actually accumulated in `DistAutogradContext` for the appropriate autograd context id. 8. The distributed autograd engine waits for the local autograd engine to complete and also waits for all the 'Futures' (stored in 4.) for respective RPCs to finish. We have made the following changes to the local autograd engine for this purpose: 1. Expose GraphTask and NodeTask so that the distributed autograd engine can use them. 2. Expose a `execute_with_graph_task` API which gives the distributed engine to build a GraphTask and pass it to the local autograd engine. 3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build a `NodeTask` for a 'send' function and enqueue it on the local autograd engine. In addition to this a few general improvements: 1. Added a `PropagateGradients` RPC call for the 'recv' function to pass gradients to the appropriate node during the backward pass. 2. Use IValues as much as possible in serialization for RpcWithAutograd. 3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate exception instead of just returning the message. This is inline with what most Future.wait() APIs do. 4. Added a `get_gradients(context_id)` API which allows users to retrieve a map from Tensor to respective gradient for the provided context_id on the local node. ghstack-source-id: 91794926 Test Plan: unit tests. Differential Revision: D17652615 fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3	2019-10-12 09:47:49 -07:00
Mike Ruberry	87a2c92615	Updates autograd engine to respect streams set in forward (#8354 ) Summary: This PR addresses issue https://github.com/pytorch/pytorch/issues/7601. Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user. For example, a model with forward declared like (original example courtesy of ngimel): ``` def forward(self,x): x0 = x.clone() torch._C._cuda_setStream(self.stream1._cdata) y0 = self.fc1(x0) self.event1.record(stream = torch.cuda.current_stream()) torch._C._cuda_setStream(self.stream2._cdata) y1 = self.fc2(x) self.event2.record(stream = torch.cuda.current_stream()) self.stream2.wait_event(self.event1) return y0 + y1 ``` currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously. The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer. In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354 Test Plan: Two tests were added specifically for this behavior. Differential Revision: D17275980 Pulled By: mruberry fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d	2019-09-10 23:46:51 -07:00
Brian Vaughan	88e4cee3e7	Improve handling of mixed-type tensor operations (#22273 ) Summary: Improve handling of mixed-type tensor operations. This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops). For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts. The details of the promotion rules are described here: https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst Some specific backwards incompatible examples: * now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)` * Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result. See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR: https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273 Reviewed By: gchanan Differential Revision: D16582230 Pulled By: nairbv fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3	2019-09-05 18:26:09 -07:00
Ilia Cherniavskii	936632b120	Thread local debug info Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22365 Test Plan: USE_CUDA=0 python setup.py develop ./build/bin/test_jit Imported from OSS Reviewed By: ajyu Differential Revision: D16065129 Pulled By: ilia-cher fbshipit-source-id: f300985459a83c2c1049ed8c4fefd23a3144047a	2019-08-12 14:53:57 -07:00
mal	e7a9b0d62f	Rename torch::autograd::Function to torch::autograd::Node Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23269 Test Plan: Imported from OSS Differential Revision: D16454878 fbshipit-source-id: b1e840fc2d3901955280d141e5ad6efd5e9d66af	2019-07-23 20:52:22 -07:00
mal	0140a756d8	Prioritize reentrant tasks and execute them recursively until close to limit Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22397 Test Plan: Added test for reentrant backwards with checkpoint and a test for a recursive backwards function (which should fail if we run all the reentrant tasks recursively in the same thread) and for testing priority of reentrant tasks. ~~Will add a test for priority of reentrant tasks in future pr.~~ Imported from OSS Differential Revision: D16131955 fbshipit-source-id: 18301d45c1ec9fbeb566b1016dbaf7a84a09c7ac	2019-07-05 08:51:06 -07:00
mal	c8b5f1d2f8	Switch autograd to use a pool of workers for each device (#21911 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21911 ghimport-source-id: 3b7d37481201aa4b4ca8f7767603d0dfd13f871f Test Plan: Tested on https://github.com/pytorch/pytorch/issues/6959 and ensured no Recursion Error. Performance testing: [word_language_model](https://gist.github.com/malvika2147/34c214871d549f9275812f2d20506990) (no significant change) [mnist](https://gist.github.com/malvika2147/77890eef102099490a1029122fb20dd0) (no significant change) [Comparison of performance](https://gist.github.com/malvika2147/c0a8790910b8513bd2e20b224bdd6300) on https://github.com/pytorch/pytorch/issues/6959 with smaller inputs. (slower by about ~25%, expected) Imported from OSS Differential Revision: D15985852 fbshipit-source-id: ca172690857fd1718462b80f3a244af9d8825d6c	2019-06-25 09:08:26 -07:00
mal	f308b07e8c	Don't leak threads on exit (#21438 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21438 ghimport-source-id: 33f145f5b3508163365442c22a223c4a44e677d8 Differential Revision: D15738856 fbshipit-source-id: 656e8d0e3d0d22f116e3ab66bf0282608d6f1a76	2019-06-10 09:14:13 -07:00
Malvika Joshi	9deab0cf0e	Documentation for locking discipline in engine.cpp/.h (#21548 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21548 Added documentation as titled. Reviewed By: ezyang Differential Revision: D15723146 fbshipit-source-id: fab4a35c62f07256673318c0874701f7628b2f7a	2019-06-10 07:50:01 -07:00
mal	4980b8b95c	Renaming member variables in engine.cpp/h (#21283 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21283 ghimport-source-id: 360a138e420ace3cd4ca6ccbc761c8e68319440d Differential Revision: D15607428 fbshipit-source-id: f8df6b42796a49c4d68fa8366b6a68d5715f6421	2019-06-03 12:54:50 -07:00
Roy Li	ab78449e8c	Add ScalarType argument to Type::options() (#19270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19270 ghimport-source-id: a5ade6131f3260066c5750ea1fa9ed5c998bb791 Differential Revision: D14938707 Pulled By: li-roy fbshipit-source-id: 018fb3f01706531a06515d6d861e5683a455a705	2019-04-21 21:16:07 -07:00
Ilia Cherniavskii	646cb6157d	Move OMP/MKL thread initialization into ATen/Parallel (#19011 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19011 ghimport-source-id: 432e31eccfd0e59fa21a790f861e6b2ff4fdbac6 Differential Revision: D14846034 Pulled By: ilia-cher fbshipit-source-id: d9d03c761d34bac80e09ce776e41c20fd3b04389	2019-04-16 00:16:32 -07:00
Roy Li	c705d9eb1e	Introduce DeprecatedTypeProperties class (#17991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17991 changes: -Breaks bc: Tensor::type() now returns DeprecatedTypeProperties& rather than Type&. -Added DeprecatedTypeProperties, it serves as a temporary replacement for Type as the return value of Tensor::type(). This contributes to making Type just for dispatch purposes so that we can make it dtype agnostic. -Tensor::dispatch_type() now returns Type& like Tensor::type() used to do. -Changed callsites of Tensor::type() appropriately. Reviewed By: ezyang Differential Revision: D14443117 fbshipit-source-id: 239ccb7a09626279a71d1a37f8f82e7f57bf7d9e	2019-04-04 02:24:13 -07:00
Davide Libenzi	272a48f6fe	Enable autograd to recognize the XLA backend as one providing multiple devices (#17847 ) Summary: …e devices, while not being CUDA/HIP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17847 Differential Revision: D14545634 Pulled By: ezyang fbshipit-source-id: 417181bf2ff4f8978544afe2fb6b042e787854ed	2019-03-20 13:58:36 -07:00
bddppq	27f6a29fd0	Remove USE_CUDA and USE_ROCM in engine.cpp Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15893 Differential Revision: D13627319 Pulled By: zdevito fbshipit-source-id: 7c72c1c6cc242143fb66383423c668c9b9810884	2019-01-10 14:45:11 -08:00
albanD	828cb18fa3	Allow ReadyQueue to handle empty tasks (#15791 ) Summary: Allow the comparison function used in ReadyQueue to handle the empty FunctionTasks created by the reentrant autograd. Fix #11732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15791 Differential Revision: D13598006 Pulled By: soumith fbshipit-source-id: 0bfdf28a735fbfe44f0fdbaf8b74a6198e6a1984	2019-01-08 20:06:04 -08:00
Peter Goldsborough	7a61306031	Enable all clang-tidy performance checks (#15198 ) Summary: This PR adds the final set of clang-tidy checks we should add for our codebase: a last set of performance-related checks. Most fixes here are around changing `auto` to `const auto&` in a few places where unnecessary copies were made, and adding `reserve()` calls before loops doing repeated `push_back()`. Also a few cases of calling `std::string::find` with a single-character string literal instead of a single char, which uses a less efficient string search algorithm meant for searching larger substrings. ![image](https://user-images.githubusercontent.com/6429851/49978940-adc1a780-ff01-11e8-99da-a4e431361f07.png) ezyang apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/15198 Differential Revision: D13468797 Pulled By: goldsborough fbshipit-source-id: 2bed1ea1c7c162b7f3e0e1026f17125e88c4d5b2	2018-12-14 13:32:47 -08:00
Peter Goldsborough	1e9c384afb	Enable performance-unnecessary-value-param in .clang-tidy (#15026 ) Summary: This PR fixes around 250 places in the codebase where we were making unnecessary copies of objects (some large, some small). ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/15026 Differential Revision: D13458784 Pulled By: goldsborough fbshipit-source-id: be5148b2ce09493588d70952e6f6d6ff5ec5199b	2018-12-13 16:15:35 -08:00
Edward Yang	2d485ffb17	Move CUDAGuard, CUDAStream and CUDAGuardImpl to c10/cuda (#14248 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248 This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current behavior of pretending CUDA is HIP when you build with ROCm enabled. Reviewed By: bddppq Differential Revision: D13145293 fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02	2018-12-12 11:24:26 -08:00
Edward Yang	b710642969	Make ATen HIPify out-of-place, but still reuse CUDA names. (#14866 ) Summary: ``` This diff changes the HIPification of ATen to be out-of-place. We now have the following mappings: - ATen/cuda => ATen/hip - ATen/native/cuda => ATen/native/hip - ATen/native/sparse/cuda => ATen/native/sparse/hip - THC => THH - THCUNN => THHUNN The build system is adjusted to know about these new build paths, and HIPify is taught how to adjust include paths and THC_GENERIC_FILE appropriately. ATen_hip is now built as the ATen_hip library, rather than reusing ATen_cuda. However, despite these new filepaths, none of the identifiers in ATen have actually changed. So, e.g., THHGeneral.h still defines functions named THC_blahblah, and HIP still shows up as CUDA in PyTorch itself. We'll tackle this in a subsequent PR; this diff is just to get the files out-of-place. Minor extra improvements: - Don't edit tmp_install when hipifying - HIP no longer builds native_cudnn_cpp; it was unnecessary - Caffe2_HIP_INCLUDES is now Caffe2_HIP_INCLUDE, for consistency with all the other variables. - HIP build now properly respects ATEN_CUDA_FILES_GEN_LIB (it did not previously.) - You can now override file extension matching in pyHIPIFY by explicitly specifying its full name in the matching list. This is used so we can HIPify CMakeLists.txt in some situations. A little bit of string and ceiling wax: - gen.py grows a --rocm flag so that it knows to generate CUDA files which actually refer to the HIP headers (e.g., THH.h) We'll get rid of this eventually and generate real HIP files, but not for this PR. - Management of HIP dependencies is now completely deleted from the ATen CMakeLists.txt. The old code was dead (because it was shoveled in ATen_CUDA_DEPENDENCY_LIBS and promptly ignored by the Caffe2 build system) and didn't actually work. ``` Stacked on https://github.com/pytorch/pytorch/pull/14849 review last commit only Pull Request resolved: https://github.com/pytorch/pytorch/pull/14866 Differential Revision: D13419475 Pulled By: ezyang fbshipit-source-id: cb4c843df69a1d8369314c9fab1b7719520fa3db	2018-12-11 19:15:27 -08:00
Edward Yang	517c7c9861	Canonicalize all includes in PyTorch. (#14849 ) Summary: Anywhere we used #include "foo.h", we now say #include <foo.h> Paths are adjusted to be rooted out of aten/src, torch/lib, or the root level directory. I modified CMakeLists.txt by hand to remove TH and THC from the include paths. I used the following script to do the canonicalization: ``` import subprocess import re import os.path files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n') for fn in files: if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']): continue if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]): continue with open(fn, 'r') as f: c = f.read() def fmt(p): return "#include <{}>".format(p) def repl(m): p = m.group(1) if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]: return fmt(p) if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]): return fmt(p) for root in ["aten/src", "torch/lib", ""]: for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]: new_p = os.path.relpath(os.path.join(bad_root, p), root) if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))): return fmt(new_p) print("ERROR: ", fn, p) return m.group(0) new_c = re.sub(r'#include "([^"]+)"', repl, c) if new_c != c: print(fn) with open(fn, 'w') as f: f.write(new_c) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849 Reviewed By: dzhulgakov Differential Revision: D13363445 Pulled By: ezyang fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68	2018-12-08 19:38:30 -08:00
Junjie Bai	6651fae827	Make autograd engine compatible with hip Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14873 Differential Revision: D13375053 Pulled By: bddppq fbshipit-source-id: f3051640386667bbf0566856ed433eb83276c39e	2018-12-07 00:12:06 -08:00
Adam Paszke	8812a5d42e	Reduce broadcasted inputs in derivative code (#14485 ) Summary: Previously symbolic AD formulas assumed that no broadcasting happened, and would return gradients of incorrect shapes (possibly leading to silent errors later). Fixes a few bugs (known and unknown): - #11736 - ArgumentSpec didn't compute the input types correctly [(it didn't advance the offset for non-tensor args)](https://github.com/pytorch/pytorch/pull/14485/files#diff-4fd3157a056596aefb8cdf41022a208bR153) - Symbolic AD could suffer from use after free (dangling pointers in grad map), because [`EliminateDeadCode` could have removed nodes](https://github.com/pytorch/pytorch/pull/14485/files#diff-25d33ad1ed6855684dec79d927ca6142L781) that referenced gradients of certain values. - Undefined behavior in `aten::size` During my tests I've also found a few new problems, and I have opened issues for them: - FusionGroup seems to think that cat nodes broadcast their inputs (#14483) - `prim::ConstantChunk` derivative formula doesn't handle undefined inputs (#14484) This patch unfortunately deoptimizes some of our code (Fusion doesn't happen past chunk nodes, and outputs more tensors only because we have to get their size). I know how to fix those issues, but wanted to fix this terrible bug quickly. cc zou3519 zdevito ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/14485 Reviewed By: eellison Differential Revision: D13312888 Pulled By: suo fbshipit-source-id: ad46bfb4d0a306ad9451002f8270f7a790f72d58	2018-12-04 00:16:21 -08:00
Michael Suo	9ac845f734	Revert D13280899: [pytorch][PR] Reduce broadcasted inputs in derivative code Differential Revision: D13280899 Original commit changeset: 80cc5ec9331b fbshipit-source-id: 2335093cca8fd7db95470fd83b9299adfa17aa8e	2018-12-03 14:55:02 -08:00
Adam Paszke	68ffe46991	Reduce broadcasted inputs in derivative code (#14485 ) Summary: Previously symbolic AD formulas assumed that no broadcasting happened, and would return gradients of incorrect shapes (possibly leading to silent errors later). Fixes a few bugs (known and unknown): - #11736 - ArgumentSpec didn't compute the input types correctly [(it didn't advance the offset for non-tensor args)](https://github.com/pytorch/pytorch/pull/14485/files#diff-4fd3157a056596aefb8cdf41022a208bR153) - Symbolic AD could suffer from use after free (dangling pointers in grad map), because [`EliminateDeadCode` could have removed nodes](https://github.com/pytorch/pytorch/pull/14485/files#diff-25d33ad1ed6855684dec79d927ca6142L781) that referenced gradients of certain values. - Undefined behavior in `aten::size` During my tests I've also found a few new problems, and I have opened issues for them: - FusionGroup seems to think that cat nodes broadcast their inputs (#14483) - `prim::ConstantChunk` derivative formula doesn't handle undefined inputs (#14484) This patch unfortunately deoptimizes some of our code (Fusion doesn't happen past chunk nodes, and outputs more tensors only because we have to get their size). I know how to fix those issues, but wanted to fix this terrible bug quickly. cc zou3519 zdevito ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/14485 Differential Revision: D13280899 Pulled By: soumith fbshipit-source-id: 80cc5ec9331be80e1bb9ddfe85b81c2b997e0b0c	2018-12-03 13:44:18 -08:00
Michael Carilli	edb3ddf1a5	Accumulate grad fix (#14587 ) Summary: Rebased version of https://github.com/pytorch/pytorch/pull/13337. I don't think the lint errors in the original PR had to do with files I touched, so hopefully the rebase fixes them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14587 Differential Revision: D13277428 Pulled By: soumith fbshipit-source-id: f04c186b1dd4889b4250597eef87f9e9bf7b2426	2018-11-30 10:49:15 -08:00
Edward Yang	e35418b3be	New implementations of DeviceGuard, StreamGuard and MultiStreamGuard (with CUDA specializations) (#13342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342 This PR introduces a few new concepts: - DeviceGuardImplInterface, and implementations for CPU and CUDA, which provide a generic interface for interfacing with device and stream state, without requiring a direct dependency on the code in question. - InlineDeviceGuard, a general template for generating both specialized and dynamically dispatched device guard implementations. Dynamic dispatch is done by specializing it on a VirtualGuardImpl. - Provide a device-independent DeviceGuard class, which can be used even from CPU code. It uses the aforementioned dynamic dispatch. - CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch but can only be used from CUDA. - StreamGuard, which is the same as above, but for streams rather than devices. - Optional variants of all the aforementioned guards, which are a no-op if no device/stream is specified - CUDAMultiStreamGuard, specifically for the case when we want to set a device on every guard. There are some subtle semantic changes, which have been thoroughly documented in the class definition. BC-breaking changes: - Move constructor/assignment have been removed from all device guard implementations. - In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write 'reset_device', because if you switch devices/device types, the stream/device on the previous device is unset. This is different from previous behavior. - CUDAGuard no longer handles streams, or multiple streams. Use CUDAStreamGuard or CUDAMultiStreamGuard as appropriate for your use case. Reviewed By: dzhulgakov Differential Revision: D12849620 fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e	2018-11-11 12:11:10 -08:00
Peter Goldsborough	0479517325	Add modernize-* checks to clang-tidy (#13196 ) Summary: Enables almost all `modernize-*` checks in clang-tidy. This warns against things such as: - Use of `const std::string&` instead of new-style `std::string` + move, - Using old-style loops instead of range-for loops, - Use of raw `new` - Use of `push_back` instead of `emplace_back` - Use of `virtual` together with `override` (`override` is sufficient) ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13196 Differential Revision: D12891837 Pulled By: goldsborough fbshipit-source-id: 4d0f782a09eb391ee718d3d66f74c095ee121c09	2018-11-02 20:30:40 -07:00
Edward Yang	e5d56659ec	Delete DeviceGuard(int64_t) constructor. (#13232 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13232 DeviceGuard should be device agnostic, which means that it shouldn't assume that int64_t means select the CUDA device. Reviewed By: gchanan Differential Revision: D10858024 fbshipit-source-id: b40e8337e4046906fd8f83a95e6206367fb29dbe	2018-10-31 07:55:11 -07:00
Yangqing Jia	08aab4dfdd	remove ATen/Error.h and ATen/core/Error.h (#12792 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12792 This is a follow up diff after D10238910. Only non-codemod change is the removal of ATen/Error.h and ATen/core/Error.h. Other files are basically changing the inclusion path + clang format for inclusion order. Reviewed By: bddppq Differential Revision: D10437824 fbshipit-source-id: 7f885f80ab5827468d1351cfb2765d0e3f555a69	2018-10-17 17:25:42 -07:00
Yangqing Jia	13cf39294d	Remove ATen/Error.h and use ATen/core/Error.h instead. (#12132 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12132 TSIA. No code change involved. Reviewed By: bwasti Differential Revision: D10083237 fbshipit-source-id: bdab029015b9d0f1fa1f866c68aa5945cc68db9d	2018-09-27 10:11:17 -07:00
Christian Puhrsch	a9e6a673ae	Remove caffe2::Tensor::capacity_nbytes, at::Tensor::to##name##Data, (#11876 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11876 Modern C++ api instead of macros, item() is aligned with Python frontend. caffe2::Tensor::capacity_nbytes is effecitvely unused and confusing w.r.t. caffe2::Tensor::nbytes(). codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCByte "item<uint8_t>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCLong "item<int64_t>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCInt "item<int32_t>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCDouble "item<double>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCFloat "item<float>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toByteData "data<uint8_t>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toLongData "data<int64_t>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toIntData "data<int32_t>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toDoubleData "data<double>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toFloatData "data<float>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCByte "item<uint8_t>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCLong "item<int64_t>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCInt "item<int32_t>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCDouble "item<double>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCFloat "item<float>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toByteData "data<uint8_t>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toLongData "data<int64_t>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toIntData "data<int32_t>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toDoubleData "data<double>" codemod -d hphp --extensions cc,cpp,cu,cuh,h,py,hpp,mm toFloatData "data<float>" codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCComplexDouble "item<std::complex<double>>" codemod -d tc --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCFloat "item<float>" Reviewed By: ezyang Differential Revision: D9948572 fbshipit-source-id: 70c9f5390d92b82c85fdd5f8a5aebca338ab413c	2018-09-24 10:40:10 -07:00
Tongzhou Wang	7df6650e9c	Fix empty embedding bag on cuda (#11740 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/11739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11740 Differential Revision: D9881392 Pulled By: SsnL fbshipit-source-id: 2964d314f199dd9b4bb69e36592b67efdf5e0760	2018-09-17 14:40:03 -07:00
Peter Goldsborough	dccd0f2de6	Bag of clang tidy fixes for torch/csrc/ and torch/csrc/autograd (#11050 ) Summary: Linting `torch/csrc/` (non-recursive) and `torch/csrc/autograd` (non-recursive). Fixed things like: - `typedef` vs `using` - Use `.empty()` instead of comparing with empty string/using `.size() == 0` - Use range for loops instead of old style loops (`modernize-`) - Remove some `virtual` + `override` - Replace `stdint.h` with `cstdint` - Replace `return Type(x, y)` with `return {x, y}` - Use boolean values (`true`/`false`) instead of numbers (1/0) - More ... ezyang apaszke cpuhrsch Pull Request resolved: https://github.com/pytorch/pytorch/pull/11050 Differential Revision: D9597505 Pulled By: goldsborough fbshipit-source-id: cb0fb4793ade885a8dbf4b10484487b84c64c7f2	2018-09-05 19:55:50 -07:00
mruberry	9b1a65bec3	Extends type and shape tracing with device (#9796 ) Summary: This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing. The precise changes are: - TypeAndShape -> InputMetadata, now includes device() - Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible - The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator - Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape - (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings fyi colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/9796 Reviewed By: goldsborough Differential Revision: D9119325 Pulled By: ezyang fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de	2018-08-07 12:25:17 -07:00
Peter Goldsborough	04939a4745	Match parameter names and = default (#9737 ) Summary: More clang tidy cleanups in `torch/csrc`. This time: 1. `hicpp-use-equals-default` recommends `= default` instead of `{}` for constructors/destructors. This is better practice because it expresses the intent better (https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration) 2. `readability-inconsistent-declaration-parameter-name` enforces that parameter names in the declaration match parameter names in the definition. This is just generally useful and can prevent confusion and bugs. Also updated my script a little bit. apaszke ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/9737 Differential Revision: D9069069 Pulled By: goldsborough fbshipit-source-id: f7b3f3a4eb4c9fadc30425a153566d3b613a41ae	2018-07-30 14:10:00 -07:00
Sam Gross	829d763c69	Implement add, sub, mul, div using TensorIterator (#8919 ) Summary: ``` This adds TensorIterator, a helper class for computing element-wise operations that's intended to replace the CPU and CUDA apply utils functions. CPU kernels are implemented as functions that operate on strided 1-d tensors compared to CPUApplyUtils which operated individual elements. This allows the kernels to handle vectorization, while TensorIterator handles parallelization and non-coalesced dimensions. GPU kernels continue to operate on elements, but the number of specializations is reduced. The contiguous case remains the same. The non-contiguous case uses a single (reduced) shape for all operands and the fast integer division from THCIntegerDivider. To avoid extra specializations for indexing with 64-bits, large operations are split into smaller operations that can be indexed with 32-bits. Major semantic changes: - No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by TensorIterator. The autograd engine performs the reduction assuming standard broadcasting if the gradient shape does not match the expected shape. Functions that do not use standard broadcasting rules should either continue to trace the expand calls or handle the reduction in their derivative formula. - Use ONNX v7, which supports broadcasting ops. Performance impact: - Small increased fixed overhead (~0.5 us) - Larger overhead for wrapped numbers (~2.5 us) - No significant change for ops on contiguous tensors - Much faster worst-case performance for non-contiguous GPU tensors - Faster CPU bias addition (~2x) - Faster GPU bias addition (~30% faster) Future work: - Decrease overhead, especially for wrapping numbers in Tensors - Handle general inter-type operations - Extend to unary ops and reductions - Use buffering for compute-bound operations on non-contiguous tensors (pull in from CPUApplyUtils) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919 Differential Revision: D8677600 Pulled By: colesbury fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd	2018-07-27 14:43:24 -07:00
Mary McBreen	483ae8cb5d	Replaces const ref with && for apply (#9175 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/5011 Tested with python test/test_autograd.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/9175 Reviewed By: zdevito Differential Revision: D8736377 Pulled By: marymcbreen fbshipit-source-id: ff86f427f7b2cf0cab5912e7f32812bd0f49a712	2018-07-12 08:31:59 -07:00

1 2 3

107 Commits