pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Zachary DeVito	bdc57d3833	Merge ProfiledTensorType and TensorType (#24284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24284 This PR finishes the unification of all Tensor types into a single object. ProfiledTensorType is renamed to TensorType and the old TensorType is deleted. Notes: * Fixes bug in merge for VaryingShape by changing its representation to an optional list of optional ints. * Removes ProfiledTensorType::create(type) invocations that can now simply be expect calls on tensor type. Test Plan: Imported from OSS Differential Revision: D16794034 Pulled By: zdevito fbshipit-source-id: 10362398d0bb166d0d385d74801e95d9b87d9dfc	2019-08-20 13:01:28 -07:00
Zachary DeVito	0cbd7fa46f	remove CompleteTensorType Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24169 Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D16765329 Pulled By: zdevito fbshipit-source-id: 88560cefba635c3d586a3e4dee67f9b1d901a642	2019-08-15 13:31:34 -07:00
Zachary DeVito	c2549cb8d3	Remove DimensionedTensorType (#24077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24077 This replaces all uses of DimensionedTensorType with ProfiledTensorType. For places where we propagate shape information, we still follow the dimension-only propagation rules, meaning that even if full size information is known on inputs the outputs will only have dimension information. This fixes several bugs in existing implentations that this change uncovered: * requires_grad was not propgated correctly across loops * requires_grad on ProfiledTensorType returned false when requires_grad information is unknown but the conservative result is true * some equality code on ProfiledTensorType contained bugs. Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D16729581 Pulled By: zdevito fbshipit-source-id: bd9f823c1c6b1d06a236a1b5b2b2fcdf0245edce	2019-08-13 10:05:47 -07:00
Nikolay Korovaiko	3d15ee1b34	Remove more uses of `DimensionedTensorType` Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23060 Differential Revision: D16460391 Pulled By: Krovatkin fbshipit-source-id: b50ee87d22ad18b8cbfff719b199ea876ef172f1	2019-08-01 21:19:28 -07:00
Thomas Viehmann	cf50249bde	Disable fusion of grad_sum_to_size (#23372 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/22833 grad_sum_to_size does not commute with AutogradAdd after all because it turns the broadcasting AutogradAdd into a broadcasting add. Chillee did actually do most of the tracking down to the fusion of grad_sum_to_size and pinging me when he had found the cause. Thank you! About the choice of removing the fusion completely instead of being more precise: - We do have grad_sum_to_size elimination which works for cases where broadcasting does not actually happen in the forward, so the cases where the fusing of grad_sum_to_size is actually beneficial is much smaller than when initially proposed. - There will be less fusion, in terms of the tests, IOU stops being fully fused. I vaguely think that it is a case we could handle with refined logic. - Keeping it would add complexity in checking when to merge fusion groups to the complexities that this PR removes. - The future of fusion probably lies more in more complete solutions including reductions (TVM or KeOps or our own or ...). Pull Request resolved: https://github.com/pytorch/pytorch/pull/23372 Differential Revision: D16489930 Pulled By: soumith fbshipit-source-id: bc0431b0d3eda264c401b634675872c4ce46f0f4	2019-07-25 08:55:33 -07:00
Bram Wasti	05d56bd1b6	Remove hard-coded NVRTC specific constant from fuser header Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22699 Test Plan: Imported from OSS Differential Revision: D16192290 Pulled By: bwasti fbshipit-source-id: 4dccaf3e6e0151e86d35474c36e1ddb7f2afb5cf	2019-07-11 13:44:25 -07:00
Roy Ju	a9a29dd63f	Fixes error when too many parameters are passed to fused cuda kernel (#18063 ) Summary: Bug fix for https://github.com/pytorch/pytorch/issues/15043, where a large fusion in JIT with a large number of kernel arguments, which exceeds the limit allowed by nvrtc on a cuda device. The fix is to check the number of arguments before a cuda kernel is generated. If the number exceeds the limit, take the runFallBack() path. Add a reduced test from the original issue to keep the test time low. The test would fail without this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18063 Differential Revision: D14691401 Pulled By: soumith fbshipit-source-id: b98829bc89ed7724e91eda82ae3a5a1151af721a	2019-04-09 22:37:09 -07:00
Natalia Gimelshein	ed47b85d3b	Allow fusion of float function arguments (#18087 ) Summary: so that functions like `def fn(x, p:float)` can be fused. Fixes #9940 and #11186. Fuses only float (not integer) arguments to simplify assembling arguments for fusion launch. CPU fusion is disabled in CI and this won't be tested, but I tested it locally. cc t-vi, apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/18087 Differential Revision: D14581206 Pulled By: wanchaol fbshipit-source-id: ccb0cf79b1751706f9b2cdf1715115eae5a39fb6	2019-03-22 13:52:33 -07:00
Zachary DeVito	f34192db0f	Rename DynamicType -> TensorType (#16787 ) Summary: ``` import json from subprocess import check_call from pprint import pprint renames = { 'c10::TensorType': 'DimentionedTensorType', 'c10::DynamicType': 'TensorType', 'c10::TensorTypePtr': 'DimentionedTensorTypePtr', 'c10::DynamicTypePtr': 'TensorTypePtr', 'c10::TypeKind::DynamicType': 'TensorType', 'c10::TypeKind::TensorType': 'DimentionedTensorType', } entries = json.loads(open('compile_commands.json', 'r').read()) build = None sources = [] for e in entries: name = e['file'] if not ('jit' in name or 'ATen/core' in name): continue build = e['directory'] sources.append(name) args = ['clang-rename', '-i', '-force', '-pl'] for name in sorted(renames.keys()): args += ['-qualified-name={}'.format(name), '-new-name={}'.format(renames[name])] for source in sources: cmd = args + [source] pprint(args) check_call(cmd, cwd=build) check_call(['git', 'stash', 'push', '-m', 'rename']) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16787 Differential Revision: D13974132 Pulled By: zdevito fbshipit-source-id: 8368fd53e17cff83707bbe77f2d7aad74f8ce60e	2019-02-06 17:31:07 -08:00
James Reed	dfb081a7e4	Fix a lot of C++ build warnings (#16411 ) Summary: I went through my build log and did what I thought were reasonable fixes to all the C++ compilation warnings that came up Pull Request resolved: https://github.com/pytorch/pytorch/pull/16411 Differential Revision: D13901006 Pulled By: jamesr66a fbshipit-source-id: 02df4e3e5a5c8dd9e69ac9f065cd3f2a80645033	2019-01-31 14:35:56 -08:00
Thomas Viehmann	20d45c43d7	Get more fusion after autodiff uses SumToSize (#14957 ) Summary: Here is a fresh attempt at getting some fusion back in autodiff-generated graphs in the presence of SumToSize. - The sum to size operator is now `aten::_grad_sum_to_size` to allow symbolic script differentiation (and that in turn would need to use this in place of sum_to_size to signal that it strictly operates on gradients). This is also used in the autodiff code, replacing `prim::SumToSize`. - `_grad_sum_to_size` is now fusable, `cat`s - which are fused afterwards thanks to Adam's simplification of the code - are only fused if there is no `_grad_sum_to_size` in the fusion group. - I push the `_grad_sum_to_size` out of the the fusion group when compiling and record the desired summations in the KernelSpec. The reasoning is the following: - As the autodiff is a repeated applicaiton of the chain rule, we always have the pattern `grad_in = mm(A, grad_out)`, with A often diagonal for cases interesting to the fuser, whence it is `grad_in = a * grad_out` (a pointwise multiplication). We know that only `grad_out` may have AutodiffGradSumToSize applied, so we can commute AutodiffGradSumToSize with the `mul` (and `div` and `neg` are of similar origin). - For `type_as` the gradient might be giving the type, so just skip SumToSize, - `add` (which was inserted as `prim::AutogradAdd`) adding gradients when the forward used the same value in several places. This is non-broadcasting, so we know that the two arguments would have the same sizes as inputs - which is good so we don't have to do bookkeeping of the two parts. Details: - During fusion, the Tensor arguments are always kept as the first parameters of the fusion group to accomodate indexing assumptions in the fuser. - The rewriting of the fusion group to record the necessary output transformation and eliminate `_grad_sum_to_size` from the fusion group is now in the fuser compile step. - In the execution step, the arguments are split into Tensor / Non-Tensor and the non-tensor args are mostly forgotten about except for doing `sum_to_size` at the end. This would want to be improved if/when we fuse nonconstant scalar arguments. - In a number of places in the fuser, the non-Tensor arguments to the fusion group needed to be ignored. Thank you, apaszke for the insightful discussion. All bad ideas and errors are my own. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14957 Differential Revision: D13888173 Pulled By: zou3519 fbshipit-source-id: 071992c876e8b845f2b3e6329ae03a835d39a0ea	2019-01-31 12:24:38 -08:00
Mikhail Zolotukhin	47bf30661f	Directly include headers from ATen. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16287 Differential Revision: D13792949 Pulled By: ZolotukhinM fbshipit-source-id: d627d8dc469df048063c70d0b5b8d33fede809a3	2019-01-24 11:22:27 -08:00
Zachary DeVito	3f6b212e80	Register CPU/CUDA fuser dynamically (#15887 ) Summary: This avoids a bunch of conditional compilation logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/15887 Reviewed By: eellison Differential Revision: D13613239 Pulled By: zdevito fbshipit-source-id: a18fc69676b3ef19b4469ab58d8714d1f6efccbb	2019-01-11 10:50:35 -08:00
mruberry	1bc47c0d86	Refactors shape logic out of code generation, fixes possible segfault (#15750 ) Summary: This PR: - Removes shape logic from the code generator, which was previously relied on to return chunk and concat information - Copies the logic to detect if a kernel has a rand_like node to the executor, making its pass independent of the code generator - Fixes a possible segfault where references to a vector still being modified were relied upon The actual shape logic is unchanged. The possible segfault is in the handling of the former "flat_inputs" in codegen.cpp. This vector holds pairs, and the second element of these pairs is a reference. In some cases these would be references to items in the vector chunk_desc, which could be added to later, possibly invalidating any references to items in it. I hit a similar segfault in testing when naively making parallel code for "flat_outputs." I'm submitting this small PR because it's separable, self-contained, has a fix, and I am trying to actively get away from large PRs to encourage more stability and incremental change in the fuser. ngimel zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15750 Differential Revision: D13597451 Pulled By: zou3519 fbshipit-source-id: 0d48b365779b42849b044ba0286258aacc7b0332	2019-01-08 12:36:59 -08:00
Michael Suo	f636dc9276	clang format world (#15524 ) Summary: The PR clang-formats everything in `torch/csrc/jit/` and adds it to the pre-commit hook. Here is a list of non-mechanical changes: - I went over each file and fixed up whenever I could tell that clang-format was clobbering comment formatting. - Made the macros in register_prim_ops a little more clang-format friendly by omitting trailing commas - Refactored autodiff.cpp to use a helper class with explicit state rather than a bunch of capturing lambdas - Small improvements to the precommit hook clang-format Pull Request resolved: https://github.com/pytorch/pytorch/pull/15524 Differential Revision: D13547989 Pulled By: suo fbshipit-source-id: 3ff1541bb06433ccfe6de6e33f29227a2b5bb493	2018-12-26 06:55:01 -08:00
Peter Goldsborough	7a61306031	Enable all clang-tidy performance checks (#15198 ) Summary: This PR adds the final set of clang-tidy checks we should add for our codebase: a last set of performance-related checks. Most fixes here are around changing `auto` to `const auto&` in a few places where unnecessary copies were made, and adding `reserve()` calls before loops doing repeated `push_back()`. Also a few cases of calling `std::string::find` with a single-character string literal instead of a single char, which uses a less efficient string search algorithm meant for searching larger substrings. ![image](https://user-images.githubusercontent.com/6429851/49978940-adc1a780-ff01-11e8-99da-a4e431361f07.png) ezyang apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/15198 Differential Revision: D13468797 Pulled By: goldsborough fbshipit-source-id: 2bed1ea1c7c162b7f3e0e1026f17125e88c4d5b2	2018-12-14 13:32:47 -08:00
Richard Zou	b14d6d730a	Reuse KernelSpec for FusionGroups with equivalent graphs (#14541 ) Summary: Before this PR, loop unrolling + the graph fuser was creating multiple FusionGroups with the same bodies (with different variable names) for JIT LSTMs. Each FusionGroup got registered to a separate fusion key; each key resulted in a different compilation for the same specializations. This PR makes it so that when registering FusionGroups with the fusion compiler, the compiler first checks the KernelSpec cache to see if the FusionGroup's graph exists already. If it does, then return the corresponding KernelSpec's key to share compiled kernels. In addition, graphs in the KernelSpec cache are canonicalized before being cached. I added a flag to the canonicalize pass to remove unique names of values. This shortens the compile time for a JIT LSTM (seq_len of 100, loop unroll factor of 8) from 5.3s to 2.3s. Most of this compile time is running the graph fuser and/or fusion compiler; while this PR makes it so that there is only one unique kernel in the forward pass, there are a lot of different kernels (6) in the backward pass (after loop unrolling) that should be investigated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14541 Differential Revision: D13324487 Pulled By: zou3519 fbshipit-source-id: b841d82ed35a959b5cfc72db033bf5a7b42cc4fb	2018-12-13 07:54:35 -08:00
Edward Yang	517c7c9861	Canonicalize all includes in PyTorch. (#14849 ) Summary: Anywhere we used #include "foo.h", we now say #include <foo.h> Paths are adjusted to be rooted out of aten/src, torch/lib, or the root level directory. I modified CMakeLists.txt by hand to remove TH and THC from the include paths. I used the following script to do the canonicalization: ``` import subprocess import re import os.path files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n') for fn in files: if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']): continue if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]): continue with open(fn, 'r') as f: c = f.read() def fmt(p): return "#include <{}>".format(p) def repl(m): p = m.group(1) if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]: return fmt(p) if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]): return fmt(p) for root in ["aten/src", "torch/lib", ""]: for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]: new_p = os.path.relpath(os.path.join(bad_root, p), root) if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))): return fmt(new_p) print("ERROR: ", fn, p) return m.group(0) new_c = re.sub(r'#include "([^"]+)"', repl, c) if new_c != c: print(fn) with open(fn, 'w') as f: f.write(new_c) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849 Reviewed By: dzhulgakov Differential Revision: D13363445 Pulled By: ezyang fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68	2018-12-08 19:38:30 -08:00
Adam Paszke	31b3d81714	Broadcast prim::FusedConcat inputs independently when checking kernels (#14503 ) Summary: Fixes #14483. cc zou3519 mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/14503 Differential Revision: D13256343 Pulled By: zou3519 fbshipit-source-id: 1c68a23f425be067a742bada7ee8cdfab7fc3fa2	2018-11-29 13:05:00 -08:00
Thomas Viehmann	8408dff55a	Add Type support to the fuser, fuse more (#14336 ) Summary: This adds scalar type support to the fuser, both internally (instead of auto / assuming float) and for the inputs/outputs. We can now fuse things with input / output of arbitrary scalar type, in particular comparisons and where work well. So it fixes #13384 by returning the right type tensor (and adds a test where byte and double tensors are returned). The type inference is done by re-calling PropagateTensorShapeOnNode in the compilation, I would venture that it isn't prohibitively expensive compared to the actual compilation. (Propagation was fixed for where to return the second argument's type and amended to handle FusedConcat.) I'm not sure how to add a check for the code generated by the fuser, but I am not sure we absolutely need to (we'd see if it is invalid / produces wrong results). Thanks in particular to apaszke, fmassa, mruberry for advice and encouragement! All the errors are my own. I have discussed order of PRs briefly with mruberry, if this goes in before he submits the PR, he graciously agreed to rebasing his, but I'd happily rebase, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14336 Differential Revision: D13202620 Pulled By: soumith fbshipit-source-id: 855159e261fa15f21aca3053bfc05fb3f720a8ef	2018-11-27 11:33:11 -08:00
Michael Suo	33d091f432	shape analysis fix (#14325 ) Summary: This PR is deceptively large because of an indenting change. The actual change is small; I will highlight it inline Pull Request resolved: https://github.com/pytorch/pytorch/pull/14325 Differential Revision: D13183296 Pulled By: suo fbshipit-source-id: fcbf6d5317954694ec83e6b8cc1c989f2d8ac298	2018-11-23 11:24:24 -08:00
Thomas Viehmann	1f871f126f	Have PYTORCH_FUSION_DEBUG print C kernel source (#14213 ) Summary: - Move up handling the environment variable from CPU only to all - Introduce two levels to be enabled with PYTORCH_FUSION_DEBUG=n: 1: print C source 2: print CPU assembly, too (previous effect of PYTORCH_FUSION_DEBUG) apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/14213 Differential Revision: D13135393 Pulled By: soumith fbshipit-source-id: befa4ebea3b3c97e471393a9f6402b93a6b24031	2018-11-20 12:45:07 -08:00
ArmenAg	751b5ea941	use at::Device throughout JIT (#14181 ) Summary: zdevito soumith Sorry about the previous PR, had some git issues. This is the same exact code as the previous PR but updated w.r.t pytorch/master. fixes #13254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14181 Differential Revision: D13117688 Pulled By: soumith fbshipit-source-id: 044840b2c7a0101ef43dd16655fd9a0f9981f53f	2018-11-19 09:21:57 -08:00
mruberry	6fe089c6ea	Hierarchical device independent -> device specific architecture (#13108 ) Summary: This PR principally redesigns the fuser's logical flow to be hierarchical, with device-independent logic directing (relatively little) device-specific logic. This design is based on reviews of XLA, TVM, internal design review at NVIDIA and discussions with fuser owners at Facebook. To further vet the design I have begun developing the next significant PR (extended fusion logic) on top of this architecture and it has made the work significantly easier. This PR also improves fuser modularity, which should make it easier for others to contribute to. Unfortunately, this PR is large and its nature has made breaking it into smaller pieces challenging. Future PRs should be smaller. The fusion flow is now: - Fusions are "registered" and "upfront compilation" occurs. The fusion specifications, which includes the graph, go into a thread-safe device-independent cache. Upfront compilation generates some information used later during shape inference. - Fusions are run, which passes them to an executor that performs shape inference, requests an instantiated fusion from the specification's thread-safe store, and launches them. Launch logic eventually defers to device-specific logic. - Fusions not previously instantiated are compiled. Compilation is device-specific and arg-specific. Compilation logic eventually defers to device-specific logic. - If the fusion could not be run because fusion on the requested device is disabled or shape inference fails a fallback is invoked. This flow can be thought of as PyTorch IR -> Device-Independent Fusion Logic -> Device-Specific Fusion Logic. The current upstream logic is, by contrast, PyTorch IR -> Device-Specific Logic -> Device-Independent Logic, which results in needless code duplication and lack of conceptual clarity. That was my mistake when splitting the fuser off from the rest of the jit and our reviews since then have been incredibly helpful in understanding why the approach in this PR is better. This PR does not only move code around. It also fixes few couple bugs and makes some logical/code changes. Bug fixes: - thread-safety is improved with caches preventing concurrent access - the nvrtc version is now reviewed to determine the appropriate compute architecture to compile for, fixing a bug that would cause runtime errors if a user's nvrtc didn't support the compute architecture their gpu reported - an issue with DeviceGuard not setting the device properly and failing silently is worked-around (ezyang mentioned he was reviewing the dynamic registration DeviceGuard uses, which may resolve the issue) Code/Logical changes: - "const" now appears many more places (note: I cast const away in operator.h because of some obscure build issues -- I think we should be able to fix this and will take a look while this goes through testing) - The new flow allowed some redundant code to be removed (AnnotatedGraph is gone, for example, and the more straightforward flow eliminated duplication of effort elsewhere) - Fallback logic is now also invoked if a fusion is requested on a device that cannot handle fusions - Use of macros to determine which files are compiled is reduced (though they may come back if the Windows build is unhappy) - There is no more "common" code or folder, the device-independent logic being at the forefront of the fuser replaces and improves upon the goal of sharing code apaszke who I promised naming rights to zdevito who correctly pointed out that the device-independent logic should be the bulk of what the fuser is doing ngimel who contributed to the design of this architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/13108 Reviewed By: gchanan, fmassa Differential Revision: D12850608 Pulled By: soumith fbshipit-source-id: 24e2df6dfa97591ee36aeca8944519678c301fa3	2018-10-31 18:13:00 -07:00

24 Commits