pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jason Ansel	43dd043ea7	Revert "[inductor] Improve error messages (#95567 )" (#96014 ) This reverts commit `62b775583f`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014 Approved by: https://github.com/Chillee	2023-03-04 04:03:31 +00:00
Edward Z. Yang	d303665d33	Make int unspecialization actually work (#95621 ) OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor. The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors. * I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.) * Only 0/1 integers get specialized by default now * A hodgepodge of fixes. I'll comment on the PR about them. Fixes https://github.com/pytorch/pytorch/issues/95469 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621 Approved by: https://github.com/jansel, https://github.com/Chillee	2023-03-04 01:22:08 +00:00
Jason Ansel	62b775583f	[inductor] Improve error messages (#95567 ) Example error message before/after (710 to 131 lines): https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567 Approved by: https://github.com/mlazos	2023-03-02 02:20:55 +00:00
Shunting Zhang	5d29b68bbc	[inductor] generate triton kernel benchmark (#95506 ) A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module. Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong. In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b . A compiled module contains the following part: - various triton kernels - a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor - some utility code that generate random inputs and run the wrapper The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning. This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well. The path to each compiled triton kernel is added as comment in the compiled module. E.g. ``` # kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py triton__0 = async_compile.triton(''' import triton import triton.language as tl ... """) ```` Example command: ``` TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506 Approved by: https://github.com/Chillee	2023-03-01 18:29:07 +00:00
Christian Puhrsch	1fe2a9d122	Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339 ) Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-27 20:27:25 +00:00
Jason Ansel	d3e1f165b3	Copy helper next_power_of_2 from triton (#95436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95436 Approved by: https://github.com/ngimel	2023-02-26 20:49:36 +00:00
Nicolas Macchioni	17d0b7f532	[pt2][inductor]global autotuning cache (#94922 ) Summary: this diff adds logic to handle a global autotuning cache, stored in json format at config.global_cache_path. what is changing from `DiskCache`: * `DiskCache` is renamed to `PersistentCache` * the local cache is now stored as a single file in json format, located at `/tmp/torchinductor_{$USER}/local_cache`. the file contains a dictionary structure like `local_cache[name][inputs][choice]` where `name` is the type of operation, like `addmm`, `inputs` is the repr of the inputs, and `choice` is the hash of a `ChoiceCaller`. the stored value is the benchmark time for that `ChoiceCaller`. * a global cache is added, initially stored at `fbcode/caffe2/torch/_inductor/global_cache`, with almost identical format as the local cache. since the global cache exists over different machines, there is an additional `dinfo` field, such that `global_cache[dinfo] = local_cache` (at least structure wise, there is no guarantee that the global cache and local cache share the same values). `dinfo` is just a repr of the cuda device properties. * the autotuner will prioritize the global cache, and return values from there first, before looking in the local cache * the autotuner will look in both the global cache and the local cache even when `max_autotune=False`, but will still only generate values if `max_autotune=True`. * the autotuner will log global cache hits and misses to a scuba table (inductor_autotuning_cache) which will be used to update the global cache at regular intervals Test Plan: D43285472 Differential Revision: D42785435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94922 Approved by: https://github.com/jansel	2023-02-19 05:35:18 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Aaron Gokaslan	3d82d8d0ed	[BE] Enable more flake8-comprehensions checks (#94601 ) I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR. This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601 Approved by: https://github.com/ezyang	2023-02-10 23:40:29 +00:00
Jason Ansel	24ae50bcc7	Add config option to reduce warnings in inductor (#94413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94413 Approved by: https://github.com/ezyang	2023-02-10 15:44:15 +00:00
Edward Z. Yang	dc70b00d0b	Track and record hint on SymNode and use when possible (#94201 ) Historically, we work out `size_hint` by working it out on the fly by doing a substitution on the sympy expression with the `var_to_val` mapping. With this change, we also maintain the hint directly on SymNode (in `expr._hint`) and use it in lieu of Sympy substitution when it is available (mostly guards on SymInt, etc; in particular, in idiomatic Inductor code, we typically manipulate Sympy expressions directly and so do not have a way to conveniently maintain hints.) While it's possible this will give us modest performance improvements, this is not the point of this PR; the goal is to make it easier to carefully handle unbacked SymInts, where hints are expected not to be available. You can now easily test if a SymInt is backed or not by checking `symint.node.hint is None`. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94201 Approved by: https://github.com/voznesenskym	2023-02-09 00:00:44 +00:00
Will Constable	f2156ef42b	Make triton debug util reusable (#94225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94225 Approved by: https://github.com/Chillee	2023-02-08 22:03:35 +00:00
chunyuan	cff4d3bb22	inductor: fix convert_shape_to_symint (#93349 ) Fixes https://github.com/pytorch/pytorch/issues/93833. When `lst` is composed of a mix of static shapes and `sympy.Expr`, convert static shapes to ints and `sympy.Expr` to `symints`. The old logic required that all of the elements of `lst` be static and it can then convert them to ints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93349 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-02 07:34:57 +00:00
Horace He	19c9b09449	Replace IndexingDiv with FloorDiv in Inductor (#92878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878 Approved by: https://github.com/ezyang	2023-01-24 15:06:22 +00:00
Horace He	20bf77f9bd	Fixed virtualized import and typing rule (#92774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92774 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2023-01-22 22:19:40 +00:00
Horace He	5c4f0fd72c	Change convolution to use symbolic shapes for propagation (#92397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92397 Approved by: https://github.com/ezyang	2023-01-21 21:54:24 +00:00
Horace He	4f4b62e4a2	some fixes to get symbolic shapes working through inductor (#92320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92320 Approved by: https://github.com/ezyang	2023-01-19 03:09:02 +00:00
Jason Ansel	9b173b87b2	Refactor away leftover import indirection (#92188 ) This indirect ways of importing are a leftover from when we wanted to support both `import torchdynamo` and `import torch._dynamo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92188 Approved by: https://github.com/desertfire	2023-01-18 04:53:05 +00:00
Jason Ansel	7c1c239db1	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575 ) This reverts commit `94262efc7d` to reland #91105 / #90738. Fixes https://github.com/pytorch/torchdynamo/issues/2015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575 Approved by: https://github.com/ngimel	2023-01-11 00:08:03 +00:00
PyTorch MergeBot	94262efc7d	Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 )" This reverts commit `d6dd2e97da`. Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds	2022-12-21 00:02:38 +00:00
Jason Ansel	d6dd2e97da	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 ) https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land. This is an exact copy that PR without using ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105 Approved by: https://github.com/ngimel	2022-12-20 02:38:23 +00:00
Peter Bell	81f351acd7	[inductor] Prevent blowup in inner_fn_str and extract_read_writes (#88933 ) Currently the default `ops` handler expects strings as arguments and just formats them into a function call template string. For complex expressions, this can lead to exponential growth in terms. Say for example you have: ```python def fn(a): for _ in range(3) a = ops.mul(a, a) return a ``` You might expect `inner_fn_str` to contain 1 load and 3 multiplies, but instead you find 8 loads and 7 multiplies: ```python load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) ``` This type of blowup is present in the lowering for `max_pool2d_with_indices_backward` which in #pytorch/torchdynamo#1352 was reported to have caused the entire compilation to hang. This PR fixes the issue by formatting the string as a series of assignments to variables, so for the example above, we now get: ``` tmp0 = load(arg_0, i0) tmp1 = tmp0 * tmp0 tmp2 = tmp1 * tmp1 tmp3 = tmp2 * tmp2 return tmp3 ``` Which corresponds to sequence of `ops` calls made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88933 Approved by: https://github.com/jansel	2022-12-15 15:36:52 +00:00
Andrew M. James	7a7f29704f	Remove hard numpy dep introduced by _inductor/utils.py (#90716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90716 Approved by: https://github.com/cpuhrsch	2022-12-13 04:58:26 +00:00
Natalia Gimelshein	a88400e0cc	pad low precision matmuls when requested (#90235 ) Matmul padding is beneficial not only for fp32, fp16/bf16 with amp can benefit as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90235 Approved by: https://github.com/jiawenliu64	2022-12-06 04:13:24 +00:00
Animesh Jain	d09c52e4fd	[inductor] Deterministic kernel names (#89713 ) `node.origins` is a set and does not have an order. Therefore, inductor w and w/o cudagraphs experiments generate different kernel names, making it hard to debug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89713 Approved by: https://github.com/soumith, https://github.com/mlazos, https://github.com/ngimel	2022-12-02 02:37:36 +00:00
Natalia Gimelshein	a188f05e8c	Reland #89031 Added conv constraint that infers layouts (#89530 ) Relands #89031 Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530 Approved by: https://github.com/Chillee	2022-11-23 20:18:54 +00:00
Horace He	419ef2cdcf	Added utility to count memory reads/written in Inductor (#89203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89203 Approved by: https://github.com/jansel, https://github.com/ngimel	2022-11-19 04:18:26 +00:00
Jiawen Liu	55b88cde0a	[Inductor] Build Shape Padding in Inductor (#88709 ) Summary: Build shape padding for matmul/bmm/addmm in Inductor Differential Revision: D41071282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88709 Approved by: https://github.com/bertmaher, https://github.com/Chillee	2022-11-15 03:10:36 +00:00
Michael Lazos	c1553880de	Have kernel names include fused ops (#88624 ) - Propagates origin fx nodes through inlining during lowering - Concatenates op names into kernel name - Adds config to cap the number of ops in the kernel name so they don't get too long Caveats: - The ordering in the name may not match the order that the ops are executed in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-11-10 21:38:06 +00:00
Elias Ellison	2381548071	add stride constraints to fallbacks (#88534 ) Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel. Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534 Approved by: https://github.com/ngimel	2022-11-10 01:13:44 +00:00
Animesh Jain	d67b2edec3	[dynamo][dashboard] minor fixes for a clean Dashboard (#88056 ) * better check for cold start latency * sort on inductor column for better readability. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88056 Approved by: https://github.com/ngimel	2022-10-31 02:30:29 +00:00
Animesh Jain	1b575782a0	[dynamo][benchmarks] use fresh inductor cache and raise batch size wherever possible (#88044 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88044 Approved by: https://github.com/ngimel	2022-10-30 17:10:17 +00:00
Horace He	2418ddb1ec	Unified symbolic shape variables between Inductor and AOTDispatcher (#87161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161 Approved by: https://github.com/jansel	2022-10-19 04:50:34 +00:00
Zachary DeVito	d36c284d14	[triton] allow cuda properties to be queried from workers (#87101 ) Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork. Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down. This just moves the needed properties from the main trainer process to the workers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101 Approved by: https://github.com/soumith	2022-10-18 04:48:29 +00:00
Jason Ansel	054a2fd6c2	Sync changes from `pytorch/torchdynamo` (#87013 ) This updates to: `6380959be2` Generated with: https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013 Approved by: https://github.com/voznesenskym	2022-10-15 21:00:57 +00:00
Jason Ansel	c7c09722ad	Move TorchDynamo into PyTorch core (#86461 ) Context: https://github.com/pytorch/torchdynamo/issues/1588 This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core. - `torchdynamo` becomes `torch._dynamo` - `torchinductor` becomes `torch._inductor` This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461 Approved by: https://github.com/voznesenskym	2022-10-13 23:18:06 +00:00

36 Commits