pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Bin Bao	96f548a1ac	[inductor] Add an AOT mode for the Triton backend (#98214 ) Summary: This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make the landing easier. This PR implements a two-pass wrapper codegen for the Triton backend to achieve ahead-of-time compilation. In the first pass, the regular python wrapper code will be generated, and then the generated code will be executed to perform Triton compilation and autotuning. After that, the second pass wrapper codegen will generate C++ wrapper with proper CUDA API to load and launch Triton-generated CUDA kernels. Like the AOT mode for the cpp backend, the next step would be to provide a more complete API for AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214 Approved by: https://github.com/eellison	2023-04-03 22:19:18 +00:00
chunyuan	0c1f524b92	Inductor cpp wrapper: support MKLPackedLinear (#90755 ) Invoke `torch.ops.mkl._mkl_linear` from c++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-03 04:07:38 +00:00
Shunting Zhang	13461e9767	[inductor] more cuda metrics in wrapper (#97723 ) Following metrics should be helpful: - percent of time GPU is busy - percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes - percent of time each individual kernel takes compared to total wall time of the benchmark This PR add those. Example result from hf_Bert infernece graph: ``` == triton_pointwise category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52% triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33% triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00% triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24% Total 1.05379 12.08% == triton_persistent_reduction category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_per_fused__softmax__to_ 0.97188 12.0 11.14% triton_per_fused_add_native_la 0.37401 24.0 4.29% triton_per_fused_gelu_native_l 0.02 1.0 0.23% triton_per_fused_add_embedding 0.01718 1.0 0.20% Total 1.38307 15.86% == unknown category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74% ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03% void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61% ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55% ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11% Memset (Device) 0.024 12.0 0.28% void at::native::(anonymous na 0.01543 2.03 0.18% void at::native::vectorized_el 0.00011 0.03 0.00% Total 6.40937 73.49% Percent of time when GPU is busy: 101.44% ``` Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU). ## interesting usages 1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle: ``` (Forward graph) Percent of time when GPU is busy: 32.69% Total wall time 17.307 ms ``` Its backward graph has less percent of GPU idle time, but it's still high: ``` (Backward graph) Percent of time when GPU is busy: 46.70% Total wall time 17.422 ms ``` 2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ). NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately. ``` benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms ----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ -------------- resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms ``` ## how to use `python {compiled_module_wrapper.py} -p` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723 Approved by: https://github.com/jansel	2023-04-01 08:04:14 +00:00
Jason Ansel	1432a893ef	Fix issue with single input cat (#97822 ) Fixes #97695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-03-30 02:51:43 +00:00
Shunting Zhang	e1f44ee3b3	[inductor] correctly setup constant in the wrapper (#97571 ) V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable. Add 'global var_name' lines to maintain the same behavior as before. Test: Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix. Thanks @ngimel for report the issue with repro and @Chillee for pointing out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571 Approved by: https://github.com/ngimel	2023-03-28 03:10:53 +00:00
Shunting Zhang	652592efa9	[inductor] use torch.prifiler in the triton wrapper (#97405 ) I think it's helpful to use torch.profiler to profile the triton wrapper. E.g., I tried it for nvidia_deeprecommender's infernece graph. Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops. <img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png"> next step I'll check why the triton mm kernels are not picked. EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled: <img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405 Approved by: https://github.com/ngimel	2023-03-27 21:54:25 +00:00
Jason Ansel	5f57b36318	Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558 Approved by: https://github.com/Chillee	2023-03-23 17:41:19 +00:00
Wang, Eikan	517a432d6e	[Inductor] Enable CppWrapper to support BF16 (#97089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-22 05:54:09 +00:00
chunyuan	adcd1b3077	inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119 Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire	2023-03-21 01:40:09 +00:00
Bin Bao	931a4913b1	[inductor] Refactor memory management code in wrapper codegen (#96768 ) Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768 Approved by: https://github.com/jansel	2023-03-16 18:36:35 +00:00
Zachary DeVito	3162f71787	[memory debugging] Extract frame information from inductor (#95753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753 Approved by: https://github.com/Chillee	2023-03-16 04:12:54 +00:00
Bin Bao	b60d6e246e	[inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654 ) Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654 Approved by: https://github.com/jansel	2023-03-14 22:55:12 +00:00
Bin Bao	f03db8d6cb	[reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822. Solved the long compilation issue for inductor cpp tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-14 16:10:54 +00:00
Horace He	2a08a62777	Add extra metadata (as comments) to Inductor generated code (#96581 ) New output <img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581 Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym	2023-03-14 03:59:59 +00:00
Shunting Zhang	cc699c56dc	reland #96248 [inductor] show performance for each autotune config for a kernel (#96458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458 Approved by: https://github.com/ngimel	2023-03-10 01:40:04 +00:00
PyTorch MergeBot	fe05266fda	Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 )" This reverts commit `deaf9e5e65`. Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378	2023-03-09 01:45:24 +00:00
Bin Bao	deaf9e5e65	[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985 Approved by: https://github.com/jansel	2023-03-08 20:02:32 +00:00
Shunting Zhang	962b3f78bd	[inductor] run all kernel benchmarks individually in a compiled module (#95845 ) This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace. Here are the steps: 1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g. ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training ``` 2. From the output we will see 3 lines like ``` Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py ``` That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module. 3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file. ``` python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k ``` Example output: <img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png"> Note: I use the first 10 characters of the hash to identify each kernel since 1. hash is easier to get in the code :) 2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash) If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845 Approved by: https://github.com/Chillee	2023-03-06 21:30:33 +00:00
PyTorch MergeBot	879400e4e8	Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 )" This reverts commit `73b66098b2`. Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, `73b66098b2 (11745396725)`	2023-03-03 17:33:27 +00:00
Bin Bao	73b66098b2	[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 ) Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822 Approved by: https://github.com/jansel	2023-03-03 14:18:09 +00:00
Will Constable	92a2107375	Support Inductor collectives with wait or collective outside graph (#95893 ) Inductor implementations of collectives/wait must match eager impls in _functional_collectives in terms of interacting with _register_tensor_work API. If they do, then splitting a collective-wait pair so one half is in a compiled graph should work fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893 Approved by: https://github.com/kumpera	2023-03-03 09:00:48 +00:00
Jason Ansel	00ebbba623	Remove torch._inductor.config.triton.convolution (#95842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842 Approved by: https://github.com/ngimel	2023-03-02 17:44:41 +00:00
Shunting Zhang	5d29b68bbc	[inductor] generate triton kernel benchmark (#95506 ) A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module. Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong. In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b . A compiled module contains the following part: - various triton kernels - a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor - some utility code that generate random inputs and run the wrapper The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning. This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well. The path to each compiled triton kernel is added as comment in the compiled module. E.g. ``` # kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py triton__0 = async_compile.triton(''' import triton import triton.language as tl ... """) ```` Example command: ``` TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506 Approved by: https://github.com/Chillee	2023-03-01 18:29:07 +00:00
Edward Z. Yang	58648822b6	Handle int/float arguments for cpp codegen in inductor (#95533 ) This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-28 03:57:35 +00:00
Horace He	01c861af14	Added utilities to instrument kernel bandwidth numbers (#95355 ) Looks like ![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-24 17:51:11 +00:00
Nicolas Macchioni	dd7e2b7c0e	[pt2][inductor] update choice caller hashes (#94853 ) Summary: update the hashing method for `ChoiceCaller` class. `TritonTemplateCaller` objects will now be hashed to: `{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}` for example: `triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o` `ExternKernelCaller` objects will now be hashed to: `{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}` for example: `addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf` Test Plan: sandcastle Differential Revision: D43285470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853 Approved by: https://github.com/jansel, https://github.com/bertmaher	2023-02-16 00:11:26 +00:00
Natalia Gimelshein	a5daea69fb	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-10 11:21:57 +00:00
PyTorch MergeBot	6007874bbb	Revert "teach inductor to handle floor (#94341 )" This reverts commit `e7df9aaec8`. Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related. It fails on both PR and trunk `e7df9aaec8`	2023-02-09 19:31:08 +00:00
Natalia Gimelshein	e7df9aaec8	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-09 17:09:35 +00:00
Eli Uriegas	567e6152da	Revert "[inductor] fix crash issue when input is a view tensor (#90150 )" (#94329 ) Had to provide a merge conflict resolution due to conflicts with https://github.com/pytorch/pytorch/pull/94118 This was causing issues with internal tests that look similar to: ``` in clone_preserve_strides x.size(), x.stride(), x.storage_offset() AttributeError: 'KeyedJaggedTensor' object has no attribute 'size' ``` See https://fburl.com/testinfra/nc0du2sp for more information This reverts commit #90150 @jansel can you help @blzheng with re-landing this as a co-development diff? Pull Request resolved: https://github.com/pytorch/pytorch/pull/94329 Approved by: https://github.com/jansel	2023-02-07 20:45:58 +00:00
Natalia Gimelshein	ca8450849b	compute dynamic tensor shapes for indexing on the host (#93872 ) Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is ``` x1 = (xindex // pks0) % 64 ``` instead of ``` x1 = (xindex // (1 + (((((-1) + ks0) // 4))((((-1) + ks0) // 4))) + (2((((-1) + ks0) // 4))))) % 64 ``` with `pks0` arg computed on the host ``` ps0 = (1 + ((((-1) + s2) // 4)))(1 + ((((-1) + s2) // 4))) ``` It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g. ``` tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0(((((-1) + ks0) // 32))((((-1) + ks0) // 32)))) + (2x0((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32) ``` Unfortunately, `unet` which is one of the examples failing with floor does the latter: ``` tmp1 = ((-1)(1/(((-1) + (floor(2.0(ks0//16))))))) + ((1/(((-1) + (floor(2.0(ks0//16))))))*(ks0 // 16)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872 Approved by: https://github.com/jansel	2023-02-03 09:58:39 +00:00
blzheng	a71395dd88	[inductor] fix crash issue when input is a view tensor (#90150 ) Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-03 04:54:14 +00:00
PyTorch MergeBot	5d259425fc	Revert "[inductor] fix crash issue when input is a view tensor (#90150 )" This reverts commit `b11ec270ba`. Reverted https://github.com/pytorch/pytorch/pull/90150 on behalf of https://github.com/clee2000 due to failing test_inplace_unsqueeze3 (__main__.CPUReproTests) https://github.com/pytorch/pytorch/actions/runs/4074618739/jobs/7020199369 `b11ec270ba`, marking as landrace cuz all jobs are green on pr	2023-02-02 17:06:34 +00:00
Will Constable	a14e3190e3	Mark buffers that reuse other buffers (#93329 ) Provides a way at codegen time to emit code conditioned on having a fresh allocation vs reusing an input. - For collective ops, if reusing an input, a copy can be skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/93329 Approved by: https://github.com/jansel	2023-02-02 14:22:26 +00:00
blzheng	b11ec270ba	[inductor] fix crash issue when input is a view tensor (#90150 ) Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-02 12:49:26 +00:00
Edward Z. Yang	ca9ebf9e2b	Delete dynamo_import and inductor_import (#93851 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93851 Approved by: https://github.com/albanD, https://github.com/jansel	2023-02-02 01:51:29 +00:00
Wu, Chunyuan	42633cf5f9	Inductor cpp wrapper: cache the loading of the kernel (#89742 ) ### Pitch Cache the loaded kernel to reduce the overhead. #### Code before: ```cpp std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) { ... auto kernel_cpp_0_lib = dlopen("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so", RTLD_NOW); assert(kernel_cpp_0_lib != nullptr); void (kernel_cpp_0)(const float,const float,float,float); (void *) (&kernel_cpp_0) = dlsym(kernel_cpp_0_lib, "kernel"); kernel_cpp_0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float)(buf1.data_ptr())); ... } ``` #### Code after: ```cpp template <typename KernelFunc> KernelFunc load_cpp_kernel(const char so_filename) { KernelFunc kernel_cpp; auto kernel_cpp_lib = dlopen(so_filename, RTLD_NOW); assert(kernel_cpp_lib != nullptr); (void ) (&kernel_cpp) = dlsym(kernel_cpp_lib, "kernel"); return kernel_cpp; } std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) { ... static auto kernel_cpp_0 = load_cpp_kernel<void ()(const float,const float,float,float)>("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so"); kernel_cpp_0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float)(buf1.data_ptr())); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89742 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-02-01 05:05:50 +00:00
Jason Ansel	9b173b87b2	Refactor away leftover import indirection (#92188 ) This indirect ways of importing are a leftover from when we wanted to support both `import torchdynamo` and `import torch._dynamo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92188 Approved by: https://github.com/desertfire	2023-01-18 04:53:05 +00:00
Natalia Gimelshein	5625f521a4	generate set_device call to ensure context existence (#92055 ) Hopefully Fixes https://github.com/pytorch/torchdynamo/issues/2026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92055 Approved by: https://github.com/wconstab	2023-01-12 17:23:49 +00:00
Peter Bell	eece6da162	[inductor] Reduce device context manager overhead (#91045 ) This adds `torch.cuda._DeviceGuard` which is a stripped down version of `torch.cuda.device` with lower overhead. To do this, it only accepts `int` as the device so we don't need to call `_get_device_index` and is implemented with a new C++ helper `torch._C._cuda_exchangeDevice` that allows `_DeviceGuard.__enter__` to be just a single function call. On my machine, I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark: ```python def set_device(): with torch.cuda.device(0): pass %timeit set_device() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-01-12 16:51:59 +00:00
Jason Ansel	7c1c239db1	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575 ) This reverts commit `94262efc7d` to reland #91105 / #90738. Fixes https://github.com/pytorch/torchdynamo/issues/2015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575 Approved by: https://github.com/ngimel	2023-01-11 00:08:03 +00:00
PyTorch MergeBot	94262efc7d	Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 )" This reverts commit `d6dd2e97da`. Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds	2022-12-21 00:02:38 +00:00
Jason Ansel	d6dd2e97da	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 ) https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land. This is an exact copy that PR without using ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105 Approved by: https://github.com/ngimel	2022-12-20 02:38:23 +00:00
Natalia Gimelshein	a10b3ce876	generate device context managers in inductor code (#90934 ) Fixes https://github.com/pytorch/torchdynamo/issues/1717, https://github.com/pytorch/torchdynamo/issues/1990 <s>TODO: add test with multiple devices, figure out extra context initialization</s> Problems: <s>It still initializes context on 0-th device that it shouldn't, I'll take a look where that happens and fix before landing</s> It adds a python device context manages, that is absurdly slow and takes ~2.5 us (should be nanoseconds). That's not a problem for real models, because it'll be called just once, but it is a bit of an inconvenience for microbenchmarking, we should make that context manager more performant (won't fix in this PR) It still can have bugs for graphs that run on multiple devices and can have buffers incorrectly shared between multiple device by memory reuse, if that happens that'll need to be solved separately. Generated code: ``` def call(args): arg0_1, arg1_1 = args args.clear() with torch.cuda.device(1): buf0 = empty_strided((4, ), (1, ), device='cuda', dtype=torch.float32) stream1 = get_cuda_stream(1) triton_fused_div_0.run(arg0_1, arg1_1, buf0, 4, grid=grid(4), stream=stream1) del arg0_1 del arg1_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90934 Approved by: https://github.com/wconstab	2022-12-16 18:03:39 +00:00
chunyuan	2ba5c1d7c4	Inductor cpp wrapper: change inputs args from tuple to vector (#90754 ) ## Pitch Change input args type from `std::tuple` to `std::vector` to reduce the compilation time. ## Description `std::tie()` takes quite a long time during the compilation when the input args number grows. For example, for a graph from the `PegasusForConditionalGeneration` model with 318 input args, the compilation of `std::tie` for the args is about 10s. By changing to std::vector, the compilation time of arg assignment is reduced to less than 1s. ### Code before: ```cpp at::Tensor call_0(std::tuple<at::Tensor&, at::Tensor&> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; ... return buf0; } ``` ### Code after: ```cpp at::Tensor call_0(std::vector<at::Tensor> args) { at::Tensor arg0_1, arg1_1; arg0_1 = args[0]; arg1_1 = args[1]; ... return buf0; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90754 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-15 05:07:16 +00:00
chunyuan	fde5646f3d	Inductor cpp wrapper: support bmm, mm, addmm extern call (#88667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88667 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-14 18:19:27 +00:00
chunyuan	d35aa2f65a	Inductor cpp wrapper: support Reduction (#88561 ) For reductions, the code string in the codegen stage and the execution stage are different due to `\`. - The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage: ``` #pragma omp declare reduction(argmax : struct IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ``` - The code string loaded during the execution (`\` will be escaped): ``` #pragma omp declare reduction(argmax : struct IndexValue_1 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ``` Thus we can't get the same hash value for these two pieces of code. This PR adds a function to make the transformation escape the backslash in the codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-12-14 12:29:50 +00:00
chunyuan	e2e4a80cdb	Inductor cpp wrapper: support None as output (#88560 ) Map `None` to `at::Tensor()` in the cpp wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/88560 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-12-14 02:28:22 +00:00
Bert Maher	d3d85e1c3b	Emit torch.cuda.synchronize() after every kernel call in inductor (#90472 ) Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1 and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace. doesn't necessarily guarantee that you'll get a stack trace pointing to the right kernel. This diff adds a config option to force a CUDA synchronize after every kernel call in inductor, for debugging those tricky cases. Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/) Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472 Approved by: https://github.com/jansel	2022-12-12 04:35:10 +00:00
Wu, Chunyuan	a6caa9c54b	Add a cpp wrapper for Inductor (#88167 ) ## Description Implements https://github.com/pytorch/torchdynamo/issues/1556. This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting: ```python from torch._inductor import config config.cpp_wrapper = True ``` ### Example The main part of the generated code: ```python from torch.utils.cpp_extension import load_inline wrapper = ( ''' #include <dlfcn.h> #include <assert.h> std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float); auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float); auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW); assert(kernel0_lib != nullptr); void (kernel0)(const float,const float,float,float); (void *) (&kernel0) = dlsym(kernel0_lib, "kernel"); kernel0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float*)(buf1.data_ptr())); arg0_1.reset(); arg1_1.reset(); return std::make_tuple(buf0, buf1); }''' ) module = load_inline( name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu', cpp_sources=[wrapper], functions=['call_0'], extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'], extra_ldflags=['-shared -lgomp'], extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m']) def _wrap_func(f): def g(args): return f(args) return g call = _wrap_func(module.call_0) ``` ### Next steps The below items will be addressed in upcoming PRs. - [x] Support Reduction: #88561 - [x] Support None: #88560 - [ ] Support ExternKernel - [x] ATen GEMM-related OPs: #88667 - [ ] ATen Conv - [ ] Conv/GEMM fusion OPs - [x] Cache the kernel loading part: #89742 - [ ] De-allocate input buffers when possible by leveraging CPython APIs - [ ] Support Constant Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-11-30 13:40:47 +00:00

1 2

56 Commits