pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
chunyuan	cc54da4877	Inductor cpp wrapper: fix FallbackKernel support (#100788 ) Fixes cpp wrapper support for kernels that are not exposed in `torch.ops.aten`. The current PR limits the support scope to `repeat_interleave.Tensor` and will submit follow-up PRs for more OPs. The PR maps the python schema of the kernel to the cpp schema and uses `c10::Dispatcher::singleton().findSchemaOrThrow` to find the corresponding cpp OP. The current support is limited and will raise `AssertionError` for unsupported cases. The limitation includes: - only support kernel that is not alias - only support kernel the args and returns of which don't have `alias_info` - only support output args to be a `Tensor` - only support input args to be `Tensor`, `Optional[int]`, `Optional[float]` and `Optional[bool]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100788 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-05-15 00:45:44 +00:00
Aaron Gokaslan	dfe484a3b3	[BE]: Bugfix functorch and some generic typing improvements (#101337 ) Fixes some typing bugs found with newer versions of mypy Pull Request resolved: https://github.com/pytorch/pytorch/pull/101337 Approved by: https://github.com/ezyang	2023-05-14 14:20:56 +00:00
Bin Bao	03433080e6	[inductor] Support FallbackKernel in cpp wrapper codegen (#100553 ) Summary: This works well for ops without kwargs. For ops with kwargs, we need to register ordered_kwargs_for_cpp_kernel for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100553 Approved by: https://github.com/jansel	2023-05-07 14:33:53 +00:00
Edward Z. Yang	4101de342b	Type torch._inductor.codegen.wrapper (#100657 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100657 Approved by: https://github.com/voznesenskym	2023-05-05 16:19:23 +00:00
Edward Z. Yang	f093ee1722	Prevent Triton from getting eagerly imported when importing torch._inductor (#100374 ) This makes 'import torch._inductor.utils' go from 3.5s to 2.1s See also https://github.com/openai/triton/issues/1599 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:12 +00:00
Natalia Gimelshein	ff29722364	[inductor] Prevent reusing aliased buffers if aliases still have uses (#100332 ) Fixes #100314 In dependencies, we should track not only immediately used buffer, but also aliased buffers that point to it, otherwise we can reuse and overwrite the buffer while there are still pending uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100332 Approved by: https://github.com/jansel	2023-05-02 04:05:16 +00:00
Edward Z. Yang	2d8deffc1e	Refactor repro/minifier into CLI; add analyze (#100226 ) This is a two part PR; I can split it if you really want me to. The first part is a refactor of the after aot repro/minifier scripts to come with a command line interface. I maintain exact BC with the previous interface (so, e.g., you still get a repro.py and a run_minifier.py that do the same thing as before), but each of these scripts also take command line arguments now which you can use to customize what actually happens. Check `run_repro` for full documentation on the arguments. The second part of this is an implementation of `analyze` subcommand on the new CLI for any repro. <img width="1277" alt="image" src="https://user-images.githubusercontent.com/13564/235045677-8545aab7-5e83-4813-bbec-47783dc60122.png"> This facility is oriented towards accuracy debugging. It does several things: 1. It will run your model twice and check for nondeterminism in inductor/float64, even on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy. 2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64. Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.) Some other minor improvements: * MinifierTestBase now has an easy to comment out spot that you can use to retain the temporary directory; good for debugging * We print "running minifier" and "running repro" in MinifierTestBase to make it easier to orient where logs are coming from * same takes a `log_error` optional argument which you can use to reroute the error logs when things mismatch * counters["inductor"]["intermediate_hooks"] tracks the number of intermediate hooks we've codegen'ed; good for populate the tqdm interface * torch.fx.interpreter gets an official `boxed_run` interface which uses the boxed arguments calling convention and doesn't retain inputs unnecessarily long * torch.utils._content_store gets compute_tensor_metadata/read_tensor_metadata helper functions for computing tensor information without serializing it Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100226 Approved by: https://github.com/bertmaher, https://github.com/bdhirsh, https://github.com/anijain2305	2023-05-01 11:12:38 +00:00
Edward Z. Yang	beb7f79517	Fix intermediate hooks on inplace buffers, enable it in testing (#100322 ) Fixes https://github.com/pytorch/pytorch/issues/100312 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322 Approved by: https://github.com/ngimel	2023-04-30 13:34:44 +00:00
Edward Z. Yang	54c0edf6da	Track exact origin_node on best effort basis (#100110 ) Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node. This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph. Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future. The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.) I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110 Approved by: https://github.com/voznesenskym	2023-04-28 04:15:27 +00:00
Bin Bao	afa9d10ed6	[inductor] Support mixed device in cpp wrapper (#99950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99950 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-26 16:26:56 +00:00
Bin Bao	efded3f3e9	[inductor] Add cpp_wrapper support for FallbackKernel (#99887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99887 Approved by: https://github.com/ngimel	2023-04-26 01:03:53 +00:00
Bin Bao	e43918b93a	[inductor] Fix AOTInductor (#99203 ) Summary: Fix the broken AOTInductor flow and add a smoketest on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99203 Approved by: https://github.com/jansel	2023-04-25 14:42:12 +00:00
Bin Bao	e5501a967e	[inductor] Support IndexPutFallback in cpp_wrapper (#98972 ) Summary: 1) Make the fallback index_put generate the right cpp code in cpp_wapper 2) Add a --cpp-wrapper option to common.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/98972 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-13 15:41:03 +00:00
Bin Bao	0c0e5c574e	[inductor] Consolidate constant_args and cpp_constant_args (#98742 ) Summary: Refactor code to simplify the logic. Support convolution as an extern call in CudaWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98742 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-12 11:59:08 +00:00
Bin Bao	ff9e34fb35	[inductor] Consolidata kernel and cpp_kernel for wrapper codegen (#98741 ) Summary: refactor to simplify the wrapper codegen logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/98741 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/ngimel	2023-04-12 11:59:08 +00:00
Peter Bell	48397cddd7	[inductor] Fix benchmark_compiled_module codegen with CppWrapperCodeGen (#98608 ) The python function `benchmark_compiled_module` ends up using C++ expression printer to print the size for `rand_strided`, so you get a set e.g. `{2, 17}` instead of a tuple `(2, 17)`. Here is a complete example from master: ```python def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.float32) arg1_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.uint8) return print_performance(lambda: call([arg0_1, arg1_1]), times=times, repeat=repeat) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98608 Approved by: https://github.com/ngimel	2023-04-08 00:55:51 +00:00
Bin Bao	152d65ae1d	[reland][inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98534 ) Summary: This is a reland of #98264. When _inductor.config.cpp_wrapper is specified, we run a two-pass wrapper codegen to generate wrapper code in cpp which calls cuLaunchKernel to launch pre-compiled cuda kernels, and then call load_inline to load that generated wrapper back into the python world. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534 Approved by: https://github.com/huydhn	2023-04-07 02:04:03 +00:00
PyTorch MergeBot	f228b3977b	Revert "[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264 )" This reverts commit `77f32eb6cc`. Reverted https://github.com/pytorch/pytorch/pull/98264 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is failing in trunk due to a name error fake_mode_from_tensors is not defined `67d1a77086`. This is probably a landrace	2023-04-06 19:00:09 +00:00
Bin Bao	77f32eb6cc	[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264 ) Summary: when _inductor.config.cpp_wrapper is specified, we run a two-pass wrapper codegen to generate wrapper code in cpp which calls cuLaunchKernel to launch pre-compiled cuda kernels, and then call load_inline to load that generated wrapper back into the python world. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264 Approved by: https://github.com/ngimel	2023-04-06 15:59:55 +00:00
Bin Bao	348dcf51e5	[inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088 ) Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one file, which unifies the codegen for AOT and non-AOT mode. There will be more refactoring for the AOT part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-06 15:59:55 +00:00
chunyuan	2987bc0758	Inductor cpp wrapper: support dynamic shapes (#97965 ) 1. Fixed dynamic shapes support in cpp_wrapper - fixed the cpp codegen of `size()` and `stride()` - fixed the cpp codegen of `ShapeAsConstantBuffer` - changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar` 2. Enabled dynamic shapes tests for cpp_wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-05 07:02:30 +00:00
Bin Bao	96f548a1ac	[inductor] Add an AOT mode for the Triton backend (#98214 ) Summary: This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make the landing easier. This PR implements a two-pass wrapper codegen for the Triton backend to achieve ahead-of-time compilation. In the first pass, the regular python wrapper code will be generated, and then the generated code will be executed to perform Triton compilation and autotuning. After that, the second pass wrapper codegen will generate C++ wrapper with proper CUDA API to load and launch Triton-generated CUDA kernels. Like the AOT mode for the cpp backend, the next step would be to provide a more complete API for AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214 Approved by: https://github.com/eellison	2023-04-03 22:19:18 +00:00
chunyuan	0c1f524b92	Inductor cpp wrapper: support MKLPackedLinear (#90755 ) Invoke `torch.ops.mkl._mkl_linear` from c++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-03 04:07:38 +00:00
Shunting Zhang	13461e9767	[inductor] more cuda metrics in wrapper (#97723 ) Following metrics should be helpful: - percent of time GPU is busy - percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes - percent of time each individual kernel takes compared to total wall time of the benchmark This PR add those. Example result from hf_Bert infernece graph: ``` == triton_pointwise category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52% triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33% triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00% triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24% Total 1.05379 12.08% == triton_persistent_reduction category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_per_fused__softmax__to_ 0.97188 12.0 11.14% triton_per_fused_add_native_la 0.37401 24.0 4.29% triton_per_fused_gelu_native_l 0.02 1.0 0.23% triton_per_fused_add_embedding 0.01718 1.0 0.20% Total 1.38307 15.86% == unknown category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74% ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03% void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61% ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55% ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11% Memset (Device) 0.024 12.0 0.28% void at::native::(anonymous na 0.01543 2.03 0.18% void at::native::vectorized_el 0.00011 0.03 0.00% Total 6.40937 73.49% Percent of time when GPU is busy: 101.44% ``` Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU). ## interesting usages 1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle: ``` (Forward graph) Percent of time when GPU is busy: 32.69% Total wall time 17.307 ms ``` Its backward graph has less percent of GPU idle time, but it's still high: ``` (Backward graph) Percent of time when GPU is busy: 46.70% Total wall time 17.422 ms ``` 2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ). NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately. ``` benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms ----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ -------------- resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms ``` ## how to use `python {compiled_module_wrapper.py} -p` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723 Approved by: https://github.com/jansel	2023-04-01 08:04:14 +00:00
Jason Ansel	1432a893ef	Fix issue with single input cat (#97822 ) Fixes #97695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-03-30 02:51:43 +00:00
Shunting Zhang	e1f44ee3b3	[inductor] correctly setup constant in the wrapper (#97571 ) V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable. Add 'global var_name' lines to maintain the same behavior as before. Test: Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix. Thanks @ngimel for report the issue with repro and @Chillee for pointing out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571 Approved by: https://github.com/ngimel	2023-03-28 03:10:53 +00:00
Shunting Zhang	652592efa9	[inductor] use torch.prifiler in the triton wrapper (#97405 ) I think it's helpful to use torch.profiler to profile the triton wrapper. E.g., I tried it for nvidia_deeprecommender's infernece graph. Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops. <img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png"> next step I'll check why the triton mm kernels are not picked. EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled: <img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405 Approved by: https://github.com/ngimel	2023-03-27 21:54:25 +00:00
Jason Ansel	5f57b36318	Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558 Approved by: https://github.com/Chillee	2023-03-23 17:41:19 +00:00
Wang, Eikan	517a432d6e	[Inductor] Enable CppWrapper to support BF16 (#97089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-22 05:54:09 +00:00
chunyuan	adcd1b3077	inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119 Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire	2023-03-21 01:40:09 +00:00
Bin Bao	931a4913b1	[inductor] Refactor memory management code in wrapper codegen (#96768 ) Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768 Approved by: https://github.com/jansel	2023-03-16 18:36:35 +00:00
Zachary DeVito	3162f71787	[memory debugging] Extract frame information from inductor (#95753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753 Approved by: https://github.com/Chillee	2023-03-16 04:12:54 +00:00
Bin Bao	b60d6e246e	[inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654 ) Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654 Approved by: https://github.com/jansel	2023-03-14 22:55:12 +00:00
Bin Bao	f03db8d6cb	[reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822. Solved the long compilation issue for inductor cpp tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-14 16:10:54 +00:00
Horace He	2a08a62777	Add extra metadata (as comments) to Inductor generated code (#96581 ) New output <img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581 Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym	2023-03-14 03:59:59 +00:00
Shunting Zhang	cc699c56dc	reland #96248 [inductor] show performance for each autotune config for a kernel (#96458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458 Approved by: https://github.com/ngimel	2023-03-10 01:40:04 +00:00
PyTorch MergeBot	fe05266fda	Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 )" This reverts commit `deaf9e5e65`. Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378	2023-03-09 01:45:24 +00:00
Bin Bao	deaf9e5e65	[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985 Approved by: https://github.com/jansel	2023-03-08 20:02:32 +00:00
Shunting Zhang	962b3f78bd	[inductor] run all kernel benchmarks individually in a compiled module (#95845 ) This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace. Here are the steps: 1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g. ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training ``` 2. From the output we will see 3 lines like ``` Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py ``` That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module. 3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file. ``` python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k ``` Example output: <img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png"> Note: I use the first 10 characters of the hash to identify each kernel since 1. hash is easier to get in the code :) 2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash) If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845 Approved by: https://github.com/Chillee	2023-03-06 21:30:33 +00:00
PyTorch MergeBot	879400e4e8	Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 )" This reverts commit `73b66098b2`. Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, `73b66098b2 (11745396725)`	2023-03-03 17:33:27 +00:00
Bin Bao	73b66098b2	[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 ) Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822 Approved by: https://github.com/jansel	2023-03-03 14:18:09 +00:00
Will Constable	92a2107375	Support Inductor collectives with wait or collective outside graph (#95893 ) Inductor implementations of collectives/wait must match eager impls in _functional_collectives in terms of interacting with _register_tensor_work API. If they do, then splitting a collective-wait pair so one half is in a compiled graph should work fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893 Approved by: https://github.com/kumpera	2023-03-03 09:00:48 +00:00
Jason Ansel	00ebbba623	Remove torch._inductor.config.triton.convolution (#95842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842 Approved by: https://github.com/ngimel	2023-03-02 17:44:41 +00:00
Shunting Zhang	5d29b68bbc	[inductor] generate triton kernel benchmark (#95506 ) A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module. Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong. In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b . A compiled module contains the following part: - various triton kernels - a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor - some utility code that generate random inputs and run the wrapper The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning. This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well. The path to each compiled triton kernel is added as comment in the compiled module. E.g. ``` # kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py triton__0 = async_compile.triton(''' import triton import triton.language as tl ... """) ```` Example command: ``` TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506 Approved by: https://github.com/Chillee	2023-03-01 18:29:07 +00:00
Edward Z. Yang	58648822b6	Handle int/float arguments for cpp codegen in inductor (#95533 ) This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-28 03:57:35 +00:00
Horace He	01c861af14	Added utilities to instrument kernel bandwidth numbers (#95355 ) Looks like ![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-24 17:51:11 +00:00
Nicolas Macchioni	dd7e2b7c0e	[pt2][inductor] update choice caller hashes (#94853 ) Summary: update the hashing method for `ChoiceCaller` class. `TritonTemplateCaller` objects will now be hashed to: `{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}` for example: `triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o` `ExternKernelCaller` objects will now be hashed to: `{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}` for example: `addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf` Test Plan: sandcastle Differential Revision: D43285470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853 Approved by: https://github.com/jansel, https://github.com/bertmaher	2023-02-16 00:11:26 +00:00
Natalia Gimelshein	a5daea69fb	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-10 11:21:57 +00:00
PyTorch MergeBot	6007874bbb	Revert "teach inductor to handle floor (#94341 )" This reverts commit `e7df9aaec8`. Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related. It fails on both PR and trunk `e7df9aaec8`	2023-02-09 19:31:08 +00:00
Natalia Gimelshein	e7df9aaec8	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-09 17:09:35 +00:00

1 2

77 Commits