pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
chunyuan	3469f100f3	support ConvUnary in Inductor cpp wrapper (#101392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101392 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang	2023-05-26 15:52:06 +00:00
Natalia Gimelshein	68816e4fa9	Remove inplace buffers when original and mutation are both removed (#102289 ) Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed). This PR extends buffer removal to inplaced buffers that can be removed. Generated kernel for e.g. ln changes from ``` def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` where in_out_ptr0 is unused in the kernel to ``` def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` and corresponding allocation/reuse lines in the wrapper are removed. The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289 Approved by: https://github.com/jansel	2023-05-26 02:06:36 +00:00
Bin Bao	836798e0f3	[inductor] Support precomputed_sizes in CppWrapperCodeGen (#102083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102083 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-25 23:14:28 +00:00
Bin Bao	fd1d442185	[inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019 ) Summary: Use size hint for autotuning; Fix some symbol arg codegen problem. More PRs coming for fixing unit test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102019 Approved by: https://github.com/jansel	2023-05-24 13:29:47 +00:00
Bin Bao	431344f2d0	[inductor] Refactor generate_kernel_call (#102018 ) Summary: Refactor generate_kernel_call to support codegen call to Triton kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/102018 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-23 15:54:49 +00:00
Jason Ansel	0c6f409cda	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-20 03:43:33 +00:00
XiaobingSuper	bb62a3734e	inductor: fix name 'inf' is not defined issue when calling external_call function (#101865 ) This PR will fix https://github.com/pytorch/pytorch/issues/101695. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101865 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/Skylion007	2023-05-20 01:44:21 +00:00
PyTorch MergeBot	5f07c589b0	Revert "[inductor] Refactor RNG operators (#100064 )" This reverts commit `3bbf0683a1`. Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))	2023-05-17 21:16:41 +00:00
Michael Voznesensky	39f52c0218	Switch AOT Inductor test to export, add dynamic, fix invocation bug (#101585 ) Fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/101585 Approved by: https://github.com/ngimel, https://github.com/desertfire	2023-05-17 05:52:08 +00:00
Jason Ansel	3bbf0683a1	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-17 01:29:31 +00:00
Edward Z. Yang	b94f143ace	SymIntify convNd and conv_transposeNd, fix inductor symint handling (#101488 ) Fixes https://github.com/pytorch/pytorch/issues/101014 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101488 Approved by: https://github.com/ngimel	2023-05-16 17:46:52 +00:00
Jiong Gong	dfc46153a7	[inductor] add graph id prefix to inductor_wrapper_call profile info (#101350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101350 Approved by: https://github.com/jansel	2023-05-16 06:32:00 +00:00
chunyuan	cc54da4877	Inductor cpp wrapper: fix FallbackKernel support (#100788 ) Fixes cpp wrapper support for kernels that are not exposed in `torch.ops.aten`. The current PR limits the support scope to `repeat_interleave.Tensor` and will submit follow-up PRs for more OPs. The PR maps the python schema of the kernel to the cpp schema and uses `c10::Dispatcher::singleton().findSchemaOrThrow` to find the corresponding cpp OP. The current support is limited and will raise `AssertionError` for unsupported cases. The limitation includes: - only support kernel that is not alias - only support kernel the args and returns of which don't have `alias_info` - only support output args to be a `Tensor` - only support input args to be `Tensor`, `Optional[int]`, `Optional[float]` and `Optional[bool]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100788 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-05-15 00:45:44 +00:00
Aaron Gokaslan	dfe484a3b3	[BE]: Bugfix functorch and some generic typing improvements (#101337 ) Fixes some typing bugs found with newer versions of mypy Pull Request resolved: https://github.com/pytorch/pytorch/pull/101337 Approved by: https://github.com/ezyang	2023-05-14 14:20:56 +00:00
Bin Bao	03433080e6	[inductor] Support FallbackKernel in cpp wrapper codegen (#100553 ) Summary: This works well for ops without kwargs. For ops with kwargs, we need to register ordered_kwargs_for_cpp_kernel for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100553 Approved by: https://github.com/jansel	2023-05-07 14:33:53 +00:00
Edward Z. Yang	4101de342b	Type torch._inductor.codegen.wrapper (#100657 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100657 Approved by: https://github.com/voznesenskym	2023-05-05 16:19:23 +00:00
Edward Z. Yang	f093ee1722	Prevent Triton from getting eagerly imported when importing torch._inductor (#100374 ) This makes 'import torch._inductor.utils' go from 3.5s to 2.1s See also https://github.com/openai/triton/issues/1599 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:12 +00:00
Natalia Gimelshein	ff29722364	[inductor] Prevent reusing aliased buffers if aliases still have uses (#100332 ) Fixes #100314 In dependencies, we should track not only immediately used buffer, but also aliased buffers that point to it, otherwise we can reuse and overwrite the buffer while there are still pending uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100332 Approved by: https://github.com/jansel	2023-05-02 04:05:16 +00:00
Edward Z. Yang	2d8deffc1e	Refactor repro/minifier into CLI; add analyze (#100226 ) This is a two part PR; I can split it if you really want me to. The first part is a refactor of the after aot repro/minifier scripts to come with a command line interface. I maintain exact BC with the previous interface (so, e.g., you still get a repro.py and a run_minifier.py that do the same thing as before), but each of these scripts also take command line arguments now which you can use to customize what actually happens. Check `run_repro` for full documentation on the arguments. The second part of this is an implementation of `analyze` subcommand on the new CLI for any repro. <img width="1277" alt="image" src="https://user-images.githubusercontent.com/13564/235045677-8545aab7-5e83-4813-bbec-47783dc60122.png"> This facility is oriented towards accuracy debugging. It does several things: 1. It will run your model twice and check for nondeterminism in inductor/float64, even on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy. 2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64. Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.) Some other minor improvements: * MinifierTestBase now has an easy to comment out spot that you can use to retain the temporary directory; good for debugging * We print "running minifier" and "running repro" in MinifierTestBase to make it easier to orient where logs are coming from * same takes a `log_error` optional argument which you can use to reroute the error logs when things mismatch * counters["inductor"]["intermediate_hooks"] tracks the number of intermediate hooks we've codegen'ed; good for populate the tqdm interface * torch.fx.interpreter gets an official `boxed_run` interface which uses the boxed arguments calling convention and doesn't retain inputs unnecessarily long * torch.utils._content_store gets compute_tensor_metadata/read_tensor_metadata helper functions for computing tensor information without serializing it Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100226 Approved by: https://github.com/bertmaher, https://github.com/bdhirsh, https://github.com/anijain2305	2023-05-01 11:12:38 +00:00
Edward Z. Yang	beb7f79517	Fix intermediate hooks on inplace buffers, enable it in testing (#100322 ) Fixes https://github.com/pytorch/pytorch/issues/100312 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322 Approved by: https://github.com/ngimel	2023-04-30 13:34:44 +00:00
Edward Z. Yang	54c0edf6da	Track exact origin_node on best effort basis (#100110 ) Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node. This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph. Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future. The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.) I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110 Approved by: https://github.com/voznesenskym	2023-04-28 04:15:27 +00:00
Bin Bao	afa9d10ed6	[inductor] Support mixed device in cpp wrapper (#99950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99950 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-26 16:26:56 +00:00
Bin Bao	efded3f3e9	[inductor] Add cpp_wrapper support for FallbackKernel (#99887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99887 Approved by: https://github.com/ngimel	2023-04-26 01:03:53 +00:00
Bin Bao	e43918b93a	[inductor] Fix AOTInductor (#99203 ) Summary: Fix the broken AOTInductor flow and add a smoketest on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99203 Approved by: https://github.com/jansel	2023-04-25 14:42:12 +00:00
Bin Bao	e5501a967e	[inductor] Support IndexPutFallback in cpp_wrapper (#98972 ) Summary: 1) Make the fallback index_put generate the right cpp code in cpp_wapper 2) Add a --cpp-wrapper option to common.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/98972 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-13 15:41:03 +00:00
Bin Bao	0c0e5c574e	[inductor] Consolidate constant_args and cpp_constant_args (#98742 ) Summary: Refactor code to simplify the logic. Support convolution as an extern call in CudaWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98742 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-12 11:59:08 +00:00
Bin Bao	ff9e34fb35	[inductor] Consolidata kernel and cpp_kernel for wrapper codegen (#98741 ) Summary: refactor to simplify the wrapper codegen logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/98741 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/ngimel	2023-04-12 11:59:08 +00:00
Peter Bell	48397cddd7	[inductor] Fix benchmark_compiled_module codegen with CppWrapperCodeGen (#98608 ) The python function `benchmark_compiled_module` ends up using C++ expression printer to print the size for `rand_strided`, so you get a set e.g. `{2, 17}` instead of a tuple `(2, 17)`. Here is a complete example from master: ```python def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.float32) arg1_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.uint8) return print_performance(lambda: call([arg0_1, arg1_1]), times=times, repeat=repeat) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98608 Approved by: https://github.com/ngimel	2023-04-08 00:55:51 +00:00
Bin Bao	152d65ae1d	[reland][inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98534 ) Summary: This is a reland of #98264. When _inductor.config.cpp_wrapper is specified, we run a two-pass wrapper codegen to generate wrapper code in cpp which calls cuLaunchKernel to launch pre-compiled cuda kernels, and then call load_inline to load that generated wrapper back into the python world. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534 Approved by: https://github.com/huydhn	2023-04-07 02:04:03 +00:00
PyTorch MergeBot	f228b3977b	Revert "[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264 )" This reverts commit `77f32eb6cc`. Reverted https://github.com/pytorch/pytorch/pull/98264 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is failing in trunk due to a name error fake_mode_from_tensors is not defined `67d1a77086`. This is probably a landrace	2023-04-06 19:00:09 +00:00
Bin Bao	77f32eb6cc	[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264 ) Summary: when _inductor.config.cpp_wrapper is specified, we run a two-pass wrapper codegen to generate wrapper code in cpp which calls cuLaunchKernel to launch pre-compiled cuda kernels, and then call load_inline to load that generated wrapper back into the python world. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264 Approved by: https://github.com/ngimel	2023-04-06 15:59:55 +00:00
Bin Bao	348dcf51e5	[inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088 ) Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one file, which unifies the codegen for AOT and non-AOT mode. There will be more refactoring for the AOT part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-06 15:59:55 +00:00
chunyuan	2987bc0758	Inductor cpp wrapper: support dynamic shapes (#97965 ) 1. Fixed dynamic shapes support in cpp_wrapper - fixed the cpp codegen of `size()` and `stride()` - fixed the cpp codegen of `ShapeAsConstantBuffer` - changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar` 2. Enabled dynamic shapes tests for cpp_wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-05 07:02:30 +00:00
Bin Bao	96f548a1ac	[inductor] Add an AOT mode for the Triton backend (#98214 ) Summary: This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make the landing easier. This PR implements a two-pass wrapper codegen for the Triton backend to achieve ahead-of-time compilation. In the first pass, the regular python wrapper code will be generated, and then the generated code will be executed to perform Triton compilation and autotuning. After that, the second pass wrapper codegen will generate C++ wrapper with proper CUDA API to load and launch Triton-generated CUDA kernels. Like the AOT mode for the cpp backend, the next step would be to provide a more complete API for AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214 Approved by: https://github.com/eellison	2023-04-03 22:19:18 +00:00
chunyuan	0c1f524b92	Inductor cpp wrapper: support MKLPackedLinear (#90755 ) Invoke `torch.ops.mkl._mkl_linear` from c++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-03 04:07:38 +00:00
Shunting Zhang	13461e9767	[inductor] more cuda metrics in wrapper (#97723 ) Following metrics should be helpful: - percent of time GPU is busy - percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes - percent of time each individual kernel takes compared to total wall time of the benchmark This PR add those. Example result from hf_Bert infernece graph: ``` == triton_pointwise category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52% triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33% triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00% triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24% Total 1.05379 12.08% == triton_persistent_reduction category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_per_fused__softmax__to_ 0.97188 12.0 11.14% triton_per_fused_add_native_la 0.37401 24.0 4.29% triton_per_fused_gelu_native_l 0.02 1.0 0.23% triton_per_fused_add_embedding 0.01718 1.0 0.20% Total 1.38307 15.86% == unknown category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74% ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03% void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61% ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55% ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11% Memset (Device) 0.024 12.0 0.28% void at::native::(anonymous na 0.01543 2.03 0.18% void at::native::vectorized_el 0.00011 0.03 0.00% Total 6.40937 73.49% Percent of time when GPU is busy: 101.44% ``` Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU). ## interesting usages 1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle: ``` (Forward graph) Percent of time when GPU is busy: 32.69% Total wall time 17.307 ms ``` Its backward graph has less percent of GPU idle time, but it's still high: ``` (Backward graph) Percent of time when GPU is busy: 46.70% Total wall time 17.422 ms ``` 2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ). NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately. ``` benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms ----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ -------------- resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms ``` ## how to use `python {compiled_module_wrapper.py} -p` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723 Approved by: https://github.com/jansel	2023-04-01 08:04:14 +00:00
Jason Ansel	1432a893ef	Fix issue with single input cat (#97822 ) Fixes #97695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-03-30 02:51:43 +00:00
Shunting Zhang	e1f44ee3b3	[inductor] correctly setup constant in the wrapper (#97571 ) V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable. Add 'global var_name' lines to maintain the same behavior as before. Test: Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix. Thanks @ngimel for report the issue with repro and @Chillee for pointing out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571 Approved by: https://github.com/ngimel	2023-03-28 03:10:53 +00:00
Shunting Zhang	652592efa9	[inductor] use torch.prifiler in the triton wrapper (#97405 ) I think it's helpful to use torch.profiler to profile the triton wrapper. E.g., I tried it for nvidia_deeprecommender's infernece graph. Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops. <img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png"> next step I'll check why the triton mm kernels are not picked. EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled: <img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405 Approved by: https://github.com/ngimel	2023-03-27 21:54:25 +00:00
Jason Ansel	5f57b36318	Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558 Approved by: https://github.com/Chillee	2023-03-23 17:41:19 +00:00
Wang, Eikan	517a432d6e	[Inductor] Enable CppWrapper to support BF16 (#97089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-22 05:54:09 +00:00
chunyuan	adcd1b3077	inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119 Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire	2023-03-21 01:40:09 +00:00
Bin Bao	931a4913b1	[inductor] Refactor memory management code in wrapper codegen (#96768 ) Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768 Approved by: https://github.com/jansel	2023-03-16 18:36:35 +00:00
Zachary DeVito	3162f71787	[memory debugging] Extract frame information from inductor (#95753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753 Approved by: https://github.com/Chillee	2023-03-16 04:12:54 +00:00
Bin Bao	b60d6e246e	[inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654 ) Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654 Approved by: https://github.com/jansel	2023-03-14 22:55:12 +00:00
Bin Bao	f03db8d6cb	[reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822. Solved the long compilation issue for inductor cpp tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-14 16:10:54 +00:00
Horace He	2a08a62777	Add extra metadata (as comments) to Inductor generated code (#96581 ) New output <img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581 Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym	2023-03-14 03:59:59 +00:00
Shunting Zhang	cc699c56dc	reland #96248 [inductor] show performance for each autotune config for a kernel (#96458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458 Approved by: https://github.com/ngimel	2023-03-10 01:40:04 +00:00
PyTorch MergeBot	fe05266fda	Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 )" This reverts commit `deaf9e5e65`. Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378	2023-03-09 01:45:24 +00:00
Bin Bao	deaf9e5e65	[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985 Approved by: https://github.com/jansel	2023-03-08 20:02:32 +00:00

1 2

89 Commits