pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Sherlock Huang	b9dfdc091b	[AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279 ) (#108350 ) Summary: This is a prototype for running extern fallback kernels with a host side proxy executor. Sample of generated cpp wrapper call: ``` at::Tensor buf0; // output buffer void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0}; int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81}; proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0); ``` - In my current implementation, proxy executor interprets the raw pointers according to the ops schema. This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption) - I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels. - These are all the input argument types I have support so far. union Argument { # Bool value does not matter 1: bool asNone; 2: TensorArgument asTensor; 3: list<TensorArgument> asTensors; 5: i64 asInt; 7: list<i64> asInts; 8: double asFloat; 9: list<double> asFloats; 10: string asString; 10.5: list<string> asStrings; 11: SymIntArgument asSymInt; 12: list<SymIntArgument> asSymInts; 13: ScalarType asScalarType; 14: MemoryFormat asMemoryFormat; 15: Layout asLayout; 16: Device asDevice; 17: bool asBool; 18: list<bool> asBools; } - Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC implications. 1. requires exported fx graph to explicitly populate default values, if users doesn't specify. 2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify. 3. Proxy executor look up from opSchema for default values. For fixing T162112344 Test Plan: frontend: buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main test: buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops backend: buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)' Reviewed By: suo Differential Revision: D48747417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350 Approved by: https://github.com/izaitsevfb	2023-09-02 17:14:10 +00:00
Bin Bao	06d74e6b24	Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349 ) This reverts commit `c3239442a3` due to internal test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108349 Approved by: https://github.com/aakhundov, https://github.com/zhxchen17	2023-08-31 16:26:02 +00:00
Shunting Zhang	7cb4bf675b	[inductor] no-side-effect codegen (#107617 ) Inductor kernel codegen previously have the following side effect: - in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers - during codegen, we do memory allocation/free. These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617 Approved by: https://github.com/jansel	2023-08-31 00:25:17 +00:00
Jason Ansel	2c87ef3dbf	[inductor] Fix inputs with existing offsets (#108168 ) This cherrypicks the reinterpret_tensor change from #102625 in order to fix a subtle correctness bug when the graph inputs already have a storage_offset set. The view change also fixes some issues with quantized models in torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108168 Approved by: https://github.com/desertfire	2023-08-29 23:47:03 +00:00
Mu-Chu Lee	c3239442a3	[AOTInductor] Include constants in AOTInductor .so file. (#107718 ) Summary: Include the constants into AOTInductor .so file. We do not modify existing API signatures but create necessary format with weight lifted out instead. Test Plan: test/inductor/test_aot_inductor.py Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718 Approved by: https://github.com/angelayi, https://github.com/eellison	2023-08-29 22:37:30 +00:00
Yang Chen	2179ebde1f	[inductor] correctly handle resize for AOTInductor wrapper calls (#107848 ) When generating a wrapper call, we may have implicit resize applied to the kernel's output. For example, for addmm(3d_tensor, 2d_tensor), its output buffer is resized to a 2d tensor. This triggers a warning from Aten's resize_output op: "UserWarning: An output with one or more elements was resized since it had... This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements..." More importantly, the output shape is not the same as we would expect, i.e. 2d tensor v.s. 3d tensor. This PR fixed the issue by injecting resize_(0) before calling the relevant kernel and resize_(expected_shape) after the kernel call. We also fixed a minor typo in the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107848 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-08-27 09:56:16 +00:00
Wei Wei	497571df58	[aot_inductor] fix hardcoded output dtype (#107825 ) Summary: as titled Reviewed By: chenyang78 Differential Revision: D47779519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107825 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-08-24 22:16:13 +00:00
Adnan Akhundov	1491bae277	[reland][inductor] Adjust dynamic SMEM limit when above default in AOT (#107814 ) Summary: This relands #107601, which was reverted due to the new test failing in the internal CI. Here we skip the new test (as well as the existing tests in `test_aot_inductor.py`, as those are also failing in the internal CI). Test Plan: ``` $ python test/inductor/test_aot_inductor.py ... ---------------------------------------------------------------------- Ran 5 tests in 87.309s OK ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D48623171](https://our.internmc.facebook.com/intern/diff/D48623171) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107814 Approved by: https://github.com/eellison	2023-08-24 07:59:51 +00:00
PyTorch MergeBot	42897e8127	Revert "[inductor] Adjust dynamic SMEM limit when above default in AOT (#107601 )" This reverts commit `3920ce2f6e`. Reverted https://github.com/pytorch/pytorch/pull/107601 on behalf of https://github.com/ZainRizvi due to Sorry, but the test added in this PR breaks when run internally. See D48549503 for more details ([comment](https://github.com/pytorch/pytorch/pull/107601#issuecomment-1689049609))	2023-08-22 23:26:50 +00:00
Adnan Akhundov	3920ce2f6e	[inductor] Adjust dynamic SMEM limit when above default in AOT (#107601 ) Summary: When AOT Inductor runs a Triton matmul kernel (generated from the Triton mm template) on large inputs of particular shape, the `RuntimeError: CUDA driver error: 1` may happen. E.g., when `x @ y` is compiled with AOT Inductor and run on the input shapes `[10285, 96]` and `[96, 1]`. Digging deeper into the generated AOT Inductor wrapper code, we see this line: ``` launchKernel(triton_unk_fused_mm_0, 81, 1, 1, 4, 55296, kernel_args_var_0, stream); ``` `55296` is the required amount (in bytes) of dynamic shared memory. This is larger than the default dynamic shared memory on A100: `49152` bytes. In these cases, `cudaFuncSetAttribute` must be called explicitly to set the`cudaFuncAttributeMaxDynamicSharedMemorySize` attribute of the kernel before launching it. Or, because AOT Inductor wrapper relies on the CUDA Driver API, the equivalent [`cuFuncSetAttribute`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g0e37dce0173bc883aa1e5b14dd747f26) function can be called to set the `CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute. This PR adds the above call in the AOT Inductor codegen for every case when the required amount of dynamic SMEM is > 0. The call is done within the `launchKernel` function, meaning that it will happen only once per kernel and not affect the subsequent AOT Inductor-compiled model performance (after the first run). P.S. One could, in principle, call the `cuFuncSetAttribute` only when the required amount of dynamic SMEM is above the default limit, but that would require detecting the default limit which is different on different devices. Assuming that the `cuFuncSetAttribute` is relatively lightweight and because it's performed only once per kernel, for simplicity, the suggestion is to call the function in every non-zero dynamic SMEM case. Test Plan: ``` $ python test/inductor/test_aot_inductor.py ... ---------------------------------------------------------------------- Ran 5 tests in 100.177s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/107601 Approved by: https://github.com/jansel	2023-08-21 21:06:09 +00:00
chunyuan	c21e9de25d	Inductor cpp wrapper: fix optional tensor input (#106847 ) Fix cpp wrapper failure on `clip` in Torchbench: ``` RuntimeError: tensor does not have a device ``` An `optional<at::Tensor>` variable with value equal to `at::Tensor()` will be considered as _contains value_. When it's converted to `bool`, it returns `true`. While for `None` in python, when converting it to `bool`, `false` is returned. Fix it to be an optional variable that _does not contain a value_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106847 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-18 13:20:19 +00:00
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
angelayi	b2d3a2f433	[inductor] Remove ReinterpretView copy_ for AOT Inductor outputs (#106564 ) Running benchmark on HF models result in 71% pass rate now: P802905571 Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Fri%2C%2028%20Jul%202023%2005%3A02%3A20%20GMT&stopTime=Fri%2C%2004%20Aug%202023%2005%3A02%3A20%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/bench&lCommit=e35a655e59b2038c0395f972a1f567f862093d9c&rBranch=main&rCommit=3e5a52cedd2d586fc6cb40a73a098252b9edc2a1) Originally, a lot of the HF export-aot-inductor tests are failing with the error message: ``` RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation. ``` I looked at the result of one of the models, AlbertForMaskedLM, and the error is due to an additional [`copy_`](https://www.internalfb.com/phabricator/paste/view/P802043305?lines=1460%2C1426%2C1438%2C1451%2C1428) being inserted at the end. Looking at the [exported graph](https://www.internalfb.com/phabricator/paste/view/P802908243?lines=1124), `buf237` in the cpp program corresponds to the `view_269` node. During inductor lowering, this `view_269` node will result in a `ir.ReinterpretView` node, and when generating code for the outputs, this [line](https://fburl.com/code/epola0di) will add an additional `copy_`. I'm unsure if removing this case will result in other errors, but it seems to raise the HF model benchmark pass rate :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106564 Approved by: https://github.com/jansel	2023-08-04 07:51:29 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
Nikita Shulga	92cac6bf32	InductorCpp: Fix "call to constructor is ambiguous" error (#106418 ) Not sure why `{{}}` is better that just calling a default constructor, but removing it fixes: ``` % python test_cpp_wrapper.py -v -k test_profiler_mark_wrapper_call_cpu_cpp_wrapper .... clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_clang\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1002\" -I/var/lib/jenkins/workspace/test/inductor/-I/var/lib/jenkins/workspace/torch/include -I/var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -I/var/lib/jenkins/workspace/torch/include/TH -I/var/lib/jenkins/workspace/torch/include/THC -I/opt/conda/envs/py_3.9/include/python3.9 -isystem /var/lib/jenkins/workspace/torch/include -isystem /var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -isystem /var/lib/jenkins/workspace/torch/include/TH -isystem /var/lib/jenkins/workspace/torch/include/THC -isystem /opt/conda/envs/py_3.9/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -std=c++17 -Wno-unused-variable -O3 -ffast-math -fno-finite-math-only -march=native -fopenmp -Wall -DCPU_CAPABILITY_AVX512 -D C10_USING_CUSTOM_GENERATED_MACROS -c /tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp -o main.o /tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp:41:50: error: call to constructor of 'c10::ArrayRef<c10::IValue>' is ambiguous RECORD_FUNCTION("inductor_wrapper_call", c10::ArrayRef<c10::IValue>({{}})); ^ ~~~~ /var/lib/jenkins/workspace/torch/include/ATen/record_function.h:580:38: note: expanded from macro 'RECORD_FUNCTION' at::RecordScope::FUNCTION, fn, inputs, ##__VA_ARGS__) ^~~~~~ /var/lib/jenkins/workspace/torch/include/ATen/record_function.h:561:20: note: expanded from macro 'RECORD_FUNCTION_WITH_SCOPE' guard, fn, inputs, ##__VA_ARGS__); \ ^~~~~~ /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit move constructor) class ArrayRef final { ^ /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit copy constructor) /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:71:13: note: candidate constructor constexpr ArrayRef(const T& OneElt) : Data(&OneElt), Length(1) {} ^ /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:126:28: note: candidate constructor /* implicit */ constexpr ArrayRef(const std::initializer_list<T>& Vec) ^ 1 error generated. ``` if clang12 is used as the host compiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/106418 Approved by: https://github.com/desertfire	2023-08-02 04:02:15 +00:00
Sam Larsen	0cf918947d	[inductor] Support using the 'stream' param in AOT mode (#105589 ) Summary: When in AOT mode, make use of the existing stream param: - Pass through and use the stream param in the launchKernel helper function. - In non-AOT mode, assign the stream param in the caller and pass to launchKernel - Use a CUDAStreamGuard so all fallback ops execute on the stream - CUDAStreamGuard subsumes CUDAGuard in AOT mode since it sets both stream and device Test Plan: - Ran cpp_wrapper tests: pytest test/inductor/test_cpp_wrapper.py - Manually inspected cpp output from the alexnet benchmark: a) In AOT mode: ``` static inline void launchKernel( CUfunction func, int gridX, int gridY, int gridZ, int numWraps, int sharedMemBytes, cudaStream_t stream) { AT_CUDA_DRIVER_CHECK_OVERRIDE(cuLaunchKernel( func, gridX, gridY, gridZ, 32*numWraps, 1, 1, sharedMemBytes, stream, args, nullptr)); ... at::cuda::CUDAStreamGuard stream_guard(at::cuda::getStreamFromExternal(stream, 0)); ... launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream); ... ``` b) Regular cpp wrapper: ``` ... at::cuda::CUDAGuard device_guard(0); cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0); ... launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream0); ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105589 Approved by: https://github.com/desertfire	2023-07-28 20:26:27 +00:00
Bin Bao	b0816e4714	[inductor] Fix AOTInductor output issues (#105773 ) Summary: This is a follow-up on https://github.com/pytorch/pytorch/pull/105496. There are several issues with the previous fix, 1) It explicitly does copy for every output at the end of the main function; 2) When an output is ReinterpretView, no as_strided was generated for it; 3) There can be duplicated buffer declarations. This PR fixes by making sure can_reuse behave consistently between two AOTIndcutor passes, and thus always generate the same set of kernels. It also adds handling of ReinterpretView. Differential Revision: [D47692214](https://our.internmc.facebook.com/intern/diff/D47692214) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105773 Approved by: https://github.com/jansel	2023-07-24 01:58:49 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Justin Chu	79c5e33349	[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-21 07:38:46 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
Bin Bao	71067631c2	[inductor] Fix an AOTInductor missing output issue (#105496 ) Summary: When an output buffer is reused instead of directly referring to the passed-in output, we need to explictly make a copy Pull Request resolved: https://github.com/pytorch/pytorch/pull/105496 Approved by: https://github.com/jansel	2023-07-20 08:27:31 +00:00
Bin Bao	b10de43c0a	Add aot_inductor as a test backend for benchmarking (#105221 ) Summary: Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead. Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend. Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking. Differential Revision: D47452591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221 Approved by: https://github.com/jansel	2023-07-18 13:16:36 +00:00
chunyuan	1fdc88f877	Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575 ) Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`: ``` random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype) ``` For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case). The current support in this PR is conservative and we'll extend the functionality in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-15 03:33:44 +00:00
Bin Bao	528ab477ce	[reland][inductor] Register an op for mm_plus_mm (#105153 ) Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues Test Plan: CI Differential Revision: D47442849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153 Approved by: https://github.com/clee2000	2023-07-14 14:35:29 +00:00
Kefei Lu	4328138c1e	AOT inductor: error: ‘c10::Dispatcher’ has not been declared (#104742 ) Differential Revision: D47275262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104742 Approved by: https://github.com/desertfire	2023-07-14 01:47:52 +00:00
Catherine Lee	c36dca7bc5	Revert "[inductor] Register an op for mm_plus_mm (#104835 )" (#105150 ) This reverts commit `9c46a1620c`. Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149 #104835 is causing internal builds to fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150 Approved by: https://github.com/atalman	2023-07-13 17:13:45 +00:00
Bin Bao	9c46a1620c	[inductor] Register an op for mm_plus_mm (#104835 ) Summary: Currently the aten version of mm_plus_mm has no cpp implementation, and thus cpp_wrapper can not generate the correct cpp function call for it. Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835 Approved by: https://github.com/jansel, https://github.com/SherlockNoMad	2023-07-12 02:34:02 +00:00
chunyuan	ba167e6578	Inductor cpp wrapper: fix codegen of ScatterFallback (#104524 ) Fix cpp wrapper failure on TorchBench model `basic_gnn_edgecnn` and `hf_Reformer` which contain scatter OP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104524 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-11 08:17:56 +00:00
XiaobingSuper	54f33265db	inductor(re-land): support cpu fusion path for bfloat16 amp (#104399 ) This PR is about the fusion of amp path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104399 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-10 00:58:04 +00:00
Bin Bao	a860b965f1	[inductor] Relax custom op schema checking for cpp_wrapper (#104349 ) Summary: Remove fallback ops whitelist because FallbackKernel.set_cpp_kernel is doing sufficient checking Differential Revision: [D47269612](https://our.internmc.facebook.com/intern/diff/D47269612) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104349 Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel	2023-07-09 17:31:31 +00:00
XiaobingSuper	8ce3a18b6a	inductor: reduce complie time by reducing repr calls of quantize or Opaque tensor (#104696 ) For quantize or opaue tensor, if they are constant values, the calls of tensor ```__repr__``` will have memory copy(https://github.com/pytorch/pytorch/blob/main/torch/_tensor_str.py#L550): `db1ac4e29b/torch/_inductor/codegen/wrapper.py (L289-L292)` for CPP codegen, there have many times of initiation of ```WrapperCodeGen```: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/cpp.py#L2023, which consumes much time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104696 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-07 01:12:34 +00:00
Yang Chen	d2281e38ae	Adds the initial support for AOTInductor model and interface (#104202 ) This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen. It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference. On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model. This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths. Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202 Approved by: https://github.com/desertfire	2023-06-27 00:37:26 +00:00
Jason Ansel	8c54cd434f	[inductor] Fix allow_buffer_reuse=False (#103630 ) Fixes #103461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103630 Approved by: https://github.com/anijain2305	2023-06-15 22:50:01 +00:00
chunyuan	17217d367f	Inductor cpp wrapper: support Constant in input (#103496 ) ## Description Fix cpp wrapper for models which have constants in the graph inputs. Python wrapper directly gets the value inside the wrapper call as a global variable passed when calling: `4081e924a8/torch/_inductor/codecache.py (L757)` The constants value has been saved in `mod.__dict__` in `4081e924a8/torch/_inductor/graph.py (L874-L875)` For cpp wrapper, we need to append constants to the input args, so as to pass this python value to the `inductor_entry_cpp` function explicitly. ### Example Example of output code for dlrm in TorchBench with this fix: ```py module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'cfkc6c36t7cggi6mnokrdm5jhesnunjg5xysv3o3x3vaqmzmpe6r', False) def _wrap_func(f): def g(args): args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args] constants_tensor = [constant0, constant1] args_tensor.extend(constants_tensor) return f(args_tensor) return g call = _wrap_func(module.inductor_entry_cpp) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103496 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-06-15 05:01:25 +00:00
Animesh Jain	58d2c66a70	[activation checkpointing] Higher order functional rng op wrappers (#102934 ) Introduces two higher order operators * run_and_save_rng_state - Saves the current rng state and then runs the op. * run_with_rng_state - Runs the op with the rng state supplied as an input Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934 Approved by: https://github.com/jansel, https://github.com/zou3519	2023-06-12 22:54:17 +00:00
Shunting Zhang	daf75c0759	[AOTAutograd] compare with stride hints (#103342 ) We previously compare FakeTensor's strides with real tensor's strides. This cause dynamic dimension of FakeTensor being specialized to static int. This may cause a graph specialized for one shape being used by another shape which is wrong. Use stride hints for the comparison instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103342 Approved by: https://github.com/malfet	2023-06-10 06:51:54 +00:00
chunyuan	d61cd03b97	Inductor cpp wrapper: support ConvTranspose and fix Convolution ir (#103308 ) The changes in this PR include: - Support ConvTranspose in cpp wrapper - Fix cpp wrapper support for aten convolution when bias is `not None`: bias is in `args` instead of `kwargs` when it is `not None`. The change is covered by ConvTranspose dynamic shapes UT since we'll fall back to aten convolution in dynamic shape cases. - Fix cpp wrapper support for `inf`. This is a UT added in https://github.com/pytorch/pytorch/issues/101865. The cpp wrapper UT is covered in `test_conv2d_unary` of `test_cpp_wrapper.py`. It's in `slowTest` category and seems not captured in the CI of that PR. I will submit another PR to remove the hard-coded schema in these `ExternKernel`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103308 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-06-10 03:53:05 +00:00
David Berard	cde4657284	[inductor] Support complex fallback for convert_element_type, _fft_c2c, view_as_real to support GoogleFnet with cpp wrapper (#103183 ) Fixes #102752 These 3 fallback kernels appear in GoogleFnet because they take complex arguments - i.e., usually they aren't fallback kernels. To support this model, we added support for these 3 ops. Details: 1. Add these 3 ops to the allowlist. I assume that we eventually want to support all fallback kernels, but for now we just add these 3 ops to the allowlist. 2. Support complex64 in cpp codegen 3. Support List[] arguments and ScalarType arguments in cpp codegen 4. Allow alias_info in schema arguments. In the original PR supporting fallback kernels for cpp wrapper, ops with schemas with non-null alias_info for any of the arguments were disallowed; but I don't think there's any reason we need to disallow these in cpp wrapper code. Caveats: * This has not added support for complex32 or complex128 * It only works with static shapes, not dynamic shapes. It seems like the dynamic shapes issue is unrelated to cpp wrapper, since it fails in the test_torchinductor_dynamic_shapes.py test. I checked these `test_fft_.` tests, which I added in this PR, and verified that they were broken with dynamic shapes before any of the code changes from this PR. Test*: ``` benchmarks/dynamo/huggingface.py --inductor --amp --accuracy --inference --device cuda --cpp-wrapper --only GoogleFnet ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103183 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w	2023-06-09 21:12:41 +00:00
Bin Bao	fbbde8df69	[inductor] fix a numel expr codegen issue (#103005 ) Summary: Correctly use pexpr or cexpr for generating symbolic expression during wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
Bin Bao	44fdfd3222	[inductor] Support select_algorithm with cpp_wrapper (#103003 ) Summary: This is one step towards getting cpp_wrapper work with max_autotune. Switch to use unique kernel name to cache generated cubin file. This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
Shunting Zhang	86c7652503	[inductor] layout optimization for conv (#99773 ) convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773 Approved by: https://github.com/jansel	2023-06-02 21:08:18 +00:00
chunyuan	4c9992d5ed	Inductor cpp wrapper: cache the wrapper (#89743 ) If the wrapper code has been built, directly load the .so file to avoid recompilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89743 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-02 00:02:39 +00:00
Bin Bao	c58264c3e9	[inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093 ) Summary: Add a set to avoid generating extra `auto` when seeing the symbolic numel expression for the second time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102093 Approved by: https://github.com/jansel	2023-05-30 16:08:00 +00:00
chunyuan	3469f100f3	support ConvUnary in Inductor cpp wrapper (#101392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101392 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang	2023-05-26 15:52:06 +00:00
Natalia Gimelshein	68816e4fa9	Remove inplace buffers when original and mutation are both removed (#102289 ) Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed). This PR extends buffer removal to inplaced buffers that can be removed. Generated kernel for e.g. ln changes from ``` def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` where in_out_ptr0 is unused in the kernel to ``` def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` and corresponding allocation/reuse lines in the wrapper are removed. The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289 Approved by: https://github.com/jansel	2023-05-26 02:06:36 +00:00
Bin Bao	836798e0f3	[inductor] Support precomputed_sizes in CppWrapperCodeGen (#102083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102083 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-25 23:14:28 +00:00
Bin Bao	fd1d442185	[inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019 ) Summary: Use size hint for autotuning; Fix some symbol arg codegen problem. More PRs coming for fixing unit test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102019 Approved by: https://github.com/jansel	2023-05-24 13:29:47 +00:00
Bin Bao	431344f2d0	[inductor] Refactor generate_kernel_call (#102018 ) Summary: Refactor generate_kernel_call to support codegen call to Triton kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/102018 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-23 15:54:49 +00:00
Jason Ansel	0c6f409cda	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-20 03:43:33 +00:00

1 2 3

133 Commits