pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Oguz Ulgen	79ee6bbde3	Support `triton.language.dtype` with `torch.compile` (#121690 ) Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work. ``` (Pdb) p triton.language.float32 triton.language.fp32 (Pdb) p str(triton.language.float32) 'fp32' (Pdb) p repr(triton.language.float32) 'triton.language.fp32' ``` This means that we need to "rewrite" them for fx graph and inductor execution. This PR allows Mamba2 to work with `torch.compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690 Approved by: https://github.com/Skylion007	2024-03-12 23:21:46 +00:00
Peter Bell	459c5bca58	[inductor] Refactor common triton imports into one function (#121438 ) This means when codegen depends on a particular import we only need to add it in one place and it's applied to all triton kernels. This also changes codegen slightly so instead of generating `@pointwise` we now generate `@triton_heuristics.pointwise` just so the imports are the same for all kernel types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438 Approved by: https://github.com/lezcano	2024-03-09 18:17:36 +00:00
Adnan Akhundov	3d089de851	Add torch.cond support to AOT Inductor (#121120 ) Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends). Notable additions: - A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config). - Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers. - More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions). - New unit tests to cover the added AOT Inductor + `torch.cond` functionality. Codegen examples from the new unit tests: - [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e) - [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb) - [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6) - [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223) - [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711) - [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12) - [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690) Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 42 tests in 170.619s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-07 22:39:57 +00:00
Oguz Ulgen	18d574a07a	[Inductor] Use indices for constants in triton_meta (#121427 ) @bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true `392370b303/python/triton/runtime/jit.py (L381-L385)` I'm guessing both indices and names work here but lets be consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427 Approved by: https://github.com/aakhundov	2024-03-07 21:59:43 +00:00
Bin Bao	7e598c0053	[Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309 ) Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309 Approved by: https://github.com/chenyang78	2024-03-07 14:22:06 +00:00
Oguz Ulgen	6566b3db67	Add an autotune cache for inductor generated kernels (#120963 ) Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally. Test Plan: tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1` Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk The plan is to land this diff with this turned off and gradually introduce this. Differential Revision: D54398076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963 Approved by: https://github.com/jansel	2024-03-04 16:58:37 +00:00
Oguz Ulgen	558316b5f4	Emit grid wrapper inlined with the user defined triton kernel (#120824 ) Fixes #120801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120824 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #120809	2024-02-29 16:17:45 +00:00
Oguz Ulgen	84e2accd6c	Make triton_meta be part of user defined triton kernel cache (#120809 ) Tensors with different shapes will generate different triton meta (divisibility rules), we need this to be part of the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120809 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-02-29 16:17:45 +00:00
Adnan Akhundov	0a46102b37	Add equal_to_1 to triton_meta for user-written Triton kernels (#120579 ) Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization. Fixes #120478. The repro from the issue, on A100: Before this PR: ``` Triton matmul: 0.0167 seconds Triton matmul compiled: 0.0751 seconds ``` After this PR: ``` Triton matmul: 0.0168 seconds Triton matmul compiled: 0.0072 seconds ``` Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 3 tests in 3.545s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78	2024-02-29 05:19:39 +00:00
Adnan Akhundov	2d17230212	[inductor] Do not reuse buffers across scopes in mem planning (#120777 ) Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes. In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state. Test Plan: ``` python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters ... ---------------------------------------------------------------------- Ran 27 tests in 149.413s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #120665	2024-02-29 03:52:02 +00:00
Jason Ansel	01ec8df6d8	[Compiled Autograd] Introduce BackwardState capture (#120382 ) This adds support for backwards hooks that are both: 1) Interior to the graph; and 2) Dynamically generated (e.g. lambdas) We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo after the forwards runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382 Approved by: https://github.com/xmfan	2024-02-28 20:36:47 +00:00
Yang Chen	b96ea097ee	[aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391 ) make WrapperCodeGen subclass names consistent with the file names: CppWrapperCodeGen -> CppWrapperCpu CudaWrapperCodeGen -> CppWrapperCuda Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391 Approved by: https://github.com/aakhundov	2024-02-23 10:41:50 +00:00
Oguz Ulgen	29b2131c62	[Inductor] Fix bug around out of order constexprs in inductor (#120287 ) Inductor signature/config generation code assumes that all constexprs come as last arguments of the function. This is not always true for user defined kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120287 Approved by: https://github.com/jansel	2024-02-21 17:39:41 +00:00
wangjiangben-hw	26610175d2	pass device_str for async_compile.triton function (#120202 ) Fixes #120203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120202 Approved by: https://github.com/jansel	2024-02-21 03:48:57 +00:00
Adnan Akhundov	badf84bd6b	[inductor] Add torch.cond support to JIT Inductor (#119759 ) Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them. In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs. Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 24 tests in 86.790s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-17 07:25:27 +00:00
Yang Chen	bc7f3efb09	[aot_inductor] move CppWrapperCodeGen into a separate file (#119871 ) This reverts commit `d8e319a961`. Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871 Approved by: https://github.com/albanD, https://github.com/khabinov ghstack dependencies: #119870	2024-02-16 08:14:20 +00:00
Yang Chen	78c9b2948a	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119870 ) This reverts commit `3ab08946d5`. Differential Revision: [D53817852](https://our.internmc.facebook.com/intern/diff/D53817852) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119870 Approved by: https://github.com/khabinov	2024-02-16 08:10:51 +00:00
Adnan Akhundov	c2a835d710	[inductor] Refactor device guard Python codegen to allow nested indentation (#119673 ) Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing). In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing). This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673 Approved by: https://github.com/peterbell10	2024-02-13 15:05:30 +00:00
Bin Bao	70c93c6097	[inductor] Update JIT Inductor cpp wrapper entry function signature (#119280 ) Summary: Change JIT Inductor cpp wrapper entry function to use similar signature as AOTInductor, i.e. using an array of AtenTensorHandle instead of a vector of at::Tensor as the inputs and return output through a pointer. This makes it easier to consolidate the ABI compatible and non-compatible modes. Differential Revision: [D53478825](https://our.internmc.facebook.com/intern/diff/D53478825) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119280 Approved by: https://github.com/chenyang78	2024-02-12 22:24:35 +00:00
Adnan Akhundov	e5f46a1d35	Check alignment of ReinterpretView args of custom Triton kernels (#119649 ) Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x. In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment. Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`). Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view ... ---------------------------------------------------------------------- Ran 6 tests in 27.952s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649 Approved by: https://github.com/oulgen	2024-02-11 20:21:17 +00:00
Adnan Akhundov	0bed0501fa	Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634 ) Summary: There has been some empirical evidence that, for (non-trivial) custom (user-written) Triton kernels, a register-spilling config yields the best result in auto-tuning. For this reason, we don't skip register-spilling config from auto-tuning of the custom Triton kernels. <details> <summary>An example of auto-tuning result with the register-spilling config outperforming others</summary> ``` BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.748896, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.723424, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 2.202656, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.748256, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.724896, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 2.201632, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.651664, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.846368, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.841792, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.651584, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.846432, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.841904, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.236448, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.484384, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.131168, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.236544, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.483648, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.131408, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.516112, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.737792, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.411632, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.515904, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.736608, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.409808, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.553536, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569792, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.892448, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.553584, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569568, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.892240, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.332928, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.922256, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.758400, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.333440, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.922336, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.758496, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.231648, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.639424, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.917952, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.230624, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.639168, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.917440, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.838080, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569184, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.614720, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.838048, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569472, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.615104, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.012128, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.861536, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.771584, nreg 255, nspill 134, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.012512, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.861024, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.771712, nreg 255, nspill 134, #shared-mem 40960 ``` </details> In the above, the winning config is `BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2`, although it has non-zero `nspill 28`. This is an example where we need to consider all configs, including the register-spilling ones, to obtain the best result from auto-tuning. In the worst case, this will just make auto-tuning longer, but can't regress the results. And, as the number of custom Triton kernels in the model is normally much smaller than the number of Inductor-generated ones, this should be acceptable. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119634 Approved by: https://github.com/oulgen	2024-02-11 02:13:25 +00:00
PyTorch MergeBot	3ab08946d5	Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 )" This reverts commit `0597dab523`. Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))	2024-02-10 23:04:36 +00:00
PyTorch MergeBot	d8e319a961	Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 )" This reverts commit `760056bbdc`. Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))	2024-02-10 23:02:05 +00:00
Yang Chen	760056bbdc	[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 ) This PR moved CppWrapperCodeGen class into a seperate file, cpp_wrapper.py, to simplify wrapper.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491 Approved by: https://github.com/desertfire, https://github.com/albanD	2024-02-10 02:15:56 +00:00
Yang Chen	0597dab523	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 ) wrapper.py is getting more complex. Let's first split it into smaller pieces. Will have another PR to move CppWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448 Approved by: https://github.com/desertfire	2024-02-09 20:18:04 +00:00
Peter Bell	88429a8084	[inductor] Add split scan kernel (#117992 ) This PR adds a new type of triton kernel in which data is persistent but the reduction dimension is split over multiple blocks (up to the entire kernel). though this is called a reduction dimension, in actuality we only support scans. because of this limitation, i have to be able to block fusions of split scan operations with reductions so chose to add a new `ir.SplitScan` node which is identical but allows for differentiation in the scheduler. The split scan kernel is also the first to require an additional workspace buffer which is used to communicate between cuda blocks. this is slightly tricky as we the exact scratch space requirement isn't known until the grid size is calculated. here i workaround the issue by setting a minimum rblock size and always allocating to the maximum possible grid size for a given input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992 Approved by: https://github.com/jansel ghstack dependencies: #117991	2024-02-09 01:56:00 +00:00
Bin Bao	40ec155e58	[AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066 ) Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle. Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066 Approved by: https://github.com/khabinov	2024-02-07 16:54:00 +00:00
Bin Bao	e868a7fedd	[AOTI] Rename config.aot_inductor.abi_compatible (#119065 ) Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag. Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065 Approved by: https://github.com/khabinov	2024-02-07 00:14:33 +00:00
Colin Peppler	7d7a3f0b37	[inductor] Support sympy.expr in user-defined Triton kernel grid fn (#119165 ) ## Problem A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`. Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below. ## Approach Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`. Differential Revision: D53367012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165 Approved by: https://github.com/aakhundov	2024-02-06 08:39:55 +00:00
Colin Peppler	3829b55416	[inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166 ) Differential Revision: D53398312 ## Problem Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead. ``` # What we see aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...); # What we want aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...) ``` ## Approach Use C++ wrapper's expression printer to handle this conversion Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166 Approved by: https://github.com/aakhundov	2024-02-06 00:33:25 +00:00
Bin Bao	c7ba5f6c6f	[AOTI] Fix a cpp kernel missing arg type issue (#119021 ) Summary: The current way of fetching the kernel arg types only works for tensors, not symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021 Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov	2024-02-02 20:11:58 +00:00
Bin Bao	0e5fe4b3ae	[AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963 ) Summary: generate_index_put_fallback currently generates something like the following, ``` AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)}; ``` The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault. Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963 Approved by: https://github.com/aakhundov	2024-02-02 16:49:45 +00:00
Colin Peppler	babd6c776d	[inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654 ) Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512) ### Why? Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`. To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints. This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model. ### Test ``` $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols OK (skipped=3) $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols # Before Error: CUDA driver error: invalid argument FAILED (errors=2, skipped=3) # Now OK (skipped=3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-02-02 03:19:52 +00:00
Mu-Chu Lee	2b48891e62	[AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765 ) Summary: Add Runtime Constant-folding for AOTInductor. This also include the invocation of constant folding at load time. The constant folding lowering is a 2-step process. First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code. Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module. Test Plan: Unit tests included in commit. Differential Revision: D53274382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765 Approved by: https://github.com/chenyang78	2024-02-01 04:54:25 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
Jason Ansel	e332653eb3	[inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490 Approved by: https://github.com/desertfire	2024-01-30 21:03:19 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Colin Peppler	8be6dee14b	[inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569 ) Summary: ### Context It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`. * First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()` * Second in `self.codegen_kwargs()`. When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration. ``` auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); ... // There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0. // And there's no reference to tmp_tensor_handle_0. // Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't // automatically cleaned-up like RAIIAtenTensorHandle CUdeviceptr var_6; aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void*>(&var_6)); void kernel_args_var_2[] = {..., &var_6, ...}; launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2); ``` ### Solution We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`. Test Plan: ### Inspect device memory allocated ``` # Before diff 0 device memory 2048 1 device memory 2560 2 device memory 3072 3 device memory 3584 4 device memory 4096 5 device memory 4608 # With diff (memory usage doesn't grow) 0 device memory 1536 1 device memory 1536 2 device memory 1536 3 device memory 1536 4 device memory 1536 5 device memory 1536 ``` Reviewed By: jingsh, tissue3 Differential Revision: D53190934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569 Approved by: https://github.com/oulgen	2024-01-30 05:19:32 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Edward Z. Yang	2951bbf0f7	Add some type annotations to torch._inductor.codegen.wrapper (#118491 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491 Approved by: https://github.com/Skylion007	2024-01-29 06:17:27 +00:00
Edward Z. Yang	cad79bd0bb	Remove follow_imports = skip from sympy (#118469 ) dmypy silently ignores follow_imports = skip, so to get parity between dmypy and mypy we have to suck it up and type: ignore all of the sympy typing problems. The suppressions were added automatically with the following script generated by GPT-4: ``` import re # Read the error file with open("error_file.txt", "r") as f: errors = f.readlines() # Parse the lines with errors and error types error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type # Insert ignore comments in the source files for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467, #118468	2024-01-28 13:38:38 +00:00
Bin Bao	4e456fd95b	[AOTI] Support scalar to tensor in the ABI-compatible mode (#118024 ) Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024 Approved by: https://github.com/ezyang	2024-01-26 03:15:05 +00:00
Jason Ansel	2de24c11f6	[inductor] Slightly faster memory allocation on CUDA (#118255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255 Approved by: https://github.com/peterbell10 ghstack dependencies: #118065, #118070, #118171	2024-01-25 20:49:14 +00:00
Bin Bao	476b744e23	[AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291 ) Summary: https://github.com/pytorch/pytorch/pull/117989 disabled use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models. Differential Revision: D53089956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118291 Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov	2024-01-25 20:30:17 +00:00
Jason Ansel	817debeb89	[inductor] Slightly faster memory allocation on CPU (#118171 ) Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `12.2us` - After `10.5us` This is inspired by `a2c17a2b00` -- but in Python rather than C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #118065, #118070	2024-01-25 16:54:57 +00:00
Bin Bao	ee1dbb2acf	[AOTI] Fix a None as index codegen issue (#118187 ) Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices. Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168, #118169	2024-01-25 11:53:44 +00:00
Bin Bao	821b2c543c	[AOTI] Support .item() in the ABI-compatible mode (#117989 ) Summary: Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989 Approved by: https://github.com/ezyang, https://github.com/chenyang78	2024-01-24 20:17:59 +00:00
Nikita Shulga	bd99115276	[AOTI] Enable for MacOS (#118076 ) - Add `darwin` to the list of supported platform - Add `#include <sstream>` to `aoti_runtime/model.h` - Refactor Linux specific constant compilation logic to `_compile_consts_linux` - Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library - Patch file using magic to avoid converting bytes to large hexadecimal string - Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition) - Enable test_aot_inductor.py tests on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076 Approved by: https://github.com/desertfire ghstack dependencies: #118077	2024-01-24 14:24:05 +00:00
Bin Bao	41556324a9	[cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693 ) Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation. Result: ``` python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper Before: Ran 1 test in 32.843s After: Ran 1 test in 26.229s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693 Approved by: https://github.com/jansel	2024-01-21 16:07:52 +00:00
Adnan Akhundov	fbd1d567ed	[inductor] Fix CPP wrapper codegen for ExternKernel args (#117931 ) Summary: We see IR nodes `repr`-ed directly in the CPP wrapper codegen. Recently, this issue has been fixed for the Python wrapper codegen in D52899373 (https://github.com/pytorch/pytorch/pull/117838). Here we extend the fix to CPP wrapper codegen / AOTInductor. Test Plan: New unit tests. In OSS: ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_multi_output_arg ``` ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_extern_kernel_arg ``` Differential Revision: D52936248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117931 Approved by: https://github.com/oulgen, https://github.com/chenyang78, https://github.com/desertfire	2024-01-21 04:58:56 +00:00

1 2 3 4 5 ...

309 Commits