Prior to this PR, `_inductor/codegen/cpp_prefix.h` was copied into a new temporary directory on every inductor run utilizing the CPP backend (i.e. CPU-only), then included in the output source code. Instead, this PR puts it in an appropriate place in the torch includes, and includes it from there. This allows us to precompile it in cpp_wrapper and AOT inductor mode, saving significant compilation time.
Due to difficulties getting this to work in FBCode, the precompilation itself is only enabled in OSS PyTorch.
Differential Revision: [D69420620](https://our.internmc.facebook.com/intern/diff/D69420620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144293
Approved by: https://github.com/desertfire
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.
It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.
Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.
Differential Revision: D74599203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
When codegen'ed, it looks like:
```py
post_grad_custom_pass = <object at 0x12345678>
```
Which is not runnable at all. Some logic is also trying to deepcopy the
object, and not all of these objects are deepcopy-able.
This PR skips codegenning of these passes.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243
Approved by: https://github.com/houseroad
Summary: This diff implements an AsyncManifoldCache class that performs cache write and update ttl operations in an async manner. Essentially we are ok with the fire and forget approach where we dont guarantee that we can observe our writes, this gives us better runtime latency.
Test Plan: added new unit test
Reviewed By: jamesjwu
Differential Revision: D73867797
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152452
Approved by: https://github.com/jamesjwu
This PR:
- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides
Test Plan:
- tests + CI
disable padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
Summary: Previously D70489427 changed tanh impl to `.tanh()`, and this is causing some meta internal workload perf regression. This diff will introduce a config so we can set it based on need.
Differential Revision: D73909371
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152542
Approved by: https://github.com/desertfire
This is a new pass to replace the pre-existing passes. It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.
The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
not increaes curr memory beyond peak memory
The pass logs a summary table for each graph to TORCH_LOGS=overlap.
e.g. (exact format may have been tweaked but this shows the idea).
```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4
[rank
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR:
- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides
Test Plan:
- tests + CI
disable padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #150495
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.
One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`
List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.
Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).
Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
Summary:
User defined Triton kernel sometimes rely on real inputs to determine
the path of execution. We need real inputs to invoke the correct
behavior of the user defined triton kernels (see example in test case,
where we have an early return for random inputs)
Test Plan:
Included in the commit.
python test/inductor/test_aot_inductor.py -k triton_autotuning
python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553
Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.
One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`
List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.
Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.
Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
- The kernel is a cuda kernel and inductor can find a cubin file associated with it
- The kernel takes less than 50 arguments
- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
- The kernel is not user defined (to be supported in a later PR)
We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.
Some key features of StaticTritonCompileResult:
- It is fully serializable
- It stores the minimum amount of stuff, so that later it can be cached easily
- It does not depend on any triton specific types (though it does have various triton metadata).
For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.
Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.
Fixes#149448
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890
Approved by: https://github.com/jansel
Summary:
1. Check against the "0" char instead
2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`.
Test Plan:
```
buck run 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest
```
Differential Revision: D70670957
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714
Approved by: https://github.com/desertfire