This PR does the following:
- Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`. The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`. It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`. The `FusionInterface` is an object that represents a Fusion in python. It can also query the cache based on id.
- The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
- Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
- Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
- Adds a README file to explain how to use the Python Frontend
While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83267
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Code changes includes:
- codegen improvements:
1. Indexing refactor -> Remove reference tensor in predicate indexing logic
2. MMA Rfactor support for cross-warp and cross-CTA split on K dimension
3. Grouping grid allreduces across iterations
4. Swizzle op formulation for non-affine swizzles
5. Use scheduler_utils to cache inputs and outputs in schedulePointwise
- scheduler refactor
1. New compute at interface
- transformation propagation refactor on MaxInfoSpanningTree
1. Added sibling path that is required to generate consistent replay for some cases where `MaxInfoSpanningTree` is used with a selector.
2. Optimization to skip Transform propagator
3. SpanningTreePrinter for debugging
- parser update
1. Fixes `div`
2. Added `_to_copy`
3. Broadcast in dim with expand to support expanding to concrete size
4. Dropout prob extremal patch
- executor patch on caching strides for output allocation
Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:
```
3b87896706fc98aa4d5b5c811af034cc4dddfbab Fix allocation of work buffers and `fused_reduction::ParallelReduce` with unswitch (#1818)
4cae1227f666b68d275144afd6e4be1fa7aa0786 schedulePointwise cleanup: - computeAt + InlinePropagator (#1815)
3df97426adfb5ecc6fe2c12c43d56d59670e5020 Use scheduler_utils to cache inputs and outputs in schedulePointwise (#1811)
03180aa8facde51237dffa29f6632ffa870cf923 improve broadcast resolution (#1792)
bee6c69979d8c34d6d6ef7514f8886cf1416d64f bug fix (#1819)
4413c8f43a5a64dd0a6ddb0763523bbc7314f4b5 Support PYTORCH_NVFUSER_DUMP=transform_propagator (#1812)
de6b7ca5ce755061ae0d26e006c4403653627ab5 Fix negative position in InlinePropagator (#1813)
10a996cb4dce5d514f09fd0d49ffcd3b88869a28 Remove redundant check in schedulePointwise (#1810)
acd5ed4df825d4c25999e8c9041e0f8ca1a3448f Swizzle op formulation for non-affine swizzles (#1441)
3ed8330f881f429fe2df0e5af9000b91355a96da Kernel args patch to show zero_init buffer (#1809)
037a75a42048f1d8a9c30efb466f1ffbfd2894ad Dropout prob extremal patch (#1804)
282c42902bff07f759cddbbe619249cf5e7c5281 spam nvrtc options (#1783)
3ba6a5fe0a47044179cd36b5b62e628c75180da5 Broadcast in dim with expand (#1794)
fd4be1236ddfeb31ca0659e1b0df36546424c979 remove dead indexing code (#1806)
fa4e6a4739a9daaa0e4111fb4730704d79c91010 Check siblings in getMaxPosAll (#1805)
025c840c76d89b0d032b65a78a375719cab78d46 Grouping grid allreduces across iterations (#1755)
37c579e64f8145fc292273cdebb6519edeb9cf76 Temporarily disable test requring large shared memory. (#1802)
5f375d074524ab65cb78282eff7abe5846cc4203 More cleanup on InlinePropagator (#1800)
8d384da0cfb50a7c5082e91585c12f4c3a775e6c Indexing refactor stage 2 : Remove reference tensor in predicate indexing logic (#1784)
f008140e26335584a143f71c2cb9e91fd61ec530 MMA Rfactor support for cross-warp and cross-CTA split on K dimension (#1554)
76b3cca5cc9a18a56db8107d2f6c8e94851bb85c Add parsing support for `_to_copy` to handle AMP casts. (#1756)
ef04f6c4c0ee043979ac7aad4e5be6f22faeb547 Coding style cleanups (#1798)
38c7f3cf69ea58cc9480b0621506bbfd90a7c9d3 InlinePropagator please don't replay (#1797)
3f2c263ade35017be2d99fe8e4ec97fd0f14f754 validateDomain in TransformPropagator (#1796)
c07708520d99ef815ce15ec367bf7e98797d602b Use TransformPropagatorWithCheck in many tests (#1795)
d0d0908aee2e2b7615c28d04ee80a54b01a02bcd Some further cleanup for the new computeAt interface (#1793)
45f5203b5744cd3512d83263b3fb07c99795a271 Fix TransformReplay::getMatchedLeafPosWithoutReplay* (#1791)
28cbaf931870086cf59807dd60ce412d6dfad0fd New compute at interface (#1743)
635ebfc79bc016eea94d4cbde2c12324171b908b Add SpanningTreePrinter (#1786)
59f3c3223c48ea89549fe7d323f17cbecbebede0 Output allocate patch (#1790)
fe93bf5a6485696ffb36751606a84080349967b5 Transform propagator skip replay when possible (#1782)
ebf23a50f3adf3d28e824c3b3b4ed6ea6f9cf483 Fix isIntegralType error msg (#1789)
0c82ecf04d12b9fe5428af6824a7a978cf5e0ddd Disable register reuse across serial broadcast ops (#1787)
33a824d8d9ace7790a4a58d497e525a7a059579d Adding sibling path for MaxInfoSpanningTree (#1776)
86f46aad83cbb2aa06943419a7335d71a8798f2a Fix div(Val, TensorView) (#1778)
d3de227ade763bdac9e9df15ba8671be78565ee9 Fix FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA (#1781)
ecc7a87cdaaed66672d08bf819ad58d2980384cb Extend mma dimension and layout checking to support strided batched matmul and tensor contractions (#1761)
```
RUN_TORCHBENCH: nvfuser
Differential Revision: [D38043938](https://our.internmc.facebook.com/intern/diff/D38043938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81861
Approved by: https://github.com/davidberard98
Regarding issues reported in #79246, we notice that bias decomposition from conv/linear could actually hurt perf, due to the overhead of compilation. This PR changes it to make decomposition an explicit opt-in from user to avoid these regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81134
Approved by: https://github.com/davidberard98
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440
Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Summary:
Things changed in this PR that requires review:
test/forward_backward_compatibility/check_forward_backward_compatibility.py
Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.
nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627
Reviewed By: Chillee
Differential Revision: D34765458
Pulled By: davidberard98
fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported
nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor
Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127
Reviewed By: HamidShojanazeri
Differential Revision: D34113233
Pulled By: jbschlosser
fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D33336948
Pulled By: albanD
fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, malfet
Differential Revision: D32596264
Pulled By: albanD
fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.
Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943
Reviewed By: ngimel
Differential Revision: D32288709
Pulled By: dzhulgakov
fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:
- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).
To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.
internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`
updates affecting integration:
1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745
Reviewed By: saketh-are
Differential Revision: D30752939
Pulled By: malfet
fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452
Reviewed By: zou3519, anjali411
Differential Revision: D24364642
Pulled By: ngimel
fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
Summary:
A lot of changes are in this update, some highlights:
- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218
Reviewed By: ezyang
Differential Revision: D23905183
Pulled By: soumith
fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.
**Overall:**
- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.
**Integration:**
- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic
**Code Generation:**
- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129
Reviewed By: mrshenli
Differential Revision: D23162207
Pulled By: soumith
fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63