Commit Graph

67 Commits

Author SHA1 Message Date
jjsjann123
df741c589f [NVFuser] Upstream push 0809 (#83067)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. removes un-necessary sync from redundant thread compute analysis
  2. symmetric API for BestEffortReplay
  3. support merge on trivial reductions
  4. Ampere async copy improvements
- bug fixes:
  1. vectorization bug fixes
  2. type inference patch : fixes upstream #81725
  3. segmenter bug fix with deterministic iteration ordering
- parser update
  1. added leaky_relu
- scheduler
  1. normalization scheduler clean up.
  2. simplifies matmul scheduling with new transform propagator
  3. merge all dimensions in PW scheduler
  4. various gemm related improvements
- debuggability
  1. nsight compute support
  2. debug dump for InlinePropagator
  3. Add `UnaryOpType::Print`

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
dfe02f3faed4c64477e5f5c678f21f33415d0195 Merge remote-tracking branch 'csarofeen/devel' into HEAD
16173732ecfafc4797e93c2449cfb778015a6c7a Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884)
7cfb7796bdcf055eb61d600b7b5c9df292950290 Merge pull request #1887 from csarofeen/upstream_merge_0803
3399f6de62061d30781de50ef1862bbfb1615173 Merge remote-tracking branch 'origin/viable/strict' into HEAD
01208f5bba3bc158d41ccbefa0ee2c5ceea7aedb Add `UnaryOpType::Print` which can be helpful for debugging (#1878)
0646522454aa715ef164c88a73fb8bdddc706805 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881)
7bc76aa219293a59e4166e258d76289fe13633ca Fix most inlined propagator for mismatched dims (#1875)
501f4aa270bf4dd47b0d2f4860bc6f23ebc32a38 Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826)
d863d690f923047a85b5229a787118708f810741 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827)
e0ae11a61c87cd998e88ddd79a496548171c31e0 Larger sized mma instructions to support full vectorization (#1824)
9bb4cf7a66b098f04c9d95a2d34ab2bceee151b3 fragment iteration to support fully unrolled mma ops (#1823)
a48270a18dc2d3accc2626758d14d5858ae55032 Merge all dims in pointwise scheduler (#1872)
172fb3673fb4aaf4c1e889922a4fc5c06cbd59f7 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868)
a64462a5ac2fcf57a177bf36b0f26c61a4e252a4 Allow trivial reduction to be merged (#1871)
440102bcda6eb1dcd42d5fa5aeab9d6b049956bc Symmetric API for BestEffortReplay (#1870)
d1caf330c08ea8002f7133ca655bbd5b28c4eb98 Some misc cleanups/refactor split out from #1854 (#1867)
1013eda50be38eac96c00ba781340ac199d5a136 Remove some welford specific logic. (#1864)
51589d36be5a101d06e641fe0400b39028b7cb81 Some cleanups on tests and heuristics params (#1866)
a6b3e70da5dee51dbc246347228ea21384e46ac3 Segmenter bug fix, and deterministic iteration ordering.  (#1865)
1b665b9b5e562d6f0caba5e7319e83e5df64104f Add nullptr checks to IrBuilder (#1861)
1cd9451d7493f631c2837ba07c1ea93a74e83a15 Simplify matmul scheduling with the new transform propagator.  (#1817)
bbc1fb9b8c454f557ab9fcf5b1c3cef9b9e136d0 Add leaky_relu operation (#1852)
e842a9bab5e9f7289b7ce33ee37a682b22373f49 Minor cleanup in pointwise scheduler (#1858)
9ee850ca2f7f51dd5269bffb1255e485f809282d Fix stringstream usage (#1857)
20a36c1e4f28c4ff9837e56784be2686d17435f3 Improve nsight compute support (#1855)
405910308301097297b55c34d560aab6a360e897 Remove debugging `true ||` from getPointwiseHeuristics (#1822)
01117bfe8fdfacdbfdcfba9a624cdf900fe044d4 Misc cleanup (#1853)
5cc64943dc381a568223140bce0f22163c01e29f Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846)
92e6f0207e3a89fe90fd5cd3ffc575dfd766ba00 Cleanup normalization scheduler (#1845)
db89c6591a2f21130599a93675e0615e55564e41 Type inference patch (#1848)
102fe93a4605ca465cda26ebaee4ba1af2026901 Add debug dump for InlinePropagator (#1847)
b7a4d93d375a6e2ddef483763c93ffddc62ec452 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687)
942be5b256056d0e02877361b814ae6af32ca15f Upstream ci build fixes (#1842)
0b83645915029d67f9345aa4649b8c6f62b0061b Fix vectorization bug introduced in #1831 (#1840)
63630f1ae091180e541932a9d9dc598e0a9902dd Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825)
9135a963c01d97ba34b1a7d2f106e78a13fd6651 Fix transpose benchmark dtype (#1839)
2c9a6c02312d5bf4f83cde653b847b4f85849432 Add extra configurability to `parallelizeAllLike` (#1831)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83067
Approved by: https://github.com/davidberard98
2022-08-10 21:02:56 +00:00
Kurt Mohler
5ca9b2b6fa Enable dim=None for torch.var (#82765)
### Description
Add support for `dim=None` in `torch.var`

### Issue
Part of #29137

### Testing
N/A
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82765
Approved by: https://github.com/albanD
2022-08-04 20:47:27 +00:00
Kurt Mohler
eb0e30e0bc Enable dim=None for torch.std (#81845)
Part of #29137

**BC Breaking Note**

This PR breaks C++ API backward compatibility for `at::std`. A call that has argument types `at::std(Tensor, OptionalIntArrayRef, int64_t, bool)` used to resolve to the `std.correction` overload, but now it resolves to the `std.dim` overload. In order to call the `std.correction` overload, the `int64_t` argument can be wrapped in a `c10::optional`, so that the call has the form `at::std(Tensor, OptionalIntArrayRef, optional<int64_t>, bool)`. The same is true for the corresponding arguments of the `std.out` and `std.correction_out` overloads of `at::std_out`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81845
Approved by: https://github.com/albanD
2022-08-04 01:49:13 +00:00
Kurt Mohler
2bfae07a79 Enable dim=None for torch.mean (#81286)
Part of #79525

This will require coordination with XLA before merging, just like #79881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81286
Approved by: https://github.com/albanD
2022-07-28 22:34:56 +00:00
jjsjann123
4a000ff03e [NVFuser] Upstream push 0714 (#81861)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. Indexing refactor -> Remove reference tensor in predicate indexing logic
  2. MMA Rfactor support for cross-warp and cross-CTA split on K dimension
  3. Grouping grid allreduces across iterations
  4. Swizzle op formulation for non-affine swizzles
  5. Use scheduler_utils to cache inputs and outputs in schedulePointwise
- scheduler refactor
  1. New compute at interface
- transformation propagation refactor on MaxInfoSpanningTree
  1. Added sibling path that is required to generate consistent replay for some cases where `MaxInfoSpanningTree` is used with a selector.
  2. Optimization to skip Transform propagator
  3. SpanningTreePrinter for debugging
- parser update
  1. Fixes `div`
  2. Added `_to_copy`
  3. Broadcast in dim with expand to support expanding to concrete size
  4. Dropout prob extremal patch
- executor patch on caching strides for output allocation

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
3b87896706fc98aa4d5b5c811af034cc4dddfbab Fix allocation of work buffers and `fused_reduction::ParallelReduce` with unswitch (#1818)
4cae1227f666b68d275144afd6e4be1fa7aa0786 schedulePointwise cleanup: - computeAt + InlinePropagator (#1815)
3df97426adfb5ecc6fe2c12c43d56d59670e5020 Use scheduler_utils to cache inputs and outputs in schedulePointwise (#1811)
03180aa8facde51237dffa29f6632ffa870cf923 improve broadcast resolution (#1792)
bee6c69979d8c34d6d6ef7514f8886cf1416d64f bug fix (#1819)
4413c8f43a5a64dd0a6ddb0763523bbc7314f4b5 Support PYTORCH_NVFUSER_DUMP=transform_propagator (#1812)
de6b7ca5ce755061ae0d26e006c4403653627ab5 Fix negative position in InlinePropagator (#1813)
10a996cb4dce5d514f09fd0d49ffcd3b88869a28 Remove redundant check in schedulePointwise (#1810)
acd5ed4df825d4c25999e8c9041e0f8ca1a3448f Swizzle op formulation for non-affine swizzles (#1441)
3ed8330f881f429fe2df0e5af9000b91355a96da Kernel args patch to show zero_init buffer (#1809)
037a75a42048f1d8a9c30efb466f1ffbfd2894ad Dropout prob extremal patch (#1804)
282c42902bff07f759cddbbe619249cf5e7c5281 spam nvrtc options (#1783)
3ba6a5fe0a47044179cd36b5b62e628c75180da5 Broadcast in dim with expand (#1794)
fd4be1236ddfeb31ca0659e1b0df36546424c979 remove dead indexing code (#1806)
fa4e6a4739a9daaa0e4111fb4730704d79c91010 Check siblings in getMaxPosAll (#1805)
025c840c76d89b0d032b65a78a375719cab78d46 Grouping grid allreduces across iterations (#1755)
37c579e64f8145fc292273cdebb6519edeb9cf76 Temporarily disable test requring large shared memory. (#1802)
5f375d074524ab65cb78282eff7abe5846cc4203 More cleanup on InlinePropagator (#1800)
8d384da0cfb50a7c5082e91585c12f4c3a775e6c Indexing refactor stage 2 : Remove reference tensor in predicate indexing logic (#1784)
f008140e26335584a143f71c2cb9e91fd61ec530 MMA Rfactor support for cross-warp and cross-CTA split on K dimension (#1554)
76b3cca5cc9a18a56db8107d2f6c8e94851bb85c Add parsing support for `_to_copy` to handle AMP casts. (#1756)
ef04f6c4c0ee043979ac7aad4e5be6f22faeb547 Coding style cleanups (#1798)
38c7f3cf69ea58cc9480b0621506bbfd90a7c9d3 InlinePropagator please don't replay (#1797)
3f2c263ade35017be2d99fe8e4ec97fd0f14f754 validateDomain in TransformPropagator (#1796)
c07708520d99ef815ce15ec367bf7e98797d602b Use TransformPropagatorWithCheck in many tests (#1795)
d0d0908aee2e2b7615c28d04ee80a54b01a02bcd Some further cleanup for the new computeAt interface (#1793)
45f5203b5744cd3512d83263b3fb07c99795a271 Fix TransformReplay::getMatchedLeafPosWithoutReplay* (#1791)
28cbaf931870086cf59807dd60ce412d6dfad0fd New compute at interface (#1743)
635ebfc79bc016eea94d4cbde2c12324171b908b Add SpanningTreePrinter (#1786)
59f3c3223c48ea89549fe7d323f17cbecbebede0 Output allocate patch (#1790)
fe93bf5a6485696ffb36751606a84080349967b5 Transform propagator skip replay when possible (#1782)
ebf23a50f3adf3d28e824c3b3b4ed6ea6f9cf483 Fix isIntegralType error msg (#1789)
0c82ecf04d12b9fe5428af6824a7a978cf5e0ddd Disable register reuse across serial broadcast ops (#1787)
33a824d8d9ace7790a4a58d497e525a7a059579d Adding sibling path for MaxInfoSpanningTree (#1776)
86f46aad83cbb2aa06943419a7335d71a8798f2a Fix div(Val, TensorView) (#1778)
d3de227ade763bdac9e9df15ba8671be78565ee9 Fix FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA (#1781)
ecc7a87cdaaed66672d08bf819ad58d2980384cb Extend mma dimension and layout checking to support strided batched matmul and tensor contractions (#1761)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38043938](https://our.internmc.facebook.com/intern/diff/D38043938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81861
Approved by: https://github.com/davidberard98
2022-07-28 02:32:16 +00:00
jjsjann123
8d753c8062 [WIP] Upstream push 0627 (#80355)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- TransformPropagator refactor: switched to Dijkstra instead of exhaustive enumeration on all possible paths to reduce compilation time on transform propagation;
- Indexing refactor: remove reference tensor creation in all tensor indexing logic (#1690)
- (more) generic grouped grid reduction kernel;
- Minor parser/fuser patches:
  1. zero-dim tensor reduction support
  3. no-op binary removal within fused graph
  4. expand supported in fusion

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
a054b3efcf5af58ea518de283f55aaf9fe06ff5f Refactor TransormPropagator to allow specifying a position and propagating to part of the DAG (#1775)
d67e1cda9b802036841a371318014a818a849b0a Indexing refactor stage 1: remove reference tensor creation in all tensor indexing logic (#1690)
1b6529956a1ace220898ad09dde0bf85e49827f7 Issue 1770 (#1774)
35b04276b648c9b55cdb6a67f3889f54e745c3d2 Avoid compilation errors like below: (#1773)
452c77326a340d2a4130b7802f4f319aec60e72a Ignore reductions of zero-dim tensors per PyTorch conventions (#1771)
31d6c56d88afba09ac53b2d5dd3493d625f8cd57 TransformPropagator refactor (#1769)
570c5a84b91a3cf67207331be9650d26a2d37e3d Merge pull request #1767 from csarofeen/upstream_merge_0621
9d6c3d84be86da643df6fd51695543938111f20d merging upstream 61305cd638
0ed815f76b08f285bda855dd500692ff10a8abce New TransformPropagator algorithm (#1763)
6c195200c0a92fb0f38c833431a8940ed07569b9 no-op binary removal (#1764)
ec7fa4187c177186527409dfc5c7b1754d30bc92 Proper propagation of IterType (#1762)
b263562dbc3c865007ad7d7d42a58a20be8d7922 Fix dimensionality check (#1759)
2d6343f6cc1e47b63ef20a50d1446f6480736478 More generic grouped grid reduction kernel (#1740)
64e2b56df2c8b9fd22a362d9cc05974a8607ef3d [nvfuser] prevent spamming warning message (#77777) (#1758)
0c431624ff15b6458b9f9b674a3852373fc426b1 [nvFuser] Improving bitwise ops support (#77158) (#1757)
b93a14777fde3b9b39684b9cf1715651a806b281 Parser expand (#1754)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80355
Approved by: https://github.com/davidberard98
2022-07-13 19:34:31 +00:00
Kurt Mohler
23bdb570cf Reland: Enable dim=None for torch.sum (#79881)
Part of #29137

Reland of #75845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79881
Approved by: https://github.com/albanD, https://github.com/kulinseth
2022-07-09 00:54:42 +00:00
jjsjann123
9e86796fe3 simple c10 implementation for std::call_once (#78051)
A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
It could hang during re-entry after an exception handling.

Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051
Approved by: https://github.com/albanD
2022-06-28 15:47:03 +00:00
PyTorch MergeBot
ee6ebfc06b Revert "Enable dim=None for torch.sum (#75845)"
This reverts commit e79a51f7db.

Reverted https://github.com/pytorch/pytorch/pull/75845 on behalf of https://github.com/malfet due to Breaks MacOS builds, see e79a51f7db
2022-06-16 22:01:41 +00:00
Kurt Mohler
e79a51f7db Enable dim=None for torch.sum (#75845)
Part of #29137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75845
Approved by: https://github.com/ezyang
2022-06-16 20:17:07 +00:00
jjsjann123
c9c402eae9 [nvfuser_upstream_push] Reland: nvfuser code base bump 060822 (#79406)
Landing reverted PR #79147.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79406
Approved by: https://github.com/davidberard98
2022-06-16 17:52:21 +00:00
Michael Andreas Dagitses
606b234336 turn on -Werror=unused-function in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 22:11:54 +00:00
PyTorch MergeBot
d28e9e145b Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)"
This reverts commit 49c41b87a2.

Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk 49c41b87a2
2022-06-10 20:55:10 +00:00
PyTorch MergeBot
bcd7a20953 Revert "turn on -Werror=unused-function in our Bazel CPU build"
This reverts commit 67d313a032.

Reverted https://github.com/pytorch/pytorch/pull/79154 on behalf of https://github.com/malfet due to Breaks bazel build: 67d313a032
2022-06-10 20:43:03 +00:00
jjsjann123
49c41b87a2 [nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
2022-06-10 19:37:42 +00:00
Michael Andreas Dagitses
67d313a032 turn on -Werror=unused-function in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 18:30:08 +00:00
jjsjann123
9e52ad28c9 [nvfuser_upstream_push] nvfuser code base bump 052422 (#78244)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440

Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
2022-06-07 17:30:51 -07:00
jjsjann123
17fbb85734 [nvfuser] prevent spamming warning message (#77777)
updating TORCH_WARN to TORCH_WARN_ONCE to prevent spamming the log
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77777
Approved by: https://github.com/davidberard98
2022-05-19 20:43:14 +00:00
jjsjann123
a2802ad0b9 Upstream master bump 0513 (#77471)
Updating nvfuser code base.

This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471
Approved by: https://github.com/seemethere, https://github.com/eellison
2022-05-18 11:48:50 -07:00
Xiang Gao
4eec865f58 [nvFuser] Improving bitwise ops support (#77158)
- Some renaming to better match PyTorch API:
  - `lshift` -> `bitwise_left_shift`
  - `rshift` -> `bitwise_right_shift`
  - `andOp` -> `bitwise_and`
  - `orOp` -> `bitwise_or`
  - `xorOp` -> `bitwise_xor`
  - `notOp` -> `bitwise_not`
- Fix type inferences and type checking of these ops
- Add `bitwise_*` to parser and python frontend
- Improve test coverage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77158
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123
2022-05-18 17:21:34 +00:00
jjsjann123
489818e7c6 disabling squeeze/unsqueeze; disabling BN/BN_BWD for perf concern (#77017)
Fixes #76883 (via disabling squeeze/unsqueeze)

Disabling BN fwd/bwd for our perf concern. I need to update our python tests. Awaiting build to finish so I can update tests accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77017
Approved by: https://github.com/csarofeen, https://github.com/davidberard98
2022-05-09 22:57:20 +00:00
jjsjann123
b4f3f9c651 Torchvision patch (#77001)
Fixes #76791

Note that this is a hot patch so we get to run upstream tests. I'm doing proper fix in our local repo and will update upstream code once those are merged/reviewed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77001
Approved by: https://github.com/davidberard98
2022-05-09 16:53:23 +00:00
Xiang Gao
104f0bf09e [Reland] Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend (#76769)
This reverts commit 4bb5944133.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76769
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-05-07 21:26:00 +00:00
PyTorch MergeBot
4bb5944133 Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"
This reverts commit 92d10decc4.

Reverted https://github.com/pytorch/pytorch/pull/76598 on behalf of https://github.com/malfet
2022-05-03 19:53:28 +00:00
Xiang Gao
92d10decc4 Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend
Fixes: https://github.com/csarofeen/pytorch/issues/1632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76598
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-05-03 16:31:40 +00:00
jjsjann123
d23619b030 Permutation extended
Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.

The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)`

1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation;
2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output;
3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous.

By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`.

Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):
1. different rank & same permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1])  # stride (1, wc, c)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
2. different rank & compatible permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
3. different rank & compatible permutation with broadcasting
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1)  # stride (1, 1, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
4. different rank & in-compatible permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w).cuda()  # stride (w, 1)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, wc, c, 1)  # nvfuser outputs contiguous tensor
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # TI preserves memory format of LHS operand
```
5. different rank & in-compatible permutation
```
    t0 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # nvfuser preserves memory format of highest rank tensors
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, hw, w, 1)  # TensorIterator preserves memory format of LHS operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
2022-05-02 22:09:56 +00:00
Ryan Spring
e9f17da2cf Nvfuser - Type Promotion Fix
Fix Type Promotion failures in [Issue 76046](https://github.com/pytorch/pytorch/issues/76046)

1. Updated nvfuser type promotion rule for codegen kernel;
2. Updated casting for output of nvfuser kernel to respect profiling/TorchScript scalar type;
3. Updated type_inference.cpp to only update device/scalar_type when profiling information is missing.

Additional Type Promotion Fixes:
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float32
-  test_nvfuser_correctness_softmax_with_dtype_cuda_bfloat16
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float16
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float32
-  test_nvfuser_correctness_log_softmax_dtype_cuda_bfloat16
-  test_nvfuser_correctness_log_softmax_dtype_cuda_bool
-  test_nvfuser_correctness_log_softmax_dtype_cuda_float16
-  test_nvfuser_correctness_log_softmax_dtype_cuda_float32
-  test_nvfuser_correctness_sum_cuda_int32
-  test_nvfuser_correctness_sum_to_size_cuda_int32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76343
Approved by: https://github.com/jjsjann123, https://github.com/mruberry
2022-04-28 16:08:38 +00:00
David Berard
f36d348f75 [NVFuser] multithreading nvfuser test
1) add multithreading tests
2) make IrParser thread safe with std::call_once (previously, registerJitOperator could get called twice simultaneously and segfault)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76259

Approved by: https://github.com/jjsjann123
2022-04-25 21:48:50 +00:00
Nikita Shulga
f6c275f55d Remove -Wno-unused-variable from utils.cmake (take 2) (#75538)
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538

Reviewed By: anjali411

Differential Revision: D35747333

Pulled By: malfet

fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
2022-04-20 17:41:59 +00:00
PyTorch MergeBot
5c56b2286b Revert "Remove -Wno-unused-variable from utils.cmake"
This reverts commit 018cbe1f5c.

Reverted https://github.com/pytorch/pytorch/pull/75538 on behalf of https://github.com/seemethere
2022-04-19 17:19:09 +00:00
Nikita Shulga
018cbe1f5c Remove -Wno-unused-variable from utils.cmake
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Approved by: https://github.com/cpuhrsch
2022-04-19 15:26:55 +00:00
David Berard
ebb60a8b2f [NVFuser] don't decompose linear if we don't have shape info
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75770

Approved by: https://github.com/jjsjann123, https://github.com/robieta
2022-04-18 14:24:37 +00:00
Nirav Mehta
dfcaedeb1a Fix formatting issues
Summary:
Fixing clang-format errors using `arc f`

Changes already in github included https://github.com/pytorch/pytorch/pull/68460

Test Plan: test run in Signals

Reviewed By: osalpekar

Differential Revision: D35649381

fbshipit-source-id: 15f9cc7259c6425a14d2646200008f15ec47cbf0
(cherry picked from commit 6581afe58afae4dcc34d4024499c6cb61a56b448)
2022-04-14 23:29:13 +00:00
PyTorch MergeBot
db6165215e Revert "[ci] use lintrunner in CI"
This reverts commit 4c3ee53522.

Reverted https://github.com/pytorch/pytorch/pull/68460 on behalf of https://github.com/malfet
2022-04-14 23:27:27 +00:00
Michael Suo
4c3ee53522 [ci] use lintrunner in CI
This changes our lint workflows to use lintrunner for the linters that
are currently supported

+ some random fixes to make things lint clean on master
+ changes to Makefile to use lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68460

Approved by: https://github.com/t10-13rocket, https://github.com/seemethere, https://github.com/janeyx99
2022-04-14 17:43:41 +00:00
jjsjann123
692ebc8d8b baby steps on patching inf/nan behavior & aten::amin support in nvfuser
Fixes #75622

1. Instead of getting max/min_value for reduction init value, we go with (-)infinity instead so we can properly preserve inf inputs;
2. Adding inf/(-)inf/nan for float value.
3. Adding aten::amin in nvfuser (@kevinstephano @rdspring1 for review)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75646
Approved by: https://github.com/rdspring1, https://github.com/kevinstephano, https://github.com/ngimel
2022-04-13 15:51:17 +00:00
jiej
0203341bbd patching clamp for one sided clamp
Fixes #75088

The solution is just to avoid putting random value for non-specified clamp as pointed out in https://github.com/pytorch/pytorch/issues/75088#issuecomment-1093410036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75558
Approved by: https://github.com/ngimel
2022-04-12 03:02:32 +00:00
jjsjann123
873ced7cd0 Nvfuser code bump 030122 (#73627)
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
2022-03-31 08:18:22 +00:00
jiej
e4e19d5beb nvfuser parser skip api (#74520)
Summary:
added python API to disable nvfuser on certain opkind.

```
          "_jit_set_nvfuser_skip_node_kind",
          [](const std::string& op_name, bool flip = true) {
            return fuser::cuda::skipNode(op_name, flip);
          })
```

Args:
    `op_name`: Symbol of op;
    `flip`: flag indicating whether to flip the given op in the skip list.
Returns:
    a bool flag indicating if `op_name` was already in the skip list.

The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True)  # returns False, as no op is in skip list by default`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520

Reviewed By: saketh-are

Differential Revision: D35046110

Pulled By: davidberard98

fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
2022-03-23 20:56:43 +00:00
jiej
2d110d514f Nvfuser code bump 2_1_2022 (#72127)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported

nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor

Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127

Reviewed By: HamidShojanazeri

Differential Revision: D34113233

Pulled By: jbschlosser

fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
2022-02-15 00:43:16 +00:00
Ryan Spring
4f8b986e28 Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: VitalyFedyunin

Differential Revision: D33894937

Pulled By: jbschlosser

fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851
(cherry picked from commit 6e986f91a9)
2022-02-14 03:40:32 +00:00
Nikita Shulga
74c44ba9d6 Revert D33850228: [pytorch][PR] Implement Tanh Gelu Approximation
Test Plan: revert-hammer

Differential Revision:
D33850228 (23d03025dc)

Original commit changeset: 3cc33fb298e4

Original Phabricator Diff: D33850228 (23d03025dc)

fbshipit-source-id: 9436e7df73c2b2e2011f321674f24973316d3692
(cherry picked from commit c9efb58223)
2022-01-31 17:44:19 +00:00
Ryan Spring
23d03025dc Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: cpuhrsch

Differential Revision: D33850228

Pulled By: jbschlosser

fbshipit-source-id: 3cc33fb298e480d7ecc5c67716da019d60c6ab33
(cherry picked from commit 3a53b3e94f)
2022-01-31 17:07:45 +00:00
Joel Schlosser
cb823d9f07 Revert D33744717: [pytorch][PR] Implement Tanh Gelu Approximation
Test Plan: revert-hammer

Differential Revision:
D33744717 (f499ab9cef)

Original commit changeset: d64532a562ed

Original Phabricator Diff: D33744717 (f499ab9cef)

fbshipit-source-id: 396c3f63de5865f894dbc353d0790a01a624be93
(cherry picked from commit e9fb2d1db1)
2022-01-28 18:35:01 +00:00
Ryan Spring
f499ab9cef Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: mikaylagawarecki

Differential Revision: D33744717

Pulled By: jbschlosser

fbshipit-source-id: d64532a562ed53247bb4fa52bb16722634d5c187
(cherry picked from commit 4713dd9cca)
2022-01-28 16:59:09 +00:00
CodemodService FBSourceClangFormatLinterBot
de2d9e2966 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33183467

fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de
2021-12-17 09:45:20 -08:00
jiej
76d282d447 Nvfuser code bump 12 5 (#69964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
2021-12-16 08:28:54 -08:00
jjsjann123
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
soulitzer
4cdfceddd2 [Reland] Avoid saving self for softmax and log_softmax (#66018)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/65242

The last attempt of the reland automatically rebased onto stable, which did not yet have the revert commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66018

Reviewed By: albanD

Differential Revision: D31348822

Pulled By: soulitzer

fbshipit-source-id: 881d701b404530c1352ac9245bd67264e1652b8a
2021-10-03 21:35:01 -07:00
Michael Suo
ccf8d48f16 Revert D31317680: [pytorch][PR] Avoid saving self forsoftmax and log_softmax
Test Plan: revert-hammer

Differential Revision:
D31317680 (5f7cadc7aa)

Original commit changeset: b3b921e06775

fbshipit-source-id: 1bca0672383536a2c21243ceb52349c766a94344
2021-10-01 09:31:44 -07:00