Commit Graph

48 Commits

Author SHA1 Message Date
Jeff Daily
8db30255c3 [ROCm] set nvfuser default to disabled, keep CI (#86369)
Bug fix. nvfuser is functional for ROCm on gfx906, but some tests are failing for other gfx targets. Disable nvfuser until all features are verified. Users may still opt-in by setting the known env var PYTORCH_JIT_ENABLE_NVFUSER=1. This PR sets this env var for the github actions workflow for ROCm since all current CI hosts are gfx906.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86369
Approved by: https://github.com/huydhn
2022-10-11 20:55:58 +00:00
Edward Z. Yang
3638089755 Ported reshape to symints and added a shim for BC (#85998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85998
Approved by: https://github.com/ezyang
2022-10-02 17:46:00 +00:00
Jeff Daily
d09486ab23 [ROCm] enable nvfuser (#82498)
### Description
The nvfuser is enabled for ROCm.

### Testing
CI label ciflow/trunk covers the newly enabled ROCm functionality as well as any CUDA regressions caused by these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82498
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-08-30 21:50:39 +00:00
Edward Z. Yang
ad44670fa1 Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9

Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159

Also update Metal registrations for C++ registration changes.

Also update NNPI registration to account for tightened schema checking

Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
2022-08-29 18:01:07 +00:00
PyTorch MergeBot
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
Edward Z. Yang
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00
jjsjann123
b21a6ff639 [NVFuser] Upstream push 0811 (#83239)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. double support in expression evaluator
- bug fixes:
  1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784)
  2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard
- scheduler:
  1. manual transpose schedule example
  2. WIP transpose scheduler

Commits that's in this PR from the devel branch:

```
b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854)
8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889)
83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898)
69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883)
15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888)
aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239
Approved by: https://github.com/davidberard98
2022-08-25 02:23:22 +00:00
PyTorch MergeBot
a7edf71360 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 8fae7027b3.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222
2022-08-25 00:49:40 +00:00
Edward Z. Yang
8fae7027b3 Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-23 22:04:07 +00:00
jjsjann123
8d753c8062 [WIP] Upstream push 0627 (#80355)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- TransformPropagator refactor: switched to Dijkstra instead of exhaustive enumeration on all possible paths to reduce compilation time on transform propagation;
- Indexing refactor: remove reference tensor creation in all tensor indexing logic (#1690)
- (more) generic grouped grid reduction kernel;
- Minor parser/fuser patches:
  1. zero-dim tensor reduction support
  3. no-op binary removal within fused graph
  4. expand supported in fusion

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
a054b3efcf5af58ea518de283f55aaf9fe06ff5f Refactor TransormPropagator to allow specifying a position and propagating to part of the DAG (#1775)
d67e1cda9b802036841a371318014a818a849b0a Indexing refactor stage 1: remove reference tensor creation in all tensor indexing logic (#1690)
1b6529956a1ace220898ad09dde0bf85e49827f7 Issue 1770 (#1774)
35b04276b648c9b55cdb6a67f3889f54e745c3d2 Avoid compilation errors like below: (#1773)
452c77326a340d2a4130b7802f4f319aec60e72a Ignore reductions of zero-dim tensors per PyTorch conventions (#1771)
31d6c56d88afba09ac53b2d5dd3493d625f8cd57 TransformPropagator refactor (#1769)
570c5a84b91a3cf67207331be9650d26a2d37e3d Merge pull request #1767 from csarofeen/upstream_merge_0621
9d6c3d84be86da643df6fd51695543938111f20d merging upstream 61305cd638
0ed815f76b08f285bda855dd500692ff10a8abce New TransformPropagator algorithm (#1763)
6c195200c0a92fb0f38c833431a8940ed07569b9 no-op binary removal (#1764)
ec7fa4187c177186527409dfc5c7b1754d30bc92 Proper propagation of IterType (#1762)
b263562dbc3c865007ad7d7d42a58a20be8d7922 Fix dimensionality check (#1759)
2d6343f6cc1e47b63ef20a50d1446f6480736478 More generic grouped grid reduction kernel (#1740)
64e2b56df2c8b9fd22a362d9cc05974a8607ef3d [nvfuser] prevent spamming warning message (#77777) (#1758)
0c431624ff15b6458b9f9b674a3852373fc426b1 [nvFuser] Improving bitwise ops support (#77158) (#1757)
b93a14777fde3b9b39684b9cf1715651a806b281 Parser expand (#1754)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80355
Approved by: https://github.com/davidberard98
2022-07-13 19:34:31 +00:00
jjsjann123
9e86796fe3 simple c10 implementation for std::call_once (#78051)
A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
It could hang during re-entry after an exception handling.

Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051
Approved by: https://github.com/albanD
2022-06-28 15:47:03 +00:00
David Berard
459090e3ce [NVFuser] add "canBeEnabled" interface
If you try to enable NVFuser when it's not possible, it will error out.
This will allow you to check whether or not it's possible before trying
to enable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79648

Approved by: https://github.com/eellison
2022-06-17 16:15:04 +00:00
David Berard
38bc10ae25 retry - enable NVFuser by default
Enable NVFuser in OSS.

Retry of #77213, because it was breaking torchvision tests.

Fix in #77471 has been verified by jjsjann123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77579

Approved by: https://github.com/eellison, https://github.com/malfet, https://github.com/atalman, https://github.com/seemethere
2022-05-20 14:21:18 +00:00
jjsjann123
a2802ad0b9 Upstream master bump 0513 (#77471)
Updating nvfuser code base.

This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471
Approved by: https://github.com/seemethere, https://github.com/eellison
2022-05-18 11:48:50 -07:00
PyTorch MergeBot
2a905aef09 Revert "enable NVFuser by default"
This reverts commit 24f7dcd816.

Reverted https://github.com/pytorch/pytorch/pull/77213 on behalf of https://github.com/davidberard98
2022-05-16 18:23:39 +00:00
David Berard
e175065c4e [NVFuser] fix force-disable flag
This prevents the std::call_once() check from erroring if:

* PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1
* PYTORCH_JIT_ENABLE_NVFUSER=1
* user has not set a flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77395

Approved by: https://github.com/eellison
2022-05-14 19:31:32 +00:00
David Berard
24f7dcd816 enable NVFuser by default
Enable NVFuser in OSS.

Tests are passing, and we've also run tests in [torchvision](https://github.com/pytorch/vision/pull/5959) and [torchaudio](https://github.com/pytorch/audio/pull/2372)

Retry of #76006, because that PR had GH1/ghstack issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77213

Approved by: https://github.com/eellison
2022-05-11 19:59:31 +00:00
David Berard
6fd14ba9db [NVFuser] Add environment variable to force disable NVFuser
PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1 will force NVFuser to be disabled,
regardless of other environment variables or values set at runtime. It
will be used for guarding certain parts of the internal rollout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77168

Approved by: https://github.com/jjsjann123, https://github.com/eellison
2022-05-11 16:12:19 +00:00
David Berard
6c615a21a0 [NVFuser] prep for on-by-default
1. fix tests that expected nvfuser off-by-default behavior
2. skip nvfuser if getExecutorMode() == false

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76937

Approved by: https://github.com/eellison
2022-05-06 18:18:53 +00:00
David Berard
e33f3229a2 [NVFuser] environment variable to turn nvfuser on or off (#76485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485

Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for
controlling whether or not nvfuser is enabled. This required changing
the PassManager behavior to support the case where nvfuser gets enabled
by default when PYTORCH_JIT_ENABLE_NVFUSER=1.

Previously the solution for turning nvfuser on or off was to use the
PassManager to register or un-register the pass. That works fine if the
pass starts of _disabled_, but causes issues once we try to enable the
pass by default.

The main issue with enabling by default is with the validation check to
see whether NVFuser can be turned on. The check relies on
at::globalContext().hasCUDA(), which requires CUDAHooks to be registered
before hasCUDA() wil work correctly. At static initialization time it's
difficult to ensure that CUDAHooks will be registered _before_ we
attempt to register the nvfuser pass. In OSS it worked fine, but in
internal builds it would fail on ROCm builds.

To fix this, we switch the control of NVFuser enablement to a check in
the pass. i.e. previously, we enabled/disabled nvfuser by registering or
de-registering the pass in pass manager; now, the pass is always
registered in pass manager, and enablement is done by a check within the
nvfuser pass.

Remaining TODO: Connect this with NNC so that in cases where NNC is
available but not NVFuser (i.e. on AMD gpus), NNC can be turned on
automatically.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D35982618

Pulled By: davidberard98

fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60
(cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)
2022-05-03 23:05:40 +00:00
Ryan Spring
e9f17da2cf Nvfuser - Type Promotion Fix
Fix Type Promotion failures in [Issue 76046](https://github.com/pytorch/pytorch/issues/76046)

1. Updated nvfuser type promotion rule for codegen kernel;
2. Updated casting for output of nvfuser kernel to respect profiling/TorchScript scalar type;
3. Updated type_inference.cpp to only update device/scalar_type when profiling information is missing.

Additional Type Promotion Fixes:
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float32
-  test_nvfuser_correctness_softmax_with_dtype_cuda_bfloat16
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float16
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float32
-  test_nvfuser_correctness_log_softmax_dtype_cuda_bfloat16
-  test_nvfuser_correctness_log_softmax_dtype_cuda_bool
-  test_nvfuser_correctness_log_softmax_dtype_cuda_float16
-  test_nvfuser_correctness_log_softmax_dtype_cuda_float32
-  test_nvfuser_correctness_sum_cuda_int32
-  test_nvfuser_correctness_sum_to_size_cuda_int32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76343
Approved by: https://github.com/jjsjann123, https://github.com/mruberry
2022-04-28 16:08:38 +00:00
Horace He
5994d68484 Reland NVFuser guard changes
Reland of https://github.com/pytorch/pytorch/pull/75016 with `USE_CUDA` => `USE_NVFUSER`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75303
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-04-06 06:32:34 +00:00
PyTorch MergeBot
1352c6417a Revert "Nvfuser guard patch"
This reverts commit d86181f745.

Reverted https://github.com/pytorch/pytorch/pull/75016 on behalf of https://github.com/malfet
2022-04-01 23:45:55 +00:00
jjsjann123
d86181f745 Nvfuser guard patch
Fixes issue where CudaFusionGuard would return false on backward graph because `requires_grad` flag doesn't match.

This is due to the fact that autodiff uses GradMode switch to turn on/off requires_grad, which is not taken into consideration by nvfuser guard. We verified the implementation under `TensorType::matchTensor`.

- [x] Add python test to verify no fallback is observed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75016
Approved by: https://github.com/eellison
2022-04-01 14:23:48 +00:00
jjsjann123
873ced7cd0 Nvfuser code bump 030122 (#73627)
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
2022-03-31 08:18:22 +00:00
jiej
86c817cfa0 Requires grad guard
Adding CudaFusionGuard to guard on device/requires_grad of profiled tensor type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74780
Approved by: https://github.com/davidberard98
2022-03-29 19:23:10 +00:00
jiej
e4e19d5beb nvfuser parser skip api (#74520)
Summary:
added python API to disable nvfuser on certain opkind.

```
          "_jit_set_nvfuser_skip_node_kind",
          [](const std::string& op_name, bool flip = true) {
            return fuser::cuda::skipNode(op_name, flip);
          })
```

Args:
    `op_name`: Symbol of op;
    `flip`: flag indicating whether to flip the given op in the skip list.
Returns:
    a bool flag indicating if `op_name` was already in the skip list.

The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True)  # returns False, as no op is in skip list by default`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520

Reviewed By: saketh-are

Differential Revision: D35046110

Pulled By: davidberard98

fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
2022-03-23 20:56:43 +00:00
jiej
2d110d514f Nvfuser code bump 2_1_2022 (#72127)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported

nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor

Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127

Reviewed By: HamidShojanazeri

Differential Revision: D34113233

Pulled By: jbschlosser

fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
2022-02-15 00:43:16 +00:00
jjsjann123
e429a68478 Allow single node fusion for nvfuser (#70000)
Summary:
Setting `PYTORCH_NVFUSER_ONE_OP_FUSION=1` will take all nodes nvFuser support, instead of waiting for fusion opportunity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70000

Reviewed By: samdow

Differential Revision: D33292195

Pulled By: davidberard98

fbshipit-source-id: 8ed5ce5e82fbb6737e8ab5ce4223b038eaf47756
2021-12-23 17:07:57 -08:00
jjsjann123
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
jiej
321345d7c9 Revert "Revert D31227448: [pytorch][PR] fixing sorting in stride indices" (#66176)
Summary:
enabling https://github.com/pytorch/pytorch/issues/63940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66176

Reviewed By: ngimel

Differential Revision: D31423920

Pulled By: dzhulgakov

fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c
2021-10-07 22:09:07 -07:00
jiej
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00
Eli Uriegas
03389dc851 Revert D30752939: [pytorch][PR] nvfuser update
Test Plan: revert-hammer

Differential Revision:
D30752939 (cfaecaf40b)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2
2021-09-15 17:38:47 -07:00
jiej
cfaecaf40b nvfuser update (#63745)
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
2021-09-15 14:42:55 -07:00
Zhengxu Chen
ac99d63f83 [jit] Make operation call accept Stack& instead Stack* (#63414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414

Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318

Test Plan:
compiles.

Imported from OSS

Reviewed By: ejguan

Differential Revision: D30375410

fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
2021-08-30 11:49:20 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Richard Barnes
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Thomas Viehmann
ea087e2d92 JIT: guard DifferentiableGraph node (#49433)
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes https://github.com/pytorch/pytorch/issues/49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433

Reviewed By: ngimel

Differential Revision: D25681374

Pulled By: Krovatkin

fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
2021-01-08 20:01:27 -08:00
jiej
a6fa3b2682 adding profile_ivalue (#47666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47666

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25255573

Pulled By: Krovatkin

fbshipit-source-id: 5d8753e4040a3d96105d28d26728125947c7a638
2020-12-09 15:29:15 -08:00
jiej
ac146c4820 [nvFuser] Switching to CudaFusionGuard from BailOut for nvfuser - update 2 (#46452)
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452

Reviewed By: zou3519, anjali411

Differential Revision: D24364642

Pulled By: ngimel

fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
2020-10-19 15:44:31 -07:00
jjsjann123
99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
Sebastian Messmer
53af9df557 Unify boxed function signature between jit and c10 (#37034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034

c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.

This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069

Test Plan: unit tests

Differential Revision: D20567950

fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
2020-06-29 19:24:26 -07:00
Song Zhou
dabeff33b9 [pytorch] Fix fblearner flow compiling errors (#35902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35902

Move operator registration to anonymous namespace to avoid collision.

Reviewed By: soumith

Differential Revision: D20822382

fbshipit-source-id: 1ab00871491668b8b85e803ac877d96477f1688b
2020-04-02 14:52:48 -07:00
Soumith Chintala
d9dd353a00 fix clang-format (#35884)
Summary:
breakage introduced in PR that I landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35884

Differential Revision: D20817603

Pulled By: soumith

fbshipit-source-id: b0729bed81549d4c8e6a889c380baa19c73ef127
2020-04-02 12:12:27 -07:00
Christian Sarofeen
6d24f8fe21 Infrastructure for a new CUDA Fuser (#34785)
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_  One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.

**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.

**Short term goals:**

Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout

**Mid-term goals:**

- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785

Reviewed By: ZolotukhinM

Differential Revision: D20650977

Pulled By: soumith

fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
2020-04-02 09:22:42 -07:00