Commit Graph

316 Commits

Author SHA1 Message Date
David Berard
908daa8ae5 [nvfuser] avoid out of bounds error (#89584)
Summary: update OOB check (https://github.com/csarofeen/pytorch/pull/2218) and skip tests that OOM on internal machines.

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/torch/csrc/jit/codegen/cuda/test:nvfuser
```

Differential Revision: D41502369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89584
Approved by: https://github.com/jjsjann123
2022-11-29 02:03:59 +00:00
R Max Espinoza
3af5cf4de1 doc(typo): memroy -> memory (#89126)
Minor typo in comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89126
Approved by: https://github.com/kit1980
2022-11-17 01:03:34 +00:00
Wenzhe Xue
5314af5383 Set correct size of attr::output_layouts when the graph has multiple outputs in JIT oneDNN fuser (#88496)
Bug:
Previously, `initOutputLayouts()` was called after creating a graph and before merging other nodes. It is a vector with one element. So when a graph contains multiple outputs, e.g. using AOTAutograd compile in my case, layout_propagation pass try to access out of range elements in the vector. Then it comes to the second bug in `useOpaqueLayout()`, the out of range checks the index with the updated output size instead of the size of the vector. Then used `[]` to access the element, which is out of range.

Fixes the above two issues:

1. check the offset is within range with the size of `attr::output_layouts` vector instead of another variable. This check catches the error now.
2. change the place to initial `attr::output_layouts` after node merging. The graph may change with node merging. Thus we moved the initialization in layout_propagation with the complete graph.

Added test time:
`Ran 1 test in 0.383s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88496
Approved by: https://github.com/jgong5, https://github.com/sanchitintel
2022-11-15 07:29:55 +00:00
Kazuaki Ishizaki
e0c194f10b Fix typos in messages under torch (#88961)
This PR fixes typos of messages and parms in c++ source and head files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88961
Approved by: https://github.com/albanD
2022-11-14 19:06:41 +00:00
jjsjann123
7b419e8513 [NVFuser] Upstream push 1026 (#87779)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

* codegen improvement:
    i. allow non-root trivial reductions, allow empty/no-op fusion
    ii. fixes vectorization checks and size calculation
    iii. bank conflict handle improvement
    iv. enables transpose scheduler

* misc:
    i. CI tests failure fixes
    ii. cpp tests file clean up
    iii. trivial forwarding supports added in codegen runtime
    iv. added factory methods support in codegen

Commits that's in this PR from the devel branch:

```
7117a7e37ebec372d9e802fdfb8abb7786960f4a patching nvfuser conv cudnn test numerics mismatch (#2048)
65af1a4e7013f070df1ba33701f2d524de79d096 Inserting sync for redundant parallel types is already done at the (#2023)
6ac74d181689c8f135f60bfc1ec139d88941c98c Fix sync map (#2047)
f5bca333355e2c0033523f3402de5b8aac602c00 Bank conflict checker improvements (#2032)
d2ca7e3fd203537946be3f7b435303c60fa7f51e Minor update on cp.async code generation. (#1901)
d36cf61f5570c9c992a748126287c4e7432228e0 Test file cleanup (#2040)
0b8e83f49c2ea9f04a4aad5061c1e7f4268474c6 Allow non-root trivial reductions (#2037)
a2dfe40b27cd3f5c04207596f0a1818fbd5e5439 Fix vectorize size calculation (#2035)
e040676a317fe34ea5875276270c7be88f6eaa56 Use withPredicate to replace setPredicate to maintain Exprs immutable (#2025)
197221b847ad5eb347d7ec1cf2706733aacbf97c removing ci workflow (#2034)
40e2703d00795526e7855860aa00b9ab7160755f Reduction rand like patch (#2031)
bc772661cbdb3b711d8e9854ae9b8b7052e3e4a3 Add utility for checking bank conflict of shared memory (#2029)
ddd1cf7695f3fb172a0e4bcb8e4004573617a037 Add back FusionReductionWithTrivialReduction_CUDA (#2030)
fbd97e5ef15fa0f7573800e6fbb5743463fd9e57 Revert "Cleanup trivial reduction workarounds (#2006)" (#2024)
bca20c1dfb8aa8d881fc7973e7579ce82bc6a894 Cleanup trivial reduction workarounds (#2006)
e4b65850eee1d70084105bb6e1f290651adde23e Trivial forwarding (#1995)
1a0e355b5027ed0df501989194ee8f2be3fdd37a Fix contiguity analysis of predicates to match updated contiguity. (#1991)
a4effa6a5f7066647519dc56e854f4c8a2efd2a7 Enable output allocation cache (#2010)
35440b7953ed8da164a5fb28f87d7fd760ac5e00 Patching bn inference (#2016)
0f9f0b4060dc8ca18dc65779cfd7e0776b6b38e8 Add matmul benchmark (#2007)
45045cd05ea268f510587321dbcc8d7c2977cdab Enable tests previously disabled due to an aliasing bug (#2005)
967aa77d2c8e360c7c01587522eec1c1d377c87e Contiguous indexing for View operations (#1990)
a43cb20f48943595894e345865bc1eabf58a5b48 Make inlining even more modular (#2004)
dc458358c0ac91dfaf4e6655a9b3fc206fc0c897 Test util cleanup (#2003)
3ca21ebe4d213f0070ffdfa4ae5d7f6cb0b8e870 More strict validation (#2000)
a7a7d573310c4707a9f381831d3114210461af01 Fix build problem (#1999)
fc235b064e27921fa9d6dbb9dc7055e5bae1c222 Just fixes comments (#1998)
482386c0509fee6edb2964c5ae72074791f3e43a cleanup (#1997)
4cbe0db6558a82c3097d281eec9c85ad2ea0893a Improve divisible split detection (#1970)
42ccc52bdc18bab0330f4b93ed1399164e2980c9 Minor build fix. (#1996)
fcf8c091f72d46f3055975a35afd06263324ede6 Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989)
15f2f6dba8cbf408ec93c344767c1862c30f7ecc Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988)
8f1c7f52679a3ad6acfd419d28a2f4be4a7d89e2 Minor cleanup lower_unroll.cpp (#1994)
1d9858c80319ca7f0037db7de5f04e47f540d76c Minor cleanup (#1992)
f262d9cab59f41c669f53799c6d4a6b9fc4267eb Add support for uniform RNG (#1986)
eb1dad10c73f855eb1ecb20a8b1f7b6edb0c9ea3 Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987)
634820c5e3586c0fe44132c51179b3155be18072 Add support for some empty fusion (#1981)
eabe8d844ad765ee4973faa4821d451ef71b83c3 Segment self mapping fusions (#1954)
e96aacfd9cf9b3c6d08f120282762489bdf540c8 Enable Transpose operation (#1882)
425dce2777420248e9f08893765b5402644f4161 Add a null scheduler that helps segmenting away no-op schedules (#1835)
306d4a68f127dd1b854b749855e48ba23444ba60 Fix canScheduleCompileTime check of transpose scheduler (#1969)
b1bd32cc1b2ae7bbd44701477bddbcfa6642a9be Minor fix (#1967)
bd93578143c1763c1e00ba613a017f8130a6b989 Enable transpose scheduler (#1927)
b7a206e93b4ac823c791c87f12859cf7af264a4c Move scheduler vectorize utilities into their own file (#1959)
d9420e4ca090489bf210e68e9912bb059b895baf View scheduling (#1928)
c668e13aea0cf21d40f95b48e0163b812712cdf2 Upstream push ci fixes (#1965)
c40202bb40ce955955bb97b12762ef3b6b612997 Fix dump effective bandwidth (#1962)
93505bcbb90a7849bd67090fe5708d867e8909e4 WAR on index mapping when exact and permissive maps differ (#1960)
45e95fd1d3c773ee9b2a21d79624c279d269da9f Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930)
a3ecb339442131f87842eb56955e4f17c544e99f Improve the comments at the beginning of index_compute.h (#1946)
f7bc3417cc2923a635042cc6cc361b2f344248d6 Remove unused variables (#1955)
df3393adbb5cb0309d091f358cfa98706bd4d313 Some cleanup (#1957)
7d1d7c8724ab5a226fad0f5a80feeac04975a496 TVDomainGuard factory (#1953)
357ba224c0fb41ed3e4e8594d95599c973f4a0ca Fill allocation with nan on tests (#1956)
8eafc54685d406f5ac527bcbacc475fda4492d7a Fix detection of unmappable root domains (#1952)
90a51f282601ba8ebd4c84b9334efd7762a234bc Some indexing cleanups, Add eye support (#1940)
ddc01e4e16428aec92f9c84d698f959b6436a971 Exclude unsupported data types (#1951)
992e17c0688fe690c51b50e81a75803621b7e6aa test the groups the same order as they are merged (#1949)
208262b75d1fed0597a0329d61d57bc8bcd7ff14 Move detection of self mapping IDs to IterDomainGraph from (#1941)
ac4de38c6ee53b366e85fdfe408c3642d32b57df Merge pull request #1945 from csarofeen/master_merge_0828
631094891a96f715d8c9925fb73d41013ca7f2e3 Add full, full_like, zeros, zeros_like, ones, ones_like (#1943)
aab10bce4541204c46b91ff0f0ed9878aec1bfc4 Merge remote-tracking branch 'upstream/viable/strict' into HEAD
4c254c063bb55887b45677e3812357556a7aa80d Fix arange when step is negative (#1942)
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D40869846](https://our.internmc.facebook.com/intern/diff/D40869846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87779
Approved by: https://github.com/davidberard98
2022-11-04 20:04:34 +00:00
jjsjann123
b325c3fc25 [nvFuser] patches profiling on scalar arguments for std/var (#88165)
Fixes #86531

Added profiling on scalar values for aten::std & aten::var.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88165
Approved by: https://github.com/kevinstephano
2022-11-02 22:47:34 +00:00
Ivan Yashchuk
9ebb8d5232 Add ops.broadcast for nvFuser (#88080)
Having nvFuser's `broadcast` available alongside `broadcast_in_dim` would allow easier experimentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88080
Approved by: https://github.com/jjsjann123, https://github.com/kevinstephano, https://github.com/mruberry
2022-11-02 10:05:12 +00:00
Ivan Yashchuk
4a8382b58e Update caching of tensor arguments for nvFuser's fusion creation (#87860)
Previously nvFuser's fusion definition was cached based on concrete shape and strides of tensor inputs for simplicity and correctness. This PR changes Python's cache to check the number of dimensions, size-1 dimensions, and contiguity information based on given strides and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87860
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/ngimel
2022-11-02 09:29:20 +00:00
Kazuaki Ishizaki
0131a66ab6 Fix typos under torch directory (#88172)
This PR fixes typos in '.md' files under torch directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88172
Approved by: https://github.com/malfet
2022-11-01 22:58:22 +00:00
Kevin Stephano
8ef9bda1bf Fix nvFuser Fusion Definition printing of Squeeze and Permute (#88041)
NM

cc @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88041
Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry
2022-11-01 19:02:40 +00:00
jjsjann123
97b3eeac90 remove assert on tensor inputs to FusionGroup (#88018)
Fixes #86530 #86227 #85872
All issues seem to be duplicate of each other.

Removes the false positive assert

Fixes come from @kevinstephano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88018
Approved by: https://github.com/kevinstephano, https://github.com/soumith
2022-11-01 18:07:17 +00:00
Kevin Stephano
323c646ca9 Cleaned up the nvFuser Python Frontend Batch Norm printing (#88057)
* Removed `define_null_tensor` usage in favor of using optional arguments for binding.
* Re-ordered the non-State arguments for easier printing.
* Added a printing function to include booleans `training` and `channels_last`
* Fixed `define_tensor` to print `is_cpu`

cc @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88057
Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry
2022-11-01 05:05:15 +00:00
Richard Barnes
3ece9fb45d Check all CUDA API calls for errors in torch/ (#81560)
Summary:
Original commit changeset: 0bb770d2cdb2

Original Phabricator Diff: D35194935 (79e5b053b6)

Differential Revision: D35291874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81560
Approved by: https://github.com/ezyang
2022-10-28 00:40:48 +00:00
Nikita Shulga
82c8365c16 [BE] Delete TH_DISALLOW_COPY_AND_ASSIGN (#87743)
Replace it with `AT_DISALLOW_COPY_AND_ASSIGN` and delete the header that
contained this define

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87743
Approved by: https://github.com/atalman, https://github.com/ngimel
2022-10-26 03:31:56 +00:00
alexmsettle
00b8c7e63b New feature for issue #85575. (#86514)
Introduced RECORD_OUTPUTS() macro that goes with RECORD_FUNCTION(). It is used to capture the output tensors from a kernel launch.  The tensors automatically get passed to the profiler using record_function methods.  This allows the profiler to track the tensors that flow into and out of each op.

Fixes #85575

cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86514
Approved by: https://github.com/robieta
2022-10-24 20:02:56 +00:00
Natalia Gimelshein
272747db36 attempted fix for nvrtc with lovelace (#87611)
Fixes #87595 (maybe?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87611
Approved by: https://github.com/malfet, https://github.com/atalman
2022-10-24 18:41:38 +00:00
Nikita Shulga
c28cdb53ea [BE] Delete BUILD_SPLIT_CUDA option (#87502)
As we are linking with cuDNN and cuBLAS dynamically for all configs anyway, as statically linked cuDNN is different library than dynamically linked one, increases default memory footprint, etc, and libtorch_cuda even if compiled for all GPU architectures is no longer approaching 2Gb binary size limit, so BUILD_SPLIT_CUDA can go away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87502
Approved by: https://github.com/atalman
2022-10-22 06:00:59 +00:00
Kazuaki Ishizaki
d80a5f9a96 Fix typo under torch directory (#87274)
This PR fixes typo in .md files under torch directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87274
Approved by: https://github.com/albanD
2022-10-21 14:22:20 +00:00
Zachary DeVito
f56ce8dbad [allocator] Move getFreeMutex (#87237)
It isn't used at all the allocators and this change makes that more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237
Approved by: https://github.com/wconstab
2022-10-19 18:00:40 +00:00
Nikita Shulga
3924aa75b1 [BE] Extend linter to detect DOS newlines (#86973)
Fix DOS newlines in `onednn/decompose_silu.[cpp|h]` introduced by https://github.com/pytorch/pytorch/pull/85591 as well as one in `.github/PULL_REQUEST_TEMPLATE.md`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86973
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb
2022-10-15 00:20:42 +00:00
Ivan Yashchuk
fd80684784 Add nvFuser support for torch.Tensor.view (#84634)
This is an alternative to https://github.com/pytorch/pytorch/pull/83739. While PrimTorch has `view` as a reference, we would like to use nvFuser's implementation for `view` for now. Later we might transition to PrimTorch's `torch._refs.view`.

See `test_nvprims_view` for examples of things that are now sent to nvFuser. Note that nvFuser's `view` is a copy-like operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84634
Approved by: https://github.com/kevinstephano, https://github.com/mruberry
2022-10-14 12:08:02 +00:00
sanchitintel
974ad8fa6c Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591)
## BFloat16 dtype support for faster inference with TorchScript using oneDNN Graph

Intel Xeon Cooper Lake platform & beyond support the `AVX512_BF16` ISA, which is essentially native BFloat16 support.
oneDNN Graph delivers high inference performance with BFloat16 on such machines.

While oneDNN Graph can still be used with BFloat16 on older machines that lack `avx512_bf16` ISA but support `avx512bw`, `avx512vl` & `avx512dq` ISAs, the BF16 performance on these older machines will be significantly poorer (probably even poorer than Float32), as they lack native BF16 support.

Currently, [AMP support for eager mode & JIT mode is divergent in PyTorch](https://github.com/pytorch/pytorch/issues/75956).
So, for using oneDNN Graph with BFloat16, eager-mode AMP should be leveraged by turning off AMP for JIT mode, using `torch._C._jit_set_autocast_mode(False)` in python code, so as to avoid conflicts.

Please use the following environment variable to view JIT logs -
`PYTORCH_JIT_LOG_LEVEL=">>graph_helper:>>graph_fuser:>>kernel:>>interface"`

## Changes being made in this PR
1. This PR does NOT change the `oneDNN` commit or the `ideep` files. While the `ideep` commit is being updated, only files pertaining to oneDNN Graph are being updated. oneDNN Graph is being upgraded to version 0.5.2 (alpha patch release 2).
To put things into perspective, `ideep` is a git submodule of PyTorch. `oneDNN Graph` is a git submodule of `ideep` (`ideep/mkl-dnn`), and oneDNN is a git submodule of oneDNN Graph (`ideep/mkl-dnn/third_party/oneDNN`).
2. Unit-tests are being updated. We now use the [existing dtypes decorator](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L123-L131).
3. Suggestions made by @eellison in the [FP32 PR](https://github.com/pytorch/pytorch/pull/68111#pullrequestreview-896719477) are being incorporated/addressed -

| Action-item | Status |
| :---                                             |          ---: |
|checkInputCompatibility follow up | Fixed |
|the mayConvertScalarInputToTensor logic we can consider | Added type promotion code |
|fix up fixConvOptionalBias| The current approach seems correct |
|Use opinfo tests| using dtypes decorator. Will use `OpInfo` in a subsequent PR, if that'd be possible. Should we create a list of ops from opDB that are supported by oneDNN Graph, and add it to `common_methods_invocations.py`? |
|inferDevice torch_check call | not necessary now, perhaps, as only CPU is supported, for now? We'd add it by the beta release of oneDNN Graph, though, so that by then, users might be able to use other fusers with oneDNN Graph (NNC/TensorExpr are already compatible with the oneDNN Graph fuser). We can still add it, if you'd insist. |
|not checking shapes of input mkldnn tensor to llga guard | Those checks should not be present because oneDNN Graph may use blocked or channels-last layout, so those strides would be different. They're only skipped if an LLGA subgraph's output is input to another LLGA subgraph, which enables LLGA to choose an optimal layout between them. |
|fix test failures with respect to unsupported inputs | We'll address them with the upcoming release of oneDNN Graph beta version|

4. More PyTorch ops are being been mapped to oneDNN Graph

## Example of using oneDNN Graph with BFloat16

```python
# Assuming we have a model of the name 'model'

example_input = torch.rand(1, 3, 224, 224)

# enable oneDNN Graph
torch.jit.enable_onednn_fusion(True)
# Disable AMP for JIT
torch._C._jit_set_autocast_mode(False)
with torch.no_grad(), torch.cpu.amp.autocast():
    model = torch.jit.trace(model, (example_input))
    model = torch.jit.freeze(model)
     # 2 warm-ups (2 for tracing/scripting with an example, 3 without an example)
    model(example_input)
    model(example_input)

    # speedup would be observed in subsequent runs.
    model(example_input)
```

## TorchBench based Benchmarks
**URL:** https://github.com/sanchitintel/benchmark/tree/onednn_graph_benchmark (instructions present at URL).
**Batch-size(s):** TorchBench-default for each model
**Baseline :** PyTorch JIT OFI FP32
**Machine:** Intel(R) Xeon(R) Platinum 8371HC (Cooper Lake)
**Sockets used**: 1
**Number of cores on one socket**: 26
Intel OpenMP & tcmalloc were preloaded

#### Benchmark results with single thread
| name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[alexnet-cpu-jit]                       |      1.063851 |        0.509820 |     -52.1% |
| test_eval[mnasnet1_0-cpu-jit]                    |      0.218435 |        0.107100 |     -51.0% |
| test_eval[mobilenet_v2-cpu-jit]                  |      0.114467 |        0.058359 |     -49.0% |
| test_eval[mobilenet_v3_large-cpu-jit]            |      0.233873 |        0.117614 |     -49.7% |
| test_eval[resnet18-cpu-jit]                      |      0.160584 |        0.075854 |     -52.8% |
| test_eval[resnet50-cpu-jit]                      |      1.652846 |        0.713373 |     -56.8% |
| test_eval[resnext50_32x4d-cpu-jit]               |      0.471174 |        0.209431 |     -55.6% |
|test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.310306 | 0.167090 | -46.2% |
| test_eval[squeezenet1_1-cpu-jit]                 |      0.161247 |        0.045684 |     -71.7% |
| test_eval[timm_efficientnet-cpu-jit]             |      1.643772 |        0.800099 |     -51.3% |
| test_eval[timm_regnet-cpu-jit]                   |      5.732272 |        2.333417 |     -59.3% |
| test_eval[timm_resnest-cpu-jit]                  |      1.366464 |        0.715252 |     -47.7% |
| test_eval[timm_vision_transformer-cpu-jit]       |      0.508521 |        0.271598 |     -46.6% |
| test_eval[timm_vovnet-cpu-jit]                   |      2.756692 |        1.125033 |     -59.2% |
| test_eval[vgg16-cpu-jit]                         |      0.711533 |        0.312344 |     -56.1% |

#### Benchmark results with 26 threads:
| name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[alexnet-cpu-jit]                       |      0.062871 |        0.034198 |     -45.6% |
| test_eval[mnasnet1_0-cpu-jit]                    |      0.022490 |        0.008172 |     -63.7% |
| test_eval[mobilenet_v2-cpu-jit]                  |      0.012730 |        0.005866 |     -53.9% |
| test_eval[mobilenet_v3_large-cpu-jit]            |      0.025948 |        0.010346 |     -60.1% |
| test_eval[resnet18-cpu-jit]                      |      0.011194 |        0.005726 |     -48.9% |
| test_eval[resnet50-cpu-jit]                      |      0.124662 |        0.045599 |     -63.4% |
| test_eval[resnext50_32x4d-cpu-jit]               |      0.034737 |        0.015214 |     -56.2% |
|test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.028820 | 0.012517 | -56.6% |
| test_eval[squeezenet1_1-cpu-jit]                 |      0.012557 |        0.003876 |     -69.1% |
| test_eval[timm_efficientnet-cpu-jit]             |      0.203177 |        0.051879 |     -74.5% |
| test_eval[timm_regnet-cpu-jit]                   |      0.452050 |        0.151113 |     -66.6% |
| test_eval[timm_resnest-cpu-jit]                  |      0.117072 |        0.052848 |     -54.9% |
| test_eval[timm_vision_transformer-cpu-jit]       |      0.046048 |        0.023275 |     -49.5% |
| test_eval[timm_vovnet-cpu-jit]                   |      0.213187 |        0.077482 |     -63.7% |
| test_eval[vgg16-cpu-jit]                         |      0.044726 |        0.021998 |     -50.8% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85591
Approved by: https://github.com/jgong5, https://github.com/frank-wei, https://github.com/chunyuan-w
2022-10-13 20:36:59 +00:00
Nikita Shulga
9eb4f9dd17 Tweak test tolerances to be compatible with A10G (#86538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538
Approved by: https://github.com/ngimel
2022-10-11 23:31:48 +00:00
Jeff Daily
8db30255c3 [ROCm] set nvfuser default to disabled, keep CI (#86369)
Bug fix. nvfuser is functional for ROCm on gfx906, but some tests are failing for other gfx targets. Disable nvfuser until all features are verified. Users may still opt-in by setting the known env var PYTORCH_JIT_ENABLE_NVFUSER=1. This PR sets this env var for the github actions workflow for ROCm since all current CI hosts are gfx906.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86369
Approved by: https://github.com/huydhn
2022-10-11 20:55:58 +00:00
jjsjann123
dd6dd03ff2 Enable output allocation cache (#86100)
Cherry-picked from devel branch: https://github.com/csarofeen/pytorch/pull/2010

turns on accidentally disabled output allocation cache [#2002](https://github.com/csarofeen/pytorch/issues/2002)
Updated check for safety regarding allocation cache by iterating all IterDomain on outputs and enables cache re-use only when no extent value is a consumer of fusion inputs (output sizes is not dependent on scalar inputs).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86100
Approved by: https://github.com/csarofeen
2022-10-10 23:31:21 +00:00
Kevin Stephano
b14f1d7bb8 Add Skip List for Aten Ops that are fused in nvFuser. (#86101)
This Skip List (tuple) is added under the nvprims context manager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86101
Approved by: https://github.com/jjsjann123, https://github.com/mruberry
2022-10-07 03:55:13 +00:00
Ivan Yashchuk
68a6113248 Add nvFuser support for torch.native_batch_norm (#85562)
This PR adds nvFuser's implementation for batch_norm as there's no reference yet (https://github.com/pytorch/pytorch/pull/81191) and no in-place copy support (https://github.com/pytorch/pytorch/pull/84545).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85562
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
2022-10-03 15:03:08 +00:00
Edward Z. Yang
3638089755 Ported reshape to symints and added a shim for BC (#85998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85998
Approved by: https://github.com/ezyang
2022-10-02 17:46:00 +00:00
PyTorch MergeBot
a0b1693996 Revert "Update amax/amin/norm/count_nonzero signatures with int[*]? dim (#83300)"
This reverts commit 1c0f0b33a0.

Reverted https://github.com/pytorch/pytorch/pull/83300 on behalf of https://github.com/jeffdaily due to The commit breaks nvfuser tests
2022-09-28 17:04:53 +00:00
Kurt Mohler
1c0f0b33a0 Update amax/amin/norm/count_nonzero signatures with int[*]? dim (#83300)
Changes `dim` arg to use `int[*]?` type for the following functions in `native_funcitons.yaml`:
* `amax`
* `amin`
* `norm`
* `frobenius_norm`
* `native_norm`
* `count_nonzero`

Part of #29137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83300
Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/kulinseth
2022-09-28 01:56:37 +00:00
PyTorch MergeBot
572dd862c4 Revert "Update amax/amin/norm/count_nonzero signatures with int[*]? dim (#83300)"
This reverts commit 8c7c7ed322.

Reverted https://github.com/pytorch/pytorch/pull/83300 on behalf of https://github.com/huydhn due to The commit pin breaks XLA test somehow
2022-09-28 01:36:43 +00:00
Kurt Mohler
8c7c7ed322 Update amax/amin/norm/count_nonzero signatures with int[*]? dim (#83300)
Changes `dim` arg to use `int[*]?` type for the following functions in `native_funcitons.yaml`:
* `amax`
* `amin`
* `norm`
* `frobenius_norm`
* `native_norm`
* `count_nonzero`

Part of #29137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83300
Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/kulinseth
2022-09-27 23:50:04 +00:00
S. Song
101f10d7ca Cherry pick sorting patch (#85620)
Fixes https://github.com/csarofeen/pytorch/issues/1947

Cherry-picked patch for torchbench issues where fusion segmenter asserts in nvfuser:
1. test the groups comes with the same order as they are merged.
2. Fix detection of un-mappable root domains:
    ComputeAtRootDomainMap flags domains that should not be mapped due to
    reductions. Previously, checking if a domain potentially causes an
    invalid mapping is only done with one domain in each group of domains
    that are found to be mappable so far. That's not actually sufficient as
    the unmappable domain set is created just once with no root mapping
    information. The fix is to check all consumer domains of a producer
    tensor. A small other fix is also done to address a different problem
    discovered after the first fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85620
Approved by: https://github.com/csarofeen, https://github.com/davidberard98
2022-09-27 15:53:01 +00:00
jjsjann123
0e582fbfcc [NVFuser] Upstream push 0907 (#84626)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84626
Approved by: https://github.com/malfet
2022-09-23 20:29:48 +00:00
Ivan Yashchuk
308b26fe4d Add nvFuser support for transpose (#84629)
`torch._refs.t`, `torch._refs.transpose`, `torch._refs.permute` are all should be working now with nvFuser executor. It would also work with graphs processed by AOT Autograd as these functions are registered to the aten->ref mapping via the "register_decomposition" decorator:
07d398fb26/torch/_refs/__init__.py (L3125-L3126)
07d398fb26/torch/_refs/__init__.py (L3143-L3144)
07d398fb26/torch/_refs/__init__.py (L2548-L2549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84629
Approved by: https://github.com/ngimel
2022-09-21 12:45:15 +00:00
Kevin Stephano
39f482acdf Add a reset() method to nvFuser FusionCache to enable proper resetting during tests. (#85319)
Fixes issue Jie found in his PR:

https://github.com/pytorch/pytorch/pull/84626#issuecomment-1250745334
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85319
Approved by: https://github.com/jjsjann123
2022-09-20 16:10:05 +00:00
Kevin Stephano
b8418e02eb Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#85045)
This PR does the following:

- Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`.  The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`.  It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`.  The `FusionInterface` is an object that represents a Fusion in python.  It can also query the cache based on id.
- The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
- Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
- Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
- Adds a README file to explain how to use the Python Frontend

While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.

An identical PR to #83267 to avoid tooling issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85045
Approved by: https://github.com/davidberard98
2022-09-17 10:52:54 +00:00
Aidyn-A
5271494ef2 [CUDA graphs] Fixes errors in RNG seed (#84967)
Fixes #84614

Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967
Approved by: https://github.com/ngimel
2022-09-14 19:56:12 +00:00
PyTorch MergeBot
94b67f4cd8 Revert "Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267)"
This reverts commit ec916bf6af.

Reverted https://github.com/pytorch/pytorch/pull/83267 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-09-14 17:40:22 +00:00
Kevin Stephano
ec916bf6af Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267)
This PR does the following:

- Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`.  The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`.  It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`.  The `FusionInterface` is an object that represents a Fusion in python.  It can also query the cache based on id.
- The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
- Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
- Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
- Adds a README file to explain how to use the Python Frontend

While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83267
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-09-13 23:28:39 +00:00
jjsjann123
1a33e944b5 nvfuser torchbench patch (#84411)
1. Patching nvfuser_execute to take aten nvprim fallback when no cuda tensors are provided as inputs
2. Extending support of nvfuser python API on cpu scalar tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84411
Approved by: https://github.com/ngimel, https://github.com/kevinstephano, https://github.com/IvanYashchuk
2022-09-07 05:22:37 +00:00
Jeff Daily
6efadf7e7e [ROCm] guard ROCm-only files in NVFUSER_RUNTIME_FILES (#84312)
Addresses comment in #82498 as a follow-up PR.

https://github.com/pytorch/pytorch/pull/82498#discussion_r958745967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84312
Approved by: https://github.com/jjsjann123
2022-08-31 18:26:24 +00:00
Jeff Daily
d09486ab23 [ROCm] enable nvfuser (#82498)
### Description
The nvfuser is enabled for ROCm.

### Testing
CI label ciflow/trunk covers the newly enabled ROCm functionality as well as any CUDA regressions caused by these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82498
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-08-30 21:50:39 +00:00
Ivan Yashchuk
90161c23cf Add nvfuser support for squeeze (#84117)
"_refs.squeeze" and "refs.unsqueeze" now work with nvfuser executor tests.

Similarly to `_refs.reshape` we need to explicitly save the concrete shape on the trace to pass that info to nvfuser, as it gets lost in translation (https://github.com/pytorch/pytorch/pull/83739#discussion_r950352124).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84117
Approved by: https://github.com/ngimel
2022-08-30 20:36:11 +00:00
Edward Z. Yang
ad44670fa1 Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9

Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159

Also update Metal registrations for C++ registration changes.

Also update NNPI registration to account for tightened schema checking

Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
2022-08-29 18:01:07 +00:00
Ivan Yashchuk
3aae6ff1e1 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-28 18:45:25 +00:00
PyTorch MergeBot
b159a5230f Revert "Add nvprims.var_mean (#83508)"
This reverts commit 7e7694b661.

Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 11:30:27 +00:00
Ivan Yashchuk
7e7694b661 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-27 09:05:20 +00:00
PyTorch MergeBot
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
Edward Z. Yang
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00