Commit Graph

364 Commits

Author SHA1 Message Date
Nikita Shulga
c40a009d66 Revert D35194935: Check all CUDA API calls for errors in torch/
Test Plan: revert-hammer

Differential Revision:
D35194935 (79e5b053b6)

Original commit changeset: f5ec5be87cdf

Original Phabricator Diff: D35194935 (79e5b053b6)

fbshipit-source-id: 0bb770d2cdb29b8e724c0b6a125c748f363d3358
(cherry picked from commit 04e5a73da4a53b0ec296f3df2c85626d19290c1f)
2022-03-31 05:48:30 +00:00
Richard Barnes
79e5b053b6 Check all CUDA API calls for errors in torch/ (#74923)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74923

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D35194935

fbshipit-source-id: f5ec5be87cdf775eb9c99f8c3baed6b0366dda49
(cherry picked from commit 7284c4ed7d57261d4936055e0c1a3f8f911fb1f0)
2022-03-31 05:08:55 +00:00
jiej
86c817cfa0 Requires grad guard
Adding CudaFusionGuard to guard on device/requires_grad of profiled tensor type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74780
Approved by: https://github.com/davidberard98
2022-03-29 19:23:10 +00:00
jiej
e4e19d5beb nvfuser parser skip api (#74520)
Summary:
added python API to disable nvfuser on certain opkind.

```
          "_jit_set_nvfuser_skip_node_kind",
          [](const std::string& op_name, bool flip = true) {
            return fuser::cuda::skipNode(op_name, flip);
          })
```

Args:
    `op_name`: Symbol of op;
    `flip`: flag indicating whether to flip the given op in the skip list.
Returns:
    a bool flag indicating if `op_name` was already in the skip list.

The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True)  # returns False, as no op is in skip list by default`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520

Reviewed By: saketh-are

Differential Revision: D35046110

Pulled By: davidberard98

fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
2022-03-23 20:56:43 +00:00
Michael Suo
e5bf87963d Revert D34584878: [pytorch][PR] Add JIT graph fuser for oneDNN Graph API (Preview4)
Test Plan: revert-hammer

Differential Revision:
D34584878 (7dd0823011)

Original commit changeset: ce817aa8cc90

Original Phabricator Diff: D34584878 (7dd0823011)

fbshipit-source-id: a941aaad34f8fe5f0c51f719f9f5c29b811c4d5b
(cherry picked from commit a43262ec7521b1665b02a64d3f279e72ee2344b9)
2022-03-21 23:07:14 +00:00
chunyuan
7dd0823011 Add JIT graph fuser for oneDNN Graph API (Preview4) (#68111)
Summary:
## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).

On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:

- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.

### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```

### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)

- SkyLake 8180 (single thread):

  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)
 \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
  \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops

### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```

Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```

CMake for the integration code is:
```
caffe2/CMakeLists.txt
```

## Limitations

- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111

Reviewed By: eellison

Differential Revision: D34584878

Pulled By: malfet

fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4
(cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)
2022-03-21 22:12:19 +00:00
David Berard
890b1e8f9e [JIT] C10_EXPORT -> TORCH_API (#73818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73818

These all appear to be defined in libtorch_cpu.so, so they should be marked with TORCH_API. TORCH_API means that these symbols are exported from libtorch_cpu.so and no other libraries. In comparison, C10_EXPORT will export the symbol in _all_ built libraries, if it's available.

I think most of these were fine because most were only defined in cpp files (which would only be included in the targets for one .so file). However, the change in pass_manager.h affects behavior, since the class is defined in the .h file, which could result in two separate implementations of the same static functions. Previously we saw issues on windows with this: https://github.com/pytorch/pytorch/pull/73742

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D34698175

Pulled By: davidberard98

fbshipit-source-id: cb871e861cf966bff596cfa8340a32a17fca0b66
(cherry picked from commit 6b9988e5688e6d4a9928c3e331efb74f000a9e4a)
2022-03-14 20:29:58 +00:00
David Berard
31b64fc3e6 [JIT] log extract tool - dump NVFuser fallbacks instead of fusion groups (#73881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73881

NVFuser fusion groups can contain nvfuser-only ops, e.g. `prim::reshape_copy`. Previously, we couldn't get a baseline performance measurement because the nvfuser-only ops would error out on nnc- and no-fusion- runs. Instead, dump the fallback graphs, after the fallbacks are corrected into runnable fallbacks.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D34698307

Pulled By: davidberard98

fbshipit-source-id: c357b2736b789bfd347afe9c83a1b610b64881e0
(cherry picked from commit 5918d826502ff75fbc22d242844ae6435dd7d22a)
2022-03-08 16:38:17 +00:00
David Berard
b27ec57331 [JIT] script & logging for extracting IR from logs (#72889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72889

The script along with the GRAPH_EXPORT macro will allow for an easy way to extract IR from logs. One use case in this diff is to extract the fusion groups from nvfuser, so that the fusions can be tested individually.

Usage (e.g. for nvfuser test)

1. Write some test.py file that uses nvfuser
2. `PYTORCH_JIT_LOG_LEVEL=">>graph_fuser" python3 test.py 2>&1 | tee output.txt`
3. `python3 pytorch/scripts/jit/log_extract.py output.txt --nvfuser`

This will run with and without nvfuser to compare the output.

Alternatively, use `--output` to dump the IR so that it can be used in other applications.

Currently, only `--output` works (since generating input tensors is not supported)

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D34440189

Pulled By: davidberard98

fbshipit-source-id: fca0f619200ee37aba34bb39b69e6c640c263e26
(cherry picked from commit eb319166075db160f1628f0de545641fbecde8be)
2022-03-02 18:34:35 +00:00
Gabor Kertesz
c4ff49f4c7 Enable win-arm64
This patch enables Pytorch build from source with Ninja and
'Visual Studio 16 2019' CMake generator on Windows on Arm.

Tests:
- Build from source: 'python setup.py develop'.
- Run simple Pytorch example: passed
- python test\test_torch.py:
-- same results as on x64
-- Ran 1344 tests, failures=2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72424
2022-02-28 17:17:56 +00:00
CodemodService FBSourceClangFormatLinterBot
b9ccbe4ff2 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: bilalsou

Differential Revision: D34237270

fbshipit-source-id: f33c06e9cbbde8b1fa39b11f9addb716f3762c99
(cherry picked from commit 0db3686e9d)
2022-02-15 10:41:24 +00:00
jiej
2d110d514f Nvfuser code bump 2_1_2022 (#72127)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported

nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor

Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127

Reviewed By: HamidShojanazeri

Differential Revision: D34113233

Pulled By: jbschlosser

fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
2022-02-15 00:43:16 +00:00
Ryan Spring
4f8b986e28 Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: VitalyFedyunin

Differential Revision: D33894937

Pulled By: jbschlosser

fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851
(cherry picked from commit 6e986f91a9)
2022-02-14 03:40:32 +00:00
Nikita Shulga
74c44ba9d6 Revert D33850228: [pytorch][PR] Implement Tanh Gelu Approximation
Test Plan: revert-hammer

Differential Revision:
D33850228 (23d03025dc)

Original commit changeset: 3cc33fb298e4

Original Phabricator Diff: D33850228 (23d03025dc)

fbshipit-source-id: 9436e7df73c2b2e2011f321674f24973316d3692
(cherry picked from commit c9efb58223)
2022-01-31 17:44:19 +00:00
Ryan Spring
23d03025dc Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: cpuhrsch

Differential Revision: D33850228

Pulled By: jbschlosser

fbshipit-source-id: 3cc33fb298e480d7ecc5c67716da019d60c6ab33
(cherry picked from commit 3a53b3e94f)
2022-01-31 17:07:45 +00:00
Nolan O'Brien
d68c314b13 [warnings][caffe2] Fix asserts yielding -Wstring-conversion warnings (#72013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72013

Find and replace `assert(!"` with `assert(false && "`
Excludes headers and paths that contain "third-party" or "external"

Clang raises a `-Wstring-conversion` warning when treating a string as a boolean.  This is not uncommon for asserts though (e.g. `assert(!"should never happen")`).  Clang does permit `expr && "string"` though in order to support these assertion use cases.

Test Plan: ci pass

Differential Revision: D33823092

fbshipit-source-id: 9a1af012215bdc91f8b4162ddb2df28d51539773
(cherry picked from commit 0286910350)
2022-01-29 00:48:06 +00:00
Joel Schlosser
cb823d9f07 Revert D33744717: [pytorch][PR] Implement Tanh Gelu Approximation
Test Plan: revert-hammer

Differential Revision:
D33744717 (f499ab9cef)

Original commit changeset: d64532a562ed

Original Phabricator Diff: D33744717 (f499ab9cef)

fbshipit-source-id: 396c3f63de5865f894dbc353d0790a01a624be93
(cherry picked from commit e9fb2d1db1)
2022-01-28 18:35:01 +00:00
Ryan Spring
f499ab9cef Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: mikaylagawarecki

Differential Revision: D33744717

Pulled By: jbschlosser

fbshipit-source-id: d64532a562ed53247bb4fa52bb16722634d5c187
(cherry picked from commit 4713dd9cca)
2022-01-28 16:59:09 +00:00
Will Constable
4523a73288 Fix usages of TORCH_CHECK/_INTERNAL_ASSERT without condition (#71879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71879

Two locations of improper macro usage were reported (https://github.com/pytorch/pytorch/issues/71848), and this diff fixes them.  In both cases this is behavior-changing, since the incorrect usages would have passed assertion due interpreting the error string as the condition, and both cases should have been 'assert false'.

Test Plan: Run CI

Reviewed By: alanwaketan

Differential Revision: D33800406

fbshipit-source-id: dfe3d9a6455e6eb96cb639022f8813a8bd6520c3
(cherry picked from commit ee551e5a16)
2022-01-27 04:20:55 +00:00
Nolan O'Brien
0fdb90da5e [warning] Fix TORCH_INTERNAL_ASSERT calls missing condition to check 1/x (#71711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71711

This will fix a ton of broken asserts that should always fire but never actually fire.
All would have been caught with `-Wstring-conversion` warnings enabled.

Test Plan: CI Pass

Differential Revision: D33743605

fbshipit-source-id: 062641f9d5d02c6e317c5a286fd01017cf77237f
(cherry picked from commit 639b42e04b)
2022-01-25 15:45:21 +00:00
CodemodService FBSourceClangFormatLinterBot
88012c7daf [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33577744

fbshipit-source-id: 7ecc8367998ee1dffde54c2f4dd3cfafe19a53c9
2022-01-14 06:10:57 -08:00
Mike Ruberry
3a0c680a14 Jiterates exp2, erfc, erfinv and entr and refactors code_template.h to ATen (#71295)
Summary:
Per title.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71295

Reviewed By: ngimel

Differential Revision: D33575885

Pulled By: mruberry

fbshipit-source-id: bc841b46fc0b5458a26a4d4465b18a7a54cd5a5b
2022-01-13 23:58:51 -08:00
Shintaro Iwasaki
5cae40c169 [pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650)
Summary:
This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in  `CUDAGeneratorImpl.h` suggests:
```
// TODO: this file should be in ATen/cuda, not top level
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650

Reviewed By: jianyuh, xw285cornell

Differential Revision: D33414890

Pulled By: shintaro-iwasaki

fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b
2022-01-10 22:27:04 -08:00
Scott Wolchok
ddea6980fe [PyTorch][JIT] Don't refcount Type singletons (#69579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69579

This should help us avoid reference counting overhead on singleton Type subclasses without a major rewrite of the Type subsystem.
ghstack-source-id: 146643993

Test Plan:
Ran //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark with arguments `--op empty -niter 40 --stressTestRecordFunction --captureRecordFunctionInputs` on devbig with turbo off.

Before:
```
I1206 13:47:15.037441 1201670 bench.cpp:144] Mean 0.737675
I1206 13:47:15.037463 1201670 bench.cpp:145] Median 0.736725
I1206 13:47:15.037468 1201670 bench.cpp:146] Min 0.722897
I1206 13:47:15.037473 1201670 bench.cpp:147] stddev 0.00508187
I1206 13:47:15.037482 1201670 bench.cpp:148] stddev / mean 0.00688903
```

After:
```
I1206 13:48:16.830123 1205612 bench.cpp:144] Mean 0.66988
I1206 13:48:16.830150 1205612 bench.cpp:145] Median 0.663956
I1206 13:48:16.830157 1205612 bench.cpp:146] Min 0.65986
I1206 13:48:16.830164 1205612 bench.cpp:147] stddev 0.0335928
I1206 13:48:16.830171 1205612 bench.cpp:148] stddev / mean 0.0501475
```

Static runtime startup is also improved; for CMF local_ro, time to initialize a predictor went from 10.01s to 9.59s.

(Note: I wish I had a production workload to demonstrate the advantage of this on. I tried ctr_mobile_feed local_ro net but it was neutral. Anything that manipulates types or List/Dict a lot might be promising.)

Reviewed By: suo

Differential Revision: D32923880

fbshipit-source-id: c82ed6689b3598e61047fbcb2149982173127ff0
2022-01-06 17:39:16 -08:00
Peter Bell
fa09099ba3 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D33336948

Pulled By: albanD

fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
2022-01-02 13:09:19 -08:00
jjsjann123
e429a68478 Allow single node fusion for nvfuser (#70000)
Summary:
Setting `PYTORCH_NVFUSER_ONE_OP_FUSION=1` will take all nodes nvFuser support, instead of waiting for fusion opportunity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70000

Reviewed By: samdow

Differential Revision: D33292195

Pulled By: davidberard98

fbshipit-source-id: 8ed5ce5e82fbb6737e8ab5ce4223b038eaf47756
2021-12-23 17:07:57 -08:00
CodemodService FBSourceClangFormatLinterBot
181120f7d7 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33229251

fbshipit-source-id: 3a69bb459fa0a65888d6f9c8e70b5de032ddad97
2021-12-19 16:38:25 -08:00
jiej
78f06e0690 fixing conv2d decomposition and tests (#70127)
Summary:
Current implementation has a bug where decomposed `add_optional` from `conv2d` is placed before the producer node, this causes linter error on graph.

Cherry-picked from https://github.com/csarofeen/pytorch/pull/1333
Fixing issue posted in https://github.com/csarofeen/pytorch/issues/1325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70127

Reviewed By: ejguan

Differential Revision: D33199018

Pulled By: jansel

fbshipit-source-id: bce1f14a443811b4d55116a04fd4daa86084cc47
2021-12-19 10:38:23 -08:00
Nikita Shulga
26e32988bd Revert D32596264: Codegen: TraceType only includes operators being registered
Test Plan: revert-hammer

Differential Revision:
D32596264 (e66a8ab4f5)

Original commit changeset: 2f28b62d7b99

Original Phabricator Diff: D32596264 (e66a8ab4f5)

fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12
2021-12-17 11:20:12 -08:00
Peter Bell
e66a8ab4f5 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, malfet

Differential Revision: D32596264

Pulled By: albanD

fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
2021-12-17 10:35:05 -08:00
CodemodService FBSourceClangFormatLinterBot
de2d9e2966 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33183467

fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de
2021-12-17 09:45:20 -08:00
Nikita Shulga
92463573d8 Sanitize string before passing it as shell argument (#70070)
Summary:
Use `c10::printQuotedString` to escape any characters that might render
string to be interpreted as more than one argument by shell script.

Please note, that this codepath is deprecated and is not accessible
by a typical PyTorch usage workflows.

This issue was discovered by Daniel Lawrence of the Amazon Alexa team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70070

Reviewed By: suo

Differential Revision: D33172721

Pulled By: malfet

fbshipit-source-id: 9dbd17f6eb775aaa1a545da42cbc95864c1189ee
2021-12-17 08:08:28 -08:00
jiej
76d282d447 Nvfuser code bump 12 5 (#69964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
2021-12-16 08:28:54 -08:00
Peter Bell
b2e79ed5ec Remove WindowsTorchApiMacro.h in favor of Export.h (#69585)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/68095

This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585

Reviewed By: mrshenli

Differential Revision: D32958594

Pulled By: albanD

fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061
2021-12-09 17:30:09 -08:00
Peter Bell
e279963eef Remove remaining THC code (#69039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872476

Pulled By: ngimel

fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31
2021-12-08 12:18:08 -08:00
jjsjann123
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
Alexander Grund
e4a9ee8d42 Deduplicate codegenOutputQuery to query maximum CUDA compute capabilities (#55901)
Summary:
There were 2 versions of the same code which were slightly different although functionally equivalent.
When adding support for another CUDA / device version both would need to be changed and kept in sync. So it is better to have only 1 version of it as the unique source of truth.

I chose the implementation which looks cleaner and easier to read and added some minor enhancements and comments to further increase readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55901

Reviewed By: H-Huang

Differential Revision: D31636917

Pulled By: bertmaher

fbshipit-source-id: 622e1fabc39de4f3f1b1aa9a1544cfbd35a5cfd9
2021-10-18 07:42:15 -07:00
Scott Wolchok
2d885ab73d [jit] Reduce refcounting of Types (#65345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345

FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165

Test Plan:
CI

perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.

Reviewed By: hlu1

Differential Revision: D31027361

fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
2021-10-08 09:03:04 -07:00
jiej
321345d7c9 Revert "Revert D31227448: [pytorch][PR] fixing sorting in stride indices" (#66176)
Summary:
enabling https://github.com/pytorch/pytorch/issues/63940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66176

Reviewed By: ngimel

Differential Revision: D31423920

Pulled By: dzhulgakov

fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c
2021-10-07 22:09:07 -07:00
Bin Bao
6e06cb76ff [JIT] Initialize CUDA context before launching fused kernel (#65064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65064

The problem appears when nvfuser is triggered from LazyTensor.
Because LT maintains its own thread pool, the thread used for the first-time
compilation does CUDA context initialization properly, but later
cached execution may use a different thread which does not have
a proper CUDA context.

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D31269691

Pulled By: desertfire

fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2
2021-10-05 16:01:59 -07:00
Nikita Shulga
4c4525fa5c Compile without -Wno-unused-variable (take 2) (#66041)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Do not delete `caffe2::OperatorBase::Output` calls as they have side effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041

Reviewed By: ngimel

Differential Revision: D31360142

Pulled By: malfet

fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
2021-10-04 20:39:39 -07:00
soulitzer
4cdfceddd2 [Reland] Avoid saving self for softmax and log_softmax (#66018)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/65242

The last attempt of the reland automatically rebased onto stable, which did not yet have the revert commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66018

Reviewed By: albanD

Differential Revision: D31348822

Pulled By: soulitzer

fbshipit-source-id: 881d701b404530c1352ac9245bd67264e1652b8a
2021-10-03 21:35:01 -07:00
Nikita Shulga
e4ee5ca698 Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable
Test Plan: revert-hammer

Differential Revision:
D31326599 (a6280ab653)

Original commit changeset: 924155f1257a

fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf
2021-10-01 20:40:47 -07:00
Nikita Shulga
a6280ab653 Compile without -Wno-unused-variable (#65954)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954

Reviewed By: ngimel

Differential Revision: D31326599

Pulled By: malfet

fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
2021-10-01 17:40:47 -07:00
Michael Suo
ccf8d48f16 Revert D31317680: [pytorch][PR] Avoid saving self forsoftmax and log_softmax
Test Plan: revert-hammer

Differential Revision:
D31317680 (5f7cadc7aa)

Original commit changeset: b3b921e06775

fbshipit-source-id: 1bca0672383536a2c21243ceb52349c766a94344
2021-10-01 09:31:44 -07:00
soulitzer
5f7cadc7aa Avoid saving self forsoftmax and log_softmax (#65242)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64000
 - updates double backward formula to compute grad wrt output instead of self
 - ~~In some of the error messages, we still refer to the dtype of the input, even though we are now checking the dtype of the output~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65242

Reviewed By: malfet

Differential Revision: D31317680

Pulled By: soulitzer

fbshipit-source-id: b3b921e06775cfc12e5a97a9ee8d73aec3aac7c3
2021-10-01 07:49:07 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Nikita Shulga
82e0bf44c0 Apply linter suggestions to #65137 (#65459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65459

Just run linter on the change and apply all suggestions

Test Plan: N/A

Reviewed By: seemethere

Differential Revision: D31102960

fbshipit-source-id: 04e1d07935690f2ddbc64533661b3e55379d13b5
2021-09-27 13:07:40 -07:00
CodemodService FBSourceClangFormatLinterBot
2a4d5e4c6d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31138547

fbshipit-source-id: ba134ae7f057c918eaefdc6310f7663e187e9749
2021-09-23 07:54:32 -07:00
jiej
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00
Eli Uriegas
03389dc851 Revert D30752939: [pytorch][PR] nvfuser update
Test Plan: revert-hammer

Differential Revision:
D30752939 (cfaecaf40b)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2
2021-09-15 17:38:47 -07:00
jiej
cfaecaf40b nvfuser update (#63745)
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
2021-09-15 14:42:55 -07:00
Zhengxu Chen
ac99d63f83 [jit] Make operation call accept Stack& instead Stack* (#63414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414

Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318

Test Plan:
compiles.

Imported from OSS

Reviewed By: ejguan

Differential Revision: D30375410

fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
2021-08-30 11:49:20 -07:00
Bert Maher
a709ab34a8 [nnc] Re-enable CPU fusion" (#63665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63665

This reverts commit 125e2d02e5.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30471646

Pulled By: bertmaher

fbshipit-source-id: 4189869566f03b5f9ada78d78830f6a34946eed6
2021-08-23 12:42:42 -07:00
Alban Desmaison
125e2d02e5 Revert D30417370: [nnc] Enable CPU fusion
Test Plan: revert-hammer

Differential Revision:
D30417370 (b9fc656cf2)

Original commit changeset: 84ce7a578a36

fbshipit-source-id: cd23774cdc3273fd72f8a05f1900eaf36f373e6b
2021-08-20 12:30:21 -07:00
Bert Maher
b9fc656cf2 [nnc] Enable CPU fusion (#63545)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63545

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30417370

Pulled By: bertmaher

fbshipit-source-id: 84ce7a578a3678d5562bab99d1dc00330c4f72d1
2021-08-20 11:18:21 -07:00
Pruthvi Madugundu
ab7a472980 [ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786)
Summary:
- HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm.
- TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786

Reviewed By: bdhirsh

Differential Revision: D30281682

Pulled By: seemethere

fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673
2021-08-13 15:00:43 -07:00
Richard Barnes
d2594fa538 irange-ify 3 (#62112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62112

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879513

fbshipit-source-id: c01d18d34bb19014bf28d92c4d04b07e50a2770a
2021-07-26 12:56:58 -07:00
Richard Barnes
ee44d73e59 Modernize override (#61744)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717320

fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc
2021-07-23 23:04:46 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Natalia Gimelshein
6284d2a82b wrap cudaStreamSynchronize calls (#61889)
Summary:
This is a first step towards creating context manager that errors out on synchronizing calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889

Reviewed By: albanD

Differential Revision: D29805280

Pulled By: ngimel

fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a
2021-07-21 19:30:52 -07:00
Richard Barnes
59a5312ce6 Modernize fix deprecated header (#61736)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61736

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29716965

fbshipit-source-id: 314c2b557c240ac16bbfab114ab764beb189e78a
2021-07-20 10:06:11 -07:00
Richard Barnes
349f2f767c Modernize to default constructor and nullptr in torch (#61735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61735

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29716659

fbshipit-source-id: ec2a0a0b7e55d2e50b1d35f0b651bd40675ae7e8
2021-07-16 10:51:13 -07:00
Nikita Shulga
635d864b26 Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142)
Summary:
Test-plan: Compile + clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142

Reviewed By: VitalyFedyunin

Differential Revision: D29529372

Pulled By: malfet

fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6
2021-07-06 09:46:46 -07:00
Mike Guo
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
Richard Barnes
b162d95e46 Fix a number of lint perf and safety issues in torch (#59897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29037012

fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb
2021-06-15 13:14:51 -07:00
Richard Barnes
fbe65b16ae Use irange in torch/csrc/jit (#55716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27690245

fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5
2021-06-07 16:48:08 -07:00
Bert Maher
6309b342c3 [nnc] Enable CPU fuser inside FB, take 5 (#59461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59461

long tail test failues
ghstack-source-id: 130607578

Test Plan: fixed T92123560

Reviewed By: navahgar

Differential Revision: D28892885

fbshipit-source-id: 762a275b5aa14af0847e46cbf4036d3342b82189
2021-06-04 16:26:46 -07:00
Bert Maher
46d724c919 Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4
Test Plan: revert-hammer

Differential Revision:
D28859795 (6baa66ece9)

Original commit changeset: 826801db24e8

fbshipit-source-id: c85a0fc7e88c95af939d5c0f50c0c8878e1174d3
2021-06-03 16:29:51 -07:00
Bert Maher
6baa66ece9 [nnc] Enable CPU fusion inside Facebook, take 4
Summary:
fixed the awkward configerator initialization issue that broke some
tests.  Trying again

Test Plan: predictor comparisons

Reviewed By: ZolotukhinM

Differential Revision: D28859795

fbshipit-source-id: 826801db24e86b1c3594a86e3ac32f0a84c496f7
2021-06-03 09:33:13 -07:00
Richard Barnes
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
Bert Maher
afd5237a4f Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3
Test Plan: revert-hammer

Differential Revision:
D28800692 (6e7dae9cec)

Original commit changeset: d791c3b2ccd7

fbshipit-source-id: 5042fecfbab59181572013bf39760bc716e86430
2021-06-02 10:07:46 -07:00
Bert Maher
6e7dae9cec [nnc] Enable CPU fusion inside Facebook, take 3 (#59253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59253

Fixed a miscompilation exposed by multithreaded profiling collection; let's try again.
ghstack-source-id: 130286580

Test Plan: servicelab

Reviewed By: navahgar, huiguoo

Differential Revision: D28800692

fbshipit-source-id: d791c3b2ccd75fe5e6eca0859083d4cd67460147
2021-06-01 15:42:22 -07:00
Jeff Daily
ba694520e5 [ROCm] fix JIT codegen (#57400)
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.

- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
  - NAN / POS_INFINITY / NEG_INFINITY
  - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400

Reviewed By: ejguan

Differential Revision: D28421065

Pulled By: malfet

fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
2021-05-27 11:45:07 -07:00
Bert Maher
a6b358d53b Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2
Test Plan: revert-hammer

Differential Revision:
D28461013 (c76405d3b1)

Original commit changeset: 79a80b6ffb65

fbshipit-source-id: d9cc5c512542153f39664635fb080d797a9de7d0
2021-05-19 15:27:38 -07:00
Bert Maher
c76405d3b1 [nnc] Enable CPU fusion inside Facebook, take 2 (#58347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58347

Back out "Revert D27652484 (ac04cc775b): [nnc] Enable CPU fusion inside Facebook"
Original commit changeset: ecfef3ee1e71
ghstack-source-id: 129279584

Test Plan: Tests for bugfix included in this stack

Reviewed By: navahgar

Differential Revision: D28461013

fbshipit-source-id: 79a80b6ffb653ab952ff5efaa143d3362bb7d966
2021-05-18 21:45:48 -07:00
Bert Maher
c4c2039fb2 Revert D27652484: [nnc] Enable CPU fusion inside Facebook
Test Plan: revert-hammer

Differential Revision:
D27652484 (ac04cc775b)

Original commit changeset: a82681455dae

fbshipit-source-id: ecfef3ee1e7197148b172234691e72faf4b95cf8
2021-05-14 16:41:23 -07:00
Bert Maher
ac04cc775b [nnc] Enable CPU fusion inside Facebook (#58029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58029

We've been testing this for months, it's time.
ghstack-source-id: 128932738

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D27652484

fbshipit-source-id: a82681455dae0db19c8ac9918065b6e186c9e71a
2021-05-14 00:10:10 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Ailing Zhang
0ecdbfebff s/InplaceOrView/ADInplaceOrView/g (#57372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57324

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28121821

Pulled By: ailzhang

fbshipit-source-id: f568dd2505f6279da9ffb93ce1d22e0f98c606bb
2021-05-01 22:56:18 -07:00
Nikita Shulga
eac02f85cf Fix more clang-tidy errors (#57235)
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235

Reviewed By: janeyx99

Differential Revision: D28084444

Pulled By: malfet

fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
2021-04-28 23:29:10 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Ailing Zhang
be7a943bb8 s/AutoDispatchBelowAutograd/AutoDispatchBelowInplaceOrView. (#56657)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56657

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27931526

Pulled By: ailzhang

fbshipit-source-id: 3af718df3435e2b0b30bc62070dbdc5aeeecdfb4
2021-04-23 15:50:00 -07:00
Aswin John Mathews
73eaa0a5f5 Fixing error in jit cuda on ROCm: non-constant-expression cannot be n… (#55243)
Summary:
On ROCm, the error when compiling was "non-constant-expression cannot be narrowed from type 'int' to 'uint32_t'"
when compiling grid_reduction.cu.

Added typecast to fix issue.

Also, removed test skip with ROCm : re-enabling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55243

Reviewed By: malfet

Differential Revision: D27917066

Pulled By: ngimel

fbshipit-source-id: b0b7c5fc8ecd2624222b35fe060846f7d1670f07
2021-04-21 16:35:27 -07:00
Ailing Zhang
3d904b56ec s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866606

Pulled By: ailzhang

fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1
2021-04-20 17:17:46 -07:00
Jeff Daily
e1752ffa04 [reland][ROCm] use hiprtc precompiled header (#55965)
Summary:
Revert "Revert D27449031 (2a7df657fe): [pytorch][PR] [ROCm] use hiprtc precompiled header".  Reland PR https://github.com/pytorch/pytorch/issues/54350.

This reverts commit 204ac21bf1.

The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965

Reviewed By: jbschlosser

Differential Revision: D27755907

Pulled By: malfet

fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e
2021-04-15 15:47:56 -07:00
Richard Barnes
d690973295 irange on int64_t (#55148)
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27447811

fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
2021-04-05 16:14:00 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Alexander Golynski
204ac21bf1 Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header
Test Plan: revert-hammer

Differential Revision:
D27449031 (2a7df657fe)

Original commit changeset: 81a8d7847a47

fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c
2021-04-01 06:42:04 -07:00
Jeff Daily
2a7df657fe [ROCm] use hiprtc precompiled header (#54350)
Summary:
HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release.  Conditionally add support for this feature.  Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features.

The use of this feature is conditionalized on a new ROCM_VERSION macro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350

Reviewed By: H-Huang

Differential Revision: D27449031

Pulled By: malfet

fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3
2021-03-31 13:36:50 -07:00
Bert Maher
9db4802184 [fuser] Support bfloat16 (#54571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571

Supports bfloat16 via a similar method to half: upconvert inputs to
fp32, do math, then downconvert outputs to bf16.

Resource strings are mostly derived from cuda-11 headers.

Fixes #53918, for the legacy fuser at least.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27328987

Pulled By: bertmaher

fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc
2021-03-25 15:59:15 -07:00
johnlu
36ce673f16 Disable the fusion group which is not supported by XPU device. (#54239)
Summary:
The XPU device doesn't support the fusion group. Disable it for XPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54239

Reviewed By: zou3519

Differential Revision: D27188735

Pulled By: ezyang

fbshipit-source-id: f28f62148e7aa12e8b08345df7eb0133216ce6a5
2021-03-22 07:43:28 -07:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Richard Barnes
29c4290a8d Use c10::irange for great good (#52153)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52153

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D26407087

fbshipit-source-id: ea8ce1c17299cb9d89621e4a39f31edc2faa9fd6
2021-02-24 18:43:50 -08:00
Jeff Daily
4df8e774e6 [ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA (#50508)
Summary:
nvcc's `--fmad=false` is not valid for the HIP compiler.  Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50508

Reviewed By: albanD

Differential Revision: D25920291

Pulled By: mrshenli

fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c
2021-02-16 08:09:57 -08:00
Nikita Shulga
b8f3a658f9 Do not include "DynamicLibrary.h" into a top-level header (#52182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52182

DynamicLibrary provides a very specific functionality, so there is no need to exposes it to every project depending on `ATen.h`

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26417404

Pulled By: malfet

fbshipit-source-id: f8318cacb07dcc8b2f95984f88ea1df4e5369b8b
2021-02-13 19:34:46 -08:00
Nikita Shulga
5499e839f1 [Fuser] Do not attempt to use OpenMP if build without OpenMP support (#51504)
Summary:
Clang from XCode does not support `-fopenmp` option, no need to try to compile with it.
Infer whether OpenMP is supported by checking _OPENMP define.
Also, use clang compiler if host app was compiled with clang rather than gcc.
Fix few range loop warnings and add static_asserts that range loop variables are raw pointers.

This changes makes fuser tests on OS X a bit faster.

Before:
```
% python3 test_jit.py -v  TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
warning: pytorch jit fuser failed to compile with openmp, trying without it...
ok

----------------------------------------------------------------------
Ran 1 test in 0.468s

OK
```

After:
```
% python3 test_jit.py -v  TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.435s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51504

Reviewed By: smessmer

Differential Revision: D26186875

Pulled By: malfet

fbshipit-source-id: 930b3bcf543fdfad0f493d687072aaaf5f9e2bfc
2021-02-02 15:31:59 -08:00
jjsjann123
392abde8e6 patch nvrtc API for cuda TK >= 11.1 (#50319)
Summary:
CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX.
1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime;
https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx
2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319

Reviewed By: malfet

Differential Revision: D26114475

Pulled By: ngimel

fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5
2021-01-27 23:58:20 -08:00
Jane Xu
533cb9530e Introducing TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API to the code (#50627)
Summary:
Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target.

This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627

Reviewed By: mruberry

Differential Revision: D25955441

Pulled By: janeyx99

fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869
2021-01-21 19:09:11 -08:00
Scott Wolchok
4a0d17ba2d [PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228

`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D25837374

fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
2021-01-13 16:13:55 -08:00
Chester Liu
9d8bd216f9 Use Unicode friendly API in fused kernel related code (#49781)
Summary:
See https://github.com/pytorch/pytorch/issues/47422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49781

Reviewed By: gchanan

Differential Revision: D25847993

Pulled By: ezyang

fbshipit-source-id: e683a8d5841885857ea3037ac801432a1a3eda68
2021-01-10 20:03:00 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Thomas Viehmann
ea087e2d92 JIT: guard DifferentiableGraph node (#49433)
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes https://github.com/pytorch/pytorch/issues/49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433

Reviewed By: ngimel

Differential Revision: D25681374

Pulled By: Krovatkin

fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
2021-01-08 20:01:27 -08:00
Scott Wolchok
480a756194 [PyTorch] IValue::toTensor can now return const Tensor& (#48868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48868

Building on the previous diff, we can make `toTensor()` return a
`const Tensor&`, which should make it easier to avoid reference
counting.
ghstack-source-id: 119327372

Test Plan: internal benchmarks.

Reviewed By: bwasti

Differential Revision: D25325379

fbshipit-source-id: ca699632901691bcee432f595f75b0a4416d55dd
2021-01-06 08:40:50 -08:00
Chester Liu
2ac180a5dd Fix cl.exe detection in cpu/fused_kernel.cpp (#50085)
Summary:
The command used here is essentially `where cl.exe`. By using `system()` we will not be able to find cl.exe unless we are using VS Developer Prompt, which makes `activate()` meaningless. Change `system()` to `run()` fixes this.

Found during https://github.com/pytorch/pytorch/issues/49781.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50085

Reviewed By: smessmer

Differential Revision: D25782054

Pulled By: ezyang

fbshipit-source-id: e8e3cac903a73f3bd78def667ebe0e93201814c8
2021-01-06 07:16:41 -08:00
caozhong
aff0b68a58 Fix include files for out-of-tree compilation (#48827)
Summary:
Signed-off-by: caozhong <zhong.z.cao@intel.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48827

Reviewed By: agolynski

Differential Revision: D25375988

Pulled By: ailzhang

fbshipit-source-id: a8d5ab4572d991d6d96dfe758011517651ff0a6b
2020-12-15 14:40:44 -08:00
jiej
a6fa3b2682 adding profile_ivalue (#47666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47666

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25255573

Pulled By: Krovatkin

fbshipit-source-id: 5d8753e4040a3d96105d28d26728125947c7a638
2020-12-09 15:29:15 -08:00
Nikita Shulga
b9cd774e29 Get rid of printf in cuda fuser debugPrint() (#46994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46994

Reviewed By: raghuramank100, mruberry

Differential Revision: D25342954

Pulled By: malfet

fbshipit-source-id: 549b5b072f7f70877261a155e989a21072ec49d8
2020-12-04 15:13:26 -08:00
jiej
dabc286ab3 Remove output used only by sizes (#448) (#47665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47665

Re-enabled the pass to remove outputs from fusion that is only used by aten::size;
Added size computation for reduction op via new operator prim::ReductionSizes;

Test Plan: Imported from OSS

Reviewed By: navahgar, jamesr66a

Differential Revision: D25254675

Pulled By: Krovatkin

fbshipit-source-id: e9a057b0287ed0ac93b415647fd8e5e836ba9856
2020-12-03 11:14:30 -08:00
Lemo
85c1e8acdc Replace kernel resource strings with real .cu source files (#48283)
Summary:
Convert the NVFUSER's runtime CUDA sources (under `.../jit/codegen/cuda/runtime`) to string literals, then include the headers with the generated literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48283

Reviewed By: mrshenli

Differential Revision: D25163362

Pulled By: ngimel

fbshipit-source-id: 4e6c181688ddea78ce6f3c754fee62fa6df16641
2020-12-02 21:22:29 -08:00
Elias Ellison
1195403915 [NNC] Add cpu fusion gflag (#48682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48682

Reviewed By: Krovatkin, ngimel

Differential Revision: D25260205

Pulled By: eellison

fbshipit-source-id: df1655fd75f2a13bcf7c025b1f0a7becc2fd126a
2020-12-02 19:47:18 -08:00
jjsjann123
15fc66d6c8 fix nvrtc PTX architecture cap for CUDA toolkit (#48455)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48200

CUDA 11.0 only supports < sm_80 (https://docs.nvidia.com/cuda/archive/11.0/nvrtc/#group__options)

Note: NVRTC documentation is not a reliable source to query supported architecture. Rule of thumb is that nvrtc supports the same set of arch for nvcc, so the best way to query that is something like `nvcc -h | grep -o "compute_[0-9][0-9]" | sort | uniq`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48455

Reviewed By: zhangguanheng66

Differential Revision: D25255529

Pulled By: ngimel

fbshipit-source-id: e84cf51ab50519b4c97dad063cc43c9194942bb2
2020-12-02 11:50:22 -08:00
Chester Liu
8177f63c91 Reorganize and refine the Windows.h import in C++ files (#48009)
Summary:
This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009

Reviewed By: gchanan

Differential Revision: D25045840

Pulled By: ezyang

fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90
2020-11-20 14:21:09 -08:00
Scott Wolchok
4c9eb57914 [PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023

DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430

Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect

Reviewed By: dzhulgakov

Differential Revision: D24605460

fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
2020-11-18 19:39:40 -08:00
Scott Wolchok
1bafff2366 [PyTorch][JIT] Skip unnecessary refcounting in TensorType::merge (#47959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47959

Taking a shared_ptr by value incurs refcounting overhead and should only be done if the callee needs to take ownership. Otherwise, `const T&` is more efficient. (Specifically, you will have to do an atomic decrement when the argument is destroyed and probably an atomic increment as well. Passing by `const T&` also takes one less register than passing `std::shared_ptr<T>`, but that's less important.)

This diff fixes just this one function, but I'd be happy to audit & fix this whole file in future diffs. Thoughts?
ghstack-source-id: 116914899

Test Plan: build ATen-cpu

Reviewed By: Krovatkin

Differential Revision: D24970954

fbshipit-source-id: 6bdb4b710a94b8baf4ad63418fb38136134e0ef3
2020-11-18 17:49:16 -08:00
Sebastian Messmer
edf751ca2f Make empty c10-full (#46092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203

(Note: this ignores all push blocking failures!)

Test Plan:
vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/

after c10::optional fix:
https://www.internalfb.com/intern/fblearner/details/231391773/

Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression.

Reviewed By: ezyang

Differential Revision: D24219944

fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
2020-11-12 17:08:21 -08:00
Elias Ellison
f221a19a7f Force LLVM Compilation for CPU Tests (#46949)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46949

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805247

Pulled By: eellison

fbshipit-source-id: 4fcaf02d8a78cc5cbcbde36940d0a2c85fba3fc5
2020-11-12 11:12:08 -08:00
jiej
ac146c4820 [nvFuser] Switching to CudaFusionGuard from BailOut for nvfuser - update 2 (#46452)
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452

Reviewed By: zou3519, anjali411

Differential Revision: D24364642

Pulled By: ngimel

fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
2020-10-19 15:44:31 -07:00
Thomas Viehmann
d3d8da7a8e Enable CUDA Fuser for ROCm (#45965)
Summary:
This enables the cuda fuser on ROCm and enables tests for them.

Part of this patch is based on work of Rohith Nallamaddi, thank you.
Errors are my own, of course.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965

Reviewed By: seemethere

Differential Revision: D24170457

Pulled By: walterddr

fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6
2020-10-08 10:41:56 -07:00
Nikita Shulga
1558a3657b Add LazyNVRTC (#45674)
Summary:
Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables:
On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template:
```
  auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME)));
  if (!fn)
    throw std::runtime_error("Can't get" ## NAME);
  lazyNVRTC.NAME = fn;
  return fn(...)
```
Fixes https://github.com/pytorch/pytorch/issues/31985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674

Reviewed By: ezyang

Differential Revision: D24073946

Pulled By: malfet

fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874
2020-10-05 16:27:40 -07:00
jjsjann123
99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
Bert Maher
6ec8fabc29 Fix frac in CUDA fuser (#44152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44152

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23528506

fbshipit-source-id: bfd468d72fa55ce317f88ae83e1f2d5eee041aa0
2020-09-09 11:10:08 -07:00
Elias Ellison
5bd2902796 [JIT] Remove references to no longer generated _tanh_backward and _sigmoid_backward (#44138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44138

If you look at the sigmoid and tanh backward they are composed of other ops: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L786
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L164

So tanh_backward and sigmoid_backward are no longer generated / legacy ops.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23543603

Pulled By: eellison

fbshipit-source-id: ce8353e53043cf969b536aac47c9576d66d4ce02
2020-09-05 01:41:36 -07:00
Ashkan Aliabadi
4e39c310eb Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252331

Pulled By: AshkanAliabadi

fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48
2020-08-29 17:47:00 -07:00
Bert Maher
0bf27d64f4 Fix NaN propagation in fuser's min/max implementation (#43590)
Summary:
fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590

Reviewed By: mruberry

Differential Revision: D23338664

Pulled By: bertmaher

fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0
2020-08-26 17:31:06 -07:00
Nikita Shulga
d06f1818ad Fix codegen/cuda gcc-5.4 compilation issues (#43223)
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by https://github.com/pytorch/pytorch/pull/43129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43223

Reviewed By: albanD, seemethere

Differential Revision: D23198330

Pulled By: malfet

fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
2020-08-18 17:19:07 -07:00
Christian Sarofeen
b3bda94393 [NVFuser] Enable E2E BCast-PWise-Reduction fusions (#43129)
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

**Overall:**

- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

**Integration:**

- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

**Code Generation:**

- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129

Reviewed By: mrshenli

Differential Revision: D23162207

Pulled By: soumith

fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
2020-08-18 09:10:08 -07:00
Nikita Shulga
4aa543ed2e Fix unordered-map-over-enum for GCC 5.4 (#41063)
Summary:
Forgot to add this to https://github.com/pytorch/pytorch/pull/41055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41063

Differential Revision: D22407451

Pulled By: malfet

fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a
2020-07-06 23:26:31 -07:00
Nikita Shulga
50df097599 Fix CUDA jit codegen compilation with gcc-5.4 (#41055)
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by https://github.com/pytorch/pytorch/pull/40864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
2020-07-06 21:09:17 -07:00
Christian Sarofeen
b9b4f05abf [nvFuser] Working towards reductions, codegen improvements (#40864)
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864

Reviewed By: ngimel

Differential Revision: D22392877

Pulled By: soumith

fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
2020-07-06 14:52:49 -07:00
Jiayu Liu
0203d70c63 [nit] fix some typo within documentation (#40692)
Summary:
Apologize if this seems trivial, but i'd like to fix them on my way of reading some of the source code. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40692

Differential Revision: D22284651

Pulled By: mrshenli

fbshipit-source-id: 4259d1808aa4d15a02cfd486cfb44dd75fdc58f8
2020-06-30 19:24:44 -07:00
Sebastian Messmer
53af9df557 Unify boxed function signature between jit and c10 (#37034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034

c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.

This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069

Test Plan: unit tests

Differential Revision: D20567950

fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
2020-06-29 19:24:26 -07:00
Sergey Ionov
d14d47b9b5 Get rid of global constructors in cuda codegen (#40183)
Summary:
Use switch instead of look ups in global std::unordered_maps<> to do enum-to-name conversions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40183

Reviewed By: malfet

Differential Revision: D22117731

Pulled By: ionsphere

fbshipit-source-id: d150114cfae5b1222bb9142d815f2379072506c7
2020-06-18 13:54:11 -07:00
Christian Sarofeen
80e5ebf989 [nvFuser] Transform replay refactor and minor updates (#39579)
Summary:
We've got quite a few things going on, preparing a push back to upstream so we don't get too desynced.

- Major refactor of transform replay. It is now far more robust and fixes bugs discovered in reductions. Preparing for extension to explicit broadcast ops which will be the last major memory pattern for op coverage. Broadcast ops will allow us to express up to and potentially beyond norms and gemms.

- Initial runtime expression evaluator. This allows us to evaluate expressions at runtime. Will be useful for determining our grid/block layout at runtime, so we don't have to manually compute them according to the code we're trying to generate.

- Moving to int64 and double for scalar representations to match PyTorch JIT.

- Improvements in codegen interface where we return Tensor like object instead of parent class Val.

- Add `addcmul` and `lerp` ops

- General updates, fixes, test additions, test inprovements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39579

Differential Revision: D21974001

Pulled By: soumith

fbshipit-source-id: 7f7ccc91593466e948f3ce90f8f9b7fbc5c28de2
2020-06-11 23:04:24 -07:00
Ilia Cherniavskii
abe2be2063 [resubmit] Use TensorMethods.cpp (#39385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39385

see https://github.com/pytorch/pytorch/pull/37639

Test Plan:
https://github.com/pytorch/pytorch/pull/37639

Imported from OSS

Differential Revision: D21833287

fbshipit-source-id: 9928d3f4122903d0de67ad312e349352d5f5c19c
2020-06-02 20:27:51 -07:00
Edward Yang
2fe0fc2684 Revert D21374247: Use TensorMethods.cpp
Test Plan: revert-hammer

Differential Revision:
D21374247

Original commit changeset: 076964415079

fbshipit-source-id: 732ec8c561d1f37475c1b5549ba79c718e3a6db8
2020-06-01 08:12:09 -07:00
Ilia Cherniavskii
68e62b9ab6 Use TensorMethods.cpp (#37639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37639

Changing TensorMethods.h to .cpp
Necessary to avoid incomplete types in dispatcher

Test Plan:
CI

Imported from OSS

checked mobile size, no change, small reduction in size in fbios
fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -18.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -8.8 KiB

reran benchmark, no stat. significant difference

buck run mode/opt caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:benchmark_torchscript_model -- --model_file caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt --num_runs 3

╷ @  68592d0d  41 minutes ago  iliacher  D21374247
╭─╯  Use TensorMethods.cpp

Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt.
Links to the results:

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1729113760

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/3867976782

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2782186766

hg prev

@  7f501b42  Thursday at 14:26  bvaughan  D21764704
╷  short-circuit pow for complex 1 and 0 exponents

Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt.
Links to the results:

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2155256332

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1802057074

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/4119590830

Differential Revision: D21374247

fbshipit-source-id: 076964415079cf84fb57f1f7b43d087afed86e1d
2020-05-31 17:11:12 -07:00
lixinyu
a04fb2ab22 [Reland] add xenial + cuda 9.2 + gcc 5.4 CI test (#39036)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39036

Test Plan: Imported from OSS

Differential Revision: D21731026

Pulled By: glaringlee

fbshipit-source-id: ae678f786f95e3687ed6b3f176fe6736a436c721
2020-05-28 19:48:18 -07:00
Jeff Daily
1093e26d72 [ROCm] HIP version guard for occupancy API compatibility (#38551)
Summary:
CC ezyang xw285cornell

HIP from ROCm 3.5 renames `hipOccupancyMaxActiveBlocksPerMultiprocessor` to `hipModuleOccupancyMaxActiveBlocksPerMultiprocessor`.  In addition, the API parameter types now match CUDA.  Add these changes in a backwards-compatible manner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38551

Differential Revision: D21721832

Pulled By: ezyang

fbshipit-source-id: 6fc971845e363d7495d8be9550e76d0f082c3062
2020-05-27 10:09:06 -07:00
Jeff Daily
ccab142197 Add ROCm-specific half_support_literal for JIT. (#38899)
Summary:
CC ezyang xw285cornell sunway513 lcskrishna
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38899

Differential Revision: D21721855

Pulled By: ezyang

fbshipit-source-id: 3739c462f04cee40ff979f44387ef66b971f5303
2020-05-26 14:34:08 -07:00
Christian Sarofeen
8e69c3be17 [nvFuser] Reduction support in codegen, fp16 support (#38627)
Summary:
Adds reduction  support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38627

Reviewed By: albanD

Differential Revision: D21663196

Pulled By: soumith

fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
2020-05-21 17:18:39 -07:00
anjali411
8e07b75cef Have DeviceType available in torch namespace (#38036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38036

Resolves: https://github.com/pytorch/pytorch/issues/36946

Test Plan: Imported from OSS

Differential Revision: D21463610

Pulled By: anjali411

fbshipit-source-id: c4aabfac2cd1f05f8b66745aae0a17c2af4d9c9b
2020-05-11 16:06:52 -07:00
Cloud Han
6e1e2a60dc fix compilation error with gcc 5.5 (#38112)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/38111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38112

Differential Revision: D21476876

Pulled By: malfet

fbshipit-source-id: 06d25e763eb73961f7b4e4cfbd2bb59f5ab96387
2020-05-08 13:23:36 -07:00
jiej
1667aa6451 [CUDA_FUSER] Expand operation support for cuda fuser (#37849)
Summary:
This PR added more supported operations in CUDA fuser. We are covering major point-wise operations supported in legacy fuser.

In an attempt to adapt to legacy executor:
1. added an naive shape propagation pass on pytorch JIT IR;
2. small refactor on graph partitioning;
3. fallback interpreter execution of fusion group;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37849

Reviewed By: yf225

Differential Revision: D21444320

Pulled By: soumith

fbshipit-source-id: 712e18ab8497f8d58a07e6f8d200cdab52cf0d74
2020-05-07 09:21:09 -07:00
Nikolay Korovaiko
6ecb5bb1f0 match old fuser rem to eager (#37196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37196

Reviewed By: zdevito

Differential Revision: D21223172

Pulled By: Krovatkin

fbshipit-source-id: 4d4ff1127d5dc69ab73f07ca79c1f5b0b4dd9732
2020-05-01 10:55:06 -07:00
Nikolay Korovaiko
4ed790d742 Adding symbolic sizes, contiguity, stride indices (#36101)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36101

Reviewed By: jamesr66a

Differential Revision: D20908711

Pulled By: Krovatkin

fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
2020-05-01 02:01:25 -07:00
Deyu Fu
346215caa4 [jit] Adding vectorized load/store support for JIT generated CUDA kernel (#36555)
Summary:
JIT pointwise kernel currently does not do vectorized load/store, which may lead to not optimal performance in shorter data types, like half and int8.

In this PR, a fixed length of 4 elements per load/store is added for supported tensor shape, implemented as a runtime check inside kernel.

Supported tensor shape:
- all input/output data point are aligned to 4*sizeof(dtype)
- last dimension contiguous(stride 1) and size is multiple of 4
- all other dimension have stride that is multiple of 4

All test_jit* passed, and here is performance result on a simple `ax+by+c` fusion
result before PR:
```
torch.float32 kernel time: 0.748 ms.
torch.float16 kernel time: 0.423 ms.
torch.int8 kernel time: 0.268 ms.
```
result after PR:
```
torch.float32 kernel time: 0.733 ms.
torch.float16 kernel time: 0.363 ms.
torch.int8 kernel time: 0.191 ms.
```
test code:
```
import torch
import time

# disable profiling to test all data types
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)

torch.jit.script
def axpby(x, y):
    return  x * 2 - y * 3 + 1

for test_dtype in [torch.float32, torch.float16, torch.int8]:
    a = torch.randn(12345,4096, device="cuda").to(test_dtype)
    b = torch.randn(12345,4096, device="cuda").to(test_dtype)

    # warm up
    for _ in range(100):
        c = axpby(a,b)

    torch.cuda.synchronize()
    start = time.time()

    for _ in range(1000):
        c = axpby(a,b)

    torch.cuda.synchronize()
    end = time.time()
    print("{} kernel time: {:.3f} ms.".format(test_dtype, end-start))
```
Generated code:
[log_with_generated_code.txt](https://github.com/pytorch/pytorch/files/4472813/log_with_generated_code.txt)

Additional note:
double type is disabled from vectorized code path.

We can later improve it with dynamic vectorization length support and less in-kernel check when we can use tensor shape information in codegen. For now, this implementation is following cache through TensorDesc mechanism, which does not have enough compile time information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36555

Differential Revision: D21142762

Pulled By: ngimel

fbshipit-source-id: 1cfdc5807a944c4670b040dc2d2dfa480377e7d7
2020-04-20 19:24:28 -07:00
Christian Sarofeen
f11c4f90c2 New CUDA Fuser: Unrolling support, interface refactor (#36435)
Summary:
Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is:
https://godbolt.org/z/i0uAv3

What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36435

Reviewed By: ZolotukhinM

Differential Revision: D21024011

Pulled By: soumith

fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
2020-04-16 09:20:24 -07:00
Christian Sarofeen
e551bfc8de New CUDA Fuser code lowering refactor (#36199)
Summary:
This PR completely refactors the code lowering process from our IR to CUDA. Before we had one giant step that would go from a relatively high level IR straight to CUDA, now we're lowering this first into concepts like ForLoop, IfThenElse, TensorIndex, Allocate. This lowering will allow us to do more complex code lowering like reductions and unrolling. Unrolling will quickly follow this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36199

Reviewed By: dzhulgakov

Differential Revision: D20925220

Pulled By: soumith

fbshipit-source-id: 8f621c694c68a1aad8653e625d7287fe2d8b35dc
2020-04-09 14:27:05 -07:00
Pavel Belevich
34b32ca914 Remove operator-> from at::Generator (#36027)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36027

Differential Revision: D20856462

Pulled By: pbelevich

fbshipit-source-id: 156fc23d51d8125d41e96b36b3b1312f13040588
2020-04-07 08:07:07 -07:00
Pavel Belevich
3328a2f903 Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026

Differential Revision: D20856458

Pulled By: pbelevich

fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32
2020-04-07 08:05:23 -07:00
Nikita Shulga
e707cee501 Fix gcc-5.4 compilation (#35935)
Summary:
It needs a hint how to hash `enum class` in `std::unordered_map`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35935

Test Plan: CI

Differential Revision: D20837750

Pulled By: malfet

fbshipit-source-id: 4208ee4bfa2e3cfbedf5b92bf18031225bf9dfa1
2020-04-03 08:39:30 -07:00
Christian Sarofeen
aeb13f212b Make ValType hashable. (#35917)
Summary:
Build fix stemming from https://github.com/pytorch/pytorch/issues/34785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35917

Differential Revision: D20829353

Pulled By: soumith

fbshipit-source-id: 4ba84ecedd354efbc9ac47c9b0f0e3871b404f13
2020-04-03 00:16:56 -07:00
Song Zhou
dabeff33b9 [pytorch] Fix fblearner flow compiling errors (#35902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35902

Move operator registration to anonymous namespace to avoid collision.

Reviewed By: soumith

Differential Revision: D20822382

fbshipit-source-id: 1ab00871491668b8b85e803ac877d96477f1688b
2020-04-02 14:52:48 -07:00
Soumith Chintala
d9dd353a00 fix clang-format (#35884)
Summary:
breakage introduced in PR that I landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35884

Differential Revision: D20817603

Pulled By: soumith

fbshipit-source-id: b0729bed81549d4c8e6a889c380baa19c73ef127
2020-04-02 12:12:27 -07:00
Christian Sarofeen
6d24f8fe21 Infrastructure for a new CUDA Fuser (#34785)
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_  One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.

**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.

**Short term goals:**

Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout

**Mid-term goals:**

- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785

Reviewed By: ZolotukhinM

Differential Revision: D20650977

Pulled By: soumith

fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
2020-04-02 09:22:42 -07:00
Meghan Lele
6384c2d81b [JIT] clang-format JIT code (#35115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115

This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.

Testing:
Ran the script, CI.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D20568523

Pulled By: SplitInfinity

fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
2020-03-26 11:24:51 -07:00
Pavel Belevich
5306713a36 Replace Generator* with Generator that holds std::shared_ptr<GeneratorImpl> (#34468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34468

This PR prepares `at::Generator` for pybind11's `type_caster<at::Generator>` which is required to implement custom RNG in python. The following changes are done:
1. `at::Generator` was moved to `c10::GeneratorImpl` (similar to `c10::TensorImpl`)
2. `at::Generator` was recreated as a holder of `std::shared_ptr<c10::GeneratorImpl>` (similar to `at::Tensor` that holds `c10::intrusive_ptr<c10::TensorImpl>`)
3. Most of `at::Generator*` usages were replaced with `at::Generator`

TBD: replacing `Generator generator = nullptr` with `{}` requires JIT changes(adding Generator to IValue?)

Differential Revision: D20549420

Pulled By: pbelevich

fbshipit-source-id: 4c92a40eab8f033b359bb6c93f4cd84b07ee8d4e
2020-03-21 17:36:10 -07:00
Edward Yang
cf8b728255 Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. (#34588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34588

I constructed the patch by deleting OperatorOptions and then rerouting
all queries for AliasAnalysisKind to FunctionSchema.  Some of the
behavior is kind of bogus: we really shouldn't be mutating FunctionSchema
after the fact, but that won't get fixed until we actually switch to
true schema merging.

Reland of https://github.com/pytorch/pytorch/pull/34160

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20387079

Pulled By: ezyang

fbshipit-source-id: d189f7a6ad8cd186b88b6fbfa3f189994eea14e8
2020-03-11 20:59:46 -07:00
Edward Yang
6f8a8e4e47 Revert D20282846: Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema.
Test Plan: revert-hammer

Differential Revision:
D20282846

Original commit changeset: ba7bca6e8adc

fbshipit-source-id: b9e15d2b2c3d1dbc6e971ab3c0bdf380e769dcf1
2020-03-11 07:50:29 -07:00
Edward Yang
9d42177a31 Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. (#34160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34160

I constructed the patch by deleting OperatorOptions and then rerouting
all queries for AliasAnalysisKind to FunctionSchema.  Some of the
behavior is kind of bogus: we really shouldn't be mutating FunctionSchema
after the fact, but that won't get fixed until we actually switch to
true schema merging.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20282846

Pulled By: ezyang

fbshipit-source-id: ba7bca6e8adc3365789639b88e54c4e881b1692e
2020-03-11 07:15:18 -07:00
Zachary DeVito
358450e02b improved TorchScript traceback (#33834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33834

This changes how we report Tracebacks to make them more clear when
there are both serialized and non-serialized ranges. It now looks like:

```
Traceback (most recent call last):
  File "foo.py", line 25, in <module>
    s2(a, b)
  File "/scratch/zdevito/pytorch/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 7, in forward
    x: Tensor,
    y: Tensor) -> Tensor:
    return (self).bar(x, y, )
            ~~~~~~~~~ <--- HERE
  def bar(self: __torch__.Moo,
    x: Tensor,
  File "code/__torch__.py", line 11, in bar
    x: Tensor,
    y: Tensor) -> Tensor:
    _0 = (self).baz(x, y, )
          ~~~~~~~~~ <--- HERE
    _1 = torch.ones([3], dtype=None, layout=None, device=None, pin_memory=None)
    return torch.add(_0, _1, alpha=1)
  File "code/__torch__.py", line 17, in baz
    x: Tensor,
    y: Tensor) -> Tensor:
    return torch.add(x, y, alpha=1)
           ~~~~~~~~~ <--- HERE

Traceback of TorchScript, original code (most recent call last):
  File "foo.py", line 11, in forward
    def forward(self, x, y):
        return self.bar(x, y)
               ~~~~~~~~ <--- HERE
  File "foo.py", line 9, in bar
    def bar(self, x, y):
        return self.baz(x, y) + torch.ones(3)
               ~~~~~~~~ <--- HERE
  File "foo.py", line 7, in baz
    def baz(self, x, y):
        return x + y
               ~~~~~ <--- HERE
RuntimeError: The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 1
```

It follows Python convension of having the most important information last
and reading from the bottom up.

Changes:
* Moved the error message to the end, to copy Python
* Report original traceback separate from serialized traceback
* Make sure root functions have names in the interpreter trace.

Test Plan: Imported from OSS

Differential Revision: D20126136

Pulled By: zdevito

fbshipit-source-id: fd01f9985e5d74e04c4d064c02e8bc320f4fac13
2020-03-03 12:27:38 -08:00
Michael Suo
dbe850af5b [jit] do the code reorg (#33851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33851

Rationale and context described in #33828.

Script to reproduce the move:
https://gist.github.com/suo/16cbefaaeb67ca5a7c6caffd49b7f6e9
ghstack-source-id: 99079645

Test Plan: Make sure CI passes

Reviewed By: jamesr66a

Differential Revision: D20133869

fbshipit-source-id: 390e9241a9c85366d9005c492ac31f10aa96488e
2020-02-27 13:02:51 -08:00