This PR relands #80584, but instead of adding suppression in CMakeLists.txt suppresses it directly in `llvm_codegen.cpp` and just for a single header.
In general, it's better to avoid `set_target_properties` pattern for suppressing warnings, as it makes build brittle and hard to debug/understand
Test plan: wait for `ciflow/binaries_wheel` to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81012
Approved by: https://github.com/huydhn, https://github.com/kit1980
There is a problem that pybind11 silently converts Python's complex scalar to `bool` and uses `define_constant<bool>` overload. It was unnoticed because `0j` corresponds to `False` and tests passed, with `2j` scalar tests for `_refs.where` would fail without proper bindings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80522
Approved by: https://github.com/ngimel
This PR modifies the type promotion logic for nvFuser's `where` function when one of the arguments is a scalar. With the proposed change behavior now matches with ATen's type promotion.
The following script fails on master and passes with this PR:
```py
import torch
import torch._refs
from torch._prims.executor import make_traced
a = torch.ones(3, 3, dtype=torch.bool, device='cuda')
b = torch.randn(3, 3, device='cuda')
func = lambda a, b: torch._refs.where(a, 0.0, b)
assert make_traced(func)(a, b, executor="nvfuser").dtype == torch.float32
```
This PR allows to unskip nvFuser tests for `_refs.log_softmax`, it was failing with a dtype mismatch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80347
Approved by: https://github.com/ngimel
This PR fixes a bug with `broadcast_in_dim` leading to the situation when reduction ops were not allowed to be used before `broadcast_in_dim`.
With this PR it's possible to run
```py
import torch
import torch._refs
from torch._prims.executor import make_traced
def foo(a):
return torch._refs.mean(a, keepdim=False)
a = torch.randn(3, 3, device='cuda')
make_traced(foo)(a, executor="nvfuser")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79444
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Bug fixes and minor refactor
Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:
```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```
RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
adding a link to github 1.12 release branch nvfuser README.md in jit doc
Note that this PR is intended to be cherry-picked by 1.12 release, we'll have a follow up PR to update the link once this PR is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78160
Approved by: https://github.com/davidberard98
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440
Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440
Approved by: https://github.com/jjsjann123
This PR primarily addresses augmenting the frontend to properly support `broadcast_in_dim`. This required make a new version of the `define_tensor()` that takes in the `size` and `strides` of input tensors in order to properly determine broadcasts.
This PR also has a fix for the `python_example.py` that broke when a new argument was added to reductions to allow the user to specify an output Data Type.
`define_tensor()` Interface Example:
```
fusion2 = Fusion()
input1 = torch.ones(1, 1, 4, device='cuda')
input2 = torch.ones(2, 3, 4, device='cuda')
with FusionDefinition(fusion2) as fd :
t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride())
t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride())
fd.add_input(t0)
fd.add_input(t1)
t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2])
print("Broadcast TensorView", t0_b)
t2 = fd.Ops.add(t0_b, t1)
fd.add_output(t2)
```
Print statement of defined broadcast tensor:
```
Broadcast TensorView T2_l[ sbS6{1}, sbS7{1}, iS8{i2} ] DataType: float Contiguity: ttt
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76790
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485
Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for
controlling whether or not nvfuser is enabled. This required changing
the PassManager behavior to support the case where nvfuser gets enabled
by default when PYTORCH_JIT_ENABLE_NVFUSER=1.
Previously the solution for turning nvfuser on or off was to use the
PassManager to register or un-register the pass. That works fine if the
pass starts of _disabled_, but causes issues once we try to enable the
pass by default.
The main issue with enabling by default is with the validation check to
see whether NVFuser can be turned on. The check relies on
at::globalContext().hasCUDA(), which requires CUDAHooks to be registered
before hasCUDA() wil work correctly. At static initialization time it's
difficult to ensure that CUDAHooks will be registered _before_ we
attempt to register the nvfuser pass. In OSS it worked fine, but in
internal builds it would fail on ROCm builds.
To fix this, we switch the control of NVFuser enablement to a check in
the pass. i.e. previously, we enabled/disabled nvfuser by registering or
de-registering the pass in pass manager; now, the pass is always
registered in pass manager, and enablement is done by a check within the
nvfuser pass.
Remaining TODO: Connect this with NNC so that in cases where NNC is
available but not NVFuser (i.e. on AMD gpus), NNC can be turned on
automatically.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D35982618
Pulled By: davidberard98
fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60
(cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)
Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.
The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)`
1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation;
2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output;
3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous.
By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`.
Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):
1. different rank & same permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
2. different rank & compatible permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
3. different rank & compatible permutation with broadcasting
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
4. different rank & in-compatible permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(h, w).cuda() # stride (w, 1)
jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor
eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand
```
5. different rank & in-compatible permutation
```
t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1)
t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors
eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
I noticed that when `SymInt` was introduced, `jit_type_base.h` was
added as an include to the `Operator.h` template which is supposed to
be kept extremely clean and only use forward declarations. Also,
that forward declarations for `OptionalArrayRef` were missing.
So, I've refactored the forward declarations into
`ATen/core/ATen_fwd.h` and cleaned up some of the `c10`
headers that were masking these missing declarations. I've also
re-generated the pre-compiled header so `SymInt` is included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576
Approved by: https://github.com/albanD
Retry of #75983. The change is to handle cases where attr::cache_id is
not set. This can happen if compilation fails.
Original message:
1) remember when fusions fail; and on subsequent runs, always take the fallback.
2) during the first fallback, cache the Code object.
On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows:
* Original (always attempt fusion): 25ms
* Always take fallback after first failure: 0.79ms
* Always take fallback + cache Code object: 0.62ms
* Eager: 0.58ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76505
Approved by: https://github.com/eellison
Fixes#75708
`--ptxas-options` only passes its immediate argument to ptxas. So we should have put that in front of every ptxas argument.
It's actually strange how this worked in CUDA TK 11.6. I'm following up with nvrtc team on this internally, meanwhile we should merge this PR to avoid register failures in generated kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76226
Approved by: https://github.com/davidberard98
1) remember when fusions fail; and on subsequent runs, always take the fallback.
2) during the first fallback, cache the Code object.
On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows:
* Original (always attempt fusion): 25ms
* Always take fallback after first failure: 0.79ms
* Always take fallback + cache Code object: 0.62ms
* Eager: 0.58ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75983
Approved by: https://github.com/jjsjann123
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.
Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.
Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Reviewed By: anjali411
Differential Revision: D35747333
Pulled By: malfet
fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
Summary:
Fixing clang-format errors using `arc f`
Changes already in github included https://github.com/pytorch/pytorch/pull/68460
Test Plan: test run in Signals
Reviewed By: osalpekar
Differential Revision: D35649381
fbshipit-source-id: 15f9cc7259c6425a14d2646200008f15ec47cbf0
(cherry picked from commit 6581afe58afae4dcc34d4024499c6cb61a56b448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74361
This adds an optional validation after executing an NVFuser node, which checks that the output is the same as the unfused implementation. Then the outputs and the graph are reported via a callback.
```python
import torch
def callback(x, y, graph):
for i in range(len(x)-amt, len(x)):
print(x[i])
print(y[i])
print(graph)
with torch.jit.fuser("fuser2"):
torch._C._jit_nvfuser_set_comparison_callback(True, callback)
torch.jit.script
def g(x, y):
z = torch.add(x, y)
return torch.sin(z)
def f(x, y, a):
z = torch.add(x, y)
return g(torch.relu(z), a)
f_s = torch.jit.script(f)
x = torch.rand((10, 10), dtype=torch.half).cuda()
y = torch.rand((10, 10), dtype=torch.half).cuda()
a = torch.rand((10, 10), dtype=torch.half).cuda()
f_s(x, y, a)
f_s(x, y, a)
f_s(x, y, a)
```
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D34975310
Pulled By: davidberard98
fbshipit-source-id: 2379c9a6f371cd58da6a187c1f16882f3923ab24
(cherry picked from commit 96c87992c65f5e6bb1bdd51791682dd837af99b4)
Fixes issue where CudaFusionGuard would return false on backward graph because `requires_grad` flag doesn't match.
This is due to the fact that autodiff uses GradMode switch to turn on/off requires_grad, which is not taken into consideration by nvfuser guard. We verified the implementation under `TensorType::matchTensor`.
- [x] Add python test to verify no fallback is observed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75016
Approved by: https://github.com/eellison
Summary:
Things changed in this PR that requires review:
test/forward_backward_compatibility/check_forward_backward_compatibility.py
Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.
nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627
Reviewed By: Chillee
Differential Revision: D34765458
Pulled By: davidberard98
fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
Summary:
added python API to disable nvfuser on certain opkind.
```
"_jit_set_nvfuser_skip_node_kind",
[](const std::string& op_name, bool flip = true) {
return fuser::cuda::skipNode(op_name, flip);
})
```
Args:
`op_name`: Symbol of op;
`flip`: flag indicating whether to flip the given op in the skip list.
Returns:
a bool flag indicating if `op_name` was already in the skip list.
The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True) # returns False, as no op is in skip list by default`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520
Reviewed By: saketh-are
Differential Revision: D35046110
Pulled By: davidberard98
fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
Summary:
## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:
- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

- SkyLake 8180 (single thread):

\* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is:
```
caffe2/CMakeLists.txt
```
## Limitations
- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111
Reviewed By: eellison
Differential Revision: D34584878
Pulled By: malfet
fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4
(cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73818
These all appear to be defined in libtorch_cpu.so, so they should be marked with TORCH_API. TORCH_API means that these symbols are exported from libtorch_cpu.so and no other libraries. In comparison, C10_EXPORT will export the symbol in _all_ built libraries, if it's available.
I think most of these were fine because most were only defined in cpp files (which would only be included in the targets for one .so file). However, the change in pass_manager.h affects behavior, since the class is defined in the .h file, which could result in two separate implementations of the same static functions. Previously we saw issues on windows with this: https://github.com/pytorch/pytorch/pull/73742
Test Plan: Imported from OSS
Reviewed By: george-qi
Differential Revision: D34698175
Pulled By: davidberard98
fbshipit-source-id: cb871e861cf966bff596cfa8340a32a17fca0b66
(cherry picked from commit 6b9988e5688e6d4a9928c3e331efb74f000a9e4a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73881
NVFuser fusion groups can contain nvfuser-only ops, e.g. `prim::reshape_copy`. Previously, we couldn't get a baseline performance measurement because the nvfuser-only ops would error out on nnc- and no-fusion- runs. Instead, dump the fallback graphs, after the fallbacks are corrected into runnable fallbacks.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D34698307
Pulled By: davidberard98
fbshipit-source-id: c357b2736b789bfd347afe9c83a1b610b64881e0
(cherry picked from commit 5918d826502ff75fbc22d242844ae6435dd7d22a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72889
The script along with the GRAPH_EXPORT macro will allow for an easy way to extract IR from logs. One use case in this diff is to extract the fusion groups from nvfuser, so that the fusions can be tested individually.
Usage (e.g. for nvfuser test)
1. Write some test.py file that uses nvfuser
2. `PYTORCH_JIT_LOG_LEVEL=">>graph_fuser" python3 test.py 2>&1 | tee output.txt`
3. `python3 pytorch/scripts/jit/log_extract.py output.txt --nvfuser`
This will run with and without nvfuser to compare the output.
Alternatively, use `--output` to dump the IR so that it can be used in other applications.
Currently, only `--output` works (since generating input tensors is not supported)
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D34440189
Pulled By: davidberard98
fbshipit-source-id: fca0f619200ee37aba34bb39b69e6c640c263e26
(cherry picked from commit eb319166075db160f1628f0de545641fbecde8be)
This patch enables Pytorch build from source with Ninja and
'Visual Studio 16 2019' CMake generator on Windows on Arm.
Tests:
- Build from source: 'python setup.py develop'.
- Run simple Pytorch example: passed
- python test\test_torch.py:
-- same results as on x64
-- Ran 1344 tests, failures=2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72424
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported
nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor
Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127
Reviewed By: HamidShojanazeri
Differential Revision: D34113233
Pulled By: jbschlosser
fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72013
Find and replace `assert(!"` with `assert(false && "`
Excludes headers and paths that contain "third-party" or "external"
Clang raises a `-Wstring-conversion` warning when treating a string as a boolean. This is not uncommon for asserts though (e.g. `assert(!"should never happen")`). Clang does permit `expr && "string"` though in order to support these assertion use cases.
Test Plan: ci pass
Differential Revision: D33823092
fbshipit-source-id: 9a1af012215bdc91f8b4162ddb2df28d51539773
(cherry picked from commit 0286910350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71879
Two locations of improper macro usage were reported (https://github.com/pytorch/pytorch/issues/71848), and this diff fixes them. In both cases this is behavior-changing, since the incorrect usages would have passed assertion due interpreting the error string as the condition, and both cases should have been 'assert false'.
Test Plan: Run CI
Reviewed By: alanwaketan
Differential Revision: D33800406
fbshipit-source-id: dfe3d9a6455e6eb96cb639022f8813a8bd6520c3
(cherry picked from commit ee551e5a16)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71711
This will fix a ton of broken asserts that should always fire but never actually fire.
All would have been caught with `-Wstring-conversion` warnings enabled.
Test Plan: CI Pass
Differential Revision: D33743605
fbshipit-source-id: 062641f9d5d02c6e317c5a286fd01017cf77237f
(cherry picked from commit 639b42e04b)
Summary:
This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in `CUDAGeneratorImpl.h` suggests:
```
// TODO: this file should be in ATen/cuda, not top level
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650
Reviewed By: jianyuh, xw285cornell
Differential Revision: D33414890
Pulled By: shintaro-iwasaki
fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69579
This should help us avoid reference counting overhead on singleton Type subclasses without a major rewrite of the Type subsystem.
ghstack-source-id: 146643993
Test Plan:
Ran //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark with arguments `--op empty -niter 40 --stressTestRecordFunction --captureRecordFunctionInputs` on devbig with turbo off.
Before:
```
I1206 13:47:15.037441 1201670 bench.cpp:144] Mean 0.737675
I1206 13:47:15.037463 1201670 bench.cpp:145] Median 0.736725
I1206 13:47:15.037468 1201670 bench.cpp:146] Min 0.722897
I1206 13:47:15.037473 1201670 bench.cpp:147] stddev 0.00508187
I1206 13:47:15.037482 1201670 bench.cpp:148] stddev / mean 0.00688903
```
After:
```
I1206 13:48:16.830123 1205612 bench.cpp:144] Mean 0.66988
I1206 13:48:16.830150 1205612 bench.cpp:145] Median 0.663956
I1206 13:48:16.830157 1205612 bench.cpp:146] Min 0.65986
I1206 13:48:16.830164 1205612 bench.cpp:147] stddev 0.0335928
I1206 13:48:16.830171 1205612 bench.cpp:148] stddev / mean 0.0501475
```
Static runtime startup is also improved; for CMF local_ro, time to initialize a predictor went from 10.01s to 9.59s.
(Note: I wish I had a production workload to demonstrate the advantage of this on. I tried ctr_mobile_feed local_ro net but it was neutral. Anything that manipulates types or List/Dict a lot might be promising.)
Reviewed By: suo
Differential Revision: D32923880
fbshipit-source-id: c82ed6689b3598e61047fbcb2149982173127ff0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D33336948
Pulled By: albanD
fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, malfet
Differential Revision: D32596264
Pulled By: albanD
fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
Summary:
Use `c10::printQuotedString` to escape any characters that might render
string to be interpreted as more than one argument by shell script.
Please note, that this codepath is deprecated and is not accessible
by a typical PyTorch usage workflows.
This issue was discovered by Daniel Lawrence of the Amazon Alexa team.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70070
Reviewed By: suo
Differential Revision: D33172721
Pulled By: malfet
fbshipit-source-id: 9dbd17f6eb775aaa1a545da42cbc95864c1189ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964
Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h
nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)
Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428
Reviewed By: ngimel
Differential Revision: D33073817
Pulled By: wconstab
fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/68095
This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585
Reviewed By: mrshenli
Differential Revision: D32958594
Pulled By: albanD
fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.
Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943
Reviewed By: ngimel
Differential Revision: D32288709
Pulled By: dzhulgakov
fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
Summary:
There were 2 versions of the same code which were slightly different although functionally equivalent.
When adding support for another CUDA / device version both would need to be changed and kept in sync. So it is better to have only 1 version of it as the unique source of truth.
I chose the implementation which looks cleaner and easier to read and added some minor enhancements and comments to further increase readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55901
Reviewed By: H-Huang
Differential Revision: D31636917
Pulled By: bertmaher
fbshipit-source-id: 622e1fabc39de4f3f1b1aa9a1544cfbd35a5cfd9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345
FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165
Test Plan:
CI
perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.
Reviewed By: hlu1
Differential Revision: D31027361
fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65064
The problem appears when nvfuser is triggered from LazyTensor.
Because LT maintains its own thread pool, the thread used for the first-time
compilation does CUDA context initialization properly, but later
cached execution may use a different thread which does not have
a proper CUDA context.
Test Plan: Imported from OSS
Reviewed By: saketh-are
Differential Revision: D31269691
Pulled By: desertfire
fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Do not delete `caffe2::OperatorBase::Output` calls as they have side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041
Reviewed By: ngimel
Differential Revision: D31360142
Pulled By: malfet
fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
Summary:
Reland of https://github.com/pytorch/pytorch/pull/65242
The last attempt of the reland automatically rebased onto stable, which did not yet have the revert commit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66018
Reviewed By: albanD
Differential Revision: D31348822
Pulled By: soulitzer
fbshipit-source-id: 881d701b404530c1352ac9245bd67264e1652b8a
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954
Reviewed By: ngimel
Differential Revision: D31326599
Pulled By: malfet
fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64000
- updates double backward formula to compute grad wrt output instead of self
- ~~In some of the error messages, we still refer to the dtype of the input, even though we are now checking the dtype of the output~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65242
Reviewed By: malfet
Differential Revision: D31317680
Pulled By: soulitzer
fbshipit-source-id: b3b921e06775cfc12e5a97a9ee8d73aec3aac7c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610
- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.
- In the next PR
- Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
- HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.
cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd
Reviewed By: jbschlosser
Differential Revision: D30909053
Pulled By: ezyang
fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65459
Just run linter on the change and apply all suggestions
Test Plan: N/A
Reviewed By: seemethere
Differential Revision: D31102960
fbshipit-source-id: 04e1d07935690f2ddbc64533661b3e55379d13b5
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:
- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).
To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.
internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`
updates affecting integration:
1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745
Reviewed By: saketh-are
Differential Revision: D30752939
Pulled By: malfet
fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414
Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318
Test Plan:
compiles.
Imported from OSS
Reviewed By: ejguan
Differential Revision: D30375410
fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
Summary:
- HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm.
- TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786
Reviewed By: bdhirsh
Differential Revision: D30281682
Pulled By: seemethere
fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This is a first step towards creating context manager that errors out on synchronizing calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889
Reviewed By: albanD
Differential Revision: D29805280
Pulled By: ngimel
fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
fixed the awkward configerator initialization issue that broke some
tests. Trying again
Test Plan: predictor comparisons
Reviewed By: ZolotukhinM
Differential Revision: D28859795
fbshipit-source-id: 826801db24e86b1c3594a86e3ac32f0a84c496f7
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.
- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
- NAN / POS_INFINITY / NEG_INFINITY
- Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400
Reviewed By: ejguan
Differential Revision: D28421065
Pulled By: malfet
fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58347
Back out "Revert D27652484 (ac04cc775b): [nnc] Enable CPU fusion inside Facebook"
Original commit changeset: ecfef3ee1e71
ghstack-source-id: 129279584
Test Plan: Tests for bugfix included in this stack
Reviewed By: navahgar
Differential Revision: D28461013
fbshipit-source-id: 79a80b6ffb653ab952ff5efaa143d3362bb7d966
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235
Reviewed By: janeyx99
Differential Revision: D28084444
Pulled By: malfet
fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
On ROCm, the error when compiling was "non-constant-expression cannot be narrowed from type 'int' to 'uint32_t'"
when compiling grid_reduction.cu.
Added typecast to fix issue.
Also, removed test skip with ROCm : re-enabling
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55243
Reviewed By: malfet
Differential Revision: D27917066
Pulled By: ngimel
fbshipit-source-id: b0b7c5fc8ecd2624222b35fe060846f7d1670f07
Summary:
Revert "Revert D27449031 (2a7df657fe): [pytorch][PR] [ROCm] use hiprtc precompiled header". Reland PR https://github.com/pytorch/pytorch/issues/54350.
This reverts commit 204ac21bf1.
The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965
Reviewed By: jbschlosser
Differential Revision: D27755907
Pulled By: malfet
fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D27447811
fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
Summary:
HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release. Conditionally add support for this feature. Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features.
The use of this feature is conditionalized on a new ROCM_VERSION macro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350
Reviewed By: H-Huang
Differential Revision: D27449031
Pulled By: malfet
fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571
Supports bfloat16 via a similar method to half: upconvert inputs to
fp32, do math, then downconvert outputs to bf16.
Resource strings are mostly derived from cuda-11 headers.
Fixes#53918, for the legacy fuser at least.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27328987
Pulled By: bertmaher
fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
nvcc's `--fmad=false` is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50508
Reviewed By: albanD
Differential Revision: D25920291
Pulled By: mrshenli
fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52182
DynamicLibrary provides a very specific functionality, so there is no need to exposes it to every project depending on `ATen.h`
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D26417404
Pulled By: malfet
fbshipit-source-id: f8318cacb07dcc8b2f95984f88ea1df4e5369b8b
Summary:
Clang from XCode does not support `-fopenmp` option, no need to try to compile with it.
Infer whether OpenMP is supported by checking _OPENMP define.
Also, use clang compiler if host app was compiled with clang rather than gcc.
Fix few range loop warnings and add static_asserts that range loop variables are raw pointers.
This changes makes fuser tests on OS X a bit faster.
Before:
```
% python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
warning: pytorch jit fuser failed to compile with openmp, trying without it...
ok
----------------------------------------------------------------------
Ran 1 test in 0.468s
OK
```
After:
```
% python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.435s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51504
Reviewed By: smessmer
Differential Revision: D26186875
Pulled By: malfet
fbshipit-source-id: 930b3bcf543fdfad0f493d687072aaaf5f9e2bfc
Summary:
CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX.
1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime;
https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx
2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319
Reviewed By: malfet
Differential Revision: D26114475
Pulled By: ngimel
fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5
Summary:
Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target.
This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627
Reviewed By: mruberry
Differential Revision: D25955441
Pulled By: janeyx99
fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228
`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837374
fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.
Fixes https://github.com/pytorch/pytorch/issues/49299
I still need to look into a handful of failing tests, but maybe it can be a discussion basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433
Reviewed By: ngimel
Differential Revision: D25681374
Pulled By: Krovatkin
fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48868
Building on the previous diff, we can make `toTensor()` return a
`const Tensor&`, which should make it easier to avoid reference
counting.
ghstack-source-id: 119327372
Test Plan: internal benchmarks.
Reviewed By: bwasti
Differential Revision: D25325379
fbshipit-source-id: ca699632901691bcee432f595f75b0a4416d55dd
Summary:
The command used here is essentially `where cl.exe`. By using `system()` we will not be able to find cl.exe unless we are using VS Developer Prompt, which makes `activate()` meaningless. Change `system()` to `run()` fixes this.
Found during https://github.com/pytorch/pytorch/issues/49781.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50085
Reviewed By: smessmer
Differential Revision: D25782054
Pulled By: ezyang
fbshipit-source-id: e8e3cac903a73f3bd78def667ebe0e93201814c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47665
Re-enabled the pass to remove outputs from fusion that is only used by aten::size;
Added size computation for reduction op via new operator prim::ReductionSizes;
Test Plan: Imported from OSS
Reviewed By: navahgar, jamesr66a
Differential Revision: D25254675
Pulled By: Krovatkin
fbshipit-source-id: e9a057b0287ed0ac93b415647fd8e5e836ba9856
Summary:
Convert the NVFUSER's runtime CUDA sources (under `.../jit/codegen/cuda/runtime`) to string literals, then include the headers with the generated literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48283
Reviewed By: mrshenli
Differential Revision: D25163362
Pulled By: ngimel
fbshipit-source-id: 4e6c181688ddea78ce6f3c754fee62fa6df16641
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48200
CUDA 11.0 only supports < sm_80 (https://docs.nvidia.com/cuda/archive/11.0/nvrtc/#group__options)
Note: NVRTC documentation is not a reliable source to query supported architecture. Rule of thumb is that nvrtc supports the same set of arch for nvcc, so the best way to query that is something like `nvcc -h | grep -o "compute_[0-9][0-9]" | sort | uniq`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48455
Reviewed By: zhangguanheng66
Differential Revision: D25255529
Pulled By: ngimel
fbshipit-source-id: e84cf51ab50519b4c97dad063cc43c9194942bb2
Summary:
This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009
Reviewed By: gchanan
Differential Revision: D25045840
Pulled By: ezyang
fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47959
Taking a shared_ptr by value incurs refcounting overhead and should only be done if the callee needs to take ownership. Otherwise, `const T&` is more efficient. (Specifically, you will have to do an atomic decrement when the argument is destroyed and probably an atomic increment as well. Passing by `const T&` also takes one less register than passing `std::shared_ptr<T>`, but that's less important.)
This diff fixes just this one function, but I'd be happy to audit & fix this whole file in future diffs. Thoughts?
ghstack-source-id: 116914899
Test Plan: build ATen-cpu
Reviewed By: Krovatkin
Differential Revision: D24970954
fbshipit-source-id: 6bdb4b710a94b8baf4ad63418fb38136134e0ef3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46092
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.
This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203
(Note: this ignores all push blocking failures!)
Test Plan:
vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/
after c10::optional fix:
https://www.internalfb.com/intern/fblearner/details/231391773/
Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression.
Reviewed By: ezyang
Differential Revision: D24219944
fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50