Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571
Supports bfloat16 via a similar method to half: upconvert inputs to
fp32, do math, then downconvert outputs to bf16.
Resource strings are mostly derived from cuda-11 headers.
Fixes#53918, for the legacy fuser at least.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27328987
Pulled By: bertmaher
fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240
Reviewed By: mruberry
Differential Revision: D23882498
Pulled By: ngimel
fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631
I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358810
Pulled By: eellison
fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
Summary:
fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590
Reviewed By: mruberry
Differential Revision: D23338664
Pulled By: bertmaher
fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40142
test_jit is becoming huge again, which makes editor hard to load and
write new tests, this split out the tracer related tests.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D22085035
Pulled By: wanchaol
fbshipit-source-id: 696bee84985ecfbfeac8e2ee5c27f1bdda8de394
Summary:
After an early return, we conditionalize all further execution. This means that currently the pattern of
`if return elif return elif return` generates better code than `if return if return if return`. It's obviously not good to have semantically equivalent code generate worse IR, so we should rewrite the graph to handle this case. This came up in https://github.com/pytorch/pytorch/pull/37171
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
_0 = uninitialized(int)
if x:
_1, _2 = True, 1
else:
_1, _2 = False, _0
if _1:
_3 = _2
else:
_3 = 2
return _3
```
while
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
else:
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
if x:
_0 = 1
else:
_0 = 2
return _0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38282
Differential Revision: D21576733
Pulled By: eellison
fbshipit-source-id: 80cf1ad7fbda6d8d58557abbfb21c90eafae7488
Summary:
The existing contextmanager only conditionally enabled_profiling_mode, which was counter intuitive. When we changed the default executor it broke internal benchmarking as a result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37825
Differential Revision: D21404611
Pulled By: eellison
fbshipit-source-id: 306b3c333ef4eb44ab6a6e5ab4e0682e5ce312ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258
This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly.
Test Plan: Imported from OSS
Differential Revision: D21110255
Pulled By: nairbv
fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e
Summary:
This test was failing because caching resulted into a function with multiple execution plans rather than multiple functions with a single execution plan each as a test writer intended.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35847
Differential Revision: D20839674
Pulled By: Krovatkin
fbshipit-source-id: 68f41610a823d94c1e744c85ac72652c741d73ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34980
We were passing sample inputs to `torch.jit.script` (as if it was
`torch.jit.trace`), but this parameter was treated as an optional
`optimize` parameter. That parameter is deprecated and that caused a
warning.
Differential Revision: D20520369
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 87b40a5e35bfc4a3d7a5d95494632bfe117e40b7
Summary:
With the profiling executor enabled the fuser won't be invoked until the second pass over a script function, so some of these tests weren't correctly comparing the fused output with the interpreter output. I've used the `checkScript` method where applicable, which seems to do the right thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33944
Test Plan: Locally inject obvious errors into the fuser and verify that the updated tests fail when they're supposed to.
Differential Revision: D20162320
Pulled By: bertmaher
fbshipit-source-id: 4a2f3f2d2ff1d81f23db504dc8cd0d5417bdcc50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445
Create distributed and rpc directories under caffe/test for better management
of unit tests.
Differential Revision: D18702786
fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31071
Previously the profiler would think Tensors would require grad, even
when the no_grad flag is enabled during execution. This makes the profiling
and guards respect the no_grad flag, which eliminates extra differentiable
graphs that appear in the backward graph (where no_grad is typically enabled).
Test Plan: Imported from OSS
Differential Revision: D18915468
Pulled By: zdevito
fbshipit-source-id: 1ae816a16ab78ae5352825cc6b4a68ed7681a089
Summary:
These unit tests pass after landing all the warp size awareness patches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25963
Differential Revision: D17319124
Pulled By: bddppq
fbshipit-source-id: 22f5d5f1ca9c67e66a7ccf983b2d2f889a74e729
Summary:
As of ROCm 2.6, we support hiprtc - the HIP runtime compilation API. Enable the jit fusion feature depending on the existence of such an API. This entails
* new hipification rules for API_RTC
* add hiprtc APIs to the shim loader
* update cmake infrastructure to find the hiprtc library (it is part of the HIP package)
* enabling of unit tests in the jit_fuser test set
* special casing in resource strings for HIP - the typedefs CUDA requires would be redundant
* for now disable the occupancy calculation we do not support yet and hard-code
Thanks to t-vi for working with me on getting this integration done!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22872
Differential Revision: D17207425
Pulled By: bddppq
fbshipit-source-id: 93409f3051ad0ea06afacc2239fd6c402152debe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23799
Before, we inlined as part of the initial IR generation process, which
has a few disadvantages:
1. It loses information about what nodes came from which function/method
calls. Other parties who want to implement transformations on the
function/module level don't have a reliable way of doing so.
2. It duplicates a ton of code if we are inlining the same
function/method a tons of times.
After this PR: inline is deferred to the optimization stage, so
optimizations that rely on inlining will still work. But things get
serialized with the function/method calls in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23799
Differential Revision: D16652819
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Pulled By: suo
fbshipit-source-id: a11af82aec796487586f81f5a9102fefb6c246db
Summary:
This PR:
- Moves clamp from autodiff cpp to symbolic script
- Adds an additional tuple lowering pass to the graph executor
- Updates clamp backwards to be maximally gradient preserving
Moving clamp to symbolic script presented two challenges:
- When the backward graph is defined the branch taken in the conditional is known, but communicating this information to the Jit is a little tricky. It turns out the Jit has a quirk where variables that can be None at the time of graph instantiation are treated as constants, so testing min and max against None lets the Jit instantiate only one path branch. It might be more natural to select different backward functions for these cases, but that is not yet supported.
- Moving clamp to symbolic script introduced an extra tuple construction and immediate unpacking which prevented fusion. This was dealt with by adding an additional tuple removal pass. This issue could appear whenever a symbolic script's return value was defined in an if statement, which made the Jit see the unpacked tuple as being constructed from an if, not a TupleConstruct. The graph is later optimized but tuple lowering was not performed again after these optimizations.
Moving clamp to symbolic script also adds some explicit conversions to float in graphs which it appears, but these seem harmless.
If clamp were simply moved to symbolic script then its backward graphs would look like this:
`graph(%0 : Float(*, *),
%1 : AutogradZeroTensor,
%2 : Float(*, *),
%3 : int[]?,
%4 : Scalar?,
%5 : int):
%6 : None = prim::Constant() # <string>:5:31
%7 : float = aten::Float(%5) # <string>:12:37
%8 : Float(*, *) = prim::FusionGroup_0(%0, %2, %7)
%9 : (Float(*, *), None, None) = prim::TupleConstruct(%8, %6, %6)
%10 : Float(*, *), %11 : None, %12 : None = prim::TupleUnpack(%9)
return (%10)
with prim::FusionGroup_0 = graph(%0 : Float(*, *),
%1 : Float(*, *),
%2 : float):
%3 : Bool(*, *) = aten::le(%1, %2) # <string>:12:29
%mask.5 : Float(*, *) = aten::type_as(%3, %1) # <string>:12:29
%5 : Float(*, *) = aten::mul(%0, %mask.5) # <string>:13:28
return (%5)`
And adding the additional pass to remove tuples eliminates the prim::TupleConstruct and prim::TupleUnpack. Keeping these included previously would cause test_fuser_iou to fail because multiple fusion groups would be created. Since https://github.com/pytorch/pytorch/issues/23372 this test is disabled, however. When enabled the relevant portion of its graph is now:
`%59 : float = aten::Float(%26) # <string>:314:38
%60 : float = aten::Float(%27) # <string>:314:61
%61 : int[] = aten::size(%14) # <string>:41:99
%62 : int[] = aten::size(%11) # <string>:42:100
%63 : int[] = aten::size(%15) # <string>:41:99
%64 : int[] = aten::size(%12) # <string>:42:100
%65 : Tensor, %66 : Tensor, %67 : Tensor, %68 : Tensor, %69 : Tensor, %70 : Tensor, %71 : Tensor, %72 : Tensor, %73 : Double(*, *) = prim::FusionGroup_0(%w.1, %13, %16, %23, %h.1, %54, %inter.1, %0, %12, %15, %18, %17, %29, %11, %14, %60, %59)
%74 : Tensor = aten::_grad_sum_to_size(%73, %53)
%75 : Tensor = aten::_grad_sum_to_size(%73, %52)
%grad_self.10 : Tensor = aten::_grad_sum_to_size(%65, %61) # <string>:41:30
%grad_other.10 : Tensor = aten::_grad_sum_to_size(%66, %62) # <string>:42:31
%78 : Tensor = prim::FusionGroup_1(%grad_self.10, %74, %36)
%79 : Tensor = prim::FusionGroup_2(%grad_other.10, %75, %44)
%grad_self.14 : Tensor = aten::_grad_sum_to_size(%67, %21) # <string>:33:30
%grad_other.14 : Tensor = aten::_grad_sum_to_size(%68, %22) # <string>:34:31
%grad_self.12 : Tensor = aten::_grad_sum_to_size(%69, %63) # <string>:41:30
%grad_other.12 : Tensor = aten::_grad_sum_to_size(%70, %64) # <string>:42:31
%grad_self.16 : Tensor = aten::_grad_sum_to_size(%71, %19) # <string>:33:30
%grad_other.16 : Tensor = aten::_grad_sum_to_size(%72, %20) # <string>:34:31
%86 : Tensor, %87 : Tensor = prim::FusionGroup_3(%grad_self.12, %grad_self.16, %74, %39)
%88 : Tensor, %89 : Tensor = prim::FusionGroup_4(%grad_other.12, %grad_other.16, %75, %47)
return (%79, %88, %89, %78, %86, %87, %grad_self.14, %grad_other.14)`
Which I think is expected/desired.
Finally, this implementation of clamp backwards is "maximally gradient preserving," which simply means that elements on the boundary now receive gradients. For example, if an element of a tensor is 5 and the clamp is to [2, 5], then that element will now receive a gradient. The prior implementation would zero these gradients. See https://github.com/pytorch/pytorch/issues/7002 for a discussion on preserving gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23927
Test Plan: Existing tests provided sufficient coverage.
Differential Revision: D16739740
Pulled By: mruberry
fbshipit-source-id: c94291d20e1f3f25197afc7b74dc61aeb204b074
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/22833
grad_sum_to_size does not commute with AutogradAdd after all because it turns the broadcasting AutogradAdd into a broadcasting add.
Chillee did actually do most of the tracking down to the fusion of grad_sum_to_size and pinging me when he had found the cause. Thank you!
About the choice of removing the fusion completely instead of being more precise:
- We do have grad_sum_to_size elimination which works for cases where broadcasting does not actually happen in the forward, so the cases where the fusing of grad_sum_to_size is actually beneficial is much smaller than when initially proposed.
- There will be less fusion, in terms of the tests, IOU stops being fully fused. I vaguely think that it is a case we could handle with refined logic.
- Keeping it would add complexity in checking when to merge fusion groups to the complexities that this PR removes.
- The future of fusion probably lies more in more complete solutions including reductions (TVM or KeOps or our own or ...).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23372
Differential Revision: D16489930
Pulled By: soumith
fbshipit-source-id: bc0431b0d3eda264c401b634675872c4ce46f0f4
Summary:
This pull request adds the necessary Windows DLL code to be able to support JIT fusion for CUDA. CPU JIT Fusion isn't supported. This also adds all the non-CPU JIT tests back in on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21861
Differential Revision: D15940939
Pulled By: soumith
fbshipit-source-id: e11f6af1ac258fcfd3a077e6e2f2e6fa38be4ef1
Summary:
When kwargs are specified in a test defined via common_method_invocations, it doesn't work if there isn't also a positional argument (`{'foo':'foo'}` without a positional arg generates a python call like: `self.method(, foo=foo)`, erroring on the `,`). I wanted to test something in a different PR and noticed I couldn't.
Also fixed some flake8 warnings I was seeing locally.
I replaced `lambda x: x` with `ident` since it seems a bit cleaner to me, but happy to revert that if others don't agree?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21499
Differential Revision: D15826974
Pulled By: nairbv
fbshipit-source-id: a3f37c80ba2303c7d9ae06241df06c7475b64e36
Summary:
This PR is a eliminates unneeded grad_sum_to_size and in particular speeds up the LSTM backward by allowing better fusion.
It consists of two parts:
- In AutoDiff, record broadcasting sizes only if the broadcast output size is different from the input size, otherwise record None.
- The specialization of Optional arguments (#18407) allows us to then eliminate ` _grad_sum_to_size(t, None)` in the peephole optimization step.
Thus, in the LSTM case, no SumToSize remain in the crucial fusion group. The trick here is that we can specialize on the runtime information from the forward.
I'm testing that different broadcasting situations lead to different graphs.
I didn't move all symbolic_script _grad_sum_to_size to the new logic, but it might be better to do this incrementally, anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18697
Differential Revision: D15482076
Pulled By: wanchaol
fbshipit-source-id: 7f89367e35b8729910077c95c02bccefc8678afb
Summary:
I believe the existing check in FuseGraph was only `false` if PyTorch was built with NO_CUDA=1. Otherwise, we would create fusion groups even if we're on a CPU-only machine running CPU code. This is confusing. Instead I've made it so that the decision to fuse or not is dependent on if the producer Value is a known CPU tensor. If it is, we skip fusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19342
Differential Revision: D15038351
Pulled By: jamesr66a
fbshipit-source-id: fce9d83929309a7bf14346833f84b996f3e7f6db
Summary:
Partially fuse layer_norm by decomposing layer_norm into the batchnorm kernel that computes the stats, and then fusing the affine operations after the reduce operations, this is similar to the batchnorm fusion that apaszke did, it also only works in inference mode now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18266
Differential Revision: D14879877
Pulled By: wanchaol
fbshipit-source-id: 0197d8f2a17ec438d3e53f4c411d759c1ae81efe