Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228
`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837374
fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322
Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.
Test Plan: sandcastle
Reviewed By: ZolotukhinM
Differential Revision: D25126202
fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
Summary:
Raise and assert used to have a hard-coded error message "Exception". User provided error message was ignored. This PR adds support to represent user's error message in TorchScript.
This breaks backward compatibility because now we actually need to script the user's error message, which can potentially contain unscriptable expressions. Such programs can break when scripting, but saved models can still continue to work.
Increased an op count in test_mobile_optimizer.py because now we need aten::format to form the actual exception message.
This is built upon an WIP PR: https://github.com/pytorch/pytorch/pull/34112 by driazati
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41907
Reviewed By: ngimel
Differential Revision: D22778301
Pulled By: gmagogsfm
fbshipit-source-id: 2b94f0db4ae9fe70c4cd03f4048e519ea96323ad
Summary:
Fixes issue where
- top level fuser's block_ was captured by callback due to [&] capture,
- recursive/nested fusers would compare erroneously to top-level block_ instead of own block_
Closes (https://github.com/pytorch/pytorch/issues/39810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41560
Reviewed By: Krovatkin
Differential Revision: D22583196
Pulled By: wconstab
fbshipit-source-id: 8f543cd9ea00e116cf3e776ab168cdd9fed69632
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37106
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.
The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```
and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
x_in, y_in = f(a_in, b_in, c_in)
-> x_in, y_in
```
We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.
This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.
Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.
Test Plan: Imported from OSS
Differential Revision: D21219179
Pulled By: suo
fbshipit-source-id: 8ae5397b3a0ad90edec2fbc555647091f1ad5284
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33705
The fact that there were two overloads appears to be a historical
artifact that dates back to when goldsborough originally added these
bindings in the first place. If TensorOptions is made optional,
then you only need one overload, not two, as they are exactly redundant
with each other. When MemoryFormat was added, it was made a little
harder to do this, as the C++ syntax at::empty_like(t, memory_format) would
not work if you collapsed the overload; but now it works because TensorOptions
supports MemoryFormat.
The upshot is, I can get rid of all the overloads and just have one overload.
Amazingly, this change is backwards compatible, as the test attests. While
I was at it, I also deleted the overload name from the functions entirely.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20073355
Pulled By: bhosmer
fbshipit-source-id: c6a8908213b32ccf6737ea864d135e2cce34f56b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32682
This moves code around so that operator.h/cpp no longer requires a full
definition of Node* nor does it include alias analysis or the pretty printer.
This should make it possible to include in the mobile build.
Functionality for checking if operators match Node and to look up
and operator for a Node have moved to the Node object.
Test Plan: Imported from OSS
Differential Revision: D19615386
Pulled By: zdevito
fbshipit-source-id: e38bdf29971183597ef940d061c06ba56e71d9c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27561
Adds memory_format keyword argument (positional for cpp).
'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.
---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.
Test Plan: Imported from OSS
Differential Revision: D17980316
Pulled By: VitalyFedyunin
fbshipit-source-id: 2a1d47571268673de0c6f5ae1b6d4f9110962ab0
Summary:
This PR excises the last of SymbolicVariable. There should be no change in functionality. One new test for addmm fusion was added. A case where the peephole optimizer might convert a scalar argument remains untested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25077
Test Plan: Refactors existing code so mostly covered by current tests. One test for addmm fusion was added.
Differential Revision: D17145334
Pulled By: mruberry
fbshipit-source-id: 6b68faf764f9ee8398b55c43110228ed9faf81eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24284
This PR finishes the unification of all Tensor types into a single object.
ProfiledTensorType is renamed to TensorType and the old TensorType is
deleted.
Notes:
* Fixes bug in merge for VaryingShape by changing its representation to an
optional list of optional ints.
* Removes ProfiledTensorType::create(type) invocations that can now
simply be expect calls on tensor type.
Test Plan: Imported from OSS
Differential Revision: D16794034
Pulled By: zdevito
fbshipit-source-id: 10362398d0bb166d0d385d74801e95d9b87d9dfc
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/22833
grad_sum_to_size does not commute with AutogradAdd after all because it turns the broadcasting AutogradAdd into a broadcasting add.
Chillee did actually do most of the tracking down to the fusion of grad_sum_to_size and pinging me when he had found the cause. Thank you!
About the choice of removing the fusion completely instead of being more precise:
- We do have grad_sum_to_size elimination which works for cases where broadcasting does not actually happen in the forward, so the cases where the fusing of grad_sum_to_size is actually beneficial is much smaller than when initially proposed.
- There will be less fusion, in terms of the tests, IOU stops being fully fused. I vaguely think that it is a case we could handle with refined logic.
- Keeping it would add complexity in checking when to merge fusion groups to the complexities that this PR removes.
- The future of fusion probably lies more in more complete solutions including reductions (TVM or KeOps or our own or ...).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23372
Differential Revision: D16489930
Pulled By: soumith
fbshipit-source-id: bc0431b0d3eda264c401b634675872c4ce46f0f4
Summary:
This PR is a eliminates unneeded grad_sum_to_size and in particular speeds up the LSTM backward by allowing better fusion.
It consists of two parts:
- In AutoDiff, record broadcasting sizes only if the broadcast output size is different from the input size, otherwise record None.
- The specialization of Optional arguments (#18407) allows us to then eliminate ` _grad_sum_to_size(t, None)` in the peephole optimization step.
Thus, in the LSTM case, no SumToSize remain in the crucial fusion group. The trick here is that we can specialize on the runtime information from the forward.
I'm testing that different broadcasting situations lead to different graphs.
I didn't move all symbolic_script _grad_sum_to_size to the new logic, but it might be better to do this incrementally, anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18697
Differential Revision: D15482076
Pulled By: wanchaol
fbshipit-source-id: 7f89367e35b8729910077c95c02bccefc8678afb
Summary:
move the insert_guard all the way up to the beginning of the decomposation, this will fix the case that we lose insert_point context after decomposeCommonNormalization and we still need to modify the graph.
fixes#19502
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19646
Differential Revision: D15058040
Pulled By: wanchaol
fbshipit-source-id: ebdbf8623ebfe4556c461e1b650e94b905791adb
Summary:
I believe the existing check in FuseGraph was only `false` if PyTorch was built with NO_CUDA=1. Otherwise, we would create fusion groups even if we're on a CPU-only machine running CPU code. This is confusing. Instead I've made it so that the decision to fuse or not is dependent on if the producer Value is a known CPU tensor. If it is, we skip fusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19342
Differential Revision: D15038351
Pulled By: jamesr66a
fbshipit-source-id: fce9d83929309a7bf14346833f84b996f3e7f6db
Summary:
Fixes: #19253
Fixing pow(Tensor, float) is straightforward.
The breakage for pow(float, Tensor) is a bit more subtle to trigger, and fixing needs `torch.log` (`math.log` didn't work) from the newly merged #19115 (Thanks ngimel for pointing out this has landed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19324
Differential Revision: D15003531
Pulled By: ailzhang
fbshipit-source-id: 8b22138fa27a43806b82886fb3a7b557bbb5a865
Summary:
Partially fuse layer_norm by decomposing layer_norm into the batchnorm kernel that computes the stats, and then fusing the affine operations after the reduce operations, this is similar to the batchnorm fusion that apaszke did, it also only works in inference mode now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18266
Differential Revision: D14879877
Pulled By: wanchaol
fbshipit-source-id: 0197d8f2a17ec438d3e53f4c411d759c1ae81efe
Summary:
This PR propagates where we use first-class modules objects into the compiler. This creates a transitionary state where:
* compiler.cpp creates Graphs where `self` is a Module class and attributes/parameters/buffers/submodules are looked up with `prim::GetAttr`
* GraphExecutor still runs "lowered graphs" where the self object has been removed by a compiler pass `lower_first_class_method`.
* Tracing still creates "lowered graphs", and a pass "lift_lowered_method" creates a first-class method graph for things.
* This PR separates out Method and Function. A script::Function is a pure Graph with no `self` bound. Similar to Python, a script::Method is just a bound `self` and its underlying `script::Function`.
* This PR also separates CompilationUnit from Module. A CompilationUnit is just a list of named script::Functions. Class's have a CompilationUnit holding the class methods, and Modules also have a CompilationUnit holding their Methods. This avoids the weird circular case Module --has a-> Class -> has a -> Module ...
Details:
* In this transitionary state, we maintain two copies of a Graph, first-class module and lowered. Th first-class one has a self argument that is the module's class type. The lowered one is the lowered graph that uses the initial_ivalues inputs.
* When defining lowered methods using `_defined_lowered` we immediately create the first-class equivalent. The reverse is done lazily, creating lowered_methods on demand from the class.
* The two way conversions will be deleted in a future PR when the executor itself runs first-class objects. However this requires more changes to (1) the traces, (2) the python bindings, and (3) the onnx export pass and would make this PR way to large.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19167
Differential Revision: D14891966
Pulled By: zdevito
fbshipit-source-id: 0b5f03118aa65448a15c7a7818e64089ec93d7ea
Summary:
Bug fix for https://github.com/pytorch/pytorch/issues/15043, where a large fusion in JIT with a large number of kernel arguments, which exceeds the limit allowed by nvrtc on a cuda device.
The fix is to check the number of arguments before a cuda kernel is generated. If the number exceeds the limit, take the runFallBack() path.
Add a reduced test from the original issue to keep the test time low. The test would fail without this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18063
Differential Revision: D14691401
Pulled By: soumith
fbshipit-source-id: b98829bc89ed7724e91eda82ae3a5a1151af721a
Summary:
This PR did two things:
1. Enable scalar->float specialization in symbolic script, so AD formula that contains scalar in the schema, should write `float` instead.
2. add addcmul, lerp to AD and fuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18081
Differential Revision: D14490493
Pulled By: wanchaol
fbshipit-source-id: b3b86d960d5f051b30733bc908b19786111cdaa4
Summary:
so that functions like `def fn(x, p:float)` can be fused. Fixes#9940 and #11186. Fuses only float (not integer) arguments to simplify assembling arguments for fusion launch.
CPU fusion is disabled in CI and this won't be tested, but I tested it locally.
cc t-vi, apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18087
Differential Revision: D14581206
Pulled By: wanchaol
fbshipit-source-id: ccb0cf79b1751706f9b2cdf1715115eae5a39fb6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17585
Create a sugared value that represents a class during initialization. This is
so that assignments to attributes correctly define attributes in __init__ but
raise an error elsewhere.
Reviewed By: shannonzhu
Differential Revision: D14263403
fbshipit-source-id: 09b2feeb272302f00a79c2a0302fbdf5483aed6a
Summary:
Trying to land again, make prim::None into a case of prim::Constant. Reverted the previous landing because it broke an important onnx export test.
https://github.com/pytorch/pytorch/pull/16160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17186
Differential Revision: D14115304
Pulled By: eellison
fbshipit-source-id: 161435fc30460b4e116cdd62c7b2e5b94581dcb7
Summary:
Reenables rand_like fusion if no tensor is broadcasted in the fusion group. This is a sufficient but not necessary condition for fused rand_like to produce correct results, and it has an unpleasant side effect of falling back to non-fused path if rand_like was optimistically included in the fusion group, but there is a broadcast in the fusion group not necessarily related to rand_like. E.g. before this PR, if the network had (biasAdd -> relu -> dropout), fuser could fuse biasAdd and relu, now it will try fusing the whole thing (if dropout is expressed via rand_like) and fall back every time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16087
Differential Revision: D13720232
Pulled By: zou3519
fbshipit-source-id: 1e19203bec4a59257bfc7078b054a19f00fab4ad