Subgraph matcher now handles the matching of non-Node arguments.
Here are the 4 cases
- pn is Node, gn is Node: this go through the regular _match_node() function
- pn is Noed, gn is not a Node: this is a match if only pn is a placeholder op
- pn is not Node, gn is Node: this is a no match case
- pn is not a Node, gn is not a Node: this will go through the argument comparison.
With this change
```
def target(x):
return foo(x, 3)
def pattern(x, y):
return foo(x, y)
```
is a match
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85456
Approved by: https://github.com/jerryzh168
Summary:
{F770932209}
Given the original execution order and the node dependency relationship (note that the same dependency order could generate multiple execution order, which refers to “Topological Order”), after reunion, we could find the new execution order of the new GraphModule is different from the original one which is not what we want.
For example, let’s assume that NewLeaf_1 is EmbeddingLookup (Calling EmbeddingLookup is awaitable, we will keep executing the following nodes rather than waiting for the result until we have to know the lookup result), NewLeaf_4 is the node where we HAVE to get the lookup result to interact with the NewLeaf_3. So NewLeaf_1 will launch a lookup kernel and all2all communication stream to distribute the result to all ranks. In the meantime, we want to keep executing NewLeaf_2 and NewLeaf_3 to avoid meaningless waiting. However, given the new execution order, we have to wait for the lookup kernel and all2all communication to be finished since the next node NewLeaf_4 needs the result, until then we can execute NewLeaf_2, etc. It cannot leverage the advantage of parallel computation and communication stream and will hurt the QPS a lot.
So while constructing the GraphModule, we have to change from the topological order to the original order
Test Plan:
Unit test
Not sure how to add tests in FX as there's no TARGETS, so I added in the TorchRec folder
Differential Revision: D39567314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85188
Approved by: https://github.com/SherlockNoMad
Fixes some errors you run into in dynamo when turning on fake tensors. I'm waiting on flipping the switch because I need to also get some fixes into dynamo + do benchmarking.
I could manually turn off fake tensors in functorch in dynamo, and then turn it on here if requested, although the changes here are pretty minimal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84432
Approved by: https://github.com/Chillee
Summary:
Encountered `Error: bad label format` from dot (i.e. graphviz) when benchmarking models that have dict-like structure.
The root cause was that curly brackets were not properly escaped, like this example P522499127 (unescaped curly brackets in target= string)
This diff insert the fix in FxGraphDrawer, since many of these graph generation codes rely on that class.
(Modified summary before exporting to GitHub PR)
Test Plan:
```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt -c python.package_style=inplace //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --model-name={INSERT IFR QE MODEL NAME HERE} --batch-iter 100 --batch-size 768 --num-gpu 1 --lower-presets {INSERT ITS PRESET}
```
Will not encounter dot errors after this diff.
(Modified test plan before exporting to GitHub PR)
Reviewed By: yinghai
Differential Revision: D38758827
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83604
Approved by: https://github.com/yinghai, https://github.com/jianyuh
Summary: Currently `split_by_tags` determines submodule output order by iterating over `used_in_main`. Since this is a `Set`, insertion order is not retained so we run into problems with submodule output order being "randomized" & inconsistent between splits. By using `Dict[Node, None]` we can implement `used_in_main` as an ordered set so that output order is consistent when splitting the same model.
Test Plan: CI
Differential Revision: D39039268
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84136
Approved by: https://github.com/houseroad
Summary: Before the change, wrapped_fn should only take mutating passes, but we don't actually have any way to detect whether a pass is mutating before running it. To make this an abstraction without involving any precondition depending on PassManager run, we could just relax the precondition to take any kind of passes, and conditionally return the original pass based on the pass result.
Test Plan: eyes
Reviewed By: qihqi, angelayi
Differential Revision: D39086343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84232
Approved by: https://github.com/angelayi
Example:
```
======================================================================
ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
res = fn(module)
File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
raise RuntimeError("bad")
RuntimeError: bad
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
pm(traced_m)
File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
raise RuntimeError(msg) from e
RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
Approved by: https://github.com/SherlockNoMad
There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
```
def f():
a = torch.zeros(4, 4, 4)
a[:, 2:] = torch.ones(4, 2, 4)
return a
```
Tracing normally with `make_fx()` gives you:
```
def forward(self):
zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807); slice_tensor = None
copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones); slice_tensor_1 = ones = None
return zeros
```
Functionalizing it gives you:
```
def forward(self):
zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807); slice_tensor = None
slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807); slice_tensor_2 = ones = None
slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807); zeros = slice_scatter_default = None
return slice_scatter_default_1
```
Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).
What we actually want is to replace this line:
```
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
```
with this:
```
new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
_ = torch.ops.aten.copy_.default(new_slice, ones)
```
In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.
We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)
I also updated the docs for re-inplacing to more closely match the order of the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
Approved by: https://github.com/ezyang
Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.
See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
Approved by: https://github.com/ezyang
I'm testing out turning on re-inplacing + functionalization by default with the AOTAutograd + eager backend on torchbench + huggingface models. This PR contains a few bug fixes from turning re-inplacing on:
(1) Handle more gracefully when FakeTensorMode is already turned on when you call reinplace
(2) More robust detection for when an inplace variant of an op exists (the dumb bug was that `pow.Scalar` doesn't have an inplace variant, even though there are several overloads of `pow_`. None of them are eligible though
(3) Avoid re-inplacing when it would require resizing the input buffer. This isn't allowed, because inplace ops aren't allowed to resize their inputs.
For the last one, I gave the two main examples in more detail in the comments. Important cases are:
```
# This should not be re-inplaced at all; the op broadcasts, so this would require resizing the self tensor
torch.add(tensor[1, 4], tensor[4, 4])
# This should not be re-inplaced, because the inplace and out-of-place variants of the op return different dtypes
torch.ge(a, b)
# However, this means that today when functionalization functionalists a `torch.ge_(a, b)` call, reinplacing won't properly de-functionalize it. I mentioned that optimization is worth adding later in the comments
```
(4) There's some logic around keeping `storage_to_nodes` up to date when we see a view op: if we re-inplace `out = a.add(...)`, and later in the program we encounter a "later_node",`out.view(..)`, and need to replace it with `a.view(...)`, then we need to update some metadata structures. I had to fix that logic: specifically, if "later_node" isn't a dispatcher op, (e.g. if it's an FX output node), I wasn't properly handling the case where the node's fake_meta info was not a tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83626
Approved by: https://github.com/ezyang
pseudo.any is a wildcard node that can be matched with any fx node with arbitrary number of inputs and outputs.
For example, to match relu followed by one fx node:
```
def pattern(a):
y = a.relu()
z = torch.ops.pseudo.any(y)
return z
```
pseudo.oneof is a special node that can be matched with a fx node whose target is in the permissible list.
`targets` must be be a list of qualified name for operators, e.g. ["operator.add", "torch.sigmoid",
"torch.ops.aten.foo", "torch.ops.prims.bar"]
For example, using following pattern with pseudo.oneof
```
def pattern(a):
y = a.relu()
z = torch.ops.pseudo.oneof(y, targets=["relu", "torch.sigmoid", "operator.add"])
return z
```
It will have 3 matches in the following function
```
def forward(y):
z = y.relu()
x = z.relu() # first match
x = x.relu()
x = torch.sigmoid(x) # second match
x = x.relu()
return x + 1 # third match
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82853
Approved by: https://github.com/ezyang
This new version of subgraph matcher further supports
- optionally match with pattern's placeholder and output nodes
- patterns with multiple outputs
- filtering out non-containing matches
- filtering out overlapping matches
TODOs:
- [x] Update replace_pattern() to use this matcher
- [x] Fix cases with identical anchor
- [x] Introduce wildcard matching, such Any, OneOf
- [ ] Improve node comparer to match args and kwargs values
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82090
Approved by: https://github.com/ezyang
Adds a "reinplacing" FX transform, that goes through an FX graph and tries to convert out-of-place op calls into inplace calls whenever possible.
Followups from this PR include:
- Set up torch bench, and run the whole torchbench suite using AOTAutograd + functionalize + rein placing transforms to surface any issues (this is what I'm currently working on). Right now, I have some basic unit tests just to sanity check that the general logic makes sense.
- Add any missing inplace ops. This is mostly the `*_scatter*` ops, e.g. `diagonal_scatter_`, because these ops will commonly show up an FX graph after running functionalization.
The criteria for when you can swap an op `b = a.add(...)` with `a.add_(...)` is:
(1) An inplace variant of the operator with the same schema needs to exist (`aten.add` -> `aten.add_`)
(2) `a` (**or any of its aliases**) can't be used as an input to any other operators later on in the graph
(3) `a` can't be one of the inputs to the entire graph. It also can't be an **alias** of any of the inputs ***
*** One thing to note: (3) means that we can't technically guarantee that we'll get back **all** memory usage that we lost from functionalization. Functionalization converts input mutations into out-of-place calls, and then adds a `copy_()` to the end of the graph to preserve semantics.
I added logic to handle `copy_()` in this PR because it it's a pretty important optimizations in the context of `functionalization()`: any program that performs input mutations will have a `copy_()` in it after running functionalization.
There are some examples in the test file, but I think staring at an example of where re-inplacing is/isn't allowed to run is helpful:
```
// Before functionalization
def foo(a):
tmp1 = a.add_(1)
tmp2 = a.add(2)
// After functionalization
def foo(a)
tmp1 = a.add(1)
tmp2 = a.add(2)
....
a.copy_(tmp1)
// After re-inplacing
def foo(a)
// first add() is safe to re-inplace even though a is a program input,
// because a's data is overwritten later by a copy_()
tmp1 = a.add_(1)
// second add() is NOT safe to re-inplace, because:
// (1) a and tmp1 are aliased. Note that they weren't aliased in the original program,
but they are now that we've done some re-inplacing.
// (2) tmp1 is used as an input later in the program
tmp2 = a.add(2)
....
a.copy_(tmp1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80897
Approved by: https://github.com/ezyang
PassManager is a class used to run multiple passes on a given graph module.
Class Attributes
* `passes: List[Callable]`: A list of callable passes
* `constraints: List[Callable]`: A list of constraints
* `run_checks_after_each_pass`: Flag for running checks each pass
Class Methods:
* `__call__(graph_module: DispatchGraphModule)`:
* Runs the passes based on the list of passes until the graph stops changes, or until `steps` number of times.
* Each time a pass is run, it will check that the graph module still maintains the required invariants by calling `check()` and will lint the graph to check that it’s well formed if the flag `run_checks_after_each_pass` is set.
* `check(graph_module: DispatchGraphModule)`: Runs various checks on the given graph module to make sure that it contains the needed data for passes
* `add_check(check: Callable)`: Adds the `check` function to the given pass manager instance
* `add_constraint(constraint: Callable)`: Adds a constraint to the current list of constraints
We can create a PassManager and run it by doing:
```
PassManager(passes=[pass1, pass2])(graph_module)
```
Differential Revision: [D37523159](https://our.internmc.facebook.com/intern/diff/D37523159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80531
Approved by: https://github.com/SherlockNoMad
Passes should now return a `PassResult` which (for now) contain the following fields:
* `graph_module`: The graph module modified during the pass
* `modified`: A flag for if the graph module has been modified
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81366
Approved by: https://github.com/SherlockNoMad
Summary:
Add an `ignore_parameters_and_buffers` parameter which will tell the graph drawer
to leave off adding parameter and buffer nodes in the dot graph.
This is useful for large networks, where we want to view the graph to get an idea of
the topology and the shapes without needing to see every detail. Removing these buffers
de-clutters the graph significantly without detracting much information.
Reviewed By: jfix71
Differential Revision: D37317917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79982
Approved by: https://github.com/jfix71
This PR introduces two components.
CapabilityBasedPartitioner for FX graph: given a list of supported operators, this partitioner tries to forms the largest subgraphs that only contain the supported ops.
Fuser utility: given a list of nodes in FX graph, it lifts them as a sub-GraphModule in the original graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79439
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
Summary: If the model contains ModuleList, it's possible that we got some of the weight attributes as module.sub.0.weight. `getattr` doesn't work in this case and we have a dedicated function `getattrt_recursive` for that. Just use that.
Reviewed By: houseroad
Differential Revision: D37326955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80011
Approved by: https://github.com/houseroad
Summary: We were handling constant attrs in a few different ways before, leading to confusion and missed handing for fused dtypes. This diff consolidates some of that code and unbreaks current breakage.
Test Plan: CI. Recently broken tests now pass.
Differential Revision: D36335238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77401
Approved by: https://github.com/jaybean-dev, https://github.com/jamesr66a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74972
This diff
* adds PassManager and supporting logic
Test Plan:
CI and
```
buck test //caffe2/torch/fx/passes:test_pass_manager
```
```
Building: finished in 3.1 sec (100%) 124/124 jobs, 30/124 updated
Total time: 3.7 sec
More details at https://www.internalfb.com/intern/buck/build/4f947267-671c-48bc-ad07-190e5a731d2d
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 1423fed7-4674-44ce-9b84-c634f28a0406
Trace available for this run at /tmp/tpx-20220309-144735.217835-1423fed7-4674-44ce-9b84-c634f28a0406/trace.log
RemoteExecution session id: reSessionID-1423fed7-4674-44ce-9b84-c634f28a0406-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/6473924544097816
✓ ListingSuccess: caffe2/torch/fx/passes:test_pass_manager : 3 tests discovered (0.639)
✓ Pass: caffe2/torch/fx/passes:test_pass_manager - test_these_before_those_pass_constraint (caffe2.torch.fx.passes.tests.test_pass_manager.TestPassManager) (0.335)
✓ Pass: caffe2/torch/fx/passes:test_pass_manager - test_this_before_that_pass_constraint (caffe2.torch.fx.passes.tests.test_pass_manager.TestPassManager) (0.336)
✓ Pass: caffe2/torch/fx/passes:test_pass_manager - test_pass_manager_builder (caffe2.torch.fx.passes.tests.test_pass_manager.TestPassManager) (0.344)
Summary
Pass: 3
ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/6473924544097816
```
Reviewed By: yuhc, wushirong
Differential Revision: D31484770
fbshipit-source-id: 7a8cde4c23727ff612bf7bf0d7b7db5d0c47b1a9
(cherry picked from commit c281c288fe870624574d34cfc93d732d4607f7d0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74891
As title, otherwise the below error is thrown:
```
TypeError: '>=' not supported between instances of 'int' and 'str'
```
Test Plan: easy
Reviewed By: jackm321
Differential Revision: D35206473
fbshipit-source-id: 200c83b9a19b6aae6f0da03abe99121e55893fd3
(cherry picked from commit 20744d2ce59ea07ecdb2570929dd5344c65b751a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73815
Add `skip_node_names_in_args` (default=`True`) which will skip including node names in args/kwargs during graph drawing.
Test Plan:
Default (`skip_node_names_in_args=True`):
{F707455583}
Vs. `skip_node_names_in_args=False`:
{F707046375}
Reviewed By: wushirong
Differential Revision: D34659144
fbshipit-source-id: 9f0bd7bee98dc1ca8eecdabc960804564d83777b
(cherry picked from commit a0ed64b51f0187115586f4001dc81148c7ed18b9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73564
While maintaining API backward compatibility, add an optional output parameter to split_module() that returns a mapping from the new qualified names in the modules after split to the old qualified names in the original module
Test Plan:
1. Added a test (test_split_qualname_mapping) to test_fx_experimental.py to check the returned qualname mapping
```
$ python test_fx_experimental.py
...
Ran 1084 tests in 73.464s
OK (skipped=531, expected failures=4)
```
2. Ask test_fx.py to accept split_module's new signature
```
$ python test_fx.py --accept
```
Reviewed By: jamesr66a
Differential Revision: D34541792
fbshipit-source-id: e8ec7e77ec884e4db7cad0c0593e31861c76e42d
(cherry picked from commit d2e5a95a353ee5fb52cdba065f127489e9df47ae)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73519
Move `getattr_recursive()` and `setattr_recursive()` to fx main.
Test Plan: CI
Reviewed By: khabinov
Differential Revision: D34524723
fbshipit-source-id: a656e821d9dc1d446aa80cdc03a923bf0c05aeb5
(cherry picked from commit 4835965ac72d299487be14687823ea62394f4079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73464
- Improve formatting of graph by centering everything
- Add num_users
- Add args/kwargs
- Don't print more than 10 of any list/tuple by default (this is necessary for very large concats)
Test Plan: tested locally
Reviewed By: khabinov
Differential Revision: D34492256
fbshipit-source-id: 8073992edb3efddcf8bfd72e2d3db49cc242db10
(cherry picked from commit b1b802965c143fdb0d308b70f51aa741f7d90f78)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71933
Add the functionalities provided by split.py to splitter_base.
- Propagate submodule inputs
- Create SplitResult to hold the split results.
Then removed split.py, to me this makes navigating the lowering code a bit easier.
Added default split and trace function for use.
Next step is to add better error handling for each stage during lowering and create unit tests for each stage. I'll probably make some bootcamp tasks for unit tests.
Test Plan: CI
Reviewed By: frank-wei, wushirong
Differential Revision: D33794322
fbshipit-source-id: f991893047a3701177f54cf22d9a6e48e0529472
(cherry picked from commit 1f3e13efba)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67261
Adds a `Pass(callable)`, `PassManager`, `PassConstraint(callable)`, `PassManagerBuilder`.
The idea is that a `Pass` modifies an IR in-place. `PassConstraint`s define a partial ordering on `Pass`s as a less than callable. `PassManager` manages the collection of `Pass`s, `PassConstraint`s and ensures validation before execution. `PassManagerBuilder` creates `PassManager`s (example usage in follow-up diff).
These are very loosely typed, so could be applied to different IRs as well as transformation between IRs.
Test Plan:
```
buck test mode/opt //caffe2/torch/fx/passes:test_pass_manager
```
```
ore details at https://www.internalfb.com/intern/buck/build/210
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: c635415b-cdc4-4574-9739-a16d2b93ad3a
Trace available for this run at /tmp/tpx-20220203-114748.608700/trace.log
RemoteExecution session id: reSessionID-c635415b-cdc4-4574-9739-a16d2b93ad3a-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1970324927640328
✓ ListingSuccess: caffe2/torch/fx/passes:test_pass_manager : 3 tests discovered (0.332)
✓ Pass: caffe2/torch/fx/passes:test_pass_manager - test_this_before_that_pass_constraint (caffe2.torch.fx.passes.tests.test_pass_manager.TestPassManager) (0.232)
✓ Pass: caffe2/torch/fx/passes:test_pass_manager - test_these_before_those_pass_constraint (caffe2.torch.fx.passes.tests.test_pass_manager.TestPassManager) (0.231)
✓ Pass: caffe2/torch/fx/passes:test_pass_manager - test_pass_manager_builder (caffe2.torch.fx.passes.tests.test_pass_manager.TestPassManager) (0.231)
Summary
Pass: 3
ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/1970324927640328
```
Reviewed By: jfix71, kflu
Differential Revision: D31316086
fbshipit-source-id: 4302c39e221cfa43e2b2eda9f26d6d78da4db0f1
(cherry picked from commit 13c981ab00)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72145
- Added a predicate that allows us not to lower nodes with specific names.
- Added an observer function to help with the debugging
Reviewed By: jasonjk-park, houseroad
Differential Revision: D33785834
fbshipit-source-id: 7bdb7f33851da1118763c85f8e2121d01e4914a2
(cherry picked from commit 4e2268ed45)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71790
If a leaf module is specified, it means we should treat it as a blackbox and we should just avoid rewriting it too.
Test Plan:
```
buck test caffe2/test:test_fx_acc_tracer
```
with a new unit test.
Reviewed By: jfix71, houseroad, wushirong
Differential Revision: D33731903
fbshipit-source-id: 0560d9e8435b40f30d9b99dc3b2f47d1a04eb38b
(cherry picked from commit 747e9e44ee)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71016
I found out that `split_module` doesn't preserve default values for arguments. In trying to fix that, I noticed that `Graph.placeholder` doesn't make it easy to add a default argument when making a placeholder. This PR addresses both of those issues
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D33482218
Pulled By: jamesr66a
fbshipit-source-id: 57ebcdab25d267333fb1034994e08fc1bdb128ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71172
Break down div to smaller ops to make those div ops look like all other elementwise ops.
Use operator div ops instead of torch div if possible to avoid converting literal numbers to torch tensor (like in the following).
```
a = 1
b = 2
// `c` would be 0.5
c = a / b
// `c` would be torch.tensor([0.5])
c = torch.div(a, b)
```
The problem we saw on shufflenet is that there's size op followed by a div op which results in int64 tensors in acc traced graph (acc tracer turns operator.div to acc_ops.div which uses torch.div). And trt splitter splits out the reshape op that consumes the div op because we have a rule to split out ops that takes in int64 tensors as inputs.
Test Plan: Unit tests.
Reviewed By: wushirong
Differential Revision: D33482231
fbshipit-source-id: 508a171520c4e5b4188cfc5c30c1370ba9db1c55
Summary:
In the [docstring](https://github.com/pytorch/pytorch/blob/master/torch/fx/passes/graph_drawer.py#L54-L60) we mention `get_dot_graph but it is not defined, so I defined it here.
Not sure if this is preferred, or should we update the docstring to use `get_main_dot_graph`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70541
Test Plan:
```
g = FxGraphDrawer(symbolic_traced, "resnet18")
with open("a.svg", "w") as f:
f.write(g.get_dot_graph().create_svg())
```
Reviewed By: khabinov
Differential Revision: D33378080
Pulled By: mostafaelhoushi
fbshipit-source-id: 7feea2425a12d5628ddca15beff0fe5110f4a111
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68818
Operator Support was blocking all node with dtype int64 from lowering. This diff ease the condition, for input from get_attr node(which are known not gonna be used for trt compute) to have dtype int64.
Reviewed By: brad-mengchi, 842974287
Differential Revision: D32609457
fbshipit-source-id: ea255f3281349a4254cb6abdeed671ab2c0216ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68303
Result of splitter is run on either accelerator or directly on gpu, rename gpu part graph to run_on_gpu
Test Plan: buck test mode/opt caffe2/test:trt_tools_test
Reviewed By: 842974287
Differential Revision: D32392492
fbshipit-source-id: b085376c00c1097752e856e22c631d74a0fbc38f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67569
Splitter_base has assumption that the first subgraph after split must be cpu subgraph if there exists cpu node. This is wrong, start subgraph should be determined by which subgraph has 0-dep node.
Also add unit test for splitter.
Reviewed By: yinghai
Differential Revision: D32012549
fbshipit-source-id: e2639ccd7774b4295ca05c2ddbefff9726702b3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66562
Adding shape inference for `acc_ops.quantize_per_channel`, and fixing some bugs.
Bugs were related to the fact that `quantize_per_channel` arguments `scales` and `zero_points` take tensors, so when we fetch the values (which needs to be done using `.tolist()` instead of `.item()`) we may get either a list or a scalar value.
Test Plan:
# Test Quantized Resnet
From sandbox with GPU that supports quantized types (tested with V100)
`buck run mode/opt -c python.package_style=inplace caffe2:fx2trt_quantized_resnet_test`
Output
```
...
[TensorRT] INFO: [MemUsageSnapshot] Builder end: CPU 0 MiB, GPU 1548 MiB
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 0 MiB, GPU 1548 MiB
[TensorRT] VERBOSE: Using cublasLt a tactic source
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.1.0
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 0, GPU 1556 (MiB)
[TensorRT] VERBOSE: Using cuDNN as a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 0, GPU 1564 (MiB)
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.5
[TensorRT] VERBOSE: Total per-runner device memory is 23405056
[TensorRT] VERBOSE: Total per-runner host memory is 73760
[TensorRT] VERBOSE: Allocated activation device memory of size 154140672
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 0 MiB, GPU 1736 MiB
trt fp16 time (ms/iter) 1.252899169921875
trt int8 time (ms/iter) 1.3774776458740234
trt implicit int8 time (ms/iter) 1.3835883140563965
PyTorch time (CUDA) (ms/iter) 4.34483528137207
PyTorch time (CPU) (ms/iter) 55.687150955200195
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 0, GPU 1918 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 0, GPU 1866 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 0, GPU 1738 (MiB)
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1012 12:07:23.556475 711816 DynoConfigLoader.cpp:32] Failed to read config: No dyno config client
```
# Test shape inference
`buck test mode/opt glow/fb/fx/acc_tracer:test_acc_shape_inference`
Output
```
...
Summary
Pass: 95
ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/1407375092088240
```
Reviewed By: jfix71, jerryzh168
Differential Revision: D31457323
fbshipit-source-id: 8ccc4a9b0ca655fb30838e88575aff2bf3a387a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65542
Add docstring for torch.fx.passes.split_module that conforms to Google Python Style conventions.
Changed original example to the example from this diff:
https://www.internalfb.com/diff/D24925283 (9734c042b8)
Test Plan:
Ran buck test //caffe2/test:fx. No errors detected
https://pxl.cl/1QCch
Reviewed By: jamesr66a
Differential Revision: D31145694
fbshipit-source-id: 8e54f3b1be3dca1c4d414fdeeab71b9f2b5d9f3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65933
We use `split_module` to split the input model that we want to const fold into const and non-const subgraphs. Previously we were taking the non-const graph and trying to hack it back into the same signature as the input model. However this was complex/buggy.
Instead, refactor to just keep using the base split module that contains both const and non-const graphs. This means we:
- Inline the non-const graph into the split module
- Remove the const graph from the module and replace it with a getattr that will be run to insert that attr when we `run_folding`
Test Plan: Added test coverage to cover newly supported folding, and updated other tests for new strategy.
Reviewed By: yinghai
Differential Revision: D31293307
fbshipit-source-id: 6e283a8c7222cf07b14e30e74dffc8ae5ee8b55f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64787
This PR added support for lowering per channel quantization and dequantization operators
in fx2trt, this also extends TensorMeta with extra arguments corresponding to per channel quantized Tensors,
initially I was thinking of adding a qpram that can capture everything, but currently we still have some lowering support
for fbgemm ops (which has scale and zero_point in operator interface). I think we can move everything to qprams
after we deprecate lowering support for fbgemm ops in the future.
Test Plan:
Test for per channel weight:
```
python torch/fx/experimental/fx2trt/example/quantized_resnet_test.py
```
change BC compatibility test expect for TensorMeta
```
python test/test_fx.py TestFXAPIBackwardCompatibility.test_class_member_back_compat --accept
```
Imported from OSS
Reviewed By: jfix71, mrshenli, 842974287
Differential Revision: D30879848
fbshipit-source-id: 76c3804bb1d9343183ae53d9f02c1a3bf6c79e1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65848
This diff includes:
* [fix]: The initialization of `OperatorSupport._support_dict` makes it a class variable, so we need to move its initialization into constructor.
* Add abstract class (more of an interface) `OperatorSupportBase`, since `OperatorSupport`'s purpose is too specific.
* [refactor]: what `TRToperatorSupport` really does is to populate a `OperatorSupport._support_dict`, so there really is no reason for subclassing. So removing it, and changing it to instantiating a `OperatorSupport` with properly populated `_support_dict`.
* Add a framework for defining simple and basic op support logic, and composing them into more complex ones:
1. `create_op_support` wraps a function into a `OperatorSupportBase` instance
2. `chain` can combine several simple `OperatorSupportBase` into more complex ones
3. `OpSupports` provides a set of pre-defined, simple `OperatorSupportBase` that can be composed together using `chain`.
1. Currently the only pre-defined one is `decline_if_input_dtype(..)`, which declares a node non-supported, if its args are of user specified dtype
* Fix `TRTOperatorSupport` so that it not only looks for registered converters, but also decline a node if its arg is of int64
Test Plan: linter and CI
Reviewed By: 842974287
Differential Revision: D31275525
fbshipit-source-id: bbc02f7ccf4902a7912bb98ba5be2c2fbd53b606
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65136
Opportunistically add type annotation for operator_support.py
Test Plan: run linter, CI
Reviewed By: yinghai
Differential Revision: D30928464
fbshipit-source-id: 615c75152b9938792f03cdceb2a113bda6ab28c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64286
During graph splitting, `_SplitterBase` supports taking into consideration whether the subnet boundary nodes
produces "supported" outputs that will cross the acc/non-acc boundary. Specifically, if the backend only
supports Tensor-based data passing cross boundary, then we cannot split the graph at a place where the node
output is a non-Tensor type (e.g., `Tuple[Tensor]`).
There's currently a bug in this logic that it does not correctly detect the output type of a Node. Instead of
using `Node.meta['tensor_meta']`, we should instead check `Node.meta['type']`.
`Node.meta['tensor_meta']` is not appropriate because this key will exist if the node output is an iterable
and one of the element is of type `Tensor`. So `Tuple[Tensor]` will be wrongly considered "supported".
Test Plan:
arc lint
run CI tests
Reviewed By: yinghai, 842974287
Differential Revision: D30617147
fbshipit-source-id: e8ba70dfaddc05cafb8037d58fca73b7ccbb1a49
Summary: Add logging so we know which nodes are currently being visited
Test Plan: lint & SC tests
Reviewed By: 842974287
Differential Revision: D30509865
fbshipit-source-id: 09e77e44c97c825242e0b24f90463b50f3ca19c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62234
There was a typo that we caught until recently, thus making this fix.
Reviewed By: 842974287
Differential Revision: D29924190
fbshipit-source-id: ee6259fcd41358aefe9680b419acc87c0c2821cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60972
For PyTorch model memory requirement calculation, requires_grad is needed. Output tensors with requires_grad are saved in module context and increases memory during forward pass.
Test Plan: Existing test cases
Reviewed By: jamesr66a
Differential Revision: D29024932
fbshipit-source-id: def990f8c6ff6fa4537bfc377c646b9d44464ebd
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58699
Make `call_function`/`call_method` random colors based on their target name. This coloring is stable according to the name of the target. Also handle tensor_meta more elegantly for quantized types, including print q_scale/q_zero_point if they're used.
Test Plan: Tested locally
Reviewed By: chenccfb, 842974287
Differential Revision: D28580333
fbshipit-source-id: ad9961e1106a1bfa5a018d009b0ddb8802d2163c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57483
Pull Request resolved: https://github.com/pytorch/glow/pull/5622
Quantized linear has packed parameters. We want to unpack it so that it would be easier for graph optimization and importer to deal with the weight and bias. A customized remapping function is used to unpack quantized linear and map it to acc_op.linear.
Test Plan: `buck test glow/fb/fx/nnpi_importer:test_importer`
Reviewed By: gcatron, jfix71, khabinov
Differential Revision: D27451237
fbshipit-source-id: e46e961734788fd5333e227ca6143fd37c33204e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57280
We've found an issue that fusion group would results in circular dependency. For example
```
a -> b -> c -> d
| ^
+ -------------+
Only a has non tensor output and currently we would create a fusion group (a, b, d). This results in circular dependency because now the fusion group depends on c while c depends on the fusion group as well.
```
This diff implement the solution discussed before. When we add a node to fusion group, we add all the nodes that are in the middle of the fusion group and this newly added node.
Use the same logic in minimizer to build fusion group.
Test Plan: split_tests and net_min_tests
Reviewed By: khabinov
Differential Revision: D27917432
fbshipit-source-id: a3d99fe5929dbc9f8eb0f45bccd83fd7b173795a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57279
Added an option "return_intermediate". If true, when building the submodule we want to run , we will replace the output with all the nodes, so that intermediate results of all the nodes will be returned as output.
This is recommended to use with `run_node()` function.
Test Plan: `buck test glow/fb/nnpi/lowering:net_min_tests`
Reviewed By: khabinov
Differential Revision: D27913887
fbshipit-source-id: 5a3eab02da05214fb9adeb25656c267b58075b1d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56201
Refactor Splitter and Minimizer to superclass `_SplitterBase` and `_MinimizerBase` and move them to OSS. This is needed to create an OSS example of GPU lowering with those tools.
Test Plan: CI
Reviewed By: jackm321
Differential Revision: D27629598
fbshipit-source-id: 0d4da02105ca509b31f1a6c4a39b1122c2bc7bf0
Summary:
Commandeered from https://github.com/pytorch/pytorch/pull/54563
Primary changes from first PR:
1. Refactored primary `normalize_function` logic into `operator_schemas.py` so that non-FX users can use it.
2. Refactored tests a bit, and added a path to call `normalize_function` directly.
3. Moved check for `boolean_dispatch` so that `torch.lu` also gets properly handled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55992
Reviewed By: mruberry
Differential Revision: D27774396
Pulled By: Chillee
fbshipit-source-id: 7f65632e1d608e4abd55aec5ccbfdc3f67f52b8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56212
The current design doesn't make it easy to use `node.copy()`. Explicitly copy over the node's meta.
Test Plan: Updated `test_subgraph_creation` in `test_fx_experimental`
Reviewed By: jamesr66a
Differential Revision: D27808477
fbshipit-source-id: 7fe7b6428c830307dbd1e395f16fa2774936d3b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55405
Pull Request resolved: https://github.com/pytorch/glow/pull/5516
Allows FXIRImport to import quantized model.
This diff doesn't include the supports for per-channel weights, linear and conv. Will address them in the next diff.
Test Plan: buck test glow/fb/fx/nnpi_importer:test_importer
Reviewed By: jackm321, jfix71
Differential Revision: D27313543
fbshipit-source-id: bf5c96ef5f2ff1835c09db981e0ceefaec56dd5b