Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55995
Normalization is kind of broken currently. But making default arguments visible still appears to work, and is nice functionality to still be able to rely on/use. Adds an option to `NormalizeArgs`'s `__init__` called `normalize_to_only_use_kwargs` which defaults to true, which if set to false will keep using the same signature as provided, but additionally set kwargs in kwargs.
Test Plan: Added test to `test_fx_experimental`.
Reviewed By: 842974287
Differential Revision: D27759448
fbshipit-source-id: 620061fcf46d8549ac70b62aede8b6740aee3778
Summary:
Commandeered from https://github.com/pytorch/pytorch/pull/54563
Primary changes from first PR:
1. Refactored primary `normalize_function` logic into `operator_schemas.py` so that non-FX users can use it.
2. Refactored tests a bit, and added a path to call `normalize_function` directly.
3. Moved check for `boolean_dispatch` so that `torch.lu` also gets properly handled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55992
Reviewed By: mruberry
Differential Revision: D27774396
Pulled By: Chillee
fbshipit-source-id: 7f65632e1d608e4abd55aec5ccbfdc3f67f52b8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56212
The current design doesn't make it easy to use `node.copy()`. Explicitly copy over the node's meta.
Test Plan: Updated `test_subgraph_creation` in `test_fx_experimental`
Reviewed By: jamesr66a
Differential Revision: D27808477
fbshipit-source-id: 7fe7b6428c830307dbd1e395f16fa2774936d3b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55405
Pull Request resolved: https://github.com/pytorch/glow/pull/5516
Allows FXIRImport to import quantized model.
This diff doesn't include the supports for per-channel weights, linear and conv. Will address them in the next diff.
Test Plan: buck test glow/fb/fx/nnpi_importer:test_importer
Reviewed By: jackm321, jfix71
Differential Revision: D27313543
fbshipit-source-id: bf5c96ef5f2ff1835c09db981e0ceefaec56dd5b
Summary:
Part of https://github.com/pytorch/pytorch/issues/48209
Taken from the docstring:
Performs a set of optimization passes to optimize a model for the purposes of inference. Specifically, the passes that are run are:
1. Conv/BN fusion
2. Dropout removal
3. MKL layout optimizations
The third optimization takes a function `use_mkl_heuristic` that's used to determine whether a subgraph should be explicity run in MKL layout.
I implemented 2 heuristics:
1. Does it in MKL if the subgraph is larger than 2.
2. Benchmarks each subgraph with MKL layout and without, and keeps the subgraph if it's faster.
### Batch size of 10 and multi-threaded.
Results with the second heuristic are generally as strong as the "jit.freeze" version, except in `densenet` and `vgg`, where it's faster, likely due to the heuristic being better. With the first heuristic, there are some notable gaps, particularly on `inception_v3` and `alexnet`.
```
model Eager FX FX Auto jit.mkldnn
------------ --------- --------- --------- --------- -
custom 0.195614 0.14686 0.15929 0.156442 6
resnet18 0.172012 0.114007 0.119678 0.12945 6
resnet50 0.486463 0.294308 0.299518 0.318121 6
densenet161 0.955309 0.893502 0.882798 1.29315 6
inception_v3 0.38454 0.307076 0.239513 0.233083 6
googlenet 0.229388 0.237486 0.170458 0.174106 6
shufflenet 0.0513613 0.0286739 0.0292908 0.0267209 6
alexnet 0.0709602 0.0768137 0.0660831 0.0650399 6
vgg16 1.053993 0.9013264 0.9360212 1.082820 6
mobilenet 0.12264 0.0970935 0.0936568 0.106314 6
mnasnet 0.0989875 0.0412083 0.0424499 0.0472336 6
resnext 0.476811 0.315428 0.314422 0.343156 6
```
For single-threaded (still running...)
```
model eager FX FX auto mkl threads
------------ --------- --------- --------- --------- ---------
custom 0.0401415 0.259863 0.0263152 0.200667 1
resnet18 0.499931 0.382113 0.383711 0.396335 1
resnet50 1.10353 0.911865 0.923645 0.992125 1
densenet161 2.20158 2.39421 2.08204 2.30124 1
inception_v3 0.79161 0.849207 0.703546 0.724492 1
googlenet 0.66896 0.820965 0.515927 0.529414 1
shufflenet 0.0987308 0.0689343 0.0629298 0.0617193 1
alexnet 0.198795 0.19862 0.19325 0.211934 1
vgg16 3.744 3.2499 3.28503 3.31576 1
mobilenet 0.152725 0.14505 0.135555 0.159754 1
mnasnet 0.141983 0.089406 0.089599 0.0956167 1
resnext 1.13778 0.97016 0.955417 0.965376 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53805
Reviewed By: gmagogsfm
Differential Revision: D27424611
Pulled By: Chillee
fbshipit-source-id: a39137159de962fba7ca15121dfa9e78c1e01223
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53444
GraphModule construction has two options when constructing the base nn.Module: a dict of names to attrs to assign to the GraphModule, or another nn.Module to copy attrs from.
- For the dict case, add logic to explicitly register `nn.Tensors` that are not `nn.Parameter` as buffers on the GraphModule, else fall back to `__setattr__`.
- For the other `nn.Module` case, update so that it checks in the other module whether the attr to copy in is a buffer, and register it as such, else fall back to `__setattr__`.
Test Plan: Added tests for fetching params and buffers from a GraphModule using both dict and module `__init__`s
Reviewed By: jamesr66a
Differential Revision: D26860055
fbshipit-source-id: 8d9999f91fef20aaa10969558006fc356247591f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51974
Right now, when an FX `Graph` references an external object, we will emit
code like:
import foo
def forward(input: foo.bar.baz):
...
This is problematic in a world with `torch.package`, since then name
`foo.bar.baz` may reference a name from any number of packages.
This PR lays the groundwork for FX-package integration by separating the
resolution of external references from the genration of the function
code.
When generating a Graph's Python source, we keep track of all external
references and assign them unique names. At the end, we have a
dictionary mapping names -> actual objects. This becomes the `globals`
namespace we pass to `exec` when installing the forward function in a
`GraphModule`. This is nice because we can always be sure that `exec` is
seeing the same objects that were referenced from the `Graph`, no import
statements needed.
At serialization time, we use a `ModuleEnv` to resolve the globals dict
to a set of import statements that can be run to reprodce the `global`
namespace. This is only used on serialiation/deserialization, and those
functions are expected to check that the import statements are producing
the correct results.
Concretely, the code above will now look like:
from foo.bar import baz as foo_bar_baz
def forward(input: foo_bar_baz):
...
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D26340593
Pulled By: suo
fbshipit-source-id: fe247f75205d0a03fd067bdd0f95491e8edf1436
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50151
**Summary**
This commit adds a graph transformation pass that merges several matrix
multiplications that use the same RHS operand into one large matrix
multiplication. The LHS operands from all of the smaller matrix multiplications
are concatenated together and used as an input in the large matrix multiply,
and the result is split in order to obtain the same products as the original
set of matrix multiplications.
**Test Plan**
This commit adds a simple unit test with two matrix multiplications that share
the same RHS operand.
`python test/test_fx_experimental.py -k merge_matmul -v`
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D25809409
Pulled By: SplitInfinity
fbshipit-source-id: fb55c044a54dea9f07b71aa60d44b7a8f3966ed0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50120
This commit adds a graph transformation pass that merges several matrix
multiplications that use the same RHS operand into one large matrix
multiplication. The LHS operands from all of the smaller matrix multiplications
are concatenated together and used as an input in the large matrix multiply,
and the result is split in order to obtain the same products as the original
set of matrix multiplications.
Test Plan:
This commit adds a simple unit test with two matrix multiplications that share
the same RHS operand.
`buck test //caffe2/test:fx_experimental`
Reviewed By: jamesr66a
Differential Revision: D25239967
fbshipit-source-id: fb99ad25b7d83ff876da6d19dc4abd112d13001e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47935
Fetch the parameters that are needed for lowering from nn.Module to fx.node for leaf_modules.
Test Plan: A test `test_fetch` is added to test_fx_experimental.py.
Reviewed By: jfix71
Differential Revision: D24957142
fbshipit-source-id: a349bb718bbcb7f543a49f235e071a079da638b7
Summary:
add a unit test to test the situation where a node is too large to fit into any device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48938
Reviewed By: zhangguanheng66
Differential Revision: D25402967
Pulled By: scottxu0730
fbshipit-source-id: a2e2a3dc70d139fa678865ef03e67fa57eff4a1d
Summary:
add a unit test for the situation where devices have no enough memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48858
Reviewed By: malfet, gcatron
Differential Revision: D25341254
Pulled By: scottxu0730
fbshipit-source-id: c0524c22717b6c8afd67f5b0ad0f1851b973e4b7
Summary:
What this does is that given a `FxModule foo`, you can call `foo.to_folder('foo_folder', 'Foo')` and dump the current FX module into runnable Python code.
That is
```
foo = <fxModule>
foo = foo.to_folder('bar', 'Foo')
from bar import Foo
foo2 = Foo()
forall x, foo2(x) == Foo(x)
```
This has several use cases, largely lifted from jamesr66a's doc here: https://fb.quip.com/U6KHAFaP2cWa (FB-internal).
1. As we apply more heavy-weight function transformations with FX, figuring out what's going on can be quite a difficult experience. In particular, things that can typically be used for debugging (like `print` or `import pdb; pdb.set_trace()`) no longer work. This is particularly necessary if you're using a FX transform like `grad` or `vmap. With this, you simply open up the dumped file, and add `print`/`pdb` statements wherever you'd like.
2. This also provides an immense amount of user control. Some potential use-cases:
- Let's say an existing FX transform has some bug, or generates suboptimal code. Instead of needing to modify that FX transform, writing another FX pass that fixes the suboptimal code, or simply giving up on FX, they can workaround it by simply modifying the resulting code themselves.
- This allows users to check in their FX modules into source control.
- You could even imagine using this as part of some code-gen type workflow, where you write a function, `vmap` it to get the function you actually want, and then simply copy the output of the `vmap` function without needing FX at all in the final code.
An example:
```python
class Test(nn.Module):
def __init__(self):
super(Test, self).__init__()
self.W = torch.nn.Parameter(torch.randn(2))
self.linear = nn.Linear(2, 2)
self.attr = torch.randn(2)
self.attr2 = torch.randn(2)
def forward(self, x):
return self.linear(self.W + (self.attr + self.attr2) + x)
mod = fx.symbolic_trace(Test())
mod.to_folder('foo', 'Foo')
```
results in
```python
import torch
class Foo(torch.nn.Module):
def __init__(self):
super().__init__()
state_dict = torch.load('foo/state_dict.pt')
self.linear = torch.load('foo/linear.pt') # Linear(in_features=2, out_features=2, bias=True)
self.__tensor_constant0 = state_dict['__tensor_constant0']
self.W = torch.nn.Parameter(state_dict['W'])
def forward(self, x):
w = self.W
tensor_constant0 = self.__tensor_constant0
add_1 = w + tensor_constant0
add_2 = add_1 + x
linear_1 = self.linear(add_2)
return linear_1
```
Some current issues:
1. How do you actually ... save things like modules or parameters? I don't think FX is in the business of tracking initializations and such. Thus, the only way I see to do it is to dump the parameters/modules as blobs, and then load them in the generated initialization. This is a somewhat subpar user experience, and perhaps prevents it from being in some use cases (ie: you would need to check in the blobs into source control to save the model).
2. Currently, the only "atomic" modules we have are those in `torch.nn`. However, if we want to allow flexibility in this, and for example, allow "atomic" modules that are user-defined, then it's not clear how to allow those to be dumped in a way that we can then load elsewhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47544
Reviewed By: jamesr66a
Differential Revision: D25232917
Pulled By: Chillee
fbshipit-source-id: fd2b61a5f40e614fc94256a2957ed1d57fcf5492
Summary:
This PR add supports on AOT based partition. Given each node and its corresponding partition id, generate the partition, submodules and dag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48336
Reviewed By: gcatron
Differential Revision: D25226899
Pulled By: scottxu0730
fbshipit-source-id: 8afab234afae67c6fd48e958a42b614f730a61d9
Summary:
Some interesting stuff going on. All benchmarks are tested with both my implementation as well as the current quantized fuser.
For these benchmarks, things like using MKLDNN/FBGEMM make a big differene.
## Manual compilation (everything turned off)
In the small case, things look good
```
non-fused: 1.174886703491211
fused: 0.7494957447052002
```
However, for `torchvision.resnet18`, we see
```
non-fused: 1.2272708415985107
fused: 3.7183213233947754
```
This is because Conv (no bias) -> Batch Norm is actually faster than Conv (bias) if you don't have any libraries...
## Nightly (CPU)
```
Toy
non-fused: 0.45807552337646484
fused: 0.34779977798461914
resnet18
non-fused: 0.14216232299804688
fused: 0.13438796997070312
resnet50
non-fused: 0.2999534606933594
fused: 0.29364800453186035
densenet161
non-fused: 0.6558926105499268
fused: 0.6190280914306641
inception_v3
non-fused: 1.2804391384124756
fused: 1.181272029876709
```
with MKLDNN.
We see a small performance gain across the board, with more significant performance gains for smaller models.
## Nightly (CUDA)
```
M
non-fused: 1.2220964431762695
fused: 1.0833759307861328
resnet18
non-fused: 0.09721899032592773
fused: 0.09089207649230957
resnet50
non-fused: 0.2053072452545166
fused: 0.19138741493225098
densenet161
non-fused: 0.6830024719238281
fused: 0.660109281539917
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47657
Reviewed By: eellison
Differential Revision: D25127546
Pulled By: Chillee
fbshipit-source-id: ecdf682038def046045fcc09faf9aeb6c459b5e3
Summary:
This is a partition search based on Kernighan-Lin algorithm. First, the graph is partitioned using size_based_partition, then nodes from different partitions are swapped until the cost reaches minimum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48197
Reviewed By: gcatron
Differential Revision: D25097065
Pulled By: scottxu0730
fbshipit-source-id: 3a11286bf4e5a712ab2848b92d0b98cd3d6a89be
Summary:
This PR fixes the add_node and remove_node in partition class and also add a unit test for node manipulation in partition
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48016
Reviewed By: gcatron
Differential Revision: D24996368
Pulled By: scottxu0730
fbshipit-source-id: 0ddffd5ed3f95e5285fffcaee8c4b671929b4df3
Summary:
Change Partitioner.py file name to partitioner.py
Change GraphManipulation.py file name to graph_manipulation.py
Move test_replace_target_nodes_with() to test_fx_experimental.py
Remove the unnecessary argument in size_based_partition() in Partitioner class
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47914
Reviewed By: gcatron
Differential Revision: D24956653
Pulled By: scottxu0730
fbshipit-source-id: 25b65be7dc7d64e90ffdc59cf394446fee83c3e6
Summary:
In the cost_aware_partition, check the circular dependency in try_combining_partitions. Also fix the calculate of communication time between partitions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47856
Reviewed By: gcatron
Differential Revision: D24926591
Pulled By: scottxu0730
fbshipit-source-id: c634608675ac14b13b2370a727e4fb05e1bb94f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47763
Changing the name due to the discussion in
https://github.com/pytorch/pytorch/pull/47399.
Test Plan:
```
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
python test/test_fx_experimental.py
```
Imported from OSS
Reviewed By: ezyang
Differential Revision: D24891767
fbshipit-source-id: 01c7a5acd83bf9c962751552780930c242134dd2
Summary:
[WIP]This PR adds cost_aware_partition method in Partitioner class. The method partitions the fx graph module based on the latency of the whole graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47673
Reviewed By: gcatron
Differential Revision: D24896685
Pulled By: scottxu0730
fbshipit-source-id: 1b1651fe82ce56554f99d68da116e585c74099ed
Summary:
add get_device_to_partitions_mapping function in the Partitioner class to make size_based_partition more modular and organized. This function will also be used in the future cost_aware_partition
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47361
Reviewed By: gcatron
Differential Revision: D24760911
Pulled By: scottxu0730
fbshipit-source-id: 8cdda51b9a1145f9d13ebabbb98b4d9df5ebb6cd
Summary:
This PR adds the support to calculate the cost of a partitioned graph partition by partition based on the node cost. In a partitioned graph, top partitions (partitions without parents) are collected as the starting points, then use DFS to find the critical path among all partitions in the graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47280
Reviewed By: gcatron
Differential Revision: D24735932
Pulled By: scottxu0730
fbshipit-source-id: 96653a8208554d2c3624e6c8718628f7c13e320b
Summary:
This PR adds node-by-node cost function. Given a partition of nodes, get_latency_of_one_partition function will find the critical path in the partition and return its latency. A test unit is also provided. In the test unit, a graph module is partitioned into two partitions and the latency of each partition is tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47009
Reviewed By: gcatron
Differential Revision: D24692542
Pulled By: scottxu0730
fbshipit-source-id: 64c20954d842507be0d1afa2516d88f705e11224
Summary:
Placeholders and constants in the partition are counted twice when combining two partitions. This PR fixes it by adding add_node function into Partition class. A unit test is also updated to test if the partition size is correct
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47083
Reviewed By: gcatron
Differential Revision: D24634368
Pulled By: scottxu0730
fbshipit-source-id: ab408f29da4fbf729fd9741dcb3bdb3076dc30c4
Summary:
WIP: add support for different memory sizes on size_based_partition, so the size_based_partition could support different logical devices with different memory sizes. Compared to the original size_based_partition, the new one also supports partition to logical device mapping. Multiple partitions can be mapped into one device if the memory size is allowed. A test unit test_different_size_partition is also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46919
Reviewed By: gcatron, VitalyFedyunin
Differential Revision: D24603511
Pulled By: scottxu0730
fbshipit-source-id: 1ba37338ae054ad846b425fbb7e631d3b6c500b6
Summary:
WIP: This PR adds sparse_nn_partition into Partitioner class. It includes logical device assignment for all dag nodes. The basic idea is to do size_based_partition separately for embedding nodes and non-embedding nodes. A test unit is also added in test_fx_experimental.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46390
Reviewed By: gcatron
Differential Revision: D24555415
Pulled By: scottxu0730
fbshipit-source-id: 8772af946d5226883759a02a1c827cfdfce66097
Summary:
Reopen the PR: https://github.com/pytorch/pytorch/pull/45837
This PR add a new feature for Partitioner() class called size_based_partition. Given a list of devices with the same memory size, this function could distribute graph nodes into different devices. To implement this feature, several help functions are created in Partitioner.py and GraphManipulation.py.
An unit test is also added in test/test_fx_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46282
Reviewed By: gcatron
Differential Revision: D24288470
Pulled By: scottxu0730
fbshipit-source-id: e81b1e0c56e34f61e497d868882126216eba7538