Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66972
Add api to view how many custom classes we have and what their names are
Test Plan: unit test
Reviewed By: cccclai
Differential Revision: D31811337
fbshipit-source-id: 9f8ca1fc578a0a5360c9cd8f95475acc33f250e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66659
Original message: We added and registered a new operator, static_runtime::fused_sigrid_transforms, and modified the original sigrid_transforms to handle non-fused case only
Note: this diff was commandeered from a bootcamper. Some final touches were needed.
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: swolchok
Differential Revision: D31550307
fbshipit-source-id: 287380be0cca20ee6e145bcc7217547bd58cf6d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67097
all delegated models have `is_nonzero` ops by default, by making the op native and consumable without dispatch eases the portability of such models
ghstack-source-id: 141375082
Test Plan:
`buck test caffe2/test/cpp/jit:jit -- BackendTest.TestComposite`
```
~/fbsource/fbcode] cd ~/fbsource/fbcode/ && buck test caffe2/test:jit -- test_trace_arange
Parsing buck files: finished in 0.5 sec
Building: finished in 9.4 sec (100%) 16035/16035 jobs, 0/16035 updated
Total time: 10.0 sec
More details at https://www.internalfb.com/intern/buck/build/1e55eea5-2adb-41d1-96ae-cbf4b446d6c6
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 46eedba2-ae17-4e88-b205-93bd1332665d
Trace available for this run at /tmp/tpx-20211015-113905.235421/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1970324912349177
✓ ListingSuccess: caffe2/test:jit - main (12.372)
✓ Pass: caffe2/test:jit - test_trace_arange (jit.test_tracer.TestTracer) (13.748)
✓ Pass: caffe2/test:jit - test_trace_arange_with_grad (jit.test_tracer.TestTracer) (13.892)
Summary
Pass: 2
ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/1970324912349177
```
Reviewed By: iseeyuan
Differential Revision: D31656842
fbshipit-source-id: c0e6c798478a2783c0e17e6e9100ba5ce044da78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001
The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.
This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
`buck test caffe2/test:static_runtime`
Reviewed By: hlu1
Differential Revision: D31821990
fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67125
Using explicit template instantiations in D31659973 (f2582a59d0) was a bad idea. The problem is that the lvalue instantiation was for a `const` vector of `IValue`, meaning that if you tried to pass SR a non-const vector of arguments, the linker would fail to find the symbol.
The reason we didn't catch this in D31659973 (f2582a59d0) was because predictor always passes a `const` reference anyways. But we should fix this to prevent unexpected problems in the future.
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D31873406
fbshipit-source-id: 5ab5a03334bed925cec11facadcedf9bec9b90ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66337
Right now, assembly code generated for the a given method from the model is named wrapper or func by default. The function name is then replaced with a proper kernel_func_name after target specific assembly is generated.
This PR propagates a desired kernel_func_name right from aotCompiler API so that the generated function has the needed name that doesn't need to be replaced later.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31514095
Pulled By: priyaramani
fbshipit-source-id: b70c8e2c733600a435cd4e8b32092d37b7bf7de5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66149
Updated logic will be able to infer rank of slice output, when only rank is known for slice input. Enables cases where `ConstantValueMap::HasRank(input)` is `True`, while `ConstantValueMap::HasShape(input)` is `False`.
Test Plan: Imported from OSS
Reviewed By: jansel
Differential Revision: D31423840
Pulled By: malfet
fbshipit-source-id: 17b2b24aa63435d5212ebe6bdf66ae3c348c4e3b
Co-authored-by: BowenBao <bowbao@microsoft.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66140
* Add new argument to export api to enable users specifying `nn.Module` classes that they wish to be exported as local function in ONNX model.
* Refactor `torch/csrc/jit/serialization/export.cpp`, and remove redundant `EncoderBase` class.
* ~~Contains changes from #63268~~
* Depends on #63716 to update onnx submodule.
Test Plan: Imported from OSS
Reviewed By: jansel
Differential Revision: D31424098
fbshipit-source-id: c949d0b01c206c30b4182c2dd1a5b90e32b7a0d3
Co-authored-by: BowenBao <bowbao@microsoft.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648
Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:
1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```
The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.
Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.
Reviewed By: hlu1
Differential Revision: D31659973
fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66288
This change makes it so `UseVariadicOp` can transform ops with many Tensor list inputs.
Input pattern:
```
%output : Type = op(%list_1, %arg_1, %list_2, %list_3)
```
Output pattern:
```
%output : Type = variadic_op(%list_11, ..., %list_1N, %arg_1, %list_21, ..., %list_2M, %list_31, ..., %list_3K, N, M, K)
```
The length of each list is passed at the end of the variadic op so that the op implementation can process the inputs appropriately. This also frees us from needing to update `hasVarArgs` in static runtime each time we add a variadic op.
This diff also makes `UseVariadicOp` more robust. Before, `list_idx` was passed as an argument. Now, `VariadicUpdater` determines `list_idx` from the node's schema.
Test Plan:
Existing variadic ops do not break:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31450811
fbshipit-source-id: 808fcc3ae8940b9e602586f38f8cf9154c9a6462
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65860
Re-enable peepholes like `x + 0 == x`. These were at one point enabled, and then disabled because they did not properly account for aliasing, and then re-enabled with reconstructing the alias db everytime which is slow - O(n^2). I've added correctness conditions, and I've also made it so that we avoid using stale aliasing properties for either the input or output of nodes we optimize.
Some of the other code that we have written to avoid re-instantiating the alias db involves internally mutating it, however this is tricky to reason about and we probably have to add some extra invariants...
cc navahgar relevant to graph opts and d1jang alias analysis relevant here
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D31352382
Pulled By: eellison
fbshipit-source-id: 441a27f17dc623d6c24538d1d43cba0412c3c482
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66974
`D31591785 (67e003f09b)` started carrying a function object to be executed and `FunctionKind` for the type of the function *separately*, and this caused a bug fixed by D31783028 (79803b199f).
This change bundles them as it was before done by swolchok to reduce the chances of such a mistake in the future. They need to be carried altogether always since `FunctionKind` identifies the type of the function object.
Note that `struct Function` is a POD type, so accessing its field (first, second) shouldn't cause an extra overhead in `ProcessedNode::run()`.
Test Plan:
Confirmed that the managed memory metics remain the same before/after this diff on inline_cvr:
```
#AFTER
# inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
# inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)
# inline_cvr/remote_ro
First iter time: 12.0344 ms
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```
```
#BEFORE
# inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
#inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)
#inline_cvr_remote_ro
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```
Reviewed By: mikeiovine
Differential Revision: D31798419
fbshipit-source-id: fd4301b6731e402be0820729654735c791511aba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66854
diff tool and script to test correctness of flatbuffer format
Test Plan:
`./verify_flatbuffer.sh | pastry`
P463163180
Reviewed By: zhxchen17
Differential Revision: D31752696
fbshipit-source-id: bea00102b21e62c02367853c8bec2742b483fbda
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45255
Mostly straightforward. Only downside in this PR is the lack of more scalable way to check for all newly-created nodes in `callPySymbolicFunction`. The other options were:
* Create a scope within the node's scope and loop through all nodes that correspond to the scope. The code would still need to loop through all nodes.
* Add extra state to the graph (no good reason to do so).
* Add extra state to the ONNX exporter, since python calls go back to `g.op(...)` (no good reason to do so, also not very pythonic).
cc BowenBao neginraoof
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45256
Reviewed By: malfet, houseroad
Differential Revision: D31744281
Pulled By: msaroufim
fbshipit-source-id: 1b63f6e7f02ed61b3a9b7ac3d0be0a3a203c8ff6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67021
When applying the equally split optimization, we still need to delete the list unpack node.
I did an accuracy test yesterday but didn't catch this issue because my diffs were not properly synced between devservers (I use hlu1's devbig for testing and it had an old version of "Add FuseListUnpackV2"). But I did another test this morning and realized that there was an issue.
This is not affecting anything in prod right now since D31742293 has not landed.
Reviewed By: hlu1
Differential Revision: D31827278
fbshipit-source-id: c7b05e3d8ec942632adcff4bdfebb8c27c1a7a39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66990
NNC fusion groups currently show up as "TensorExpr" in the profiler,
which is true but not super useful since it obscures what's actually happening
in the fusion group. This change will log them as `fused_XXX` where XXX is a
(length-limited) series of ops describing the subgraph, for instance
`fused_mul_add` to represent a group containing `aten::mul`, `aten::add`.
Test Plan: New unit test to check the output of autograd profiler.
Reviewed By: dzhulgakov
Differential Revision: D31762087
fbshipit-source-id: 3fadbdc67b054faa01aa42e5b6ea2c4a6bc3481f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66509
Like `FuseListUnpack`, but instead of adding arguments to the fused node's outputs, inserts a new fused op.
By using a new fused op, we can avoid runtime `is_fused` checks. This will make the op implementations significantly cleaner. Eventually, we will migrate all ops to `V2` and delete to old pass.
`FuseListUnpackV2` also fixes the bug described in T103159043.
Test Plan: I've made some changes to D31550307 locally and verified that everything works.
Reviewed By: hlu1
Differential Revision: D31492017
fbshipit-source-id: 4f90fcbc17e4c70a3d65985bee836fabf868a22c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66098
`cat` is somewhat special-cased right now because currently we only have list of Tensor inputs where the list is constructed in the JIT IR graph. While that is generally true for Fusion (e.g. why we have ConstantChunk) that may not be true for shape analysis generally, so I'm waiting a bit to generalize.
Test Plan: Imported from OSS
Reviewed By: navahgar, anjali411
Differential Revision: D31797467
Pulled By: eellison
fbshipit-source-id: ca761e214dfd7f3bba8d189f3b3f42ffec064f63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66097
Adding logic to generate runtime shapes for nodes with multi-outputs. It is generalizing existing flow of looking at a node, getting its shape graph, inlining it, and adding a mapping from the output to the new value in the stitched shape compute graph to loop over multiple outputs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31797468
Pulled By: eellison
fbshipit-source-id: 2c182b71a46b36d33f23ad35b89790a4a5d4471c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65575
This is needed for lowering an NNC model to mobile. It is also the last class of unhandled ops which NNC fuses, and we need integration this for computing output symbolic shapes.
The graph of with two dynamic shape inputs produces:
```
graph(%x.1 : Tensor(SS(-2), 2, 3),
%y.1 : Tensor(SS(-3), 2, 3)):
%5 : int = prim::Constant[value=0]()
%4 : Tensor[] = prim::ListConstruct(%x.1, %y.1)
%6 : Tensor(SS(-4), 2, 3) = aten::cat(%4, %5) # /private/home/eellison/pytorch/test/jit/test_symbolic_shape_analysis.py:290:19
return (%6)
```
With a partial eval graph of
```
Done with partial evaluation
graph(%129 : int[],
%130 : int[],
%dim.14 : int):
%738 : int = prim::Constant[value=3]()
%737 : int = prim::Constant[value=2]()
%132 : int = prim::Constant[value=0]()
%392 : int = aten::__getitem__(%129, %132) # <string>:339:44
%417 : int = aten::__getitem__(%130, %132) # <string>:339:44
%cat_dim_size.48 : int = aten::add(%392, %417) # <string>:339:29
%result_size.5 : int[] = prim::ListConstruct(%cat_dim_size.48, %737, %738)
return (%result_size.5)
```
To handle cat, I essentially make the cat shape op variadic,
replacing
```
torch.cat([x, y]
...
def cat_shape_op(tensors: List[List[int]], dim: int):
...
op(tensors)
```
with
```
def cat_shape_op(x: List[int], y: List[int], dim: int):
tensors = [x, y]
op(tensors)
```
This reuses the existing input Tensor properties partial evaluation path and avoids having to add special handling to optimize out `len(tensors)` calls in the IR.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31797471
Pulled By: eellison
fbshipit-source-id: 62c794533d5fabfd3fad056d7e5fe3e8781b22c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65573
When we remove mutation on
```
x = [0, 1, 3, 4]
x[-2] = 4
```
we have a safety check that the new index will be in bounds of the old index. in practice, this should always be the case otherwise you would have a runtime error. Within that check (not within the actual adjustment) we were using the wrong length of inputs preventing the optimization from firing.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31797469
Pulled By: eellison
fbshipit-source-id: 02a1686b9f6016eb5aeb87ed342c043c203dcd0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65148
No functional changes, factoring out optimizations and renaming the `graph` in symbolic shape analysis to `shape_compute_graph` as ZolotukhinM suggested
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31797447
Pulled By: eellison
fbshipit-source-id: 60d322da040245dd7b47ee7c8996239572fd11c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66554
In native_functions.yaml, the schemas for batch_norm and instance_norm
are incorrect: the inputs `running_mean` and `running_var` are mutated,
but are not marked as such in the function schema. Since `(a!)?`
annotations are currently not working (see #65760), this instead adds a
special case to `alias_anaysis.cpp`. If the value of `training` or
`use_input_stats` is known to be `false`, then `alias_analysis` will
mark the input as _not_ being written to.
Test Plan:
Removed the `skip` annotation on the following test, and added a special
exception in `check_alias_annotations`:
```
python test/test_ops.py -k test_variant_consistency_jit_nn_functional_batch_norm
```
Also:
```
./build/bin/test_jit --gtest_filter="*BatchAndInstanceNormFixture*"
```
Imported from OSS
Reviewed By: eellison
Differential Revision: D31612339
fbshipit-source-id: 12ca61b782b9e41e06883ba080a276209dc435bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66917
The total number of 'out' variant nodes/total number of nodes is now 100% for all the models, which isn't true obviously.
Reviewed By: swolchok, mikeiovine
Differential Revision: D31783028
fbshipit-source-id: e0bc2c6614aa3c3a235283c9125de1b339f42585
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66098
`cat` is somewhat special-cased right now because currently we only have list of Tensor inputs where the list is constructed in the JIT IR graph. While that is generally true for Fusion (e.g. why we have ConstantChunk) that may not be true for shape analysis generally, so I'm waiting a bit to generalize.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31732415
Pulled By: eellison
fbshipit-source-id: 7f513cea355f1e4c1d2ca7c32c06690a9bdcb050
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66097
Adding logic to generate runtime shapes for nodes with multi-outputs. It is generalizing existing flow of looking at a node, getting its shape graph, inlining it, and adding a mapping from the output to the new value in the stitched shape compute graph to loop over multiple outputs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31732418
Pulled By: eellison
fbshipit-source-id: 767698d031b1daf002678a025b270e0ede429061
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65575
This is needed for lowering an NNC model to mobile. It is also the last class of unhandled ops which NNC fuses, and we need integration this for computing output symbolic shapes.
The graph of with two dynamic shape inputs produces:
```
graph(%x.1 : Tensor(SS(-2), 2, 3),
%y.1 : Tensor(SS(-3), 2, 3)):
%5 : int = prim::Constant[value=0]()
%4 : Tensor[] = prim::ListConstruct(%x.1, %y.1)
%6 : Tensor(SS(-4), 2, 3) = aten::cat(%4, %5) # /private/home/eellison/pytorch/test/jit/test_symbolic_shape_analysis.py:290:19
return (%6)
```
With a partial eval graph of
```
Done with partial evaluation
graph(%129 : int[],
%130 : int[],
%dim.14 : int):
%738 : int = prim::Constant[value=3]()
%737 : int = prim::Constant[value=2]()
%132 : int = prim::Constant[value=0]()
%392 : int = aten::__getitem__(%129, %132) # <string>:339:44
%417 : int = aten::__getitem__(%130, %132) # <string>:339:44
%cat_dim_size.48 : int = aten::add(%392, %417) # <string>:339:29
%result_size.5 : int[] = prim::ListConstruct(%cat_dim_size.48, %737, %738)
return (%result_size.5)
```
To handle cat, I essentially make the cat shape op variadic,
replacing
```
torch.cat([x, y]
...
def cat_shape_op(tensors: List[List[int]], dim: int):
...
op(tensors)
```
with
```
def cat_shape_op(x: List[int], y: List[int], dim: int):
tensors = [x, y]
op(tensors)
```
This reuses the existing input Tensor properties partial evaluation path and avoids having to add special handling to optimize out `len(tensors)` calls in the IR.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31732416
Pulled By: eellison
fbshipit-source-id: 6d93ddf62c34846ec238159f75229632515530b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65573
When we remove mutation on
```
x = [0, 1, 3, 4]
x[-2] = 4
```
we have a safety check that the new index will be in bounds of the old index. in practice, this should always be the case otherwise you would have a runtime error. Within that check (not within the actual adjustment) we were using the wrong length of inputs preventing the optimization from firing.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31732417
Pulled By: eellison
fbshipit-source-id: dd734254c0212ca459c1c135da262974de5299be