Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72871
We do this same trick in the native MHA implementation; backport it for purposes of fair comparison.
ghstack-source-id: 149526858
Test Plan: CI
Reviewed By: ngimel
Differential Revision: D34176090
fbshipit-source-id: 8b578c29c4dcf0d85bae74dfbbb82db9a8f32dc7
(cherry picked from commit fd50170935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72934
Before this PR, DBR quantization had a limitation on handling user
code which iterates over all module children. For example, imagine
a forward function such as
```
def forward(self, x):
for module in self:
x = module(x)
return x
```
Before this PR, this code would break with DBR quantization, because
we attach `AutoQuantizationState` objects to each child, and those
objects live in the child's module hierarchy and will appear in
these kinds of iterations, changing the meaning of the user program.
This PR reduces the scope of this problem to just the top level module.
Instead of attaching `AutoQuantizationState` objects to each child,
we register them in a map on the parent. Here is a before and after:
```
// toy model
model
|--> child1
// toy model with AutoQuantizationState objects, before this PR
model
|--> child1
| |--> _auto_quant_state
|--> _auto_quant_state
// toy model with AutoQuantizationState objects, after this PR
model
|--> child1
|--> _fqn_to_auto_quant_state_map
|--> ( ) --> _auto_quant_state // of `model`
|--> (child1) --> _auto_quant_state // of `model.child1`
```
Note: `child1._auto_quant_state` works as before for convenience,
but the `child1` object now stores a soft link to its `_auto_quant_state`
instead of properly registering it in its module hierarchy. This is
somewhat hacky. If we need to improve this in the future, we could
remove this soft link and refactor the code to call the FQN map
instead.
Note: if the top level module iterates over its children, things will
still be broken. This is less likely, and we will recommend that the
user work around this by wrapping their model, or checking for the
`AutoQuantizationStateModuleDict` type in their iteration loop.
The impact of this change should be an improvement of coverage
of user models. In fact, we expect this to drive our coverage of
torchbenchmark models from 89% to 100%.
Test Plan:
```
// previously disabled test cases with user code iterating
// over module children are now enabled, with wrappers
python test/test_quantization.py -k test_module_calls_items
python test/test_quantization.py -k test_vovnet_sequential
```
Reviewed By: dzdang
Differential Revision: D34281074
Pulled By: vkuzo
fbshipit-source-id: 0e25fc1ec529c47f72478a1875fe43219feac6b1
(cherry picked from commit 4008f89967)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.
**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).
To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.
**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932
Reviewed By: rohan-varma
Differential Revision: D34281482
Pulled By: awgu
fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
Summary:
Potentially fixes https://github.com/pytorch/pytorch/issues/71385 similar docstring could also fix https://github.com/pytorch/pytorch/issues/71384
Updated the doc to `torch.linalg.inv` to include nuance around equivalence to `torch.linalg.solve`:
Update is below:
```
.. note::
Consider using :func:`torch.linalg.solve` if possible for multiplying a matrix on the left by
the inverse, as::
linalg.solve(A, B) == linalg.inv(A) @ B # When B is a matrix
It is always prefered to use :func:`~solve` when possible, as it is faster and more
numerically stable than computing the inverse explicitly.
```
IvanYashchuk please inform if this the right direction or over-extrapolation. I can apply the same changes to the `tensorinv` doc to fix https://github.com/pytorch/pytorch/issues/71384. Also in https://github.com/pytorch/pytorch/issues/71384 there was a mention of updating `torch.matmul` error message to indicate the proper tensor shapes, I could also potentially do that in this PR if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71769
Reviewed By: H-Huang
Differential Revision: D34242541
Pulled By: mruberry
fbshipit-source-id: 40e98dad4d821928d1dea72d4512ee579b690a32
(cherry picked from commit a0321a5de9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71538
`reportMemoryUsage` is kind of awful. It does a bunch of string writes and such that makes it VERY expensive. Just moving that work off the hot path reduces the overhead for `profile_memory` from ~6.5 us to ~1.2 us. (85% reduction in the kineto contribution to profiling overhead.)
Test Plan: Ran ubenchmark with `--op empty --stressTestKineto --kinetoProfileMemory`
Reviewed By: swolchok
Differential Revision: D32730167
fbshipit-source-id: fe18e8fa3881967cad8fa1c26c71c805e9b034e5
(cherry picked from commit 0d394cb252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72469
1. Implement the framework to allow user to choose among `state_dict`, `local_state_dict`, and `sharded_state_dict`.
2. Implement ShardedTensor compatible local_state_dict() and load_local_state_dict().
ghstack-source-id: 149559985
Test Plan: CI
Reviewed By: rohan-varma
Differential Revision: D33919683
fbshipit-source-id: c9f1b43ce04da7db65c4aebf6ac2c7a0ac5e9de8
(cherry picked from commit 55fd6230c9)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72765.
- [x] Improved `NotImplementedError` verbosity.
- [x] Automate the docstring generation process
## Improved `NotImplementedError` verbosity
### Code
```python
import torch
dist = torch.distributions
torch_normal = dist.Normal(loc=0.0, scale=1.0)
torch_mixture = dist.MixtureSameFamily(
dist.Categorical(torch.ones(5,)
),
dist.Normal(torch.randn(5,), torch.rand(5,)),
)
dist.kl_divergence(torch_normal, torch_mixture)
```
#### Output before this PR
```python
NotImplementedError:
```
#### Output after this PR
```python
NotImplementedError: No KL(p || q) is implemented for p type Normal and q type MixtureSameFamily
```
## Automate the docstring generation process
### Docstring before this PR
```python
Compute Kullback-Leibler divergence :math:`KL(p \| q)` between two distributions.
.. math::
KL(p \| q) = \int p(x) \log\frac {p(x)} {q(x)} \,dx
Args:
p (Distribution): A :class:`~torch.distributions.Distribution` object.
q (Distribution): A :class:`~torch.distributions.Distribution` object.
Returns:
Tensor: A batch of KL divergences of shape `batch_shape`.
Raises:
NotImplementedError: If the distribution types have not been registered via
:meth:`register_kl`.
```
### Docstring after this PR
```python
Compute Kullback-Leibler divergence :math:`KL(p \| q)` between two distributions.
.. math::
KL(p \| q) = \int p(x) \log\frac {p(x)} {q(x)} \,dx
Args:
p (Distribution): A :class:`~torch.distributions.Distribution` object.
q (Distribution): A :class:`~torch.distributions.Distribution` object.
Returns:
Tensor: A batch of KL divergences of shape `batch_shape`.
Raises:
NotImplementedError: If the distribution types have not been registered via
:meth:`register_kl`.
KL divergence is currently implemented for the following distribution pairs:
* :class:`~torch.distributions.Bernoulli` and :class:`~torch.distributions.Bernoulli`
* :class:`~torch.distributions.Bernoulli` and :class:`~torch.distributions.Poisson`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Beta` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.Binomial` and :class:`~torch.distributions.Binomial`
* :class:`~torch.distributions.Categorical` and :class:`~torch.distributions.Categorical`
* :class:`~torch.distributions.Cauchy` and :class:`~torch.distributions.Cauchy`
* :class:`~torch.distributions.ContinuousBernoulli` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.ContinuousBernoulli` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.ContinuousBernoulli` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.ContinuousBernoulli` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.ContinuousBernoulli` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.Dirichlet` and :class:`~torch.distributions.Dirichlet`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Gumbel`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Exponential` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.ExponentialFamily` and :class:`~torch.distributions.ExponentialFamily`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Gumbel`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Gamma` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.Geometric` and :class:`~torch.distributions.Geometric`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Gumbel`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Gumbel` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.HalfNormal` and :class:`~torch.distributions.HalfNormal`
* :class:`~torch.distributions.Independent` and :class:`~torch.distributions.Independent`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Laplace`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Laplace` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.LowRankMultivariateNormal` and :class:`~torch.distributions.LowRankMultivariateNormal`
* :class:`~torch.distributions.LowRankMultivariateNormal` and :class:`~torch.distributions.MultivariateNormal`
* :class:`~torch.distributions.MultivariateNormal` and :class:`~torch.distributions.LowRankMultivariateNormal`
* :class:`~torch.distributions.MultivariateNormal` and :class:`~torch.distributions.MultivariateNormal`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Gumbel`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Laplace`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Normal` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.OneHotCategorical` and :class:`~torch.distributions.OneHotCategorical`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Pareto` and :class:`~torch.distributions.Uniform`
* :class:`~torch.distributions.Poisson` and :class:`~torch.distributions.Bernoulli`
* :class:`~torch.distributions.Poisson` and :class:`~torch.distributions.Binomial`
* :class:`~torch.distributions.Poisson` and :class:`~torch.distributions.Poisson`
* :class:`~torch.distributions.TransformedDistribution` and :class:`~torch.distributions.TransformedDistribution`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Beta`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.ContinuousBernoulli`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Exponential`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Gamma`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Gumbel`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Normal`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Pareto`
* :class:`~torch.distributions.Uniform` and :class:`~torch.distributions.Uniform`
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72845
Reviewed By: mikaylagawarecki
Differential Revision: D34344551
Pulled By: soulitzer
fbshipit-source-id: 7a603613a2f56f71138d56399c7c521e2238e8c5
(cherry picked from commit 6b2a51c796)
Summary:
Adding documentation about compiling extension with CUDA 11.5 and Windows
Example of failure: https://github.com/pytorch/pytorch/runs/4408796098?check_suite_focus=true
Note: Don't use torch/extension.h In CUDA 11.5 under windows in your C++ code:
Use aten instead of torch interface in all cuda 11.5 code under windows. It has been failing with errors, due to a bug in nvcc.
Example use:
>>> #include <ATen/ATen.h>
>>> at::Tensor SigmoidAlphaBlendForwardCuda(....)
Instead of:
>>> #include <torch/extension.h>
>>> torch::Tensor SigmoidAlphaBlendForwardCuda(...)
Currently open issue for nvcc bug: https://github.com/pytorch/pytorch/issues/69460
Complete Workaround code example: cb170ac024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73013
Reviewed By: malfet, seemethere
Differential Revision: D34306134
Pulled By: atalman
fbshipit-source-id: 3c5b9d7a89c91bd1920dc63dbd356e45dc48a8bd
(cherry picked from commit 87098e7f17)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73084
I'm wrapping up the conversion of type comments to type annotations
in caffe2. The last remaining "bulk" codemod has test failures that
are hard for me to understand, so I'm going to submit PRs for each
module individually which makes it easier to see what's causing
problems.
All the codemods were produced via LibCST and then manually cleaned up.
Test Plan: Wait for github CI
Reviewed By: H-Huang
Differential Revision: D34344289
fbshipit-source-id: e8e3a13c3d95f6804829f1818fb7f0605e5ba137
(cherry picked from commit 92d47d9cd5)
Current we are unable to utilize ONNX's SpaceToDepth operator due to the lack of the mode_s attribute, hence we add an alternative symbolic in opset 9 to support pixel_unshuffle
- Adds support for pixel_unshuffle in opset9
- Adds support for dynamic input shapes for pixel_shuffle and pixel_unshuffle
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72449
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72944
Doesn't make sense to develop it in core right now.
ghstack-source-id: 149456040
Test Plan:
CI
run MHA benchmark in benchmark_transformers.py to make sure it doesn't crash
Reviewed By: zrphercule
Differential Revision: D34283104
fbshipit-source-id: 4f0c7a6bc066f938ceac891320d4cf4c3f8a9cd6
(cherry picked from commit b9df65e97c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72836
Replacing increment iterator loops with ranged loops. It allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This auto-types the loops and adds const-safety to the iteration variable.
Reviewed By: albanD
Differential Revision: D34136539
fbshipit-source-id: 760a70ad43ce6f05630ba8fea261d4dbb699e62e
(cherry picked from commit 0428408d88)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72946
The passes to replace with copy variants are run after TensorExpr fusion. Due to this the resulting graph does not conform to the assumptions made in the fuser.
So, even if these flags `use_copy_variants`, `use_maybe_copy_variants` are turned on, the corresponding passes will not be executed if TensorExpr fusion is enabled.
ghstack-source-id: 149429753
Test Plan: Tested locally.
Reviewed By: mikeiovine
Differential Revision: D34283842
fbshipit-source-id: 74edea517a00c85dff0319f9c8b3ac8befe09018
(cherry picked from commit 3798af7f1b)
Extend shape inference support for `Expand`, when value of argument `shape` is unknown. Infer the rank of the output of `Expand`, and set shape to dynamic, if shape of argument `shape` is known.
Without this, shape inference aborts, and falls back to the static shape provided by tracer, which is incorrect in many cases.
Co-authored-by: BowenBao <bowbaomicrosoft.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72985
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72431
Adds support for a fused QAT observed module for `Linear` followed by
`BatchNorm1d`. In this PR, only the support for prepared module with
fake_quants in the right places is added.
A future PR will add support for `convert`, and tests for eager and FX
graph mode workflows.
Similar to conv-bn, we rescale the weight before applying the fake
quant, and undo the rescaling after the linear operation.
Test Plan:
```
python test/test_quantization.py TestQuantizeEagerQATNumerics.test_linear_bn
```
Imported from OSS
Reviewed By: jerryzh168, raghuramank10000
Differential Revision: D34044427
fbshipit-source-id: 47a519173939ca4824d2c6e6ea7a599764a8ed10
(cherry picked from commit bfc75fe078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72875
This diff contains changes from several PRs landed to lazy_tensor_staging branch.
* generating 'fallback' overrides for each codegenned op, useful for debugging
* supports operators which are missing aten:: symbols for op names, instead using their string counterpart
* makes the IR class a base class instead of hardcoding the assumption of TS
It also resolves lint issues and in particular cleans up the following:
* {Type}s shouldn't be passed into isValueType, and using the catch-all base class of CType is nicer than specifying a list of types.
Fixes#72852
Test Plan: test manually on lazy_tensor_staging branch
Reviewed By: shunting314
Differential Revision: D34250357
fbshipit-source-id: aa7d589f605055d5d02bc77c77fa6f1182ff7497
(cherry picked from commit 2f8f5e4971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73032
Currently, ptvsc2_predictor_bench reports nothing when the input size is zero. However, Static Runtime's module creation has some useful information even after loading a model.
This change reports static op statistics when the given input's size is zero. In addition to that, this enables it to report the out variant coverage percentage, which is crucial to establish the baseline performance of Static Runtime.
Test Plan: - Ran `ptvsc2_predictor_bench` with this change as seen above.
Reviewed By: mikeiovine
Differential Revision: D34294803
fbshipit-source-id: 80c02199075dae9280657d6edecc7c679c1c27f4
(cherry picked from commit 83aec141a2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73028
A typical use case for `TensorExprKernel` is to create the kernel once and call it multiple times, possibly in parallel. For the parallel calls to work, we need to ensure that the run() method calls do not change any state in `TensorExprKernel`.
Before this change, the `run()` method was modifying the sizes and strides vectors when dynamic shapes were present. This manifested as a data race when running a model with Static Runtime.
ghstack-source-id: 149398820
Test Plan:
```
buck build mode/dev-asan //caffe2/test/cpp/tensorexpr:tensorexpr
./buck-out/dev/gen/caffe2/test/cpp/tensorexpr/tensorexpr --gtest_filter="DynamicShapes.MultiThreadedExecution"
```
Reviewed By: eellison
Differential Revision: D34287960
fbshipit-source-id: d311f3c5a66c5d5de4e1deaeaa01816b53e9906e
(cherry picked from commit 161568bfae)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72925
Reland with fix to add the owner string in test file
ghstack-source-id: 149280348
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D34273858
fbshipit-source-id: 2174c1d71fcc5148282d94e375071a50b92114f2
(cherry picked from commit 158762bbb3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587
This pattern frequently appears in a few graphs:
```
%result = prim::If(%condition)
block0():
-> (%a)
block1():
-> (%b)
```
This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.
This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:
```
%result = prim::IfThenElse(%condition, %a, %b)
```
Test Plan: New unit tests
Reviewed By: eellison
Differential Revision: D34091789
fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72866https://github.com/pytorch/pytorch/pull/71597 adds a wrapper `torch.jit.LoweredWrapper` and it breaks the model dump. Fix the model_dump in the notebook
ghstack-source-id: 149311636
Test Plan:
CI and test with N509022
Before:
{F701413403}
After:
{F701412963}
Reviewed By: iseeyuan
Differential Revision: D34247216
fbshipit-source-id: 695b02b03675fae596bb450441b327e4cdcffe9c
(cherry picked from commit d46a82a4c1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72592
Only code paths that are not perf-critical read `ProcessedNode::num_outputs_` and also its static feature of the op that `ProcessedNode` instance is executing.
Therefore, it's better to move `ProcessedNode::num_outputs_` into `ProcessedFunction::num_outputs_` and let `ProcessedNode` access it via `ProcessedNode::fn_` for its occasional use. Note that this prevents duplicating num_outputs_ per node & per Static Runtime instance since `ProcessedFunction` instances are shared across all runtime instances.
It's confirmed that this change reduces the `sizeof(ProcessedNode)` by 14% from local instrumentation as follows:
- Before
-- sizeof(ProcessedNode): 56
- After
-- sizeof(Processednode): 48
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: mikeiovine
Differential Revision: D33984792
fbshipit-source-id: e29ffc97b799e679215f42e1e85cd3fcd7e88983
(cherry picked from commit 0f7003f4df)