Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73198
Previously, if an arg to an FX node is a subclass of tuple then it gets sanitized essentially back to that base class. An example here is when setting an arg to be a TensorMetadata object, which is a NamedTuple, it will be set as a tuple instead.
- Change `map_aggregate` to repack the tuple to `type(a)` when it's not directly a tuple (try/except for best attempt)
- During codegen, call `add_global` for `type(a)` if it's not directly a tuple.
- Add an option for an arg to provide a `_custom_fx_repr_fn` for use inside stringifying via `_format_arg`
Test Plan: Added unit test coverage, where we inline the named tuple into arg/kwarg.
Reviewed By: jamesr66a
Differential Revision: D34381888
fbshipit-source-id: bd672a8542e2bba5aa604b448bec920efc256440
(cherry picked from commit 68f99c12dd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73215
Fixing an issue in optimizers from _multi_tensor, for `sgd_mt` introduced in 2cb03e926f
Reviewed By: mikaylagawarecki
Differential Revision: D34389034
fbshipit-source-id: ede153d52dca15909c6c022853589707f18dc8d1
(cherry picked from commit cc8a58e584)
Summary:
This PR is a follow up to the following prs.
https://github.com/pytorch/pytorch/pull/69942https://github.com/pytorch/pytorch/pull/72682
We are adding support to Navi21 GPUs which have a warpsize of 32. We cannot rely on a constant so we have to dynamically look up the warpsize when launching the kernel on the host side. Inside device functions this is not needed and the compiler can correctly detect the correct warpsize to replace the C10_WARP_SIZE constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72809
Reviewed By: mruberry
Differential Revision: D34400737
Pulled By: ngimel
fbshipit-source-id: 1a1374465d4006e485d4d11531a4c78ddb178cdf
(cherry picked from commit 94211fe1f0)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72910
`last_dim_size` is the expected output size for the
Hermitian-compressed dimension and must be > 0. The confusingly named
`ld` represents the input's last dim size which is calculated as
`last_dim_size / 2 + 1` so could never be 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73012
Reviewed By: ngimel
Differential Revision: D34387147
Pulled By: mruberry
fbshipit-source-id: 6b410088efe2a9e117a5c6d8beefda370363dbb0
(cherry picked from commit f8d771ed36)
Summary:
Avoids the following deprecation warning:
```python
loss.backward(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/torch/tensor.py:245: in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py:147: in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py:89: in apply
return self._forward_cls.backward(self, *args) # type: ignore
/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/_functions.py:34: in backward
return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/_functions.py:45: in forward
return comm.reduce_add_coalesced(grads_, destination)
/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/comm.py:143: in reduce_add_coalesced
flat_result = reduce_add(flat_tensors, destination)
/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/comm.py:96: in reduce_add
nccl.reduce(inputs, output=result, root=root_index)
/usr/local/lib/python3.7/dist-packages/torch/cuda/nccl.py:69: in reduce
_check_sequence_type(inputs)
/usr/local/lib/python3.7/dist-packages/torch/cuda/nccl.py:48: in _check_sequence_type
if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
name = 'Container'
def __getattr__(name):
# For backwards compatibility, continue to make the collections ABCs
# through Python 3.6 available through the collections module.
# Note, no new collections ABCs were added in Python 3.7
if name in _collections_abc.__all__:
obj = getattr(_collections_abc, name)
import warnings
warnings.warn("Using or importing the ABCs from 'collections' instead "
"of from 'collections.abc' is deprecated since Python 3.3,"
"and in 3.9 it will stop working",
> DeprecationWarning, stacklevel=2)
E DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
/usr/lib/python3.7/collections/__init__.py:52: DeprecationWarning
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72239
Reviewed By: ngimel
Differential Revision: D34387815
Pulled By: mruberry
fbshipit-source-id: 30c9b4fe518351bc9a6f211269e27ee3ab73a13c
(cherry picked from commit 1f68cdfac5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72834
This removes the upper bound to librosa's pin and updates the scipy
pin since librosa 0.9 requires SciPy 1.2 or newer.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D34386898
Pulled By: mruberry
fbshipit-source-id: db654bd337b474cd5a2ff8dbb9a659ed272728cf
(cherry picked from commit 4790e8180c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72833Closes#72550
The latest version of librosa breaks backward compatibility in two
ways:
- Everything except the input tensor is now keyword-only
- `pad_mode` now defaults to `'constant'` for zero-padding
https://librosa.org/doc/latest/generated/librosa.stft.html
This changes the test to match the old behaior even when using the new
library and updates the documentation to explicitly say that
`torch.stft` doesn't exactly follow the librosa API. This was always
true (`torch.stft` it has new arguments, a different default window
and supports complex input), but it can't hurt to be explicit.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D34386897
Pulled By: mruberry
fbshipit-source-id: 6adc23f48fcb368dacf70602e9197726d6b7e0c1
(cherry picked from commit b5c5ed4196)
Summary:
A small bug that misses `lazy` in tensor.__deepcopy__, which results in segmentation when deepcopy a lazy model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73197
Reviewed By: jbschlosser
Differential Revision: D34394482
Pulled By: wconstab
fbshipit-source-id: c84fdb9b3a827677971fd3477a92679d7dbce3c0
(cherry picked from commit c003d150ce)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73251
Scipy's chisquare test requires that the observed frequencies should sum up to the same number as the expected frequences. This modifies `_check_sampler_discrete` to ensure that two match. See: https://github.com/scipy/scipy/issues/12282 for details.
Test Plan: Unit tests pass on platform010
Reviewed By: r-barnes
Differential Revision: D34402314
fbshipit-source-id: 995b4ddf668cfb551176d3bd21fb8415dfe96cc1
(cherry picked from commit d81a133b0d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72522
Ref #72263 for cpp_custom_type_hack removal
These overloads were deprecated in #35787 which was in the PyTorch 1.6
release, so the BC period is well expired.
cc jamesr66a
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision: D34111271
Pulled By: albanD
fbshipit-source-id: 0078564188133625ca67137975fd5dd2fa2b4827
(cherry picked from commit 4f9c5a3ed7)
Summary:
These are formatting changes automatically done with `arc f` to deal with issues landing the onnx changes in this stack
{F703786210}
Test Plan: yeah_sandcastle
Reviewed By: malfet
Differential Revision: D34402111
fbshipit-source-id: 06eb352d1e4f8b1439a580148fe1060fb5c9e102
(cherry picked from commit 7bbf29ed8e)
Summary:
Added a check for the dispatch keys present in native_function.yaml, they must be part of the fixed set of dispatch keys. If not, signal an error. I also removed two dispatch keys from the function schema copy_ , because they are not supported (SparseHIP, SpareXPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67961
Test Plan:
this function schema (for example) in native_function.yaml
```
- func: native_norm(Tensor self, Scalar p=2) -> Tensor
dispatch:
SparseCPU, SparseCUDA, SparseHIP: norm_sparse
```
now generates this error during codegen: `AssertionError: SparseHIP is not a supported dispatch key.`
Fixes https://github.com/pytorch/pytorch/issues/66190
Reviewed By: albanD
Differential Revision: D34327853
Pulled By: ezyang
fbshipit-source-id: 6959d14a7752aefd025baa482d56547b4ed69b4c
(cherry picked from commit 26bea380af)
Summary:
There are various possible approaches, but the approach chosen minimizes disruption to source control blame.
Addresses:
```
error: Function _ZN23FunctionalTest_Pad_Test8TestBodyEv is too big to optimize [-Werror,-Wignored-optimization-argument]
```
Test Plan: buck2 build mode/opt caffe2/test/cpp/api:functional
Reviewed By: jamesr66a
Differential Revision: D34027291
fbshipit-source-id: 9dfd771ad56d3d4bc0d41b38b04654c8dae7c006
(cherry picked from commit d43b5a7ed6)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/71280
We used to use `from pkg_resources import packaging`. To recap, this has
three potential problems:
1) `pkg_resources` is a really slow import
2) We have an undeclared runtime dependency on `setuptools`
3) We're relying on `pkg_resources`'s secret vendored copy of
`packaging`. This is obviously not part of the public API of
`pkg_resources`.
In https://github.com/pytorch/pytorch/issues/71345 this was made a lazy import, which is great! It means we don't
run into these problems as long as users don't use `torch.__version__`.
This change additionally helps further address problems 1 and 3, by
directly importing `packaging`, if present, and only falling back to the
vendored copy in `pkg_resources`.
Benchmark for speed difference in a virtual environment with a couple
hundred packages installed:
```
λ hyperfine -w 2 'python -c "from pkg_resources import packaging"' 'python -c "import packaging.version"'
Benchmark 1: python -c "from pkg_resources import packaging"
Time (mean ± σ): 706.7 ms ± 77.1 ms [User: 266.5 ms, System: 156.8 ms]
Range (min … max): 627.9 ms … 853.2 ms 10 runs
Benchmark 2: python -c "import packaging.version"
Time (mean ± σ): 53.8 ms ± 8.5 ms [User: 34.8 ms, System: 14.4 ms]
Range (min … max): 46.3 ms … 72.3 ms 53 runs
'python -c "import packaging.version"' ran
13.14 ± 2.52 times faster than 'python -c "from pkg_resources import packaging"'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71902
Reviewed By: mikaylagawarecki
Differential Revision: D34343145
Pulled By: malfet
fbshipit-source-id: a6bd7ecf0cbb6b5c20ab18a22576aa2df9eb3324
(cherry picked from commit 0a249044c8)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70140
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights
Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.
- User facing API is in `_stateless.py` (with documentation)
- Testing is in test_expanded_weights
- The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141
Test Plan: Imported from OSS
Reviewed By: mikaylagawarecki
Differential Revision: D34350950
Pulled By: samdow
fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2
(cherry picked from commit ae1620d3b6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72871
We do this same trick in the native MHA implementation; backport it for purposes of fair comparison.
ghstack-source-id: 149526858
Test Plan: CI
Reviewed By: ngimel
Differential Revision: D34176090
fbshipit-source-id: 8b578c29c4dcf0d85bae74dfbbb82db9a8f32dc7
(cherry picked from commit fd50170935)
Creates the superuser group for GHF to allow for any changes reviewed by
these individuals to be automatically merged using our GHF tooling
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73221
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72934
Before this PR, DBR quantization had a limitation on handling user
code which iterates over all module children. For example, imagine
a forward function such as
```
def forward(self, x):
for module in self:
x = module(x)
return x
```
Before this PR, this code would break with DBR quantization, because
we attach `AutoQuantizationState` objects to each child, and those
objects live in the child's module hierarchy and will appear in
these kinds of iterations, changing the meaning of the user program.
This PR reduces the scope of this problem to just the top level module.
Instead of attaching `AutoQuantizationState` objects to each child,
we register them in a map on the parent. Here is a before and after:
```
// toy model
model
|--> child1
// toy model with AutoQuantizationState objects, before this PR
model
|--> child1
| |--> _auto_quant_state
|--> _auto_quant_state
// toy model with AutoQuantizationState objects, after this PR
model
|--> child1
|--> _fqn_to_auto_quant_state_map
|--> ( ) --> _auto_quant_state // of `model`
|--> (child1) --> _auto_quant_state // of `model.child1`
```
Note: `child1._auto_quant_state` works as before for convenience,
but the `child1` object now stores a soft link to its `_auto_quant_state`
instead of properly registering it in its module hierarchy. This is
somewhat hacky. If we need to improve this in the future, we could
remove this soft link and refactor the code to call the FQN map
instead.
Note: if the top level module iterates over its children, things will
still be broken. This is less likely, and we will recommend that the
user work around this by wrapping their model, or checking for the
`AutoQuantizationStateModuleDict` type in their iteration loop.
The impact of this change should be an improvement of coverage
of user models. In fact, we expect this to drive our coverage of
torchbenchmark models from 89% to 100%.
Test Plan:
```
// previously disabled test cases with user code iterating
// over module children are now enabled, with wrappers
python test/test_quantization.py -k test_module_calls_items
python test/test_quantization.py -k test_vovnet_sequential
```
Reviewed By: dzdang
Differential Revision: D34281074
Pulled By: vkuzo
fbshipit-source-id: 0e25fc1ec529c47f72478a1875fe43219feac6b1
(cherry picked from commit 4008f89967)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.
**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).
To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.
**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932
Reviewed By: rohan-varma
Differential Revision: D34281482
Pulled By: awgu
fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 51344755fe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73061
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: jspark1105, jiecaoyu
Differential Revision: D34331487
fbshipit-source-id: 39cc6d4c0c7a0c8ee26cb385966123990f9e6eda
(cherry picked from commit 53919f8173)
Summary:
Potentially fixes https://github.com/pytorch/pytorch/issues/71385 similar docstring could also fix https://github.com/pytorch/pytorch/issues/71384
Updated the doc to `torch.linalg.inv` to include nuance around equivalence to `torch.linalg.solve`:
Update is below:
```
.. note::
Consider using :func:`torch.linalg.solve` if possible for multiplying a matrix on the left by
the inverse, as::
linalg.solve(A, B) == linalg.inv(A) @ B # When B is a matrix
It is always prefered to use :func:`~solve` when possible, as it is faster and more
numerically stable than computing the inverse explicitly.
```
IvanYashchuk please inform if this the right direction or over-extrapolation. I can apply the same changes to the `tensorinv` doc to fix https://github.com/pytorch/pytorch/issues/71384. Also in https://github.com/pytorch/pytorch/issues/71384 there was a mention of updating `torch.matmul` error message to indicate the proper tensor shapes, I could also potentially do that in this PR if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71769
Reviewed By: H-Huang
Differential Revision: D34242541
Pulled By: mruberry
fbshipit-source-id: 40e98dad4d821928d1dea72d4512ee579b690a32
(cherry picked from commit a0321a5de9)
Added tests for lite interpreter. By default the run_test.sh will use lite interpreter, unless manually set BUILD_LITE_INTERPRETER=0
Also fixed model generation script for android instrumentation test and README.
Verified test can pass for both full jit and lite interpreter. Also tested on emulator and real device using different abis.
Lite interpreter
```
./scripts/build_pytorch_android.sh x86
./android/run_tests.sh
```
Full JIT
```
BUILD_LITE_INTERPRETER=0 ./scripts/build_pytorch_android.sh x86
BUILD_LITE_INTERPRETER=0 ./android/run_tests.sh
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72736
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71538
`reportMemoryUsage` is kind of awful. It does a bunch of string writes and such that makes it VERY expensive. Just moving that work off the hot path reduces the overhead for `profile_memory` from ~6.5 us to ~1.2 us. (85% reduction in the kineto contribution to profiling overhead.)
Test Plan: Ran ubenchmark with `--op empty --stressTestKineto --kinetoProfileMemory`
Reviewed By: swolchok
Differential Revision: D32730167
fbshipit-source-id: fe18e8fa3881967cad8fa1c26c71c805e9b034e5
(cherry picked from commit 0d394cb252)