Summary:
I've added the parsing of an optional first line in native_functions.yaml after the precomputed keyword for arguments that will be precomputed without replacement. This line is optional, must be the first and does not contain any arrow.
These new fields are precomputed as before in the meta function and added to the precompute struct returned by the meta function. For now I've put them as last args of the impl function where they can be reused.
example:
native_function.yaml:
```
...
precomputed:
- int numBatch, int numPlanes, int inputT, int inputH, int inputW <- new
- kernel_size -> int poolSizeT, int poolSizeH, int poolSizeW
- output_size -> int outputT, int outputH, int outputW
```
meta:
```
TORCH_PRECOMPUTE_META_FUNC(fractional_max_pool3d)(
const at::Tensor& input_,
IntArrayRef pool_size,
IntArrayRef output_size,
const at::Tensor& randomSamples
) {
...
return TORCH_PRECOMPUTE_STRUCT(fractional_max_pool3d)().set_numBatch(numBatch).set_numPlanes(numPlanes).set_inputT(inputT).set_inputH(inputH).set_inputW(inputW)
.set_poolSizeT(poolSizeT) ...
}
```
impl:
```
TORCH_IMPL_FUNC(fractional_max_pool3d_out_cpu)(
const at::Tensor& input_,
int64_t poolSizeT,
int64_t poolSizeH,
int64_t poolSizeW,
int64_t outputT,
int64_t outputH,
int64_t outputW,
const at::Tensor& randomSamples,
const at::Tensor& output,
const at::Tensor& indices,
int64_t numBatch, <- for now I've put them here
int64_t numPlanes,
int64_t inputT,
int64_t inputH,
int64_t inputW) {
```
Fixes https://github.com/pytorch/pytorch/issues/71314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71368
Reviewed By: zou3519
Differential Revision: D33683984
Pulled By: bdhirsh
fbshipit-source-id: 33066dd92b8743aadf0dc8102f6bf0689f843242
(cherry picked from commit 64e46af6a4)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71933
Add the functionalities provided by split.py to splitter_base.
- Propagate submodule inputs
- Create SplitResult to hold the split results.
Then removed split.py, to me this makes navigating the lowering code a bit easier.
Added default split and trace function for use.
Next step is to add better error handling for each stage during lowering and create unit tests for each stage. I'll probably make some bootcamp tasks for unit tests.
Test Plan: CI
Reviewed By: frank-wei, wushirong
Differential Revision: D33794322
fbshipit-source-id: f991893047a3701177f54cf22d9a6e48e0529472
(cherry picked from commit 1f3e13efba)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71102
This graph pass is causing a major perf regression on some models. Ideally we would introduce maybe_copy variants for all these ops. But since those are tricky to write, I've introduced a flag to just turn the pass off for now.
ghstack-source-id: 148541673
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: navahgar
Differential Revision: D33510080
fbshipit-source-id: bb4847f26561197ea5e6bbad0a4d25db4ef468eb
(cherry picked from commit 8f333d3e81)
Summary:
Original commit changeset: 4ce347cb0f30
Original Phabricator Diff: D34043182 (8315c9b885)
Test Plan: It's a backout of a backout
Reviewed By: pbelevich, jaceyca
Differential Revision: D34060843
fbshipit-source-id: 6aaf62ce74330cbf142ab483b2a31eccba775ca9
(cherry picked from commit 046b1dbb72)
Summary:
Let's make the documentation for `torch.sparse.sampled_addmm` searchable in the PyTorch documentation.
This PR shall be cherry-picked for the next 1.11 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72312
Reviewed By: davidberard98
Differential Revision: D34045230
Pulled By: cpuhrsch
fbshipit-source-id: c1b1dc907443284857f48c8ce1efab22c6701bbe
(cherry picked from commit 225929ecf2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72391
Temporary array can be reused with in the loop, this will save memory reallocations and uninitialized_copy calls for the vector
Test Plan: CI
Reviewed By: jspark1105
Differential Revision: D34030993
fbshipit-source-id: 40708e3144c6c8f8ac3a6a45d668b34b5e52e095
(cherry picked from commit 859e126aef)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71794
mvlgamma(inp, p) requires that all the elements of inp are > (p-1)/2.
The opinfo test was occasionally producing inputs with elements == (p-1/2), which would generate errors like:
```
ERROR: test_nnc_correctness_mvlgamma_mvlgamma_p_5_cpu_bfloat16 (__main__.TestNNCOpInfoCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 381, in instantiated_test
raise rte
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
result = test(self, **param_kwargs)
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 753, in test_wrapper
return test(*args, **kwargs)
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 907, in only_fn
return fn(slf, *args, **kwargs)
File "/path/pytorch/test/test_jit_fuser_te.py", line 2293, in test_nnc_correctness
ref = variant(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
RuntimeError: All elements must be greater than (p-1)/2
```
repro example: https://gist.github.com/davidberard98/9da688e31cdfbaed7e990746b28a4ba2
Test Plan: Imported from OSS
Reviewed By: qihqi
Differential Revision: D33780905
Pulled By: davidberard98
fbshipit-source-id: c9afd443bc90ce68f33b97498921b447e4f7d1d8
(cherry picked from commit a974b03f07)
Summary:
We noticed that on M1 Macs Tranformer network profiles are dominated by scalar `exp` and `erff` functions (for softmax and GELU).
The NEON `Vectorized<float>` implementation does not use SLEEF functions in order to compile on mobile platforms. However, SLEEF is already compiled on macOS ARM64 and is safe to use there. This change adds another implementation of `Vectorized<float>` that uses SLEEF functions. This implementation is only used on macOS ARM64.
This change speeds up e.g. prediction of spaCy transformer models by 20% on M1 Macs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70354
Reviewed By: albanD
Differential Revision: D33659540
Pulled By: kimishpatel
fbshipit-source-id: b8f02a61321873fc60778190a005c466c7d0cc0c
(cherry picked from commit 71286a207c)
Summary:
This PR was opened as copy of https://github.com/pytorch/pytorch/pull/68812 by request https://github.com/pytorch/pytorch/pull/68812#issuecomment-1030215862.
-----
Fixes https://github.com/pytorch/pytorch/issues/67693.
Reference LAPACK (used in OpenBLAS) changed info error code for svd when inputs contain non-finite numbers. In PyTorch, we raise an internal assert error for negative `info` error codes because usually, it would indicate the wrong implementation. However, this is not the case with SVD now in newer versions of LAPACK. MKL (tried 2021.4.0) still gives a positive error code for this kind of input. This change aligns with the OpenBLAS and MKL behavior in our code.
MKL 2022 has uses the latest reference LAPACK behavior and returns the same `info` as OpenBLAS 0.3.15+
This PR also fixes https://github.com/pytorch/pytorch/issues/71645 that is due to the updated MKL version in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72357
Reviewed By: albanD
Differential Revision: D34012245
Pulled By: ngimel
fbshipit-source-id: 2b66c173cc3458d8c766b542d0d569191cdce310
(cherry picked from commit fa29e65611)
Summary:
`include_directories` is old-style CMake which adds the include path to every file being compiled. This instead makes `python`, `numpy` and `pybind11` into targets that only `torch_python` and `caffe2_pybind_state` are linked to. So, python libraries can't be accidentally included elsewhere.
Resubmit of https://github.com/pytorch/pytorch/issues/65654, Closes https://github.com/pytorch/pytorch/issues/65828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69085
Reviewed By: anjali411
Differential Revision: D33776456
Pulled By: malfet
fbshipit-source-id: 018b0f6cd5a4f8c9e36df961deff832bc4afd479
(cherry picked from commit 57063107d6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72425
Not sure how it worked before
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D34050484
Pulled By: malfet
fbshipit-source-id: 91e2660d4f4e3b8c04bddd07bac434fcba630c0f
(cherry picked from commit b652b25d39)
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in https://github.com/pytorch/pytorch/issues/65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.
This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.
### Additional context
_No response_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72043
Reviewed By: albanD
Differential Revision: D34042781
Pulled By: cbalioglu
fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
(cherry picked from commit f64bf3839a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70465
These tests check to ensure that
(a) the result after nnc fusion (of a single op) is the same as the
unfused op
(b) for certain ops where fusion is expected to occur, ensure that
fusion does actually occur
Test Plan: Imported from OSS
Reviewed By: wenleix
Differential Revision: D33595240
Pulled By: davidberard98
fbshipit-source-id: e2e17a921bc30c313e92e8e5bbc6c1b5fcd14bc1
(cherry picked from commit b1ba221acc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69241
Implement FlatParameter to track the information of a flat parameter, including the sharding information.
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32432503
fbshipit-source-id: b4aabba6cef29e825b45869895709c79e69c211d
(cherry picked from commit 0e5505f70b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71656
Customized `__getstate__`/`__setstate__` didn't call super (torch.nn.Module), and won't restore attributes (e.g. `_modules`) after being serialized and deserialized via torch.package
After a few iteration, as it turns out, pack/unpack linear param has been supported in torchbind class already, no need to hack torch module anymore.
Test Plan: `buck test caffe2/test/:quantization -- test_linear_api`
Reviewed By: jerryzh168
Differential Revision: D33711086
fbshipit-source-id: 3a36d10c64b7da414d3657d2ef766bb9a9290ea9
(cherry picked from commit 6337b6c207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72410
Fixes
```
caffe2/caffe2/operators/cross_entropy_op.cu(330): warning: parameter "outer_size" was declared but never referenced
caffe2/caffe2/operators/cross_entropy_op.cu(191): warning: parameter "outer_size" was declared but never referenced
caffe2/caffe2/operators/generate_proposals_op_util_nms.h(347): warning: variable "order" was declared but never referenced
caffe2/caffe2/operators/segment_reduction_op_gpu.cu(319): warning: parameter "N" was declared but never referenced
detected during:
instantiation of "__nv_bool caffe2::CUDASparseLengthsWeightedSumOp<T, Context, SparseFused>::DoRunWithType<IndexType>() [with T=float, Context=caffe2::CUDAContext, SparseFused=true, IndexType=int32_t]"
caffe2/caffe2/core/operator.h(1304): here
instantiation of "__nv_bool caffe2::DispatchHelper<caffe2::TensorTypes<FirstType, Types...>, ExtraArgs...>::call(Op *, caffe2::TypeMeta) [with FirstType=int32_t, Types=<int64_t>, ExtraArgs=<>, Op=caffe2::CUDASparseLengthsWeightedSumOp<float, caffe2::CUDAContext, true>]"
caffe2/caffe2/core/operator.h(1304): here
instantiation of "__nv_bool caffe2::DispatchHelper<caffe2::TensorTypes<FirstType, Types...>, ExtraArgs...>::call(Op *, const caffe2::Tensor &) [with FirstType=int32_t, Types=<int64_t>, ExtraArgs=<>, Op=caffe2::CUDASparseLengthsWeightedSumOp<float, caffe2::CUDAContext, true>]"
(786): here
caffe2/caffe2/operators/segment_reduction_op_gpu.cu(96): warning: parameter "len_length" was declared but never referenced
detected during:
instantiation of "__nv_bool caffe2::CUDASparseLengthsSumGradientWithIndicesOp<T, Context>::RunOnDevice() [with T=float, Context=caffe2::CUDAContext]"
(1296): here
caffe2/caffe2/sgd/adagrad_fused_op_gpu.cu(1226): warning: variable "N" was declared but never referenced
detected during:
instantiation of "__nv_bool caffe2::DispatchHelper<caffe2::TensorTypes2<FirstType, Types...>, ExtraArgs...>::call(Op *, caffe2::TypeMeta) [with FirstType=float, Types=<c10::Half>, ExtraArgs=<int32_t>, Op=caffe2::CUDARowWiseSparseAdagradFusedWithSparseLengthsSumGradientExactOp<float, int, false, caffe2::CUDAContext>]"
caffe2/caffe2/sgd/adagrad_fused_op_gpu.cu(259): warning: parameter "indices" was declared but never referenced
detected during:
instantiation of "__nv_bool caffe2::CUDARowWiseSparseAdagradFusedWithSparseLengthsSumGradientExactOp<T, TLengths, is_mean, Context>::DoRunWithType2<IndexType,TParam>() [with T=float, TLengths=int, is_mean=false, Context=caffe2::CUDAContext, IndexType=int32_t, TParam=float]"
caffe2/caffe2/core/operator.h(1308): here
caffe2/caffe2/operators/piecewise_linear_transform_op.cu(15): warning: parameter "num_grp" was declared but never referenced
caffe2/caffe2/operators/piecewise_linear_transform_op.cu(50): warning: parameter "M" was declared but never referenced
caffe2/caffe2/operators/piecewise_linear_transform_op.cu(51): warning: parameter "num_grp" was declared but never referenced
caffe2/caffe2/operators/piecewise_linear_transform_op.cu(78): warning: parameter "num_grp" was declared but never referenced
```
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D34034404
fbshipit-source-id: b834088d6a3e204e94bbffe3ac6fdccf9d0176b8
(cherry picked from commit 0148d0de04)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72411
Fixes
```
caffe2/caffe2/operators/max_pool_with_index.cu(16): warning: type qualifier specified more than once
caffe2/caffe2/operators/max_pool_with_index.cu(28): warning: type qualifier specified more than once
caffe2/caffe2/operators/max_pool_with_index.cu(61): warning: type qualifier specified more than once
caffe2/caffe2/operators/max_pool_with_index.cu(62): warning: type qualifier specified more than once
caffe2/caffe2/operators/max_pool_with_index.cu(74): warning: type qualifier specified more than once
```
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D34034382
fbshipit-source-id: 2b73c55358632090baf673b32b800656ae874040
(cherry picked from commit ab3f3f9a79)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72348
**Overview**
#43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly.
61ab04e1db/torch/testing/_internal/distributed/distributed_test.py (L4369-L4397)
- `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`.
- IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP.
- This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D34011559
Pulled By: awgu
fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72
(cherry picked from commit 8492a8b803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72304
This is a no-op change that simply moves files around in preparation of moving linear algebra in its own dynamically boundable module
This also simplifies torch_cuda_cu build rules, as all files from linalg it needs are in its own folder now.
Bazel CUDA rules are in some weird disarray(needed to add wildcard there as it ignores files mentioned in build_variables.so) and similar wildcard needs to be added to internal build system.
Test Plan: Imported from OSS
Reviewed By: dagitses, ngimel
Differential Revision: D33992796
Pulled By: malfet
fbshipit-source-id: 3f4fa1c224016d03e1a982a7ae5ac7807bc772e2
(cherry picked from commit 6a5a1b0c3f)
Summary:
This reverts the previous PR and add some comments to make it clear what the intent is.
Also removes some extra static_assert that are not needed (at least for the compilers I tried).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72336
Reviewed By: r-barnes
Differential Revision: D34006722
Pulled By: albanD
fbshipit-source-id: 290fb89a2d2c66a0d1c3651198b31d21216ec230
(cherry picked from commit 76f0aaa765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72284
This update adds the prod op to the fx2trt tool which is used to create a TensorRT engine for a PyTorch model.
Test Plan:
A new unit test was added to test that the op was added to the acc tracer. This text can be run using the following command: buck test --debug //caffe2/test:test_fx_acc_tracer -- --exact 'caffe2/test:test_fx_acc_tracer - test_prod (fx_acc.test_acc_tracer.AccTracerTest)'
A new suite of unit tests were also added for the conversion to tensorRT and can be tested using the following command: buck test mode/dev-nosan //caffe2/test/fx2trt/converters:test_prod
Please note that unfortunately unlike other pytorch reduce ops such as sum, the pytorch prod function does not support reducing more than 1 dimension at a time (the dim arg cannot be a tuple, only a single int is acceptable for prod). Therefore prod cannot utilize all of the reduce_op code.
https://pxl.cl/1Xpn8https://pxl.cl/1Xpn9
Reviewed By: 842974287
Differential Revision: D33875336
fbshipit-source-id: f9340db3685d681b1cf4ffc3b9fd25d16914e231
(cherry picked from commit cfe48d3737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72344
ATen core is mostly compliant already so we can just add the flag to
the build system. The only exception is interned string which includes
symbols like `aten::add` generated for each operator.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D34010820
Pulled By: albanD
fbshipit-source-id: ef1a625d96f30457b5e6beffc5e630516e54f9b4
(cherry picked from commit b90c262a92)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71781
The previous PR added information about fusions found in the subgraphs.
This PR uses that information for:
1. inserting observers at the end of fusions and not in the middle
2. during inference, replacing the original op with the fused op. The
way this is implemented is that the base op is replaced with the fused op,
and all other ops are replaced with identity functions.
Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_fusion_functions
```
Reviewed By: jerryzh168
Differential Revision: D33775097
Pulled By: vkuzo
fbshipit-source-id: 12249b85b2f7ba7545a54872aeb5f1ff2fc928cf
(cherry picked from commit 0db4324ea9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71780
Adds support for matching operator.add -> torch.relu in FX graph
mode quantization.
It would be nice to support torch.relu better in general, but
saving that for a future PR to keep PRs small.
This is useful for DBR quant because we have some test cases in DBR
quant which use add-relu, and we'd like to match them to FX.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_add_relu
python test/test_quantization.py TestQuantizeFxOps.test_mul_relu
```
Reviewed By: jerryzh168
Differential Revision: D33775096
Pulled By: vkuzo
fbshipit-source-id: 889d9b41d3758ecbbb6d7eab67f64ce3d4892d24
(cherry picked from commit c1f9f38ca1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71764
For DBR quant, adds the code for matching seen ops to function fusion
patterns. After we have the full DAG, we have a separate pass over the
dag and add matched fusion patterns to the seen op data structure.
This is the first PR in the stack which implements matching and
recording the match results. Future PRs in this stack will use
the match results to modify observer insertion and inference.
Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_fusion_functions
```
Reviewed By: jerryzh168
Differential Revision: D33775098
Pulled By: vkuzo
fbshipit-source-id: 488aac902bf568d41c863ee49248990411ed9c53
(cherry picked from commit 4ad1ca1abc)