* Dont include view ops in autodiff graphs
* skip view ops in autodiff testing
* two more tests
* appease calng format
* Pacify clang-format
Co-authored-by: eellison <eellison@fb.com>
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Previously when analyzing a TupleConstruct, we ignored the aliasing
information of the inputs and simply marked all elements of the returned
tuple as wildcards. But since we can fully reason about the contents of
a tuple statically, we should be able to assign them aliasing
information.
This analysis was not only incomplete but produced incorrect results,
since if `a` is not a wildcard, `a noalias wilcard`. So if we looked at
`tuple(a)` and reported the aliasing info as `tuple(wildcard)`, then
`tuple[0] noalias a`, which is...wrong.
This PR:
- renames `torch.set_deterministic` to `torch._set_deterministic`
- renames `torch.is_deterministic` to `torch._is_deterministic`
- Modifies the docstrings for both to indicate that the feature is not
yet complete.
We would like to do this because this feature is experimental and the
docstrings before this PR are misleading.
This PR does not have an accompanying change in master. That is because
there still is discussion over what the eventual state of the feature
should be: https://github.com/pytorch/pytorch/issues/15359. I expect
that there will be a better plan for this once 1.7 rolls around.
Test Plan:
- wait for CI
* Add optimizer_for_mobile doc into python api root doc
* Apply suggestions from code review
Remove all references to `optimization_blacklist` as it's missing in 1.6
Co-authored-by: Nikita Shulga <nshulga@fb.com>
Summary:
In short, we messed up. The SHM and CMA backends of TensorPipe are Linux-specific and thus they are guarded by a #ifdef in the agent's code. Due to a mishap with CMake (due the fact that TensorPipe has two CMake files, one for PyTorch and a "standalone" one) we were not correctly propagating some flags and these #ifdefs were always false. This means that these two backends have always been disabled and have thus never been covered by our OSS CI. It would be irresponsible to enable them now in v1.6, so instead we remove any mention of them from the docs.
Note that this is perhaps not as bad as it sounds. These two backends were providing higher performance (latency) when the two endpoints were on the same machine. However, I suspect that most RPC users will only do transfers across machines, for which SHM and CMA wouldn't have played any role.
Original PR against master: #41200 (merged as dde3d5f4a8)
Test Plan: Docs only
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173
Differential Revision: D22459998
Pulled By: malfet
fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41047.
Some CPU kernel implementations don't call `cast_outputs()`, so when CPU temporaries were created to hold their outputs they weren't copied back to the out parameters correctly. Instead of fixing that issue, for simplicity this PR disables the behavior. The corresponding test in test_type_promotion.py is expanded with more operations to verify that unary ops can no longer have out arguments with different dtypes than their inputs (except in special cases like torch.abs which maps complex inputs to float outputs and torch.deg2rad which is secretly torch.mul).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41097
Differential Revision: D22422352
Pulled By: mruberry
fbshipit-source-id: 8e61d34ef1c9608790b35cf035302fd226fd9421
Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115
Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944
A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.
If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```
There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.
This PR fixes the issue by passing the current stream into DDP's callback.
Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208
Differential Revision: D22073353
fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
* Move OperatorSchema default inference function implementations to .cc… (#40845)
Summary:
… file
This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.
Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845
Differential Revision: D22334779
Pulled By: malfet
fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
* Move `OperatorBase::AddRelatedBlobInfo` implementation to .cc file (#40844)
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.
This was one of the reasons why size of libcaffe2_module_test_dynamic.so was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)
Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844
Differential Revision: D22334725
Pulled By: malfet
fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40868
Differential Revision: D22343454
Pulled By: malfet
fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624
Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models
Test Plan: Imported from OSS
Differential Revision: D22259519
fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
* [quant] aten::repeat work for quantized tensor (#40644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40644
Test Plan: Imported from OSS
Differential Revision: D22268558
fbshipit-source-id: 3bc9a129bece1b547c519772ecc6b980780fb904
* [quant][graphmode][fix] remove unsupported ops in the list (#40653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40653
(Note: this ignores all push blocking failures!)
Test Plan: Imported from OSS
Differential Revision: D22271413
fbshipit-source-id: a01611b5d90849ac673fa5a310f910c858e907a3
* [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar (#40596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596
Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor
Test Plan: Imported from OSS
Differential Revision: D22251072
fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137
* [quant][graphmode] Support quantization for `aten::apend` (#40743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743
`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.
Test Plan:
TestQuantizeJitOps.test_general_shape_ops
Imported from OSS
Differential Revision: D22302151
fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
We need an easy to way to quickly visually grep binary sizes from builds
and then have a way to test out those binaries quickly.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
(cherry picked from commit 66813515d4dec66f319442ba967c64b87c0286cd)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40931
Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446
Test Plan: Docs show up in correctly
Differential Revision: D22360787
fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
* properly skip legacy tests regardless of the default executor (#40381)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40381
Differential Revision: D22173938
Pulled By: Krovatkin
fbshipit-source-id: 305fc4484977e828cc4cee6e053a1e1ab9f0d6c7
* [JIT] Switch executor from Simple to Legacy.
This is done for 1.6 only in order to recover performance regressions
caused by the Legacy->Simple switch that was done in 1.5. On master we
still plan to use Simple executor and fix the performance issues in 1.7
without falling back to the Legacy executor.
Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
* Re-apply PyTorch pthreadpool changes
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.
Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`
Reviewed By: xcheng16
Differential Revision: D22199952
fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
* Enable XNNPACK ops on iOS and macOS.
Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D221 (9788a74da8)AP-12.0.1
Reviewed By: xta0
Differential Revision: D21886736
fbshipit-source-id: ac482619dc1b41a110a3c4c79cc0339e5555edeb
* Respect user set thread count. (#40707)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40707
Test Plan: Imported from OSS
Differential Revision: D22318197
Pulled By: AshkanAliabadi
fbshipit-source-id: f11b7302a6e91d11d750df100d2a3d8d96b5d1db
* Fix and reenable threaded QNNPACK linear (#40587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40587
Previously, this was causing divide-by-zero only in the multithreaded
empty-batch case, while calculating tiling parameters for the threads.
In my opinion, the bug here is using a value that is allowed to be zero
(batch size) for an argument that should not be zero (tile size), so I
fixed the bug by bailing out right before the call to
pthreadpool_compute_4d_tiled.
Test Plan: TestQuantizedOps.test_empty_batch
Differential Revision: D22264414
Pulled By: dreiss
fbshipit-source-id: 9446d5231ff65ef19003686f3989e62f04cf18c9
* Fix batch size zero for QNNPACK linear_dynamic (#40588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40588
Two bugs were preventing this from working. One was a divide by zero
when multithreading was enabled, fixed similarly to the fix for static
quantized linear in the previous commit. The other was computation of
min and max to determine qparams. FBGEMM uses [0,0] for [min,max] of
empty input, do the same.
Test Plan: Added a unit test.
Differential Revision: D22264415
Pulled By: dreiss
fbshipit-source-id: 6ca9cf48107dd998ef4834e5540279a8826bc754
Co-authored-by: David Reiss <dreiss@fb.com>