* Dont include view ops in autodiff graphs
* skip view ops in autodiff testing
* two more tests
* appease calng format
* Pacify clang-format
Co-authored-by: eellison <eellison@fb.com>
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Previously when analyzing a TupleConstruct, we ignored the aliasing
information of the inputs and simply marked all elements of the returned
tuple as wildcards. But since we can fully reason about the contents of
a tuple statically, we should be able to assign them aliasing
information.
This analysis was not only incomplete but produced incorrect results,
since if `a` is not a wildcard, `a noalias wilcard`. So if we looked at
`tuple(a)` and reported the aliasing info as `tuple(wildcard)`, then
`tuple[0] noalias a`, which is...wrong.
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173
Differential Revision: D22459998
Pulled By: malfet
fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115
Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944
A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.
If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```
There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.
This PR fixes the issue by passing the current stream into DDP's callback.
Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208
Differential Revision: D22073353
fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624
Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models
Test Plan: Imported from OSS
Differential Revision: D22259519
fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
* [quant] aten::repeat work for quantized tensor (#40644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40644
Test Plan: Imported from OSS
Differential Revision: D22268558
fbshipit-source-id: 3bc9a129bece1b547c519772ecc6b980780fb904
* [quant][graphmode][fix] remove unsupported ops in the list (#40653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40653
(Note: this ignores all push blocking failures!)
Test Plan: Imported from OSS
Differential Revision: D22271413
fbshipit-source-id: a01611b5d90849ac673fa5a310f910c858e907a3
* [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar (#40596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596
Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor
Test Plan: Imported from OSS
Differential Revision: D22251072
fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137
* [quant][graphmode] Support quantization for `aten::apend` (#40743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743
`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.
Test Plan:
TestQuantizeJitOps.test_general_shape_ops
Imported from OSS
Differential Revision: D22302151
fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
* properly skip legacy tests regardless of the default executor (#40381)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40381
Differential Revision: D22173938
Pulled By: Krovatkin
fbshipit-source-id: 305fc4484977e828cc4cee6e053a1e1ab9f0d6c7
* [JIT] Switch executor from Simple to Legacy.
This is done for 1.6 only in order to recover performance regressions
caused by the Legacy->Simple switch that was done in 1.5. On master we
still plan to use Simple executor and fix the performance issues in 1.7
without falling back to the Legacy executor.
Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40549
Currently we didn't check if %weight_t is produced by `aten::t`, this will fuse some `matmul`/`addmm` that is
not 2d to `aten::linear`, which is incorrect
Test Plan: Imported from OSS
Differential Revision: D22225921
fbshipit-source-id: 9723e82fdbac6d8e1a7ade22f3a9791321ab12b6
* [WIP][JIT] Add ScriptModule._reconstruct (#39979)
Summary:
**Summary**
This commit adds an instance method `_reconstruct` that permits users
to reconstruct a `ScriptModule` from a given C++ `Module` instance.
**Testing**
This commit adds a unit test for `_reconstruct`.
**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33912.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39979
Differential Revision: D22172323
Pulled By: SplitInfinity
fbshipit-source-id: 9aa6551c422a5a324b822a09cd8d7c660f99ca5c
* [quant][graphmode] Enable inplace option for top level API (#40414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40414
after `_reconstruct` is supported in RecursiveScriptModule: https://github.com/pytorch/pytorch/pull/39979
we can support inplace option in quantization API
Test Plan: Imported from OSS
Differential Revision: D22178326
fbshipit-source-id: c78bc2bcf2c42b06280c12262bb31aebcadc6c32
Co-authored-by: Meghan Lele <meghanl@fb.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40495
As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.
1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442
Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot
Differential Revision: D22205180
fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
Co-authored-by: Pritam Damania <pritam.damania@fb.com>
* [JIT] Update type of the unsqueeze's output in shape analysis.
* [JIT] Fix shape analysis for aten::masked_select.
The reference says that this op always returns a 1-D tensor, even if
the input and the mask are 0-D.
- fixes#38034
- works around missing slice functionality in Sequential
by casting to tuple and slicing that instead
- supports iterating on the resulting slice but not call()
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461
It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.
Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.
As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.
{F241283111}
ghstack-source-id: 106472496
P134031188
Differential Revision: D7933834
fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247
Co-authored-by: Shihao Xu <shihaoxu@fb.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39962
Adding a simple wrapper with ref count for cuda event and
destroying cuda event after the last copy is destroyed
Test Plan: CI cuda profiler tests
Differential Revision: D22027092
Pulled By: ilia-cher
fbshipit-source-id: e0810388aa60b2291eb010896e13af1fad92e472
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40312
As part of https://github.com/pytorch/pytorch/issues/40255, we
realized that GPU support for distributed autograd was broken as part of our
multithreaded autograd change.
To fix this in the short term for 1.6, this PR includes the following changes:
1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the
autograd graph.
2) The long lived CPU thread has its own ready_queue and this queue is used for
all GraphTasks created by DistEngine.
3) In thread_main(), the CPU thread cannot exit once the GraphTask is done
processing because of the new CPU thread added in 1).
4) To resolve this, thread_main() now has a parameter `device_thread` instead
of `reentrant_thread`. When device_thread is True, we expect this to be a long
lived device thread that does not exit.
5) When device_thread is False, thread_main is expected to run a GraphTask and
return once done.
ghstack-source-id: 106391329
Test Plan: waitforbuildbot
Differential Revision: D22146183
fbshipit-source-id: dd146b7a95f55db75f6767889b7255e9d62d5825
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40389
The `mpt_uv` channel MultiPlexes over a Transport, namely the UV one. What this means is that it takes a tensor, chunks it into equal parts and sends each of them on a separate UV connection, each running in a separate UV loop. Thus they each have their own socket and thread. This allows them to reach bandwidths that go beyond what a simple single-threaded approach can do, which is necessary to reach the high bandwidths of some modern NICs.
ghstack-source-id: 106375511
Test Plan: Ran a few manual tests myself, for the rest relied on the PyTorch RPC tests.
Differential Revision: D22144380
fbshipit-source-id: ef555fa04c6f13a4acf3bd5f7b03d04d02460d38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40383
debug option is not supported for these cases, so we print a warning if it occurs
Test Plan: Imported from OSS
Differential Revision: D22164071
fbshipit-source-id: 90459530f4efdd6d255df4f015606cb0e9070cd3
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.
This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.
Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358
Differential Revision: D22165785
Pulled By: albanD
fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066
Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them.
The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map.
The key is then added to all the remote events.
Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once.
ghstack-source-id: 106316106
Test Plan: Unit test
Differential Revision: D22040035
fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43
Summary:
Previously large tensor data in attributes and subgraphs are not stored externally. ONNX won't be able to serialize the model for cases where the total size sums up to >= 2GB. This PR enables that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38793
Reviewed By: hl475
Differential Revision: D22111092
Pulled By: houseroad
fbshipit-source-id: 355234e50825d576754de33c86a9690161caaeaf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38490
A meta tensor is a tensor that is a lot like a normal tensor,
except it doesn't actually have any data associated with it.
You can use them to carry out shape/dtype computations without
actually having to run the actual code; for example, this could
be used to do shape inference in a JIT analysis pass.
Check out the description in DispatchKey.h for more information.
Meta tensors are part of a larger project to rationalize how we
write kernels so that we don't have to duplicate shape logic
in CPU kernel, CUDA kernel and meta kernel (this PR makes the
duplication problem worse!) However, that infrastructure can
be built on top of this proof of concept, which just shows how
you can start writing meta kernels today even without this
infrastructure.
There are a lot of things that don't work:
- I special cased printing for dense tensors only; if you try to
allocate a meta sparse / quantized tensor things aren't going
to work.
- The printing formula implies that torch.tensor() can take an
ellipsis, but I didn't add this.
- I wrote an example formula for binary operators, but it isn't
even right! (It doesn't do type promotion of memory layout
correctly). The most future proof way to do it right is to
factor out the relevant computation out of TensorIterator,
as it is quite involved.
- Nothing besides torch.add works right now
- Meta functions are ALWAYS included in mobile builds (selective
build doesn't work on them). This isn't a big deal for now
but will become more pressing as more meta functions are added.
One reason I'm putting up this PR now is to check with Yinghai Lu
if we can unblock shape inference for accelerators, while we are
still working on a long term plan for how to unify all shape
computation across our kernels.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21935609
Pulled By: ezyang
fbshipit-source-id: f7d8636eeb8516b6bc296db99a16e56029972eee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40100
ELU has a range of [-1, inf]. In the original PR which added
the quantized operator we decided to pass the quantization params
from the input. However, it makes more sense to require observation
for this op.
This PR changes the API to require observation. Next PRs in this stack
will add the eager and graph mode handling.
Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qelu
```
Imported from OSS
Differential Revision: D22075083
fbshipit-source-id: 0ea0fd05a00cc7a5f122a2b1de09144bbd586f32