* Dont include view ops in autodiff graphs
* skip view ops in autodiff testing
* two more tests
* appease calng format
* Pacify clang-format
Co-authored-by: eellison <eellison@fb.com>
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40312
As part of https://github.com/pytorch/pytorch/issues/40255, we
realized that GPU support for distributed autograd was broken as part of our
multithreaded autograd change.
To fix this in the short term for 1.6, this PR includes the following changes:
1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the
autograd graph.
2) The long lived CPU thread has its own ready_queue and this queue is used for
all GraphTasks created by DistEngine.
3) In thread_main(), the CPU thread cannot exit once the GraphTask is done
processing because of the new CPU thread added in 1).
4) To resolve this, thread_main() now has a parameter `device_thread` instead
of `reentrant_thread`. When device_thread is True, we expect this to be a long
lived device thread that does not exit.
5) When device_thread is False, thread_main is expected to run a GraphTask and
return once done.
ghstack-source-id: 106391329
Test Plan: waitforbuildbot
Differential Revision: D22146183
fbshipit-source-id: dd146b7a95f55db75f6767889b7255e9d62d5825
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40396
Removes activation and normalization modules from eager mode QAT.
These were incorrectly added, but we don't actually need them.
Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining
```
Imported from OSS
Differential Revision: D22169768
fbshipit-source-id: b5bd753dafe92e90e226fb773eb18c6aae179703
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066
Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them.
The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map.
The key is then added to all the remote events.
Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once.
ghstack-source-id: 106316106
Test Plan: Unit test
Differential Revision: D22040035
fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43
Summary: NVIDIA's Apex is updating to no longer rely on this behavior, but we're reverting this Python2->Python3 update to unblock internal apex users.
Test Plan: Sandcaslte + OSS CI.
Reviewed By: ngimel
Differential Revision: D22146782
fbshipit-source-id: f9483d2cbf9dc3a469ad48a6c863edea3ae51070
Summary:
Make `common_utils.TestCase.precision` a property, because it is overriden as such in `common_device_type`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40057
Differential Revision: D22138385
Pulled By: malfet
fbshipit-source-id: 0e7c14654bf60f18f585efc61f96fdd0af23346f
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39677
Test Plan:
Moved a test class suite between files, wanted to have same functionality (simple code refactor) so tested to make sure the test output was the same before/after the refactor.
Image below shows the output of TestGraphModePostTrainingStatic before refactor
{F239676498}
This image shows the output of TestQuantizeScript (renamed version that is in test_quantize_script.py instead of test_quantize.py)
{F239676509}
Differential Revision: D21940638
Pulled By: edmundw314
fbshipit-source-id: 54160a5151aadf3a34bdac2bcaeb52904e6653ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748
This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node.
As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler).
Changes in this PR:
- New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types.
- Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s
- The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp`
- Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types
- Changes in request_callback to parse and turn on the profiler in a thread-local fashion
- Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler
- Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant.
- Some changes to profiler.cpp to add useful helper methods/guards
- toHere() is now profiled for RRefs
- Unittests
ghstack-source-id: 106134626
Test Plan: Added unittests, existing profiler unittests.
Differential Revision: D19510010
fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39203
Adds logic and test coverage for optional weights and biases for
the quantized normalization operators. This was broken before this
PR because the `TORCH_LIBRARY` registration had these as required parameters
- removed it, and cleaned up the callsites.
Note: consolidating the registrations in `native_functions.yaml` as opposed to `library.cpp`
after a discussion with ezyang .
Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qlayer_norm
python test/test_quantization.py TestQuantizedOps.test_group_norm
python test/test_quantization.py TestQuantizedOps.test_instance_norm
python test/test_quantization.py TestStaticQuantizedModule.test_layer_norm
python test/test_quantization.py TestStaticQuantizedModule.test_group_norm
python test/test_quantization.py TestStaticQuantizedModule.test_instance_norm
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_layer_norm
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_group_norm
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_instance_norm
```
Imported from OSS
Differential Revision: D21885259
fbshipit-source-id: 978c7b8bd6c11a03e9e5fdb68f154cb80cc43599
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162
The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release.
ghstack-source-id: 106103238
Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered.
Differential Revision: D22090661
fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40142
test_jit is becoming huge again, which makes editor hard to load and
write new tests, this split out the tracer related tests.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D22085035
Pulled By: wanchaol
fbshipit-source-id: 696bee84985ecfbfeac8e2ee5c27f1bdda8de394
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40141
This rref timeout test could be flaky because we could end up processing `RRefUserDelete` messages on the owner node before processing the to_here message. This would result in a hang in `ProcessGroupAgent::sync()` that eventually results in a timeout.
The rough sequence of what happens is:
0) Node 0 creates RRef on node 1 with rpc.remote() call
1) rref.to_here() is called with a timeout. Because of delay injection, the processing of this message can be delayed (this is also technically possible in applications without delay injection)
2) At some point, callbacks corresponding to rpc.remote() runs and confirms the rref, adding it as a confirmed user
3) RPC shutdown starts, as part of which we send out RRef user deletes. In this case, 0 sends an RRef user delete to 1, and node 1 removes the owner from the `owners_` field.
4) The `to_here()` message is finally processed by node 1. But since we have deleted the `owner_`, while processing this message we create a future that will be complete when the owner exists (this is to account for the case of to_here() arriving here rpc.remote). But this future will never complete, since the owner is already deleted, so we hang indefnitely
As a workaround for now, we can force `to_here()` to run before RPC shutdown by adding a blocking `to_here()` call with no timeout.
A more robust, longer-term fix would be to detect if an owner has been previously deleted (such as by an RRefUserDelete). Then, we know that the future corresponding to owner creation on the remote end will never completee, and then we error out when processing a `to_here()`.
ghstack-source-id: 106036796
Differential Revision: D22084735
fbshipit-source-id: fe7265a4fe201c4d6d2f480f64fe085cd59dbfb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40167
In v1.6 TensorPipe will not support transferring GPU tensors so, just like other agents, it should raise the appropriate errors when the user attempts to do so. One such error is when sending the arguments, another is when sending the result.
ghstack-source-id: 106059723
Test Plan: Re-enabled the test for this
Differential Revision: D22091737
fbshipit-source-id: 23dda98bc006333c6179361e8cfaf00ecda06408
Summary:
BC-breaking note:
If a user is using one of these dunders directly they will not longer be available. Users should update to Python3 compatible dunders.
Original PR note:
`__div__` (and `__idiv__` and `__rdiv__`) are no longer special dunders in Python3. This PR replaces them with the `__truediv__` (`__itrudediv__`, `__rtruediv__`) dunders, since we no longer support Python2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39151
Differential Revision: D22075713
Pulled By: mruberry
fbshipit-source-id: d318b47b51f7cc4c3728b1606a34d81e49ba0fa1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40101
Create three tests for LSTMs:
1. test_qlstm: Test to check numerics of quantized LSTM operator.
2. test_lstm_api: To check the LSTM module and compare
it with the quantized LSTM op
3. test_quantized_rnn: Check the dynamic quantization workflow, scriptability and serialization of quantized
LSTM
ghstack-source-id: 105997268
(Note: this ignores all push blocking failures!)
Test Plan:
buck test caffe2/test:quantization -- 'test_lstm_api \(quantization\.test_quantized_module\.TestDynamicQuantizedModule\)' --print-passing-details
buck test caffe2/test:quantization -- 'test_quantized_rnn \(quantization\.test_quantize\.TestPostTrainingDynamic\)'
buck test caffe2/test:quantization -- 'test_qlstm \(quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp\)' --print-passing-details
Differential Revision: D22070826
fbshipit-source-id: 46c333e19b9eab8fa5cab6f132e89b80a635791a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909
As described in https://github.com/pytorch/pytorch/issues/33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.
To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.
Closes: https://github.com/pytorch/pytorch/issues/33583
ghstack-source-id: 105953303
Test Plan: waitforbuildbot
Differential Revision: D22011868
fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
Summary:
Currently compare_with_numpy requires a device and dtype, but these arguments are ignored if a tensor is provided. This PR updates the function to only take device and dtype if a tensor-like object is given. This should prevent confusion that you could, for example, pass a CPU float tensor but provided a CUDA device and integer dtype.
Several tests are updated to reflect this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40064
Differential Revision: D22058072
Pulled By: mruberry
fbshipit-source-id: b494bb759855977ce45b79ed3ffb0319a21c324c
Summary:
Create three tests for LSTMs:
1. test_qlstm: Test to check numerics of quantized LSTM operator.
2. test_lstm_api: To check the LSTM module and compare
it with the quantized LSTM op
3. test_quantized_rnn: Check the dynamic quantization workflow, scriptability and serialization of quantized
LSTM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38851
ghstack-source-id: 105945574
(Note: this ignores all push blocking failures!)
Test Plan:
buck test caffe2/test:quantization -- 'test_lstm_api \(quantization\.test_quantized_module\.TestDynamicQuantizedModule\)' --print-passing-details
buck test caffe2/test:quantization -- 'test_quantized_rnn \(quantization\.test_quantize\.TestPostTrainingDynamic\)'
buck test caffe2/test:quantization -- 'test_qlstm \(quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp\)' --print-passing-details
Differential Revision: D21628596
fbshipit-source-id: 4aeda899f2e5f14bfbe3d82096cb4ce89c725fa1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39964
The "[fut.wait() for fut in futs]" idiom can introduce up to
O(len(futs)) thread switches, which may be excessive for large N.
This plumbs through the new c++ c10::collectAll() to Python space
so that we only employ a single jit-side wait.
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn
Differential Revision: D22027412
fbshipit-source-id: 4e344a19a09638ee46e7fc478df80a41941b84ce
Summary:
Remove PY3 and PY34 checks from `torch/testing/_internal/common_utils.py`
Remove PY35 global var from `torch.jit.annotations`
Always call `try_get_real_signature` in `torch/jit/annotations.py`
Use `map` instead of `imap`, since Python-2 is no longer support, so map is always lazy.
Remove all pre Python-3.6 checks from `torch/_six.py` and `torch/_appdirs.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39879
Differential Revision: D22037811
Pulled By: malfet
fbshipit-source-id: af0c79f976569c2059d39ecb49c6b8285161734f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974
# Problem
When this assertion happens, I don't know
- which worker_id it is on, even with the worker_name "trainer:0".
- which rref is throwing this exception.
```shell
File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers
trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp>
trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception
raise result.exception_type(result.msg)
RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0')
Traceback (most recent call last):
File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke
return getattr(rref.local_value(), func_name)(*args, **kwargs)
RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0
```
Changes,
- Add stringify WorkerInfo
- Make localValue() assertion message clearer about the case.
ghstack-source-id: 105840918
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork
Reviewed By: mrshenli
Differential Revision: D5690653
fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39933
Fix rref related alias annotation to ensure it's not getting ebased by
the jit dce.
Test Plan: Imported from OSS
Differential Revision: D22015426
Pulled By: wanchaol
fbshipit-source-id: 3e74d49fa9f88abaf662bde7be5284f01f621b98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39790
The "[fut.wait() for fut in futs]" idiom can introduce up to
O(len(futs)) thread switches, which may be excessive for large N.
This plumbs through the new c++ c10::collectAll() to Python space
so that we only employ a single jit-side wait.
ghstack-source-id: 105779443
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn
Reviewed By: kiukchung
Differential Revision: D21976891
fbshipit-source-id: 253c61f503f4ffb9be784e6c49a0656cede139fb
Summary:
Enhance FileCheck util to check for highlighted source ranges. This is useful when writing tests regarding generated error messages that require source code highlighting.
Here is how the error looks like in different cases:
- In case of needed source code token not found at all in input string:
```
RuntimeError: Expected to find "invalid_token" but did not find it
Searched string:
... <--- HERE
def to_list_missing_type_annotation(x):
# type: (torch.Tensor) -> List[float]
From CHECK-SOURCE-HIGHLIGHTED: invalid_token
```
- In case of source code token not highlighted:
```
Traceback (most recent call last):
File "test_range.py", line 11, in <module>
FileCheck().check_source_highlighted("x.tolist()").run(s)
RuntimeError: Expected to find "~~~~~~~~~~" but did not find it
Searched string:
# type: (torch.Tensor) -> List[float]
li = x.tolist()
~~~~~~~~~ <--- HERE
~~~~~~~~~~~~~~~~~~~... <--- HERE
return li
```
It is a bit confusing since both input text (usually an error message) and generated error messages have their highlighted portions, but this is consistent of previous behavior. Another option is to generate plain error messages without additional range highlighting on input text.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39692
Test Plan:
Added unit test.
Closes https://github.com/pytorch/pytorch/issues/38698
Differential Revision: D22001765
Pulled By: gmagogsfm
fbshipit-source-id: 6681441eee5853ab061d198ccfe55ebffddca202