Summary:
Small quality-of-life improvement to NVTX Python bindings, that we're using internally and that would be useful to other folks using NVTX annotations via PyTorch. (And my first potential PyTorch contribution.)
Instead of needing to be careful with try/finally to make sure all your range_push'es are range_pop'ed:
```
nvtx.range_push("Some event")
try:
# Code here...
finally:
nvtx.range_pop()
```
you can simply do:
```
with nvtx.range("Some event"):
# Code here...
```
or even use it as a decorator:
```
class MyModel(nn.Module):
# Other methods here...
nvtx.range("MyModel.forward()")
def forward(self, *input):
# Forward pass code here...
```
A couple small open questions:
1. I also added the ability to call `msg.format()` inside `range()`, with the intention that, if there is nothing listening to NVTX events, we should skip the string formatting, to lower the overhead in that case. If you like that idea, I could add the actual "skip string formatting if nobody is listening to events" parts. We can also just leave it as is. Or I can remove that if you folks don't like it. (In the first two cases, should we add that to `range_push()` and `mark()` too?) Just let me know which one it is, and I'll update the pull request.
2. I don't think there are many places for bugs to hide in that function, but I can certainly add a quick test, if you folks want.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42925
Reviewed By: gchanan
Differential Revision: D24476977
Pulled By: ezyang
fbshipit-source-id: 874882818d958e167e624052e42d52fae3c4abf1
Summary:
For similar reasons as documented in the `[Sync Streams]` note. For a current example, `ProcessGroupNCCL::allgather` must also call `recordStream` and does so already.
The output tensor is created on the default stream (by the application). NCCL/RCCL internally uses another stream (i.e., ncclStream). If we do not record the output tensor on the ncclStream, there is a chance that the output tensor might be deallocated while NCCL/RCCL is using it.
The application is not aware of the ncclStream since it's internal to ProcessGroupNCCL. So, the application cannot record the output tensor on the ncclStream.
Patch originally developed by sarunyap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46603
Reviewed By: srinivas212
Differential Revision: D24458530
fbshipit-source-id: b02e74d1c3a176ea1b9bbdd7dc671b221fcadaef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46345
Allow user to add more fusion mappings
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D24317439
fbshipit-source-id: 3b144bbc305e41efbdf3e9fb25dbbeaad9e86c6a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46669
Make `Graph`'s deepcopy behavior iterative rather than recursive. This prevents stack overflow issues with very large `Graph`s
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D24455120
Pulled By: jamesr66a
fbshipit-source-id: 5c37db5acabe313b9a7a464bebe2a82c59e4e2e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44090
This is an initial commit pulling in the torchgpipe fork at
https://github.com/facebookresearch/fairscale.
The purpose of this commit is to just pull in the code and ensure all tests and
builds work fine. We will slowly modify this to match our intended API
mentioned in https://fb.quip.com/txurAV3zIFox#RPZACAfAKMq. Follow up PRs would
address further changes needed on top of the initial commit..
We're pulling the code into the `torch.distributed._pipeline.sync` package. The
package is private on purpose since there is a lot of work (ex: docs, API
changes etc.) that needs to go in before we can actually officially support
this.
ghstack-source-id: 114864254
Test Plan:
1) waitforbuildbot
2) Ran all tests on my devgpu
Reviewed By: mrshenli
Differential Revision: D23493316
fbshipit-source-id: fe3c8b7dadeeb86abdc00e8a8652491b0b16743a
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372
Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.
Test Plan: Added unittest.
Reviewed By: pritamdamania87
Differential Revision: D24324578
fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304
In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.
The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677
Test Plan: Added unittest, and CI
Reviewed By: pritamdamania87
Differential Revision: D24296377
fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
Summary:
References https://github.com/pytorch/pytorch/issues/42515
> Enable integer -> float unary type promotion for ops like sin
Will follow-up for other such Ops once this PR is merged.
cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45733
Reviewed By: zou3519
Differential Revision: D24431194
Pulled By: mruberry
fbshipit-source-id: db600bc5de0e535b538d2aa301c3526b7c75ed17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45318
When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.
Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.
I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.
Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed.
In addition, a few lines are reformatted by lint.
caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted.
#Closes: https://github.com/pytorch/pytorch/issues/44203
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl
Reviewed By: pritamdamania87
Differential Revision: D23910257
fbshipit-source-id: 66920746c41f3a27a3689f22e2a2d9709d0faa15
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46501
Gradients in this method will not be modified.
ghstack-source-id: 114851646
Test Plan: waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D24374300
fbshipit-source-id: a2941891008f9f197a5234b50260218932d2d37d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46601
* except excluded tests and magic methods.
https://github.com/pytorch/pytorch/issues/38731
Previously, we'd only do run these tests for inplace operations. Since this is a lot more tests, fixed these issues that came up when running them -
- Updated schema of conj() to reflect existing behaviour.
- Updated deepEquals method in check_alias_annotation.cpp to re-use the overloaded == operator. Previous implementation did not cover all types of IValues.
- Corrected the order inputs are passed in during autograd testing of 'view' & 'reshape'.
- Subbed out atn::ger with the func its aliased to, atn::outer, for testing. The alias annotation checking code doesn't handle aliased operators properly.
ghstack-source-id: 114830903
Test Plan: Ran all tests in test:jit and verified they pass.
Reviewed By: eellison
Differential Revision: D24424955
fbshipit-source-id: 382d7e2585911b81b1573f21fff1d54a5e9a2054
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46657
This is used to simulate fake quantize operation for ops with fixed quantization parameters
e.g. hardsigmoid
Test Plan:
Imported from OSS
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D24451406
fbshipit-source-id: 26cc140c00f12bdec9a8f9dc880f4c425f4d4074
Summary:
There's some code which uses `six.PY3`, similar to:
```python
if six.PY3:
print("Python 3+ code")
else:
print "Python 2 code"
```
Where:
```python
PY3 = sys.version_info[0] == 3
```
When run on Python 4, this will run the Python 2 code! Instead, use `six.PY2` and avoid `six.PY3`.
---
Similarly, there's some `sys.version_info[0] == 3` checks, better done as `sys.version_info[0] >= 3`.
---
Also, it's better to avoid comparing the `sys.version` string, as it makes assumptions that each version component is exactly one character long, which will break in Python 3.10:
```pycon
>>> sys.version
'3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) \n[Clang 6.0 (clang-600.0.57)]'
>>> sys.version < "3.3"
False
>>> fake_v3_10 = '3.10.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) \n[Clang 6.0 (clang-600.0.57)]'
>>> fake_v3_10 < "3.3"
True
```
---
Finally, I think the intention here is to skip when the Python version is < 3.6:
```python
unittest.skipIf(sys.version_info[0] < 3 and sys.version_info[1] < 6, "dict not ordered")
```
However, it will really skip for Python 0.0-0.5, 1.0-1.5 and 2.0-2.5. It's best to compare to the `sys.version_info` tuple and not `sys.version_info[1]`:
```python
unittest.skipIf(sys.version_info < (3, 6), "dict not ordered")
```
---
Found using https://github.com/asottile/flake8-2020:
```console
$ pip install -U flake8-2020
$ flake8 --select YTT
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32389
Reviewed By: zou3519
Differential Revision: D24424662
Pulled By: ezyang
fbshipit-source-id: 1266c4dbcc8ae4d2e2e9b1d7357cba854562177c
Summary:
Fixes issues when building certain PyTorch extensions where the cpp files do NOT compile if flags such as `__HIP_NO_HALF_CONVERSIONS__` are defined.
cc jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46273
Reviewed By: zou3519
Differential Revision: D24422463
Pulled By: ezyang
fbshipit-source-id: 7a43d1f7d59c95589963532ef3bd3c68cb8262be
Summary:
This PR makes it possible to cast the parameters of nn.Module to complex dtypes.
The following code works with the proposed changes.
```python
In [1]: import torch
In [2]: lin = torch.nn.Linear(5, 1).to(torch.complex64)
In [3]: lin(torch.zeros(3, 5, dtype=torch.complex64))
Out[3]:
tensor([[-0.1739+0.j],
[-0.1739+0.j],
[-0.1739+0.j]], grad_fn=<AddmmBackward>)
```
Fixes https://github.com/pytorch/pytorch/issues/43477.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44788
Reviewed By: zou3519
Differential Revision: D24307225
Pulled By: anjali411
fbshipit-source-id: dacc4f5c8c9a99303f74d1f5d807cd657b3b69b5
Summary:
Resolves one item in https://github.com/pytorch/pytorch/issues/46321
This PR sets up DistExamplesTest which will be used as the class to implement future tests for examples. This class is run as part of CI tests. It also creates a dist_examples folder and includes the [batch server example](https://github.com/pytorch/examples/blob/master/distributed/rpc/batch/parameter_server.py) which is slightly modified to allow to be tested.
Run test:
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_batch_updating_parameter_server -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_batch_updating_parameter_server -vs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46510
Reviewed By: mrshenli
Differential Revision: D24379296
Pulled By: H-Huang
fbshipit-source-id: 1c102041e338b022b7a659a51894422addc0e06f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46573
Original commit changeset: 7dd709b585f8
ghstack-source-id: 114730143
Test Plan: Verified on circleci that previously broken test is fixed.
Reviewed By: zdevito
Differential Revision: D24413096
fbshipit-source-id: 439568c631c4556b8ed6af20fcaa4b1375e554cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46356
Adding the flag `-Werror=cast-function-type` to ensure we don't allow
any invalid casts (ex: PyCFunction casts).
For more details see: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 114632980
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D24319759
fbshipit-source-id: 26ce4650c220e8e9dd3550245f214c7e6c21a5dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45538
This is used to simulate fake quantize operation for ops with fixed quantization parameters
e.g. hardsigmoid
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D24004795
fbshipit-source-id: fc4797f80842daacd3b3584c5b72035774634edd