Summary:
- Clarify that `torch.distributed.autograd.backwards()` does not use the current thread local autograd context, instead it looks it up based on the context_id passed in
- Clarify the same for `torch.distributeed.optimizer.optim.step()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34670
Differential Revision: D20427645
Pulled By: rohan-varma
fbshipit-source-id: a1a88de346cdd4dbe65fb2b7627157f86fd2b6a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33711Fixed#33480
This makes `dist_autograd.backward` and `dist_optimizer.step` functional by making the user explicitly pass in the `context_id` as opposed to relying on the confusing thread_local context_id.
This diff incorporates these API changes and all places where these functions are called.
More concretely, this code:
```
with dist_autograd.context():
# Forward pass.
dist_autograd.backward([loss.sum()])
dist_optim.step()
```
should now be written as follows:
```
with dist_autograd.context() as context_id:
# Forward pass.
dist_autograd.backward(context_id, [loss.sum()])
dist_optim.step(context_id)
```
Test Plan: Ensuring all existing dist_autograd and dist_optimizer tests pass with the new API. Also added a new test case for input checking.
Differential Revision: D20011710
fbshipit-source-id: 216e12207934a2a79c7223332b97c558d89d4d65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31404
Multiple "trainers" could each create different instances of DistributedOptimizer, which means we can still have a race condition unless we do a trully global per worker lock.
ghstack-source-id: 95874624
Test Plan: run unit tests -- unfortunatelly due to the non-deterministic behavior it's not clear how to unit test this properly.
Differential Revision: D19154248
fbshipit-source-id: fab6286c17212f534f1bd1cbdf9f0de002d48c74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30062
This allows to catch exceptions during optimizer creation.
ghstack-source-id: 94232436
Test Plan: new unit test.
Differential Revision: D18586108
fbshipit-source-id: 71cfdf337fe803dbea8787b4c68e5a52b70a1f68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29304
Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized.
It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel.
ghstack-source-id: 93564364
Test Plan: unit tests.
Differential Revision: D18354586
fbshipit-source-id: 85d4c8bfec4aa38d2863cda704d024692511cff5