Commit Graph

8 Commits

Author SHA1 Message Date
Wanchao Liang
32c355af5b [dist_optim] introduce distributed functional optimizer (#45221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221

This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935256

Pulled By: wanchaol

fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
2020-09-25 17:13:10 -07:00
Shen Li
f05abd1259 Fix example block format in Distributed Optimizer API doc (#34919)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34919

Test Plan: Imported from OSS

Differential Revision: D20500013

Pulled By: mrshenli

fbshipit-source-id: d28cbdd1ec207e1e8501ce389b7040fb764f12ca
2020-03-17 17:44:09 -07:00
Rohan Varma
f933fa3613 [docs][1.5] update RPC docs to reflect correct use of dist_autograd backwards and dist_optim step() (#34670)
Summary:
- Clarify that `torch.distributed.autograd.backwards()` does not use the current thread local autograd context, instead it looks it up based on the context_id passed in
- Clarify the same for `torch.distributeed.optimizer.optim.step()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34670

Differential Revision: D20427645

Pulled By: rohan-varma

fbshipit-source-id: a1a88de346cdd4dbe65fb2b7627157f86fd2b6a3
2020-03-13 14:09:23 -07:00
Omkar Salpekar
24dd800e6a [Dist Autograd] Functional API for Dist Autograd and Dist Optimizer (#33711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33711

Fixed #33480

This makes `dist_autograd.backward` and `dist_optimizer.step` functional by making the user explicitly pass in the `context_id` as opposed to relying on the confusing thread_local context_id.

This diff incorporates these API changes and all places where these functions are called.

More concretely, this code:

```
with dist_autograd.context():
    # Forward pass.
    dist_autograd.backward([loss.sum()])
    dist_optim.step()
```

should now be written as follows:

```
with dist_autograd.context() as context_id:
    # Forward pass.
    dist_autograd.backward(context_id, [loss.sum()])
    dist_optim.step(context_id)
```

Test Plan: Ensuring all existing dist_autograd and dist_optimizer tests pass with the new API. Also added a new test case for input checking.

Differential Revision: D20011710

fbshipit-source-id: 216e12207934a2a79c7223332b97c558d89d4d65
2020-02-26 19:08:28 -08:00
Pritam Damania
359c39b3c2 Use global lock instead of per instance lock. (#31404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31404

Multiple "trainers" could each create different instances of DistributedOptimizer, which means we can still have a race condition unless we do a trully global per worker lock.
ghstack-source-id: 95874624

Test Plan: run unit tests -- unfortunatelly due to the non-deterministic behavior it's not clear how to unit test this properly.

Differential Revision: D19154248

fbshipit-source-id: fab6286c17212f534f1bd1cbdf9f0de002d48c74
2019-12-18 09:22:54 -08:00
Alisson Gusatti Azzolini
07e14c7cd0 DistributedOptimizer: wait for all workers to finish _LocalOptimizer constructor (#30062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30062

This allows to catch exceptions during optimizer creation.
ghstack-source-id: 94232436

Test Plan: new unit test.

Differential Revision: D18586108

fbshipit-source-id: 71cfdf337fe803dbea8787b4c68e5a52b70a1f68
2019-11-19 18:30:00 -08:00
Pritam Damania
5d69bc1eda Add docs for distributed optimizer. (#29971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29971

ghstack-source-id: 94132160

Test Plan: waitforbuildbot

Differential Revision: D18554631

fbshipit-source-id: c4485f7cff5159f423d0f35d1caf71074b62dc28
2019-11-18 18:51:26 -08:00
Alisson Gusatti Azzolini
b0cf43b2dd Simple distributed optimizer (#29304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29304

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized.
It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel.
ghstack-source-id: 93564364

Test Plan: unit tests.

Differential Revision: D18354586

fbshipit-source-id: 85d4c8bfec4aa38d2863cda704d024692511cff5
2019-11-11 12:02:24 -08:00