Commit Graph

9 Commits

Author SHA1 Message Date
Sergii Dymchenko
d083b44818 Remove unused rank from _AllGatherBase backward (#81515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81515
Approved by: https://github.com/mrshenli
2022-07-15 15:30:07 +00:00
pritam
500fb24715 Ensure tensors are contiguous in functional all_gather.
We called `tensor.contiguous()` in the forward pass, however this was
after the `out_tensor_list` was built which results in the `out_tensor_list`
containing non-contiguous tensors resulting in errors.

Fixing this by moving the contiguous call above.

Differential Revision: [D37222870](https://our.internmc.facebook.com/intern/diff/D37222870/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79747

Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2022-06-17 01:27:11 +00:00
pritam
b9e3d722c4 Use appropriate dtype for sharded linear implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79255

We use several collective operations in our sharded linear
implementation and for many collectives, we do not set the `dtype` of the
output tensor appropriately. As a result, using a datatype like torch.float16
(which is not the default torch.float32) results in errors.

Fixing this across the board and adding appropriate tests.

Differential Revision: [D37059752](https://our.internmc.facebook.com/intern/diff/D37059752/)

Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2022-06-10 07:32:15 +00:00
pritam
44aa4ad894 Use _all_gather_base and fuse matmul for sharded linear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78477

Use `_all_gather_base` instead of all_gather for col-wise sharding
since `_all_gather_base` returns a single fused tensor that can be used to
perform a single matmul instead of looping through and performing multiple
matmuls.

This improves performance for col-wise sharding.

Differential Revision: [D36754385](https://our.internmc.facebook.com/intern/diff/D36754385/)

Approved by: https://github.com/aazzolini, https://github.com/wanchaol
2022-06-01 17:17:34 +00:00
Alban Desmaison
da3c848dfa Make distributed raise ImportError when not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75975

Approved by: https://github.com/mrshenli
2022-04-20 13:05:18 +00:00
Sherlockk Huang
752ab799bf Support noncontiguous inputs for torch.distributed.nn.functional.all_gather/reducescatter/gather
Fixes #73515

The backward for AllGather is ReduceScatter. I am wondering is there a deeper reason why it's currently implemented as All2All with explicit sum.

ReduceScatter also has a lower communication payload than All2All.

In addition, dist.reduce_scatter accepts non-contiguous input_tensor_list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75276
Approved by: https://github.com/H-Huang
2022-04-15 02:35:45 +00:00
Junjie Wang
7c2489bdae [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68786

To enable the auto grad for the sharded linear, we find we need to make some changes to the current nn function api (c10d api with auto grad enabled). So we made the following several changes:

1. Add a new api `reduce_scatter` since we need it in the rowwise sharding.
2. Modify the `all_to_all` api to make sure it consistent with the ones in distributed_c10d.py.
3. Found the cpp input params of `reduce_scatter` is missing input param, added more unit test to cover these cases.
4. Sync the NN test from gloo to nccl.
ghstack-source-id: 144860208

Test Plan: CI + Unit Test

Reviewed By: pritamdamania87

Differential Revision: D32569674

fbshipit-source-id: 9bd613f91bbf7a39eede0af32a5a5db0f2ade43b
2021-12-06 13:38:58 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Emilio Castillo
233e4ebdb6 Implement autograd functions for c10d communication operations (#40762)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40702, Fixes https://github.com/pytorch/pytorch/issues/40690

Currently wip. But I would appreciate some feedback. Functions should be double-differentiable.

Contrary to b35cdc5200/torch/nn/parallel/_functions.py
This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct?

Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40762

Reviewed By: glaringlee

Differential Revision: D24758889

Pulled By: mrshenli

fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce
2021-01-26 07:52:51 -08:00