Commit Graph

12 Commits

Author SHA1 Message Date
Yifan Xiong
c7eaec86f0 [NCCL] Patch bfloat16 support (#67843)
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.

This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
  v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
  operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843

Reviewed By: H-Huang

Differential Revision: D32248132

Pulled By: mrshenli

fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
2021-11-09 13:46:13 -08:00
Jane Xu
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
Pritam Damania
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
Pritam Damania
306eb3def7 Additional error checking for torch.cuda.nccl APIs. (#43247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43247

`torch.cuda.nccl` APIs didn't throw appropriate errors when called
with inputs/outputs that were of the wrong type and it resulted in some cryptic
errors instead.

Adding some error checks with explicit error messages for these APIs.
ghstack-source-id: 110683546

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23206069

fbshipit-source-id: 8107b39d27f4b7c921aa238ef37c051a9ef4d65b
2020-08-26 13:50:00 -07:00
Pritam Damania
872237c1f2 Output to stderr in distributed tests. (#42139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139

A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.

Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579

Test Plan: Run the offending unit tests.

Reviewed By: mrshenli

Differential Revision: D22779135

fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
2020-07-29 19:23:34 -07:00
Sandeep Kumar Pani
e179966248 [caffe2][tpx] log to stderr (#42162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42162

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22791440

fbshipit-source-id: 14f16cd7a94a57161c5724177b518527f486232d
2020-07-28 07:50:27 -07:00
Mike Ruberry
6bd88f581a Revert D22790238: [caffe2][tpx] Use logger instead of print
Test Plan: revert-hammer

Differential Revision:
D22790238 (3c6fae6567)

Original commit changeset: c0a801cdf7f0

fbshipit-source-id: cadfbd22f7d3ce656624483c9a19062f7c9a5b61
2020-07-28 06:11:30 -07:00
Sandeep Kumar Pani
3c6fae6567 [caffe2][tpx] Use logger instead of print
Test Plan: CI?

Differential Revision: D22790238

fbshipit-source-id: c0a801cdf7f0da489c67708a0eb1b498ff104c64
2020-07-28 04:26:51 -07:00
Jithun Nair
eea535742f Add bfloat16 support for nccl path (#38515)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38515

Differential Revision: D22420896

Pulled By: ezyang

fbshipit-source-id: 80d2d0c2052c91c9035e1e025ebb14e210cb0100
2020-07-07 18:07:06 -07:00
Jithun Nair
545a3e1eca Remove test_nccl from ROCM_BLACKLIST and enable only a couple of test_nccl tests (#39354)
Summary:
All individual test_nccl unit tests have been disabled for ROCm in bf9395438f
test_nccl was also added to the ROCM_BLACKLIST in 87b198d309
However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: https://github.com/pytorch/pytorch/pull/38689

This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: https://github.com/pytorch/pytorch/pull/38515
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39354

Differential Revision: D21843194

Pulled By: mrshenli

fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35
2020-06-05 13:52:23 -07:00
Pritam Damania
bf9395438f Disable test_nccl for ROCm (#38801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38801

NCCL specific tests that shouldn't be run on ROCm
ghstack-source-id: 104481245

Test Plan: waitforbuildbot

Differential Revision: D21667348

fbshipit-source-id: a3e558185d9b74e1eac5fae27d97d5d026baa0a1
2020-05-21 11:15:08 -07:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00