pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yifan Xiong	c7eaec86f0	[NCCL] Patch bfloat16 support (#67843 ) Summary: Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is still not complete to enable bfloat16 for allreduce in end-to-end training. This patch does the followings: * fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in v2.10.3-1 (commit 7e51592) * update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL operations like all reduce can use it * enable unit tests for bfloat16 datatype if possible cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843 Reviewed By: H-Huang Differential Revision: D32248132 Pulled By: mrshenli fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525	2021-11-09 13:46:13 -08:00
Jane Xu	34051d74da	Add test owner to distributed files starting with test_ (#66797 ) Summary: Action based on https://github.com/pytorch/pytorch/issues/66232 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797 Reviewed By: gchanan Differential Revision: D31761389 Pulled By: janeyx99 fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c	2021-10-19 10:55:20 -07:00
Pritam Damania	82d81455ae	[2/N] Remove unittest.skip across all of torch.distributed. (#61887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887 1) Introduced a `sandcastle_skip_if` decorator that ensures these tests just get passed on sandcastle. 2) Fixed all test files under `test/distributed` to not use `unittest.skip` Overall goal is to avoid using skips since sandcastle tags these tests as continuously skipping. ghstack-source-id: 134382237 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29784152 fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d	2021-07-27 10:53:23 -07:00
Pritam Damania	306eb3def7	Additional error checking for `torch.cuda.nccl` APIs. (#43247 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43247 `torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. ghstack-source-id: 110683546 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D23206069 fbshipit-source-id: 8107b39d27f4b7c921aa238ef37c051a9ef4d65b	2020-08-26 13:50:00 -07:00
Pritam Damania	872237c1f2	Output to stderr in distributed tests. (#42139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139 A bunch of tests were failing with buck since we would output to stdout and buck would fail parsing stdout in some cases. Moving these print statements to stderr fixes this issue. ghstack-source-id: 108606579 Test Plan: Run the offending unit tests. Reviewed By: mrshenli Differential Revision: D22779135 fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad	2020-07-29 19:23:34 -07:00
Sandeep Kumar Pani	e179966248	[caffe2][tpx] log to stderr (#42162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42162 Test Plan: CI Reviewed By: mrshenli Differential Revision: D22791440 fbshipit-source-id: 14f16cd7a94a57161c5724177b518527f486232d	2020-07-28 07:50:27 -07:00
Mike Ruberry	6bd88f581a	Revert D22790238: [caffe2][tpx] Use logger instead of print Test Plan: revert-hammer Differential Revision: D22790238 (`3c6fae6567`) Original commit changeset: c0a801cdf7f0 fbshipit-source-id: cadfbd22f7d3ce656624483c9a19062f7c9a5b61	2020-07-28 06:11:30 -07:00
Sandeep Kumar Pani	3c6fae6567	[caffe2][tpx] Use logger instead of print Test Plan: CI? Differential Revision: D22790238 fbshipit-source-id: c0a801cdf7f0da489c67708a0eb1b498ff104c64	2020-07-28 04:26:51 -07:00
Jithun Nair	eea535742f	Add bfloat16 support for nccl path (#38515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38515 Differential Revision: D22420896 Pulled By: ezyang fbshipit-source-id: 80d2d0c2052c91c9035e1e025ebb14e210cb0100	2020-07-07 18:07:06 -07:00
Jithun Nair	545a3e1eca	Remove test_nccl from ROCM_BLACKLIST and enable only a couple of test_nccl tests (#39354 ) Summary: All individual test_nccl unit tests have been disabled for ROCm in `bf9395438f` test_nccl was also added to the ROCM_BLACKLIST in `87b198d309` However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: https://github.com/pytorch/pytorch/pull/38689 This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: https://github.com/pytorch/pytorch/pull/38515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39354 Differential Revision: D21843194 Pulled By: mrshenli fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35	2020-06-05 13:52:23 -07:00
Pritam Damania	bf9395438f	Disable test_nccl for ROCm (#38801 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38801 NCCL specific tests that shouldn't be run on ROCm ghstack-source-id: 104481245 Test Plan: waitforbuildbot Differential Revision: D21667348 fbshipit-source-id: a3e558185d9b74e1eac5fae27d97d5d026baa0a1	2020-05-21 11:15:08 -07:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00

12 Commits