Commit Graph

5 Commits

Author SHA1 Message Date
Teng Li
b4bc55beef TCP init method race condition fix (#15684)
Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.

This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit).  The master rank (who is the server) will always wait until all the ranks to complete before complete itself.

This should fix: https://github.com/pytorch/pytorch/issues/15638

Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.

I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684

Differential Revision: D13570904

Pulled By: teng-li

fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
2019-01-18 02:29:38 -08:00
Teng Li
0d3cb91d8c Make env init_method support both env and args for rank and size (#14494)
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/14446

This was a supported behavior in old torch.distributed. We want to support it in the new release.

Test should cover all combination of scenario when we have either env or arg set up for rank or size or both
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494

Differential Revision: D13253433

Pulled By: teng-li

fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
2018-11-29 18:48:20 -08:00
Teng Li
97036d3c30 FileStore auto deletes file and FileStore::add bug fix (#13708)
Summary:
This addressed: https://github.com/pytorch/pytorch/issues/11874

and we will have the identical file init_method behavior as the previous THD file init.

Also the FileStore::add bug is pretty annoying.

Two bugs:
(1) Add doesn't append to the end of the file.
(2) Cache doesn't get updated.

Both are fixed and tests are covered.

I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708

Reviewed By: pietern

Differential Revision: D12972810

Pulled By: teng-li

fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48
2018-11-14 01:34:22 -08:00
Pieter Noordhuis
52472508e9 Add env:// rendezvous test (#11782)
Summary:
A missing environment variable raised a missing key error. Now it
raises a more descriptive error of the actual problem, for example:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782

Differential Revision: D9888962

Pulled By: pietern

fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
2018-09-19 09:56:06 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00