Commit Graph

13 Commits

Author SHA1 Message Date
Omkar Salpekar
6b65b3cbd8 [Distributed] DeleteKey API for c10d TCP Store (#45401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: mrshenli

Differential Revision: D23955730

fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
2020-09-28 15:30:39 -07:00
Natalia Gimelshein
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
Omkar Salpekar
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
Omkar Salpekar
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
Tianshu Bao
eb29485ed8 Support custimzed timeout when fetching blob from KVStore (#13582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13582

Worker nodes sometimes witness timeout failures when getting session_id blob from Zeus, which due to delays in master node setting the blob.
This diff will add flexibility to specify longer timeout for getting blobs from Zeus.

Reviewed By: pietern

Differential Revision: D12926156

fbshipit-source-id: b1a4d1d9cf7de084785bfa4a8a0cd3cfd095ba5c
2018-11-06 18:54:56 -08:00
Yangqing Jia
713e706618 Move exception to C10 (#12354)
Summary:
There are still a few work to be done:

- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h

This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:

(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.

Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354

Reviewed By: orionr

Differential Revision: D10238910

Pulled By: Yangqing

fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
2018-10-15 13:33:18 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Andrew Dye
b5721c2d9d Throw timeout exception from StoreHandler::wait() and catch in CreateCommonWorldOp
Summary: Define StoreHandlerTimeoutException() for timeouts in StoreHandler::wait(). Update all StoreHandler implementations. Catch new exception in CreateCommonWorldOp and store failure blob.

Reviewed By: akyrola

Differential Revision: D5095625

fbshipit-source-id: dc6f8351cc129cd1fac72bd4b2c8e6b684b21f31
2017-05-19 15:01:23 -07:00
Pieter Noordhuis
4662781099 Include hint about run ID in store handler assertion
Summary:
TSIA

Also see https://github.com/caffe2/caffe2/issues/476

Differential Revision: D5002728

fbshipit-source-id: 2c301cacc395cfed4eec11dffedc3dba0e180e72
2017-05-04 15:21:12 -07:00
Andrew Dye
f07ec699ee Add rendezvous timeout parameter and defaults to StoreHandler::wait()
Summary: Add default rendezvous timeout for RedisStoreHandler and FileStoreHandler.

Reviewed By: pietern

Differential Revision: D4911678

fbshipit-source-id: e69dd03d96214449944d583b20941540cc0b6643
2017-04-24 15:52:25 -07:00
Pieter Noordhuis
a3942b2d64 Add store ops and tests
Summary: Basic ops to set/get/check/wait against a StoreHandler.

Differential Revision: D4248059

fbshipit-source-id: cc53061fcc13823d4b9eed6b7c1c346b9e8ec991
2016-12-05 11:53:26 -08:00
Pieter Noordhuis
f3403a1110 Add RedisStoreHandler
Summary:
Add store handler implementation backed by a Redis server.

This allows for easy rendezvous when participating machines have no
access to a shared filesystem.

Differential Revision: D4241715

fbshipit-source-id: 4ce881df3a96af24f7efbb02d1050b3b2b9bc3c0
2016-12-05 11:53:26 -08:00