Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: mrshenli
Differential Revision: D23955730
fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: jiayisuse
Differential Revision: D23009117
fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962
TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761
Test Plan: Adding tests to C++ Store Tests
Reviewed By: pritamdamania87
Differential Revision: D22985085
fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13582
Worker nodes sometimes witness timeout failures when getting session_id blob from Zeus, which due to delays in master node setting the blob.
This diff will add flexibility to specify longer timeout for getting blobs from Zeus.
Reviewed By: pietern
Differential Revision: D12926156
fbshipit-source-id: b1a4d1d9cf7de084785bfa4a8a0cd3cfd095ba5c
Summary:
There are still a few work to be done:
- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h
This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:
(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.
Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354
Reviewed By: orionr
Differential Revision: D10238910
Pulled By: Yangqing
fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
Summary: Define StoreHandlerTimeoutException() for timeouts in StoreHandler::wait(). Update all StoreHandler implementations. Catch new exception in CreateCommonWorldOp and store failure blob.
Reviewed By: akyrola
Differential Revision: D5095625
fbshipit-source-id: dc6f8351cc129cd1fac72bd4b2c8e6b684b21f31
Summary: Basic ops to set/get/check/wait against a StoreHandler.
Differential Revision: D4248059
fbshipit-source-id: cc53061fcc13823d4b9eed6b7c1c346b9e8ec991
Summary:
Add store handler implementation backed by a Redis server.
This allows for easy rendezvous when participating machines have no
access to a shared filesystem.
Differential Revision: D4241715
fbshipit-source-id: 4ce881df3a96af24f7efbb02d1050b3b2b9bc3c0