pytorch/caffe2/python
Aarti Basant 8af9f0da99 Saving checkpoint failure should not cause job failure
Summary:
If we encounter failures while writing a checkpoint, ensure that the job does
not fail.
A job can make progress even if writing a checkpoint fails

Reviewed By: anshulverma, boryiingsu

Differential Revision: D6615163

fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77
2017-12-21 10:32:55 -08:00
..
docs Re-license to Apache 2017-09-28 16:22:00 -07:00
examples Allow shifting of activations / ops to other GPUs in data parallel model 2017-11-29 21:17:00 -08:00
helpers Add if and while ops to brew 2017-12-05 17:33:34 -08:00
layers print exception in layers 2017-12-15 12:12:28 -08:00
mint Re-license to Apache 2017-09-28 16:22:00 -07:00
mkl More extensions 2017-11-20 17:18:51 -08:00
modeling Added inverted FP16 Initializer 2017-10-27 10:20:04 -07:00
models Explicitly set default data type in seq2seq/translate.py 2017-12-07 11:21:01 -08:00
operator_test Add Min and MinGradient Op in Caffe2 2017-12-20 14:49:55 -08:00
predictor Record workflow run id for inference. 2017-12-18 15:33:19 -08:00
rnn Re-license to Apache 2017-09-28 16:22:00 -07:00
test Async executor with less polling 2017-11-28 18:50:32 -08:00
_import_c_extension.py Re-license to Apache 2017-09-28 16:22:00 -07:00
allcompare_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
attention.py Re-license to Apache 2017-09-28 16:22:00 -07:00
benchmark_generator.py Re-license to Apache 2017-09-28 16:22:00 -07:00
binarysize.py Re-license to Apache 2017-09-28 16:22:00 -07:00
brew_test.py Add if and while ops to brew 2017-12-05 17:33:34 -08:00
brew.py Add if and while ops to brew 2017-12-05 17:33:34 -08:00
build.py Expose CMake options in the binary 2017-10-04 02:33:02 -07:00
cached_reader.py Cached reader 2017-11-15 12:38:49 -08:00
caffe_translator_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
caffe_translator.py Translating Crop to Slice 2017-10-03 17:18:32 -07:00
checkpoint_test.py Saving checkpoint failure should not cause job failure 2017-12-21 10:32:55 -08:00
checkpoint.py Saving checkpoint failure should not cause job failure 2017-12-21 10:32:55 -08:00
CMakeLists.txt CMake completions work 2017-01-11 16:59:22 -08:00
cnn.py Re-license to Apache 2017-09-28 16:22:00 -07:00
context_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
control_ops_grad.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
control_ops_util.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
control_test.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
control.py Re-license to Apache 2017-09-28 16:22:00 -07:00
convnet_benchmarks_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
convnet_benchmarks.py Re-license to Apache 2017-09-28 16:22:00 -07:00
core_gradients_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
core_test.py Py3 test fixes 2017-12-05 10:34:41 -08:00
core.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
crf.py Re-license to Apache 2017-09-28 16:22:00 -07:00
data_parallel_model_test.py Skip DeviceShiftTest if host has < 4 GPU devices 2017-12-03 16:02:05 -08:00
data_parallel_model_utils.py Allow shifting of activations / ops to other GPUs in data parallel model 2017-11-29 21:17:00 -08:00
data_parallel_model.py Allow shifting of activations / ops to other GPUs in data parallel model 2017-11-29 21:17:00 -08:00
data_workers_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
data_workers.py move print to logger 2017-11-17 18:03:44 -08:00
dataio_test.py Cached reader 2017-11-15 12:38:49 -08:00
dataio.py Re-license to Apache 2017-09-28 16:22:00 -07:00
dataset.py Re-license to Apache 2017-09-28 16:22:00 -07:00
db_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
device_checker.py Re-license to Apache 2017-09-28 16:22:00 -07:00
dyndep.py Re-license to Apache 2017-09-28 16:22:00 -07:00
embedding_generation_benchmark.py Re-license to Apache 2017-09-28 16:22:00 -07:00
experiment_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
extension_loader.py Re-license to Apache 2017-09-28 16:22:00 -07:00
functional_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
functional.py Re-license to Apache 2017-09-28 16:22:00 -07:00
gradient_check_test.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
gradient_checker.py Re-license to Apache 2017-09-28 16:22:00 -07:00
gru_cell.py Integrated GRU implementation into C2 2017-11-14 16:18:50 -08:00
hsm_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
hypothesis_test_util.py Add check for Travis in executor test 2017-10-05 11:40:23 -07:00
hypothesis_test.py Add CUDA implementation for ReplaceNaNOp 2017-12-05 13:34:51 -08:00
layer_model_helper.py add maybe_add_global_constant 2017-12-18 22:14:00 -08:00
layer_model_instantiator.py Re-license to Apache 2017-09-28 16:22:00 -07:00
layer_parameter_sharing_test.py Add shape checks and print more info in parameter sharing 2017-10-27 01:22:06 -07:00
layer_test_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
layers_test.py Add rank loss for retrieval models with random negative sample 2017-10-25 16:19:41 -07:00
lengths_reducer_rowwise_8bit_ops_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
load_save_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
lstm_benchmark.py Re-license to Apache 2017-09-28 16:22:00 -07:00
memonger_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
memonger.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
mkl_test_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
model_device_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
model_helper.py add sanity check to model_helper.TensorProtosDBInput 2017-11-21 10:28:25 -08:00
modifier_context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
mpi_python.cc Upgrade to 2.2.1 2017-10-22 13:26:56 -07:00
muji_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
muji.py Re-license to Apache 2017-09-28 16:22:00 -07:00
net_builder_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
net_builder.py Minor documentation fix in NetBuiler 2017-11-15 16:22:22 -08:00
net_drawer.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
net_printer_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
net_printer.py Unit test that compares net snippets after parallelization 2017-11-08 15:55:27 -08:00
observer_test.py Attach observers to operators inside step net 2017-11-14 15:06:38 -08:00
optimizer_context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
optimizer_test_util.py momentum sgd 2017-11-03 16:17:17 -07:00
optimizer_test.py Support RMSProp in Caffe2. 2017-11-08 16:43:18 -08:00
optimizer.py Revert changes in blob name in optimizer 2017-12-04 19:32:45 -08:00
parallel_workers_test.py Add shutdown_fun to parallel_workers 2017-10-10 12:02:24 -07:00
parallel_workers.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
parallelize_bmuf_distributed_test.py BMUF cpu support 2017-11-19 23:41:25 -08:00
pipeline_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
pipeline.py Re-license to Apache 2017-09-28 16:22:00 -07:00
predictor_constants.py Re-license to Apache 2017-09-28 16:22:00 -07:00
pybind_state_gpu.cc Upgrade to 2.2.1 2017-10-22 13:26:56 -07:00
pybind_state_mkl.cc Re-license to Apache 2017-09-28 16:22:00 -07:00
pybind_state.cc Add slice and gather syntax 2017-12-19 19:17:01 -08:00
pybind_state.h Polling async net executor 2017-11-03 07:27:44 -07:00
python_op_test.py Throw Python exception from PythonOp instead of logging 2017-11-20 09:03:17 -08:00
queue_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
record_queue.py Re-license to Apache 2017-09-28 16:22:00 -07:00
recurrent.py Remove scoping assertion because it is not useful and causing errors 2017-12-11 18:03:45 -08:00
regularizer_context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
regularizer.py Re-license to Apache 2017-09-28 16:22:00 -07:00
rnn_cell.py LayerConfigMILSTMCell 2017-12-14 10:17:53 -08:00
schema_test.py add struct get method 2017-12-19 12:35:56 -08:00
schema.py add struct get method 2017-12-19 12:35:56 -08:00
scope_test.py Add a EmptyDeviceScope (i.e. allow setting CurrentDeviceScope() to None) 2017-11-02 11:25:48 -07:00
scope.py Add a EmptyDeviceScope (i.e. allow setting CurrentDeviceScope() to None) 2017-11-02 11:25:48 -07:00
session_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
session.py Re-license to Apache 2017-09-28 16:22:00 -07:00
sparse_to_dense_mask_test.py Skip negative indices 2017-10-09 16:09:50 -07:00
task.py Re-license to Apache 2017-09-28 16:22:00 -07:00
test_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
text_file_reader.py Re-license to Apache 2017-09-28 16:22:00 -07:00
timeout_guard.py Re-license to Apache 2017-09-28 16:22:00 -07:00
toy_regression_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
tt_core_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
tt_core.py Re-license to Apache 2017-09-28 16:22:00 -07:00
utils.py Re-license to Apache 2017-09-28 16:22:00 -07:00
visualize.py Re-license to Apache 2017-09-28 16:22:00 -07:00
workspace_test.py Compute flops in conv based on output image size 2017-12-09 21:32:08 -08:00
workspace.py Add ONNX exporter for glcgan 2017-11-14 10:09:44 -08:00