Commit Graph

29 Commits

Author SHA1 Message Date
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Qinqing Zheng
1288c4fd79 refactor epoch_limiter (#2389)
* refactor epoch_limiter

* fix test
2018-03-22 20:32:13 -07:00
Qinqing Zheng
b3fdfa7bd6 [DT] [4/n] Make epoch_group explicit for JobRunner (#2018) 2018-02-23 10:41:52 -08:00
Kittipat Virochsiri
6f533fd8b8 Only overwrite path_prefix & path_type when not None
Summary: This breaks internal functionality

Reviewed By: aartibasant

Differential Revision: D6975222

fbshipit-source-id: ce751950b4b9217d8ea5de703690451e98642f00
2018-02-13 14:40:35 -08:00
Aarti Basant
28f42cc8e7 separating set_params and init() for checkpoint managers.
Summary: separating set_params and init() for checkpoint managers.

Reviewed By: anshulverma

Differential Revision: D6852255

fbshipit-source-id: 061f16ce0c49953ca8a5fe9546af5c9945a3be48
2018-02-05 18:03:21 -08:00
Qinqing Zheng
90a3363f29 Return an empty TaskGroup if node managers exist in MultiNodeCheckpointManager
Summary: Current MultiNodeCheckpointManager return None in this case, yet in JobRunner we assume this function returns a valid task group, i.e. we call session.run(self.checkpoint_manager.init(...)) directly. This will fail the case we use LocalHostScheduler and reuse a MultiNodeCheckpointManager

Reviewed By: azzolini

Differential Revision: D6843450

fbshipit-source-id: a7ec942cfe692f19e8751b0078ae6a6108f29e54
2018-01-30 19:20:50 -08:00
Aarti Basant
fc56e86c7d Introduce init API for the optional Checkpoint Metadata Handler object
Summary:
Every call to the checkpoint_metadata_handler write() API requires us to pass all params like db_prefix, db_type etc.
Introducing an init API in the checkpoint_metadata_handler so that such params can be saved and need not be passed in every API call

Reviewed By: mraway, anshulverma

Differential Revision: D6792651

fbshipit-source-id: 059fa4309e8fce1ee5ab009af3e0570573c24245
2018-01-24 15:19:55 -08:00
Wei Zhang
1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:

1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.

Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.

Reviewed By: azzolini

Differential Revision: D6765393

fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
2018-01-22 14:04:12 -08:00
Aarti Basant
33d734fcf1 Generalize construction of db_name in checkpoint manager
Summary:
Instead of constructing db_name as a member of checkpoint_manager, generalize
this function

Reviewed By: anshulverma

Differential Revision: D6671088

fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea
2018-01-10 11:49:17 -08:00
Aarti Basant
8af9f0da99 Saving checkpoint failure should not cause job failure
Summary:
If we encounter failures while writing a checkpoint, ensure that the job does
not fail.
A job can make progress even if writing a checkpoint fails

Reviewed By: anshulverma, boryiingsu

Differential Revision: D6615163

fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77
2017-12-21 10:32:55 -08:00
Aarti Basant
5de880f3e1 Resume from epoch instead of re-starting a worklow from scratch when we retry
Reviewed By: anshulverma

Differential Revision: D6354076

fbshipit-source-id: d2bee93a1136fb07c46942649e90110d2e3ccb0e
2017-11-17 12:51:07 -08:00
Anshul Verma
4b8669b087 Write checkpoint info to XDB at the end of an epoch
Summary: In this diff I am making sure that the checkpoint metadata is written out to the db for every epoch. This will allow us to automatically resume from a epoch if a workflow fails.

Reviewed By: aartibasant

Differential Revision: D6234832

fbshipit-source-id: f09a4de118f2eac25f663556476ac6313925fdf3
2017-11-09 11:13:24 -08:00
Lei Chen
58bcf76ba3 Have model downloading as a separate plan
Summary:
For distributed offline training, downloading parameters from trainer_0 is part of epoch plan. However for distributed realtime training, we publish model by a specific time interval, so we need run multiple iterations for epoch plan before publishing the model.

In this diff, I split downloading parameters from epoch plan as a separate plan, so we can explicitly execute it before model publishing for distributed online training.

Reviewed By: boryiingsu

Differential Revision: D5995122

fbshipit-source-id: 47d61d7b8c57cfae156e79b7ec32068ef579d7c3
2017-10-16 16:03:48 -07:00
Dmytro Dzhulgakov
2972a6ca02 Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger"
Summary:
This reverts commit 95c634872ac02be721257169e38c8fead04cd66b

bypass-lint

Differential Revision: D6026557

fbshipit-source-id: 663c28583ce3b01070ff5449115ed7e222f71776
2017-10-12 20:21:52 -07:00
Luke Yeager
75bece6ede Fix "No handlers could be found for logger"
Summary: Closes https://github.com/caffe2/caffe2/pull/1316

Differential Revision: D6026557

Pulled By: Yangqing

fbshipit-source-id: 95c634872ac02be721257169e38c8fead04cd66b
2017-10-10 22:32:13 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Andrew Dye
2070467c57 Allow CheckpointManager init() and load() to use a different db type with path_prefix
Summary: CheckpointManager already accepts a path_prefix override for init() and load(), but it assumes the same db_type passed in __init__(). This change adds an optional path_type for each call.

Reviewed By: boryiingsu

Differential Revision: D5888152

fbshipit-source-id: 21cd31a62a0188fe0e0b19b43c3b232c2342d0a8
2017-09-22 09:48:29 -07:00
Aarti Basant
77a02eaa7f Enable reader checkpoint
Summary:
Reader checkpointing was disabled due to bug captured in T21143272
Now that we have resolved that issue, re-enabling reader checkpointing

Reviewed By: boryiingsu, rayleichen

Differential Revision: D5730545

fbshipit-source-id: 7fae48b03e07eaf530bfc9e8e8b6683d8ed4e206
2017-09-05 14:21:25 -07:00
Bor-Yiing Su
b3536a3a6d Adds checkpoint taskgroups to the online trainer.
Summary:
1. Uses the upload_builder in the offline training.
2. Adds the checkpoint taskgroups to the online trainer.
3. Changes the naming rules so that the model checkpoint has the format of
<directory>/<entity_id>_<snapshot_id>.<node_name>.<snapshot_id>

Reviewed By: rayleichen

Differential Revision: D5665068

fbshipit-source-id: a8103aed2ca195a506174d2a1d50611d2f1d9c35
2017-08-19 04:09:47 -07:00
Bor-Yiing Su
1d70a2276d Changes the checkpoint naming rules.
Summary: So far the we format the epoch name with 6 digits, but this is constraining. In order to have consistent naming, we can simply append the epoch to the suffix. Then we will have consistent naming rules for small and for large epoch numbers.

Reviewed By: azzolini

Differential Revision: D5653871

fbshipit-source-id: acdf26a14b731347bb85fe2f33c1b89e2ba83bdd
2017-08-17 22:16:42 -07:00
Bor-Yiing Su
49ec942825 Temporarily disables the checkpoints for the readers.
Summary:
The hive reader checkpoints are broken because of D5582328.
This breaks our offline simulator test as well.
This is a temporary fix that disables the checkpoints for readers.

Reviewed By: azzolini

Differential Revision: D5637719

fbshipit-source-id: 4f31ae534cb7e981fcacbb721cbb2420249fad91
2017-08-15 19:36:11 -07:00
Bor-Yiing Su
404f8ee9b4 Extends the jobrunner to support uploading checkpoints.
Summary:
1. Adds one more step in the JobRunner class to upload checkpoints.
2. Adds one function to return the name of the checkpoint given
the name of the node.

Reviewed By: andrewwdye

Differential Revision: D5597130

fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72
2017-08-11 14:17:17 -07:00
Bor-Yiing Su
81a55f441c Adds interfaces to check the existence of a DB
Summary:
To evaluate on checkpoints, we often need to load from multiple checkpoints.
However, it is inconvenient if we always need to check the existence of
a checkpoint manually. Adds interfaces to check the existence of a DB
so that we can find available checkpoints automatically.

Reviewed By: azzolini

Differential Revision: D4823876

fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7
2017-04-11 14:07:49 -07:00
Bor-Yiing Su
8f9cd757db Skips the initialization phase of the individual checkpoint objects.
Summary:
The initialization phase of each checkpoint object simply loads the nanmes of
the blobs in the checkpoints. When we load from the checkpoints, the names of
the blobs are given. We can skip this init step.

Reviewed By: azzolini

Differential Revision: D4808114

fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a
2017-03-31 10:10:56 -07:00
Bor-Yiing Su
0e6413f8ea Fix flaky test
Summary:
Somehow the stress-runs flag does not work as what I expected.
Now the test finally passes.

Reviewed By: azzolini

Differential Revision: D4797559

fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213
2017-03-29 16:48:20 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Bor-Yiing Su
7fa4acab9b Loads only the model blobs from the checkpoints.
Summary:
To evaluate from checkpoints, we need to load a model from the checkpoints.
However, the checkpoints store way more blobs than the blobs needed by the
model. This function enables the model builder to load only the blobs
associated with the model to the workspace. After that, the model builder
can evaluate the model from the populated workspace.

Reviewed By: azzolini

Differential Revision: D4751414

fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3
2017-03-27 10:02:11 -07:00
Alisson Gusatti Azzolini
6ff05fd49d Fix issues pickling jobs
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.

Reviewed By: dzhulgakov

Differential Revision: D4554799

fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
2017-02-21 20:47:27 -08:00
Alisson Gusatti Azzolini
14a5b35805 Snapshot -> Checkpoint
Summary: As per kennyhorror request.

Reviewed By: kennyhorror

Differential Revision: D4473177

fbshipit-source-id: 6cab6ccf247b09aab8f6f056c807bd3ed27ee6a5
2017-01-27 22:29:32 -08:00