Summary:
Instead of constructing db_name as a member of checkpoint_manager, generalize
this function
Reviewed By: anshulverma
Differential Revision: D6671088
fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea
Summary:
If we encounter failures while writing a checkpoint, ensure that the job does
not fail.
A job can make progress even if writing a checkpoint fails
Reviewed By: anshulverma, boryiingsu
Differential Revision: D6615163
fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77
Summary: In this diff I am making sure that the checkpoint metadata is written out to the db for every epoch. This will allow us to automatically resume from a epoch if a workflow fails.
Reviewed By: aartibasant
Differential Revision: D6234832
fbshipit-source-id: f09a4de118f2eac25f663556476ac6313925fdf3
Summary:
For distributed offline training, downloading parameters from trainer_0 is part of epoch plan. However for distributed realtime training, we publish model by a specific time interval, so we need run multiple iterations for epoch plan before publishing the model.
In this diff, I split downloading parameters from epoch plan as a separate plan, so we can explicitly execute it before model publishing for distributed online training.
Reviewed By: boryiingsu
Differential Revision: D5995122
fbshipit-source-id: 47d61d7b8c57cfae156e79b7ec32068ef579d7c3
Summary: CheckpointManager already accepts a path_prefix override for init() and load(), but it assumes the same db_type passed in __init__(). This change adds an optional path_type for each call.
Reviewed By: boryiingsu
Differential Revision: D5888152
fbshipit-source-id: 21cd31a62a0188fe0e0b19b43c3b232c2342d0a8
Summary:
Reader checkpointing was disabled due to bug captured in T21143272
Now that we have resolved that issue, re-enabling reader checkpointing
Reviewed By: boryiingsu, rayleichen
Differential Revision: D5730545
fbshipit-source-id: 7fae48b03e07eaf530bfc9e8e8b6683d8ed4e206
Summary:
1. Uses the upload_builder in the offline training.
2. Adds the checkpoint taskgroups to the online trainer.
3. Changes the naming rules so that the model checkpoint has the format of
<directory>/<entity_id>_<snapshot_id>.<node_name>.<snapshot_id>
Reviewed By: rayleichen
Differential Revision: D5665068
fbshipit-source-id: a8103aed2ca195a506174d2a1d50611d2f1d9c35
Summary: So far the we format the epoch name with 6 digits, but this is constraining. In order to have consistent naming, we can simply append the epoch to the suffix. Then we will have consistent naming rules for small and for large epoch numbers.
Reviewed By: azzolini
Differential Revision: D5653871
fbshipit-source-id: acdf26a14b731347bb85fe2f33c1b89e2ba83bdd
Summary:
The hive reader checkpoints are broken because of D5582328.
This breaks our offline simulator test as well.
This is a temporary fix that disables the checkpoints for readers.
Reviewed By: azzolini
Differential Revision: D5637719
fbshipit-source-id: 4f31ae534cb7e981fcacbb721cbb2420249fad91
Summary:
1. Adds one more step in the JobRunner class to upload checkpoints.
2. Adds one function to return the name of the checkpoint given
the name of the node.
Reviewed By: andrewwdye
Differential Revision: D5597130
fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72
Summary:
To evaluate on checkpoints, we often need to load from multiple checkpoints.
However, it is inconvenient if we always need to check the existence of
a checkpoint manually. Adds interfaces to check the existence of a DB
so that we can find available checkpoints automatically.
Reviewed By: azzolini
Differential Revision: D4823876
fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7
Summary:
The initialization phase of each checkpoint object simply loads the nanmes of
the blobs in the checkpoints. When we load from the checkpoints, the names of
the blobs are given. We can skip this init step.
Reviewed By: azzolini
Differential Revision: D4808114
fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a
Summary:
Somehow the stress-runs flag does not work as what I expected.
Now the test finally passes.
Reviewed By: azzolini
Differential Revision: D4797559
fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213
Summary:
To evaluate from checkpoints, we need to load a model from the checkpoints.
However, the checkpoints store way more blobs than the blobs needed by the
model. This function enables the model builder to load only the blobs
associated with the model to the workspace. After that, the model builder
can evaluate the model from the populated workspace.
Reviewed By: azzolini
Differential Revision: D4751414
fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.
Reviewed By: dzhulgakov
Differential Revision: D4554799
fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984