Commit Graph

14 Commits

Author SHA1 Message Date
Yiming Wu
6e22427929 fix tci complaining test - test_load_model_from_checkpoints
Summary:
Travis CI is complaining about test_load_model_from_checkpoints in recent PRs.
E: AssertionError: 'trainer:1/task/GivenTensorInt64Fill:0, a C++ native class of type nullptr (uninitialized).' != array([103])
See for example https://travis-ci.org/caffe2/caffe2/jobs/265665119
Reason unkown yet. First disable this then try to fix it

Reviewed By: Yangqing

Differential Revision: D5655068

fbshipit-source-id: 10949339ec92b0a4c2f0e59246040f1b0510be12
2017-08-17 17:50:42 -07:00
Bor-Yiing Su
30616ee309 Fixes the broken checkpoint test.
Summary:
Since we temporarily disable checkpointing the readers, we need to
rename all the node names in the test to make it pass.

Reviewed By: azzolini

Differential Revision: D5640930

fbshipit-source-id: 1e61be31ddf9b6e28efd2eb8e6e91e63dcd83154
2017-08-16 11:24:50 -07:00
Bor-Yiing Su
8a5bdc383e Fixes the flaky upload test
Summary:
The LocalSession does not work with the multi-node definitions.
The test becomes flaky because of that. The fix is to create
different LocalSession for each Node(), and run each node
sequentially.

Differential Revision: D5617857

fbshipit-source-id: a8079a90291b4c8b5aa6b471c33c06d18e59976c
2017-08-11 18:58:24 -07:00
Bor-Yiing Su
404f8ee9b4 Extends the jobrunner to support uploading checkpoints.
Summary:
1. Adds one more step in the JobRunner class to upload checkpoints.
2. Adds one function to return the name of the checkpoint given
the name of the node.

Reviewed By: andrewwdye

Differential Revision: D5597130

fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72
2017-08-11 14:17:17 -07:00
Alisson Gusatti Azzolini
7d482742fd Allow tasks/execution_steps to be cloned at runtime
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.

Reviewed By: dzhulgakov

Differential Revision: D5100730

fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
2017-06-20 22:32:07 -07:00
Bor-Yiing Su
c1420330b2 Fixes the checkpoint test.
Summary:
Diff D5224410 initializes the should_stop_blob explicitly. With that, we will
have one more blob when executing the job. Adjusts the check accordingly.

Reviewed By: azzolini

Differential Revision: D5228398

fbshipit-source-id: 439b186c30b0b1d0e41e513babbcccd85e7a1b4a
2017-06-12 12:19:14 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Bor-Yiing Su
81a55f441c Adds interfaces to check the existence of a DB
Summary:
To evaluate on checkpoints, we often need to load from multiple checkpoints.
However, it is inconvenient if we always need to check the existence of
a checkpoint manually. Adds interfaces to check the existence of a DB
so that we can find available checkpoints automatically.

Reviewed By: azzolini

Differential Revision: D4823876

fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7
2017-04-11 14:07:49 -07:00
Bor-Yiing Su
8f9cd757db Skips the initialization phase of the individual checkpoint objects.
Summary:
The initialization phase of each checkpoint object simply loads the nanmes of
the blobs in the checkpoints. When we load from the checkpoints, the names of
the blobs are given. We can skip this init step.

Reviewed By: azzolini

Differential Revision: D4808114

fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a
2017-03-31 10:10:56 -07:00
Bor-Yiing Su
0e6413f8ea Fix flaky test
Summary:
Somehow the stress-runs flag does not work as what I expected.
Now the test finally passes.

Reviewed By: azzolini

Differential Revision: D4797559

fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213
2017-03-29 16:48:20 -07:00
Bor-Yiing Su
a03d956b56 Fixes the flaky test. Although we create nets in three different nodes,
Reviewed By: azzolini

Differential Revision: D4788418

fbshipit-source-id: bdf90c5674b5dbb8b3bda21cf85ea33fedb36fa6
2017-03-28 13:48:07 -07:00
Bor-Yiing Su
7fa4acab9b Loads only the model blobs from the checkpoints.
Summary:
To evaluate from checkpoints, we need to load a model from the checkpoints.
However, the checkpoints store way more blobs than the blobs needed by the
model. This function enables the model builder to load only the blobs
associated with the model to the workspace. After that, the model builder
can evaluate the model from the populated workspace.

Reviewed By: azzolini

Differential Revision: D4751414

fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3
2017-03-27 10:02:11 -07:00
Alisson Gusatti Azzolini
6ff05fd49d Fix issues pickling jobs
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.

Reviewed By: dzhulgakov

Differential Revision: D4554799

fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
2017-02-21 20:47:27 -08:00
Alisson Gusatti Azzolini
14a5b35805 Snapshot -> Checkpoint
Summary: As per kennyhorror request.

Reviewed By: kennyhorror

Differential Revision: D4473177

fbshipit-source-id: 6cab6ccf247b09aab8f6f056c807bd3ed27ee6a5
2017-01-27 22:29:32 -08:00