Summary: Current MultiNodeCheckpointManager return None in this case, yet in JobRunner we assume this function returns a valid task group, i.e. we call session.run(self.checkpoint_manager.init(...)) directly. This will fail the case we use LocalHostScheduler and reuse a MultiNodeCheckpointManager
Reviewed By: azzolini
Differential Revision: D6843450
fbshipit-source-id: a7ec942cfe692f19e8751b0078ae6a6108f29e54
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.
Also support running Plan object directly in Session.
Reviewed By: azzolini
Differential Revision: D5608393
fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.
Reviewed By: dzhulgakov
Differential Revision: D5100730
fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
Summary:
At the moment LocalSession creates a new workspace if none if provided. As a
result anything that have been executed in local session is not going to be
avaiable to the external caller, i.e. everything that is using SingleRunner can
only observe side-effects and not really access intermediate blobs.
This diff is modifying LocalSession to run in current workspace instead (unless
it gots some really weird effects because we rely on privateness of the
workspace it should work).
Differential Revision: D4634743
fbshipit-source-id: 975bed154c7ca215dc3fc0d60f05a7c092711482
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.
Reviewed By: dzhulgakov
Differential Revision: D4554799
fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984