Summary: Set up a server node that periodically gathers values of all nodes' perf counters, allowing to publish them at once.
Reviewed By: dzhulgakov
Differential Revision: D4555116
fbshipit-source-id: 8e49ac8353b52b2be82aedf305762478e7fa687a
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.
Reviewed By: dzhulgakov
Differential Revision: D4554799
fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
Summary:
Previously we had several limitations for a reporter net:
- needed to be a net, not an execution step
- only one allowed per execution step, with a single interval
Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.
Reviewed By: dzhulgakov
Differential Revision: D4583686
fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
Summary:
Outline of changes:
- add single-operator support to Caffe2-Flow integration (based on Alisson's suggestions)
- because of above support we can move graph construction to the main workflow body and pass the job to the Flow operator doing running, similarly to the distributed case
- after that it's easy to unify code even more
- there's some trickery required to make sure model exporting doesn't pollute Cluster info (as TaskGroup.to_task() creates new tasks)
Important: this diff changes train_local behavior by introducing queue between preprocessing and trainer (before we did everything on trainer thread). It doesn't seem to impact perf much (even slightly positive), so I guess it's fine. It also allows for better unification.
I'll follow up with a separate diff that moves max_examples gating to multi_reader (including train_local) and then we can enable checkpointing.
Reviewed By: xianjiec
Differential Revision: D4526079
fbshipit-source-id: 8c44044f45e7738e9b13e5b3acfbb994bc5a3d72
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.
Differential Revision: D4527578
fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
Summary: This allows to have a task-local report net before the Task is created. To be used in global counter (diff soon)
Reviewed By: dzhulgakov
Differential Revision: D4497771
fbshipit-source-id: 24ec7c8e95466abbd83fbea79b58717d81201857
Summary: See distributed.py for example of usage
Reviewed By: xianjiec
Differential Revision: D4467723
fbshipit-source-id: c74f71bebaa1751098379838d3da55945aac62bd