Summary:
Comments say experimental: don't use it. But these functions are used in the critical path from pipeline.py, so better to remove the comment?
Also changed if-else to first check for None. Although python does not crash with getattr(None, "x"), it is confusing.
Some lint issues.
Reviewed By: azzolini
Differential Revision: D5853639
fbshipit-source-id: 977de5ba0ea3ae26343ae5fcacac883faf892b0e
Summary:
Funnily, the biggest issue when trying to increase number of trainers from 5 to 20 is not model convergence (it is worse but still converges without tuning); it is the initialization time: it took around 30 min to generate the job.
After this diff, job creation time for the standard 5-7 setup goes from 125s to 8s. (15x speedup).
Another improvement is that ##net_printer.to_string(job)## becomes less complex.
This makes the startup for 20 trainers go to 32s, which is still not ideal.
Next step will be to allow passing num_instances to Node as well. This way we'll be able to create only one reader and one trainer prototype and let the framework take care of the scheduling. For this one we will need to move some DataStream and PS initialization code to C++ first. (c.c. aartibasant)
Reviewed By: dzhulgakov
Differential Revision: D5100788
fbshipit-source-id: 7b76bce108f527a96b2bfe7ed43a22ea8679b682
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.
Reviewed By: dzhulgakov
Differential Revision: D5100730
fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
Summary: This diff fixes an issue with running the same reader in the same workspace multiple times. In order to achieve correct behavior of execution step we have to explicitly initialize should_stop_blob with False.
Reviewed By: kennyhorror
Differential Revision: D5224410
fbshipit-source-id: 4ad2740e187b62b0a1f5612ea3eef223dcc8a799
Summary: Adds timers to collect the average runtime for each pipe stage.
Reviewed By: azzolini
Differential Revision: D5083958
fbshipit-source-id: 42536bd70c80c2453d98d872286525388f6164c3
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.
Differential Revision: D4527578
fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
Summary: stop_if() was not being honored in ProcessingReader.
Reviewed By: dzhulgakov
Differential Revision: D4497784
fbshipit-source-id: 1c967c6252f832149800796e2c26aadf10b74850
Summary:
Rewrite D3993337 based on new stack.
Comparing to the old one, we need more readers to achieve the same speed. But so far the speed is the same and the new bottleneck is the write bandwidth of trainer. Model quality is the same as the base.
Reviewed By: azzolini
Differential Revision: D4310803
fbshipit-source-id: 6d04ae8040c1ee7caa9aea5287f054e73fbe325a