Commit Graph

27 Commits

Author SHA1 Message Date
Derek Kim
1e425d1a47 A trivial typo fix in caffe2.python (#15907)
Summary:
blobl -> globl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15907

Differential Revision: D13709586

Pulled By: ezyang

fbshipit-source-id: 9d3ad76b7fea76c7934407d3c164417b4157e234
2019-01-17 04:57:34 -08:00
Tristan Rice
e650a84872 caffe2/python/task: added __repr__ methods to all task definitions (#15250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15250

This adds `__repr__` methods to all of the classes under task.py. This makes the objects much easier to interact with when using them in an interactive manner, such as in a Jupyter notebook.

The default `__repr__` method just returns the object ID which is very unhelpful.

Reviewed By: hanli0612

Differential Revision: D13475758

fbshipit-source-id: 6e1b166ec35163b9776c797b6a2e0d002560cd29
2018-12-17 16:02:16 -08:00
Hassan Eslami
e392d428b1 Allowing TaskGroups to carry remote nets (#14342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14342

Sometimes, when we are creating a TaskGroup, we are in fact creating a TaskGroup for a distributed job. In some cases, we may want to register a few nets as "remote" to a TaskGroup. The remote net should have sufficient attributes on where they should be executed later on.

This diff adds the remote net attribute to the TaskGroup class. It exposes two minimal functionalities: adding a remote net, and getting all remote nets added to a TaskGroup.

Reviewed By: d4l3k

Differential Revision: D13188320

fbshipit-source-id: efe947aec30817e9512a5e18be985713b9356bdc
2018-11-27 13:34:11 -08:00
Shihao Xu
b834d9107e Revert D9566744: [New Checkpoint] Kill the dummy TaskOutput when task.get_step() (#11164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11164

Revert D9566744

Reviewed By: enosair

Differential Revision: D9620272

fbshipit-source-id: 6a78c46929f66bd11969840cb6b107f734be0c02
2018-08-31 22:25:57 -07:00
Shihao Xu
ad1670cf54 Kill the dummy TaskOutput when task.get_step() (#11048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9566744

fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
2018-08-29 20:11:29 -07:00
Zhanibek Datbayev
22e3b2c9c3 Revert D9413150: [New Checkpoint] Kill the dummy TaskOutput when task.get_step()
Differential Revision:
D9413150

Original commit changeset: 51aaf3201e26

fbshipit-source-id: ac7c4c0960db03f344fe3eb2ad7f0e034db2371a
2018-08-29 14:39:49 -07:00
Shihao Xu
6ca28984c7 Kill the dummy TaskOutput when task.get_step() (#10739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9413150

fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
2018-08-28 20:41:46 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Lei Chen
14950a9082 Support session in distributed realtime trainer
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.

Also support running Plan object directly in Session.

Reviewed By: azzolini

Differential Revision: D5608393

fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
2017-08-16 10:28:55 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Alisson Gusatti Azzolini
7d482742fd Allow tasks/execution_steps to be cloned at runtime
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.

Reviewed By: dzhulgakov

Differential Revision: D5100730

fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
2017-06-20 22:32:07 -07:00
Xiaolong Wang
b133c214ce fix potential bug in task.py
Summary: as titled

Differential Revision: D5225166

fbshipit-source-id: 9247fe44922c097752c6996ee9192ec72b7e7d88
2017-06-11 10:40:47 -07:00
Xiaolong Wang
827a0ac2fe Fix comment mistakes in task.py
Summary: as titled

Reviewed By: kennyhorror

Differential Revision: D5225154

fbshipit-source-id: 99a9547e15e0d5a4c81b6339ce75406160a7fc07
2017-06-11 10:17:07 -07:00
Thomas Dudziak
47e921ba49 Remove map() and filter() in favor of comprehensions
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand

Reviewed By: akyrola

Differential Revision: D5142049

fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
2017-05-30 15:32:58 -07:00
Alisson Gusatti Azzolini
310f505da7 Remove application-specific comment.
Summary: This comment is not relevant for open-source.

Differential Revision: D5070835

fbshipit-source-id: 8e2dadae85566e7f6684d42f921daf7d345dc065
2017-05-16 12:17:03 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Alisson Gusatti Azzolini
59f0454621 Gather perf counters for distributed jobs
Summary: Set up a server node that periodically gathers values of all nodes' perf counters, allowing to publish them at once.

Reviewed By: dzhulgakov

Differential Revision: D4555116

fbshipit-source-id: 8e49ac8353b52b2be82aedf305762478e7fa687a
2017-02-21 22:06:25 -08:00
Alisson Gusatti Azzolini
6ff05fd49d Fix issues pickling jobs
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.

Reviewed By: dzhulgakov

Differential Revision: D4554799

fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
2017-02-21 20:47:27 -08:00
Alisson Gusatti Azzolini
8fa156d082 Improve "reporter net" design
Summary:
Previously we had several limitations for a reporter net:
 - needed to be a net, not an execution step
 - only one allowed per execution step, with a single interval

Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.

Reviewed By: dzhulgakov

Differential Revision: D4583686

fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
2017-02-21 20:17:40 -08:00
Dmytro Dzhulgakov
335b73221c Unify train_local and train_with_distributed_readers
Summary:
Outline of changes:

- add single-operator support to Caffe2-Flow integration (based on Alisson's suggestions)
- because of above support we can move graph construction to the main workflow body and pass the job to the Flow operator doing running, similarly to the distributed case
- after that it's easy to unify code even more
- there's some trickery required to make sure model exporting doesn't pollute Cluster info (as TaskGroup.to_task() creates new tasks)

Important: this diff changes train_local behavior by introducing queue between preprocessing and trainer (before we did everything on trainer thread). It doesn't seem to impact perf much (even slightly positive), so I guess it's fine. It also allows for better unification.

I'll follow up with a separate diff that moves max_examples gating to multi_reader (including train_local) and then we can enable checkpointing.

Reviewed By: xianjiec

Differential Revision: D4526079

fbshipit-source-id: 8c44044f45e7738e9b13e5b3acfbb994bc5a3d72
2017-02-09 20:46:35 -08:00
Alisson Gusatti Azzolini
039ac56a68 Better names for nets, steps and tasks
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.

Differential Revision: D4527578

fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
2017-02-09 16:33:54 -08:00
Alisson Gusatti Azzolini
33c0e5619b Add Task.REPORT_NET attribute
Summary: This allows to have a task-local report net before the Task is created. To be used in global counter (diff soon)

Reviewed By: dzhulgakov

Differential Revision: D4497771

fbshipit-source-id: 24ec7c8e95466abbd83fbea79b58717d81201857
2017-02-03 18:44:50 -08:00
Alisson Gusatti Azzolini
1d3834eeb2 Nodes to support resource requirements and outputs
Summary: See distributed.py for example of usage

Reviewed By: xianjiec

Differential Revision: D4467723

fbshipit-source-id: c74f71bebaa1751098379838d3da55945aac62bd
2017-01-30 11:29:25 -08:00
Alisson Gusatti Azzolini
6618d7462d Improvements+fixes for NetBuilder
Summary: Title.

Reviewed By: dzhulgakov

Differential Revision: D4358227

fbshipit-source-id: 21afe5107bed27eec2027f16f2c77db62c70c6e8
2017-01-03 16:59:24 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00