Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15250
This adds `__repr__` methods to all of the classes under task.py. This makes the objects much easier to interact with when using them in an interactive manner, such as in a Jupyter notebook.
The default `__repr__` method just returns the object ID which is very unhelpful.
Reviewed By: hanli0612
Differential Revision: D13475758
fbshipit-source-id: 6e1b166ec35163b9776c797b6a2e0d002560cd29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14342
Sometimes, when we are creating a TaskGroup, we are in fact creating a TaskGroup for a distributed job. In some cases, we may want to register a few nets as "remote" to a TaskGroup. The remote net should have sufficient attributes on where they should be executed later on.
This diff adds the remote net attribute to the TaskGroup class. It exposes two minimal functionalities: adding a remote net, and getting all remote nets added to a TaskGroup.
Reviewed By: d4l3k
Differential Revision: D13188320
fbshipit-source-id: efe947aec30817e9512a5e18be985713b9356bdc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9566744
fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9413150
fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.
Also support running Plan object directly in Session.
Reviewed By: azzolini
Differential Revision: D5608393
fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.
Reviewed By: dzhulgakov
Differential Revision: D5100730
fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand
Reviewed By: akyrola
Differential Revision: D5142049
fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
Summary: Set up a server node that periodically gathers values of all nodes' perf counters, allowing to publish them at once.
Reviewed By: dzhulgakov
Differential Revision: D4555116
fbshipit-source-id: 8e49ac8353b52b2be82aedf305762478e7fa687a
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.
Reviewed By: dzhulgakov
Differential Revision: D4554799
fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
Summary:
Previously we had several limitations for a reporter net:
- needed to be a net, not an execution step
- only one allowed per execution step, with a single interval
Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.
Reviewed By: dzhulgakov
Differential Revision: D4583686
fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
Summary:
Outline of changes:
- add single-operator support to Caffe2-Flow integration (based on Alisson's suggestions)
- because of above support we can move graph construction to the main workflow body and pass the job to the Flow operator doing running, similarly to the distributed case
- after that it's easy to unify code even more
- there's some trickery required to make sure model exporting doesn't pollute Cluster info (as TaskGroup.to_task() creates new tasks)
Important: this diff changes train_local behavior by introducing queue between preprocessing and trainer (before we did everything on trainer thread). It doesn't seem to impact perf much (even slightly positive), so I guess it's fine. It also allows for better unification.
I'll follow up with a separate diff that moves max_examples gating to multi_reader (including train_local) and then we can enable checkpointing.
Reviewed By: xianjiec
Differential Revision: D4526079
fbshipit-source-id: 8c44044f45e7738e9b13e5b3acfbb994bc5a3d72
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.
Differential Revision: D4527578
fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
Summary: This allows to have a task-local report net before the Task is created. To be used in global counter (diff soon)
Reviewed By: dzhulgakov
Differential Revision: D4497771
fbshipit-source-id: 24ec7c8e95466abbd83fbea79b58717d81201857
Summary: See distributed.py for example of usage
Reviewed By: xianjiec
Differential Revision: D4467723
fbshipit-source-id: c74f71bebaa1751098379838d3da55945aac62bd