Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48340
This changes the context managed classes from using a decorator to define them to using inheritance. Inheritance allows the python static type checking to work correctly.
```
context.define_context()
class Bar(object): ...
context.define_context(allow_default=True)
class Foo(object): ...
```
becomes
```
class Foo(context.Managed): ...
class Bar(context.DefaultManaged): ...
```
Behavior differences:
* arg_name has been removed since it's not used anywhere
* classes need to call `super()` in `__enter__/__exit__` methods if they override (none do)
This also defines a context.pyi file to add types for python3. python2 support should not be affected
Test Plan:
ci
buck test //caffe2/caffe2/python:context_test //caffe2/caffe2/python:checkpoint_test
Reviewed By: dongyuzheng
Differential Revision: D25133469
fbshipit-source-id: 16368bf723eeb6ce3308d6827f5ac5e955b4e29a
Summary: `__repr__` calling self.tasks() ends up marking the instance as "used", which doesn't seem appropriate. I was debugging a value being passed around and then ran into `Cannot add Task to an already used TaskGroup.` because the value had been logged once.
Test Plan:
Added a unit test -- didn't see a clean public method to test it, but I'm happy to add one if that makes sense.
Will wait for sandcastle to trigger everything else; I'm not at all familiar with this code so any other recommendations would be great!
Reviewed By: cryptopic
Differential Revision: D23541198
fbshipit-source-id: 5d1ec674a1ddaedf113140133b90e0da6afa7270
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15250
This adds `__repr__` methods to all of the classes under task.py. This makes the objects much easier to interact with when using them in an interactive manner, such as in a Jupyter notebook.
The default `__repr__` method just returns the object ID which is very unhelpful.
Reviewed By: hanli0612
Differential Revision: D13475758
fbshipit-source-id: 6e1b166ec35163b9776c797b6a2e0d002560cd29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14342
Sometimes, when we are creating a TaskGroup, we are in fact creating a TaskGroup for a distributed job. In some cases, we may want to register a few nets as "remote" to a TaskGroup. The remote net should have sufficient attributes on where they should be executed later on.
This diff adds the remote net attribute to the TaskGroup class. It exposes two minimal functionalities: adding a remote net, and getting all remote nets added to a TaskGroup.
Reviewed By: d4l3k
Differential Revision: D13188320
fbshipit-source-id: efe947aec30817e9512a5e18be985713b9356bdc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9566744
fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9413150
fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.
Also support running Plan object directly in Session.
Reviewed By: azzolini
Differential Revision: D5608393
fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.
Reviewed By: dzhulgakov
Differential Revision: D5100730
fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand
Reviewed By: akyrola
Differential Revision: D5142049
fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
Summary: Set up a server node that periodically gathers values of all nodes' perf counters, allowing to publish them at once.
Reviewed By: dzhulgakov
Differential Revision: D4555116
fbshipit-source-id: 8e49ac8353b52b2be82aedf305762478e7fa687a
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.
Reviewed By: dzhulgakov
Differential Revision: D4554799
fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
Summary:
Previously we had several limitations for a reporter net:
- needed to be a net, not an execution step
- only one allowed per execution step, with a single interval
Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.
Reviewed By: dzhulgakov
Differential Revision: D4583686
fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
Summary:
Outline of changes:
- add single-operator support to Caffe2-Flow integration (based on Alisson's suggestions)
- because of above support we can move graph construction to the main workflow body and pass the job to the Flow operator doing running, similarly to the distributed case
- after that it's easy to unify code even more
- there's some trickery required to make sure model exporting doesn't pollute Cluster info (as TaskGroup.to_task() creates new tasks)
Important: this diff changes train_local behavior by introducing queue between preprocessing and trainer (before we did everything on trainer thread). It doesn't seem to impact perf much (even slightly positive), so I guess it's fine. It also allows for better unification.
I'll follow up with a separate diff that moves max_examples gating to multi_reader (including train_local) and then we can enable checkpointing.
Reviewed By: xianjiec
Differential Revision: D4526079
fbshipit-source-id: 8c44044f45e7738e9b13e5b3acfbb994bc5a3d72
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.
Differential Revision: D4527578
fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
Summary: This allows to have a task-local report net before the Task is created. To be used in global counter (diff soon)
Reviewed By: dzhulgakov
Differential Revision: D4497771
fbshipit-source-id: 24ec7c8e95466abbd83fbea79b58717d81201857
Summary: See distributed.py for example of usage
Reviewed By: xianjiec
Differential Revision: D4467723
fbshipit-source-id: c74f71bebaa1751098379838d3da55945aac62bd