Commit Graph

34 Commits

Author SHA1 Message Date
Bugra Akyildiz
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
Shihao Xu
b834d9107e Revert D9566744: [New Checkpoint] Kill the dummy TaskOutput when task.get_step() (#11164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11164

Revert D9566744

Reviewed By: enosair

Differential Revision: D9620272

fbshipit-source-id: 6a78c46929f66bd11969840cb6b107f734be0c02
2018-08-31 22:25:57 -07:00
Shihao Xu
ad1670cf54 Kill the dummy TaskOutput when task.get_step() (#11048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9566744

fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
2018-08-29 20:11:29 -07:00
Zhanibek Datbayev
22e3b2c9c3 Revert D9413150: [New Checkpoint] Kill the dummy TaskOutput when task.get_step()
Differential Revision:
D9413150

Original commit changeset: 51aaf3201e26

fbshipit-source-id: ac7c4c0960db03f344fe3eb2ad7f0e034db2371a
2018-08-29 14:39:49 -07:00
Shihao Xu
6ca28984c7 Kill the dummy TaskOutput when task.get_step() (#10739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9413150

fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
2018-08-28 20:41:46 -07:00
Bram Wasti
aa56a1211d
Update from facebook (#6871)
* Track checkpoint performance in scuba

As title.

* [C2/CUDA]: fix cross entropy sigmoid with logits

when adding log_d_trick, I forgot to add it to the cuda impl; this diff fixes
it.

* Back out "[caffe2] Unregister MKL fallbacks for NCHW conversions"

Original commit changeset: 8918dd40205a
Will land after @jongsoo's diff https://phabricator.intern.facebook.com/D7596315 lands

* [Easy][C2] Don't add blob to external outputs from output_record if it's already external output

As desc.

* On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization

FACEBOOK:

The QPL logger needs the initialization code. In the past, the initialization code is put in the pipeline calling Caffe2. However, those places become obsolete quickly, as the product teams change places to call Caffe2 from time to time. We also need to track which teams use Caffe2 so that we can put the initialization code there.

With this diff, the initialization code is put in the predictor constructor, only enabled for mobile phones. This way, we can always enable QPL logging.

Once we do this, we can check how many times Caffe2 inference is called in production, and which models are more popular in production. This way, we can prioritize our effort supporting those models.

Will clean up the old code calling the init in the product in a separate diff.

* add padding op for sparse length tensor

to pad length-based sparse tensor with padding_value

* Add conv_op with cudaconvnet engine

Add conv_op with cudaconvnet engine

* [numa] Fix simple NUMA copy benchmark

Move XavierFill into init_net and also compute BW

* call roundf (device function) instead of round (host function)

* [caffe2_benchmark][observer] Make caffe2_benchmark use its own observer

1. Add ClearGlobalNetObservers()
2. Make caffe2_benchmark use its own observer and observer_reporter

* [detectron] Use roundf instead of round in the detectron module ops

* allow K larger than number of elements in top k op

one use case is to use this op together with PackSegments for sparse tensors, where the number of elements in each slice is not statistically defined.

* add ChannelShuffle DNNLOWP op

* fixup math_cpu.cc break
2018-04-23 15:01:56 -07:00
Qinqing Zheng
90586d925f [DT] [38/n] Rename add_stop_signal to add_stop_condition (#6825)
att
2018-04-23 10:39:37 -07:00
Qinqing Zheng
66791f54d5 Update the compile function of Job (#6323) 2018-04-09 22:44:23 -07:00
Qinqing Zheng
fd2e7cb487 Change JobRunner's __call__ function to train (#6205) 2018-04-02 21:04:36 -07:00
Qinqing Zheng
365652229d Back out "Revert D7372460: [DT] [28/n] Lift epoch_limiter"
Original commit changeset: b0a986d16c3b
2018-03-30 21:00:44 -07:00
Andrey Malevich
f8eb8a66e2 Revert D7372460: [DT] [28/n] Lift epoch_limiter
This reverts commit 05bd9bec10fad5ff9dc40be88836fd7274d50ce9

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
2018-03-30 21:00:44 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Qinqing Zheng
1288c4fd79 refactor epoch_limiter (#2389)
* refactor epoch_limiter

* fix test
2018-03-22 20:32:13 -07:00
Qinqing Zheng
90a3363f29 Return an empty TaskGroup if node managers exist in MultiNodeCheckpointManager
Summary: Current MultiNodeCheckpointManager return None in this case, yet in JobRunner we assume this function returns a valid task group, i.e. we call session.run(self.checkpoint_manager.init(...)) directly. This will fail the case we use LocalHostScheduler and reuse a MultiNodeCheckpointManager

Reviewed By: azzolini

Differential Revision: D6843450

fbshipit-source-id: a7ec942cfe692f19e8751b0078ae6a6108f29e54
2018-01-30 19:20:50 -08:00
Wei Zhang
1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:

1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.

Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.

Reviewed By: azzolini

Differential Revision: D6765393

fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
2018-01-22 14:04:12 -08:00
Aarti Basant
33d734fcf1 Generalize construction of db_name in checkpoint manager
Summary:
Instead of constructing db_name as a member of checkpoint_manager, generalize
this function

Reviewed By: anshulverma

Differential Revision: D6671088

fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea
2018-01-10 11:49:17 -08:00
Aarti Basant
8af9f0da99 Saving checkpoint failure should not cause job failure
Summary:
If we encounter failures while writing a checkpoint, ensure that the job does
not fail.
A job can make progress even if writing a checkpoint fails

Reviewed By: anshulverma, boryiingsu

Differential Revision: D6615163

fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77
2017-12-21 10:32:55 -08:00
Bor-Yiing Su
e0fa72455d Fixes the checkpoint test.
Summary:
We need to use Cluster to isolate the definition of the nodes.
Otherwise, the contexts are polluted and the run becomes
stateful.

Reviewed By: Yangqing

Differential Revision: D6140404

fbshipit-source-id: 09d1c86ef12bb01eaa16b1dade4d2e1e93be287a
2017-10-26 13:18:21 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Bor-Yiing Su
1d70a2276d Changes the checkpoint naming rules.
Summary: So far the we format the epoch name with 6 digits, but this is constraining. In order to have consistent naming, we can simply append the epoch to the suffix. Then we will have consistent naming rules for small and for large epoch numbers.

Reviewed By: azzolini

Differential Revision: D5653871

fbshipit-source-id: acdf26a14b731347bb85fe2f33c1b89e2ba83bdd
2017-08-17 22:16:42 -07:00
Yiming Wu
6e22427929 fix tci complaining test - test_load_model_from_checkpoints
Summary:
Travis CI is complaining about test_load_model_from_checkpoints in recent PRs.
E: AssertionError: 'trainer:1/task/GivenTensorInt64Fill:0, a C++ native class of type nullptr (uninitialized).' != array([103])
See for example https://travis-ci.org/caffe2/caffe2/jobs/265665119
Reason unkown yet. First disable this then try to fix it

Reviewed By: Yangqing

Differential Revision: D5655068

fbshipit-source-id: 10949339ec92b0a4c2f0e59246040f1b0510be12
2017-08-17 17:50:42 -07:00
Bor-Yiing Su
30616ee309 Fixes the broken checkpoint test.
Summary:
Since we temporarily disable checkpointing the readers, we need to
rename all the node names in the test to make it pass.

Reviewed By: azzolini

Differential Revision: D5640930

fbshipit-source-id: 1e61be31ddf9b6e28efd2eb8e6e91e63dcd83154
2017-08-16 11:24:50 -07:00
Bor-Yiing Su
8a5bdc383e Fixes the flaky upload test
Summary:
The LocalSession does not work with the multi-node definitions.
The test becomes flaky because of that. The fix is to create
different LocalSession for each Node(), and run each node
sequentially.

Differential Revision: D5617857

fbshipit-source-id: a8079a90291b4c8b5aa6b471c33c06d18e59976c
2017-08-11 18:58:24 -07:00
Bor-Yiing Su
404f8ee9b4 Extends the jobrunner to support uploading checkpoints.
Summary:
1. Adds one more step in the JobRunner class to upload checkpoints.
2. Adds one function to return the name of the checkpoint given
the name of the node.

Reviewed By: andrewwdye

Differential Revision: D5597130

fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72
2017-08-11 14:17:17 -07:00
Alisson Gusatti Azzolini
7d482742fd Allow tasks/execution_steps to be cloned at runtime
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.

Reviewed By: dzhulgakov

Differential Revision: D5100730

fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
2017-06-20 22:32:07 -07:00
Bor-Yiing Su
c1420330b2 Fixes the checkpoint test.
Summary:
Diff D5224410 initializes the should_stop_blob explicitly. With that, we will
have one more blob when executing the job. Adjusts the check accordingly.

Reviewed By: azzolini

Differential Revision: D5228398

fbshipit-source-id: 439b186c30b0b1d0e41e513babbcccd85e7a1b4a
2017-06-12 12:19:14 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Bor-Yiing Su
81a55f441c Adds interfaces to check the existence of a DB
Summary:
To evaluate on checkpoints, we often need to load from multiple checkpoints.
However, it is inconvenient if we always need to check the existence of
a checkpoint manually. Adds interfaces to check the existence of a DB
so that we can find available checkpoints automatically.

Reviewed By: azzolini

Differential Revision: D4823876

fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7
2017-04-11 14:07:49 -07:00
Bor-Yiing Su
8f9cd757db Skips the initialization phase of the individual checkpoint objects.
Summary:
The initialization phase of each checkpoint object simply loads the nanmes of
the blobs in the checkpoints. When we load from the checkpoints, the names of
the blobs are given. We can skip this init step.

Reviewed By: azzolini

Differential Revision: D4808114

fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a
2017-03-31 10:10:56 -07:00
Bor-Yiing Su
0e6413f8ea Fix flaky test
Summary:
Somehow the stress-runs flag does not work as what I expected.
Now the test finally passes.

Reviewed By: azzolini

Differential Revision: D4797559

fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213
2017-03-29 16:48:20 -07:00
Bor-Yiing Su
a03d956b56 Fixes the flaky test. Although we create nets in three different nodes,
Reviewed By: azzolini

Differential Revision: D4788418

fbshipit-source-id: bdf90c5674b5dbb8b3bda21cf85ea33fedb36fa6
2017-03-28 13:48:07 -07:00
Bor-Yiing Su
7fa4acab9b Loads only the model blobs from the checkpoints.
Summary:
To evaluate from checkpoints, we need to load a model from the checkpoints.
However, the checkpoints store way more blobs than the blobs needed by the
model. This function enables the model builder to load only the blobs
associated with the model to the workspace. After that, the model builder
can evaluate the model from the populated workspace.

Reviewed By: azzolini

Differential Revision: D4751414

fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3
2017-03-27 10:02:11 -07:00
Alisson Gusatti Azzolini
6ff05fd49d Fix issues pickling jobs
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.

Reviewed By: dzhulgakov

Differential Revision: D4554799

fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
2017-02-21 20:47:27 -08:00
Alisson Gusatti Azzolini
14a5b35805 Snapshot -> Checkpoint
Summary: As per kennyhorror request.

Reviewed By: kennyhorror

Differential Revision: D4473177

fbshipit-source-id: 6cab6ccf247b09aab8f6f056c807bd3ed27ee6a5
2017-01-27 22:29:32 -08:00