pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Andrey Malevich	f8eb8a66e2	Revert D7372460: [DT] [28/n] Lift epoch_limiter This reverts commit 05bd9bec10fad5ff9dc40be88836fd7274d50ce9 @bypass-lint An infra SEV is better than not reverting this diff. If you copy this password, see you in SEV Review! @cause_a_sev_many_files	2018-03-30 21:00:44 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Qinqing Zheng	1288c4fd79	refactor epoch_limiter (#2389 ) * refactor epoch_limiter * fix test	2018-03-22 20:32:13 -07:00
Qinqing Zheng	90a3363f29	Return an empty TaskGroup if node managers exist in MultiNodeCheckpointManager Summary: Current MultiNodeCheckpointManager return None in this case, yet in JobRunner we assume this function returns a valid task group, i.e. we call session.run(self.checkpoint_manager.init(...)) directly. This will fail the case we use LocalHostScheduler and reuse a MultiNodeCheckpointManager Reviewed By: azzolini Differential Revision: D6843450 fbshipit-source-id: a7ec942cfe692f19e8751b0078ae6a6108f29e54	2018-01-30 19:20:50 -08:00
Wei Zhang	1d4e996b87	Separate parameter downloading tasks from training tasks and run them in a different group Summary: At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training: 1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource. 2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training. Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group. Reviewed By: azzolini Differential Revision: D6765393 fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49	2018-01-22 14:04:12 -08:00
Aarti Basant	33d734fcf1	Generalize construction of db_name in checkpoint manager Summary: Instead of constructing db_name as a member of checkpoint_manager, generalize this function Reviewed By: anshulverma Differential Revision: D6671088 fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea	2018-01-10 11:49:17 -08:00
Aarti Basant	8af9f0da99	Saving checkpoint failure should not cause job failure Summary: If we encounter failures while writing a checkpoint, ensure that the job does not fail. A job can make progress even if writing a checkpoint fails Reviewed By: anshulverma, boryiingsu Differential Revision: D6615163 fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77	2017-12-21 10:32:55 -08:00
Bor-Yiing Su	e0fa72455d	Fixes the checkpoint test. Summary: We need to use Cluster to isolate the definition of the nodes. Otherwise, the contexts are polluted and the run becomes stateful. Reviewed By: Yangqing Differential Revision: D6140404 fbshipit-source-id: 09d1c86ef12bb01eaa16b1dade4d2e1e93be287a	2017-10-26 13:18:21 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Bor-Yiing Su	1d70a2276d	Changes the checkpoint naming rules. Summary: So far the we format the epoch name with 6 digits, but this is constraining. In order to have consistent naming, we can simply append the epoch to the suffix. Then we will have consistent naming rules for small and for large epoch numbers. Reviewed By: azzolini Differential Revision: D5653871 fbshipit-source-id: acdf26a14b731347bb85fe2f33c1b89e2ba83bdd	2017-08-17 22:16:42 -07:00
Yiming Wu	6e22427929	fix tci complaining test - test_load_model_from_checkpoints Summary: Travis CI is complaining about test_load_model_from_checkpoints in recent PRs. E: AssertionError: 'trainer:1/task/GivenTensorInt64Fill:0, a C++ native class of type nullptr (uninitialized).' != array([103]) See for example https://travis-ci.org/caffe2/caffe2/jobs/265665119 Reason unkown yet. First disable this then try to fix it Reviewed By: Yangqing Differential Revision: D5655068 fbshipit-source-id: 10949339ec92b0a4c2f0e59246040f1b0510be12	2017-08-17 17:50:42 -07:00
Bor-Yiing Su	30616ee309	Fixes the broken checkpoint test. Summary: Since we temporarily disable checkpointing the readers, we need to rename all the node names in the test to make it pass. Reviewed By: azzolini Differential Revision: D5640930 fbshipit-source-id: 1e61be31ddf9b6e28efd2eb8e6e91e63dcd83154	2017-08-16 11:24:50 -07:00
Bor-Yiing Su	8a5bdc383e	Fixes the flaky upload test Summary: The LocalSession does not work with the multi-node definitions. The test becomes flaky because of that. The fix is to create different LocalSession for each Node(), and run each node sequentially. Differential Revision: D5617857 fbshipit-source-id: a8079a90291b4c8b5aa6b471c33c06d18e59976c	2017-08-11 18:58:24 -07:00
Bor-Yiing Su	404f8ee9b4	Extends the jobrunner to support uploading checkpoints. Summary: 1. Adds one more step in the JobRunner class to upload checkpoints. 2. Adds one function to return the name of the checkpoint given the name of the node. Reviewed By: andrewwdye Differential Revision: D5597130 fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72	2017-08-11 14:17:17 -07:00
Alisson Gusatti Azzolini	7d482742fd	Allow tasks/execution_steps to be cloned at runtime Summary: Advantages of cloning the tasks/execution_steps at runtime: - Less complexity on the python side: no need to clone nets and add prefixes to blob names - Faster start-up: we had cases of complex plans that took up to 30min to be created. - Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs. - Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime. Reviewed By: dzhulgakov Differential Revision: D5100730 fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc	2017-06-20 22:32:07 -07:00
Bor-Yiing Su	c1420330b2	Fixes the checkpoint test. Summary: Diff D5224410 initializes the should_stop_blob explicitly. With that, we will have one more blob when executing the job. Adjusts the check accordingly. Reviewed By: azzolini Differential Revision: D5228398 fbshipit-source-id: 439b186c30b0b1d0e41e513babbcccd85e7a1b4a	2017-06-12 12:19:14 -07:00
Thomas Dudziak	60c78d6160	Fixes range/xrange for Python 3 Summary: As title Differential Revision: D5151894 fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638	2017-06-07 00:04:26 -07:00
Bor-Yiing Su	81a55f441c	Adds interfaces to check the existence of a DB Summary: To evaluate on checkpoints, we often need to load from multiple checkpoints. However, it is inconvenient if we always need to check the existence of a checkpoint manually. Adds interfaces to check the existence of a DB so that we can find available checkpoints automatically. Reviewed By: azzolini Differential Revision: D4823876 fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7	2017-04-11 14:07:49 -07:00
Bor-Yiing Su	8f9cd757db	Skips the initialization phase of the individual checkpoint objects. Summary: The initialization phase of each checkpoint object simply loads the nanmes of the blobs in the checkpoints. When we load from the checkpoints, the names of the blobs are given. We can skip this init step. Reviewed By: azzolini Differential Revision: D4808114 fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a	2017-03-31 10:10:56 -07:00
Bor-Yiing Su	0e6413f8ea	Fix flaky test Summary: Somehow the stress-runs flag does not work as what I expected. Now the test finally passes. Reviewed By: azzolini Differential Revision: D4797559 fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213	2017-03-29 16:48:20 -07:00
Bor-Yiing Su	a03d956b56	Fixes the flaky test. Although we create nets in three different nodes, Reviewed By: azzolini Differential Revision: D4788418 fbshipit-source-id: bdf90c5674b5dbb8b3bda21cf85ea33fedb36fa6	2017-03-28 13:48:07 -07:00
Bor-Yiing Su	7fa4acab9b	Loads only the model blobs from the checkpoints. Summary: To evaluate from checkpoints, we need to load a model from the checkpoints. However, the checkpoints store way more blobs than the blobs needed by the model. This function enables the model builder to load only the blobs associated with the model to the workspace. After that, the model builder can evaluate the model from the populated workspace. Reviewed By: azzolini Differential Revision: D4751414 fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3	2017-03-27 10:02:11 -07:00
Alisson Gusatti Azzolini	6ff05fd49d	Fix issues pickling jobs Summary: We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session. This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling. Reviewed By: dzhulgakov Differential Revision: D4554799 fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984	2017-02-21 20:47:27 -08:00
Alisson Gusatti Azzolini	14a5b35805	Snapshot -> Checkpoint Summary: As per kennyhorror request. Reviewed By: kennyhorror Differential Revision: D4473177 fbshipit-source-id: 6cab6ccf247b09aab8f6f056c807bd3ed27ee6a5	2017-01-27 22:29:32 -08:00

24 Commits