Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63068
The caffe2 core.Net constructor can accept a caffe2_pb2.NetDef proto, but it always creates a copy. This is wasteful when we can prove that the proto being passed to it will not be used anywhere else. So we add an "inplace" argument to the `core.Net` constructor that allows clients to give away ownership of the passed proto without copying. We default this argument to `False`, ensuring that behavior does not change unless explicitly requested.
Test Plan: Let CI run.
Differential Revision: D29976510
fbshipit-source-id: 26e13ca76f3431b8ef0de51f08bbf263491d323e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47530
`Net.AddExternalInput` should raise if there are duplicate names. The previous code would only raise if the addition of duplicates was in separate calls, but not if it was in the same call.
Test Plan:
Added two new regression tests
```
✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.622)
✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicate (caffe2.caffe2.python.core_test.TestExternalInputs) (9.639)
✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithoutBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.883)
✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicateInSameCall (caffe2.caffe2.python.core_test.TestExternalInputs) (10.153)
```
Test trained 2 models. No issues
f230755456
f230754926
Reviewed By: dzhulgakov
Differential Revision: D24763586
fbshipit-source-id: c87088441d76f7198f8b07508b2607aec13521ed
Summary: Similar to If operator, AsyncIf also contains nets in args. It needs the same handling.
Test Plan:
New unit test test_control_op_remap
`buck test caffe2/caffe2/python:core_test`
Also it worked end to end in prototype of dist bulk eval workflow f226680903
Reviewed By: yyetim
Differential Revision: D24451775
fbshipit-source-id: 50594e2ab9bb457329ed8da7b035f7409461b5f6
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Found while trying to get RocM Caffe2 CI green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42168
Reviewed By: seemethere
Differential Revision: D22791879
Pulled By: malfet
fbshipit-source-id: 8f7ef9711bdc5941b2836e4c8943bb95c72ef8af
Summary:
Goal of this PR is to unify cuda and hip device types in caffe2 python front end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14221
Differential Revision: D13148564
Pulled By: bddppq
fbshipit-source-id: ef9bd2c7d238200165f217097ac5727e686d887b
Summary:
Original commit changeset: f5614a5d2607
D9986213 is causing Multifeed Aggregator a [huge performance different](https://our.intern.facebook.com/intern/ads/analyze_canary/412951953278781781/) and is blocking aggregator push since last Friday night: https://fburl.com/feedtools/b6izvwjz
We need to land this revert ASAP to unblock aggregator push.
Reviewed By: orionr
Differential Revision: D10123245
fbshipit-source-id: d83da8e00a1250f5d09811a0a587c127e377aab2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9566744
fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9413150
fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10528
adding 2 features to core and model_helper
- reroute_tensor which supports op insertion on net level
- model_helper complete net and cut net used for full graph analysis
Differential Revision: D9330345
fbshipit-source-id: 56341d3f500e72069ee306e20266c8590ae7985a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9636
Make sure that the blobs are registered to the net
Reviewed By: pjh5
Differential Revision: D8924883
fbshipit-source-id: f09422a2d4d5ba8bf6cfbfd00172097b5ab1fcd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9438
Current implementation of create_from_proto doesn't work as expected: it
duplicates networks and execution steps by copying original PlanDef first and
adding each step one-by-one later.
Reviewed By: pjh5
Differential Revision: D8850316
fbshipit-source-id: 9b02836d6e6ee1c91cfdd3b4c4804f14137dc22b
* Adding instance weight to batch distill loss
as title
* add bfloat 16-31
added bfloat 16-31 and their respective unit tests
* [CUDA9] Upgrade - fbcode
CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").
This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)
* Share intermediate int32 buffer across Conv ops
Adding a known type
* [C2 fix] infer function for ensure_cpu_output_op
this is adding the missing device funtion for ensure_cpu_output_op
* [int8] Add blob serializer/deserializer for Int8TensorCPU
To export to logfiledb
* [nomnigraph] Add try catch block to optimization passes in predictor
This will catch failures that happen in the optimization pass.
* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE
CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.
Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.
* NUMA support in SparseNN CPU benchmark
Adding support for NUMA in SparseNN CPU benchmark
* [mobile-roofline] Add logging needed for roofline model
This should be all that's needed
* Let the operators using the same input if the operators are not chained
or else, we have to change the input data dims
* fix null-pointer-use UBSAN errors in in reshape_op.h
* revert previous fix on input blob name
as title
* Adding flag to let MineHardNegative automatically extract single value from dict
Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.
* Reverting change that broke internal tests back to OSS compatible state
* [bootcamp] Improve "Shape" operator to support axes specification
To improve .shape operator of Caffe2 to support x.shape(tensor, axes), which takes an optional int array "axes" as input. For example, x.shape(tensor, [1, 0]) will return the dimension for axis 1 and 0 following the specified order. For current version, "axes" input allows duplications and can have arbitrary length.
* Back out "Add barrier net that runs before training nets"
Original commit changeset: b373fdc9c30f. Need additional changes to some callers to support barrier failures.
* Change warning to verbose log to reduce log spam
The `LOG(WARNING)` was a bit spammy for regular use so lets just make it a `VLOG`.
* Extract the shared code from different caffe2_benchmark binaries
The OSS benchmark and Internal benchmark will share most functions in the benchmark.
* Support MFR in sequence training
As titled.
* Make knowledge distillation work with using logged prediction feature as teacher label.
1) Add loading raw dense feature as teacher label.
2) Optional calibration function for teacher label
3) Add teacher label into generic unit test
4) Deprecated TTSN workflow version using feature_options to config teacher label
* [C2/CUDA]: unjoined cross entropy sigmoid
as desc
* Add async_scheduling executor into deferrable_net_exec_test
Add async_scheduling into tests and fix some exception cases
* Fix Event disabled error
When disabling event in RNN ops make sure we don't call Finish on disabled
event from op's RunAsync
* cuda ensure cpu output op can handle both TensorCPU and TensorCUDA
as desc.
* [C2 Core] Infer input device option in C2 hypothesis_test checkers
Improve how we default input blob device options.
Previously it defaults as where op lives but it is not necessarily the case.
For example:
CopyCPUToGPU
* [C2 Op]SplitByLengthsOp CPU/GPU implementation
[C2 Op]SplitByLengthsOp CPU/GPU implementation
* fix undefined symbol error
not sure why we're getting undefined symbol even with link_whole = True
Need to figure out why but need this workaround for now
* Add tools in DAIPlayground platform to help debugging models
Add additional tools to allow Plauground override individual method defined in AnyExp. This will allow user to create module that specificly change certain default method behavior. An example included in this diff is deactivating test model and checkpointing. When debugging any model problems, switching off components helps me quickly narrow down the location of the bug. The technique is extensively used in task T27038712 (Steady memory increase in EDPM, eventually resulting in gloo/cuda.cu:34: out of memory)
* add shape and type inference for int8 conversion operator
* Fix flaky test for group_norm
Fix flaky test for group_norm
* Fix group_norm_op_test flaky
Fix group_norm_op_test flaky
* Implementation of composite learning rate policy
In many state-of-the-arts deep learning works, people use a simple trick to
schedule the learning rate: use a fixed learning rate until error plateaus
and then switch to a different fixed learning rate, and so on. In this diff,
we implemented a simple version of the composite learning rate. The user gives
a set of learning rates policies and corresponding iteration nums, and the
optimizer will change the learning rate policy based on the number of iterations so far.
For example, the user give two learning rate policies, one is FixedLearningRate
and PolyLearningRate, with an iteration number of 1k. Then the first 1k iteration,
we use FixedLearningRate. For the following iterations, we use PolyLearningRate.
* Split two use cases of CachedReader into two classes, DBFileReader and CachedReader
# Use Cases:
1). input: DB file -> output: DatasetReader.
Use DBFileReader.
2). input: Reader -> build cache DB file -> output: DatasetReader.
Use CachedReader.
# Changes to CachedReader:
1). Move db_path to the constructor.
Because in mock reader. cache will always be built ahead.
# Changes to tests:
1). Make a separate TestCase class for CachedReader and DBFileReader.
2). Make it possible to add more test functions by adding setUp, tearDown and _make_temp_path.
3). Make delete db_path more general. `db_path` could be a file for `log_file_db`, but could also be a directory for `leveldb`.
* Back out "On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization"
Original commit changeset: 4489c6133f11
* Fix LARS bug
Fixed a bug in the LARS implementation which caused all subsequent blobs not using LARS to have the LARS learning rate multiplier applied to them.
* [tum] support sparse init & add uniformFill option
as title
* Propagate exception for async nets
Capture the exception when an exception is thrown in async nets and re-throw it after wait(). This allows exceptions to be propagated up to the caller.
This diff was a part of D7752068. We split the diff so that C2 core files changes are in a separate diff.
* Automatic update of fbcode/onnx to 69894f207dfcd72d1e70497d387201cec327efbc
Previous import was 403ccfbd0161c38f0834413d790bad0874afbf9a
Included changes:
- **[69894f2](https://github.com/onnx/onnx/commit/69894f2)**: Use op schema.all tensor types in random like definitions (#865) <Scott McKay>
- **[b9d6b90](https://github.com/onnx/onnx/commit/b9d6b90)**: Clarify random like operators (#846) <Scott McKay>
- **[fc6b5fb](https://github.com/onnx/onnx/commit/fc6b5fb)**: Refactor shape inference implementation (#855) <anderspapitto>
- **[b7d8dc8](https://github.com/onnx/onnx/commit/b7d8dc8)**: fix cmake warning message (#863) <Eric S. Yu>
- **[f585c5d](https://github.com/onnx/onnx/commit/f585c5d)**: add pytorch-operator test for tile (#831) <Wenhao Hu>
- **[993fe70](https://github.com/onnx/onnx/commit/993fe70)**: add install step (#832) <Eric S. Yu>
- **[68bc26c](https://github.com/onnx/onnx/commit/68bc26c)**: add type inference for traditional ml ops except classifier ops. (#857) <Ke Zhang>
- **[9cc0cda](https://github.com/onnx/onnx/commit/9cc0cda)**: fix string representation of scalar types (#858) <G. Ramalingam>
- **[1078925](https://github.com/onnx/onnx/commit/1078925)**: fix y in pow test case to scalar (#852) <Wenhao Hu>
- **[c66fb6f](https://github.com/onnx/onnx/commit/c66fb6f)**: Add some math function shape inference (#845) <anderspapitto>
- **[ff667d1](https://github.com/onnx/onnx/commit/ff667d1)**: Refactor return type and docs for ONNXIFI_BACKEND_DIRECTX_ID (#853) <Marat Dukhan>
- **[11c6876](https://github.com/onnx/onnx/commit/11c6876)**: clear initializer names when clear initializer (#849) <Wenhao Hu>
- **[73c34ae](https://github.com/onnx/onnx/commit/73c34ae)**: Clarify FeatureVectorizer description. (#843) <Scott McKay>
- **[1befb9b](https://github.com/onnx/onnx/commit/1befb9b)**: Remove useless text in docs (#850) <Lu Fang>
- **[e84788f](https://github.com/onnx/onnx/commit/e84788f)**: Fix SELU attributes' default values (#839) <Lu Fang>
- **[ebac046](https://github.com/onnx/onnx/commit/ebac046)**: Add tile test case (#823) <Wenhao Hu>
- **[8b7a925](https://github.com/onnx/onnx/commit/8b7a925)**: a few more shape inference functions (#772) <anderspapitto>
- **[9718f42](https://github.com/onnx/onnx/commit/9718f42)**: Make the coefficient non optional for LinearClassifier (#836) <Jaliya Ekanayake>
- **[ef083d0](https://github.com/onnx/onnx/commit/ef083d0)**: Add save_tensor and load_tensor functions for Protos (#770) <Lu Fang>
- **[45ceb55](https://github.com/onnx/onnx/commit/45ceb55)**: Check if CMAKE_BUILD_TYPE set before project(). (#812) <Sergii Dymchenko>
- **[4b3d2b0](https://github.com/onnx/onnx/commit/4b3d2b0)**: [WIP] reenable shape inference tests (#834) <anderspapitto>
- **[22d17ee](https://github.com/onnx/onnx/commit/22d17ee)**: RNN tests: LSTM, GRU, SimpleRNN (#739) <Peyman Manikashani>
- **[de65b95](https://github.com/onnx/onnx/commit/de65b95)**: dimension denotation (#443) <Tian Jin>
- **[eccc76e](https://github.com/onnx/onnx/commit/eccc76e)**: fix field number issue in onnx operator proto and enable its build (#829) <Ke Zhang>
- **[d582beb](https://github.com/onnx/onnx/commit/d582beb)**: disable shape inference test to unbreak ci (#830) <Lu Fang>
- **[485b787](https://github.com/onnx/onnx/commit/485b787)**: function proto for composite op. (#802) <Ke Zhang>
- **[cd58928](https://github.com/onnx/onnx/commit/cd58928)**: specify defaults for attributes of Affine op (#820) <G. Ramalingam>
- **[7ee2cf9](https://github.com/onnx/onnx/commit/7ee2cf9)**: merge the dummy backend back into the main one (#743) <anderspapitto>
- **[1c03a5a](https://github.com/onnx/onnx/commit/1c03a5a)**: [Proposal] ONNX Interface for Framework Integration (previously ONNX Backend API) header and docs (#551) <Marat Dukhan>
- **[3769a98](https://github.com/onnx/onnx/commit/3769a98)**: Rename real model test case from VGG-16 to ZFNet (#821) <Lu Fang>
* [C2]ReluN Op
relu n op.
tf reference: https://www.tensorflow.org/api_docs/python/tf/nn/relu6
* Call destructor when assigning a blob value
* Add executor overrides
Add executor overrides flag to enable migration to async_scheduling executor
* Add barrier net that runs before training nets - attempt #2
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow.
This change was landed previously but caused errors for some EDPM workflows - See https://fb.facebook.com/groups/1426530000692545/permalink/1906766366002237/ - because EDPM assumes any call to CreateOrCloneCommonWorld and Gloo ops are wrapped in exception handlers but in this case exception thrown in the barrier init net is not handled.
To address this issue, we add _CreateOrCloneCommonWorld to the param_init_net instead of a new barrier init net. Since errors for param_init_net run is handled gracefully and re-rendezvous, it should fixes the problem.
* Handle empty nets in async_scheduling
Make sure we don't get stuck on empty nets
* use CUDA_ARCH for conditional compile
* [C2 fix] infer function for ensure_cpu_output_op
* Update group_norm test to reduce flaky test
* Fix lr_multiplier for GPU
* [fix] Re-enable events in RNN ops
We have earlier added event disabling in RNN ops as back then we didn't use
events, with current use cases this is no longer true
(https://fburl.com/8vd0lp8y)
* use ops with cude impl
* Revert D7729695: [caffe2][fix] Re-enable events in RNN ops
This reverts commit 4b215c7496fb724656ff4c776933a15bdbbcde5e
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [observer] Clean up observer_config.h
#accept2ship
* [1/n] Refactor dataio_test.py
Replace code duplication with a common function
* Add barrier net that runs before training nets
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow. Similar change in speech/asr_training workflow will come in another diff.
* Support the dnnlowp backend in caffe2_benchmark
This is for SHARE operator latency evaluation
* Migrate integral_image_op to main caffe2
migrate integral_image_op(GPU version) given by https://fburl.com/yvqezigi
to caffe2/caffe2/operators and implement its CPU version. Write up a test
using the hypothesis_test mechanism
* [pos_disc, fbcode] Implement unjoined lr loss
As explained in https://our.intern.facebook.com/intern/wiki/Model_Based_Calibration/, when the dataset is an joined data set, where labels might change later, we need to use unjoined logloss.
The implementation is almost the same as in Sigrid (https://fburl.com/1trngsls), where
loss = y (log(p) - log(1-p)) + (1-y)(log(1-p)) = xy - (1-y)x - (1-y)log(1+exp(-x))
For x < 0, to ensure stability and avoid overflow, we reformulate the above exp as
loss = xy - (1-y)x - (1-y)x + (1-y)log(1+exp(x)) = xy + (1-y)log(1+exp(x))
Then the final expression becomes
loss = xy + (y - 1) x (x >= 0) - (1 - y) log(1 + exp(x - 2 x (x >= 0)))
where y is the true label, x is the dot product and p = logistic(x).
This kind of implementation is align with the current implementation of the original cross entropy in
https://phabricator.intern.facebook.com/diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/cross_entropy_op.cc;0bae3b5d0f825897c5e0dd0ff10f489d7271bf25$7-13
* Keep the array to fix the conflict
* [C2] Compute Adagrad effective LR
The AdagradWithLR op outputs an extra blob which is contains the average effective learning rate across all weights in this blob.
* Open-source extractMetaNetDef & runGlobalInitialization, add new Predictor constructor from db file, and add run_map_outputs
1. Open-source extractMetaNetDef and runGlobalInitialization, for use in
2. new Predictor constructor from db file.
3. Add new run function that returns outputs as TensorMap
* Disable eigen cpu
Disable eigen cpu in transpose and reduce
* Introduce request_only/object_only property of ModelLayer
by default this is False
* A simple TC Caffe2 benchmark
We can run tunner, get MappingOptions and then use them to
compare against cuBLAS
currently broken due to LLVM issues. How to run:
hg checkout eec1ab31b59c03b8deded1c755a9abaf8c45be01
add D7401202
add D7434625
add D7506031
add D7540728
buck run @mode/dev-nosan tc/tc/benchmarks_python:caffe2_benchmark
* Move Caffe2 feature_maps_ops to open source
Need feature maps operators in open source project facebookresearch/BlueWhale
* Manually fix the conflicts in channel shuffle op
* Fix the inconsistency between different gh and fbcode
* Skip Adagrad GPU Test (Because some gpu implementation is missing)
* Fix another test to make sure it won't run on gpu when implementation is not available yet
* [GanH][Easy]: Add assertion to adaptive weighting layer
0 weight causes numeric instability and exploding ne
* [Easy] Add cast op before computing norm in diagnose options
As LpNorm only takes floats we add a manual casting here.
* Introduce a new caching device allocator
`cudaMalloc` and `cudaFree` calls are slow, and become slower the
more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock
because GPU memory is transparently shared across all GPUs. Normally, this
isn't much of a concern since workloads allocate memory upfront, and reuse it
during later computation.
However, under some computation models (specifically, memory conserving
approaches like checkpoint-and-recompute, see
https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9)
this assumption is no longer true. In these situations, `cudaMalloc` and
`cudaFree` are common and frequent. Furthermore, in data parallel contexts,
these calls happen at nearly the same time from all GPUs worsening lock
contention.
A common solution to this problem is to add a custom allocator. In fact,
nVIDIA provides one out of the box: CUB, which Caffe2 already supports.
Unfortunately, the CUB allocator suffers from very high fragmentation. This is
primarily because it is a "buddy" allocator which neither splits nor merges
free cached blocks. Study
https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you
want to convince yourself.
This diff adapts a caching allocator from the Torch codebase
https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp
which does splitting and merging and ends up working really well, at least for
workloads like the checkpoint-and-recompute computation models noted above.
I simplified the implementation a little bit, made it a bit more C++-like. I
also removed a bunch of stream synchronization primitives for this diff. I
plan to add them back in subsequent diffs.
* Report reader progress in fblearner workflows
Integrate with fblearner progress reporting API and add support to report training progress from reader nodes.
If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split.
If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate.
* [GanH][Diagnose]: fix plotting
1. ganh diagnose needs to set plot options
2. modifier's blob name is used for metric field can need to be fixed before
generating net
* Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8
* Make CompositeReader stops as soon as one reader finishes
Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data.
* [dper] make sure loss is not nan
as desc.
* [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign
Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more
optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but
will soon become important.
* Intra-op parallel FC operator
Intra-op parallel FC operator
* [C2 Proto] extra info in device option
passing extra information in device option
design doc: https://fb.quip.com/yAiuAXkRXZGx
* Unregister MKL fallbacks for NCHW conversions
* Tracing for more executors
Modified Tracer to work with other executors and add more tracing
* Remove ShiftActivationDevices()
* Check for blob entry iff it is present
When processing the placeholders ops, ignore if the blob is not present in the blob_to_device.
* Internalize use of eigen tensor
Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries.
* feature importance for transformed features.
* - Fix unused parameter warnings
The changes in this diff comments out unused parameters.
This will allow us to enable -Wunused-parameter as error.
#accept2ship
* add opencv dependencies to caffe2
The video input op requires additional opencv packages. This is to add them to
cmake so that it can build
* Add clip_by_value option in gradient clipping
Add clip_by_value option in gradient clipping
when the value is bigger than max or smaller than min, do the clip
* std::round compat
This is required to support placeholder/decorator ops which does not have operator schema. Note that the change is made in such a way that it is a no-op if placeholder Ops are not used.
Changes:
1. Since the placeholder ops always run on CPU, added a utility to infer placeholder ops blob devices.
2. Placeholder op's input/output blobs should be on CPU as well. This change takes care of dealing with output blobs - i.e. use blobs on CPU.
3. Added a Unit test - test_inject_copy_placeholder_ops
Changes:
=======
1. Added device inference functions for Concat and Split Ops.
2. Added a unit test to validate the change. See, test_device_inference_function in core_test.py
3. Fixed some formatting.
Summary: Currently, the device_option equality is done in a specialized private function. Ideally, we should be able to test the equality from other places in the code and have a more detailed check for the equality.
Reviewed By: akyrola
Differential Revision: D6316608
fbshipit-source-id: c3fd085583e535d7936d05e4c8b15d2eff91c744
Summary: If a blob is copy from device A to device B in the init_net, and then is used as an external_input in the train_net, we want the train_net to correctly use the blob already on device B instead of copying it over and over again.
Reviewed By: akyrola
Differential Revision: D5800870
fbshipit-source-id: d93f44bba80e4ed70eb03183d552496b54a966b5
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.
Also support running Plan object directly in Session.
Reviewed By: azzolini
Differential Revision: D5608393
fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
Summary: fixing the case where the init net will initialize same blob twice. I made an exception by allowing inplace blob among ops if the blob keeps on the same device. This should fix this problem in a generalized way as most of our training is only on CPU now.
Reviewed By: dzhulgakov
Differential Revision: D5450564
fbshipit-source-id: 525c4c9a2e5216a70dbd1229da2d9f8a58b89e47
Summary:
Last time I used uuid filled into OperatorDef. And operator_tracebacks was populated using traceback.extract_stack. There were several issues with this approach:
1. A random field in OperatorDef breaks workflows relying on memoization, i.e. when computation is skipped based on already computed result before.
2. Adding one more field revealed RNNs being non forward compatible wrt to new fields in there. prototxt format seems to not allow forward compatibility (thanks jamesr66a for the investigation!). For RNNs we need to swtich them to a more resilient approach. azzolini's proposed change to OperatorDef / NetDef would allow that by just nesting NetDef dirrectly inside OperatorDef without need for extra serialization.
3. traceback.extract_stack is very slow when executable is on a remote filesystem. It does one or more os.stat for each frame on the stack. For some cases it ended up being up to 15 extra minutes on model construction.
In this diff I use a different approach which should fix all those problems above.
1.2. are solved by not adding a new field at all. Instead I report operator idx wrt to a net it runs in. Thanks akyrola and dzhulgakov for the idea. Downside here is that operator list manipulation breaks the logic and separately created ops are not covered at all.
3. I solved this by operating on raw frames without using traceback and inspect modules which end up doing a lot of file system calls. See function extract_stacktace in core.py with additional comments.
Reviewed By: dzhulgakov
Differential Revision: D5286285
fbshipit-source-id: 626dd0f5f6b8b1d86bd6bf519078b122f43ddcaa
Summary:
a few issues:
1. Randomization hurts memoization
1. Even if we make it non random, then we can get key colisions when loading it back.
2. RNNs use prototxt for step net and apparently its not forward compatible like normal protobuf is
I am thinking of a better less invasive solution now.
Reviewed By: jamesr66a
Differential Revision: D5272118
fbshipit-source-id: ab577fad04fbfc632e1fceffa923377a0d3da1be
Summary: This is going to show a python Caffe2 user where a failed operator was created. Motivation for having this information not right in protobuf is to avoid having it too verboose and keep ability to read protobufs of a net after a simple print() call.
Reviewed By: jamesr66a
Differential Revision: D5226047
fbshipit-source-id: 7edfe850e05a2ec209577142aa3368664a57a108
Summary: Recently people find that this test is too strict because of proto string matching. Thus, I change it to compare fields so that this test will not complain even if protobuf chnaged in future.
Reviewed By: dzhulgakov
Differential Revision: D5229855
fbshipit-source-id: 54efcd7a0f9e5dbba1ddeb480801abcb859e07bd
Summary:
`brew_test.py` is just plain broken. `core_test.py` doesn't work with pytest. `apmeter_test.py` and `top_k_test.py` don't work for CUDA builds.
Closes https://github.com/caffe2/caffe2/pull/765
Differential Revision: D5211817
Pulled By: Yangqing
fbshipit-source-id: 78ec5af35a3fa870978e4c9590210ade9e3bc5ac
Summary:
This diff plan to attack the problem where we want to just annotate device option for operators and leave Caffe2 to help us inject cross device copy functions. This feature would be useful for mixed device training and multi device training with several nets, where previously we do the heavy lifting of adding copy functions ourselves.
Ideally, this feature will happen like this:
//construct your nets first
core.InjectDeviceCopyAmongNets([train_init, train_net, ...])
My ideas are written in comments. I will update them here as well later.
Reviewed By: dzhulgakov
Differential Revision: D5134103
fbshipit-source-id: 173f7da9d1773d1c50ccdc27f1b5cd3067b04af5
Summary: Infer input and output device from OperatorDef through OperatorSchema. This is inspired by shape inference. With this feature, we can easily analysis device information for all blobs in the net in a generic way. It is really helpful for auto cross device execution.
Reviewed By: akyrola, dzhulgakov
Differential Revision: D5161065
fbshipit-source-id: ee656123112171a4ca00f2fb3f6940f32ddf3135
Summary:
Make it easier for users by returning from ExtractPredictorNet the list of blobs that must be saved/exported to run a predictor net. Added a test for ExtractPredictorNet
Codemod.
Reviewed By: asaadaldien
Differential Revision: D5176097
fbshipit-source-id: b1af42132459487b8d94fcdde0e4c514da608243