Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53319
Noticed these in profiles.
Also switch to `unordered_map`.
Test Plan: Unit tests.
Reviewed By: swolchok
Differential Revision: D26504408
fbshipit-source-id: 9e14d55909a4af019058b8c27c67ee2348cd02a9
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.
Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496
Reviewed By: malfet, samestep
Differential Revision: D25600726
Pulled By: janeyx99
fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23096
nets can have states that depends on the rest of the state in the Workspace. Hence, they should be destructed first.
Reviewed By: ajyu
Differential Revision: D16382987
fbshipit-source-id: 3fd030ba206e2d0e897abb9e31c95bdaeb9482b7
Summary: Some automation to fix uninitialized members for caffe2 code. Ran canary to make sure I don't have any regression in prod, but not sure how to test comprehensively for caffe2
Reviewed By: ezyang
Differential Revision: D13776185
fbshipit-source-id: fb2a479971cc0276d8784be1c44f01252410bd24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714
This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.
Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where
```
using namespace c10;
```
is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).
Reviewed By: dzhulgakov
Differential Revision: D10390486
fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
Summary:
This does 6 things:
- add c10/util/Registry.h as the unified registry util
- cleaned up some APIs such as export condition
- fully remove aten/core/registry.h
- fully remove caffe2/core/registry.h
- remove a bogus aten/registry.h
- unifying all macros
- set up registry testing in c10
Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077
Reviewed By: ezyang
Differential Revision: D10050771
Pulled By: Yangqing
fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf
Summary:
TSIA. Right now we should basically use C10_EXPORT and C10_IMPORT for explicitly marking dllexport and dllimport, as a continued effort of the C10 unification.
This is a codemod by mechanically doing the following change:
CAFFE2_{EXPORT,IMPORT} -> C10_{EXPORT,IMPORT}
AT_CORE_{EXPORT,IMPORT} -> C10_{EXPORT,IMPORT}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12019
Reviewed By: ezyang, teng-li
Differential Revision: D10016276
Pulled By: Yangqing
fbshipit-source-id: a420d62c43d1110105fc88f9e9076e28a3203164
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12043
Re-trying D9979976, this time with all call sites fixed.
D9979976 got reverted because there was a call site that wasn't covered by sandcastle it seems.
I fixed it and used 'grep' to ensure there aren't any more call sites in fbsource.
Reviewed By: ezyang
Differential Revision: D10026392
fbshipit-source-id: cd341514a8e53a40147ea0ee3e52f63bb6444157
Summary: The controller you requested could not be found. Original commit changeset: 2ea17724e223
Differential Revision:
D10026321
Ninja: stable broken
fbshipit-source-id: faf87cb7cc0f78c2c10d4aa6fceea279cd27acd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11923
This is pre-work to allow moving Blob to ATen/core, which cannot depend on caffe2 anymore.
(1) Removing the Blob -> Tensor dependency allows us to move Blob to ATen/core and use it inside IValue without having to wait for the Tensor merge to be complete.
(2) In the final Blob design, we want it to be a very small class that doesn't have any special treatment for Tensor (or to be more correct, doesn't allow storing Tensor anymore), so this is anyhow the direction we want to go.
This changes call sites that will have to be moved to IValue later, but they cannot be moved to IValue directly, because for that, IValue first needs to be able to store Blob, which in turn first needs this diff and some other changes coming up in future diffs.
Codemods:
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)\\.IsTensorType\\(" "BlobIsTensorType(\\1, "
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)->IsTensorType\\(" "BlobIsTensorType(*\\1, "
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)\\.GetMutableTensor\\(" "BlobGetMutableTensor(\\1, "
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)->GetMutableTensor\\(" "BlobGetMutableTensor(*\\1, "
It is, however, not only these codemods because regex based refactoring was only able to match a small amount of the call sites. To catch more, I wouldn've needed a AST aware tool like clangr, which I didn't figure out how to use.
Reviewed By: ezyang
Differential Revision: D9979976
fbshipit-source-id: 2ea17724e223b5b73b44f99362727759ca689e61
Summary:
Since ONNX opset version >5, Reshape changed semantics to take a shape tensor as input instead of relying on `shape` attribute to decide what shape to reshape to. ONNXIFI op has been postponing this change as some of the backends such as TensorRT were not ready. Now that the backends have adopted this semantics, we can remove the legacy mode and output opset version 7 ONNX models.
This change also flushes out some of the bugs and new requirement.
- Converting shape info into int64 tensor
- Fix a bug when we output the shape tensor in the mapped workspace instead of the original workspace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10848
Reviewed By: houseroad
Differential Revision: D9495121
Pulled By: yinghai
fbshipit-source-id: a6f44a89274c35b33fae9a429813ebf21d9a3d1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10766
Added a `Workspace::ForEach(...)` API for accessing the global set of
existing Workspace instances. This is used in the signal handler to print blob
info on the thread receiving a fatal signal.
Reviewed By: mraway
Differential Revision: D9147768
fbshipit-source-id: a94d0b5e6c88390a969ef259ecb8790173af01a4
Summary:
A continuation of https://github.com/pytorch/pytorch/pull/10504 for GPU, torch, etc. builds.
I was testing with
```
FULL_CAFFE2=1 python setup.py build_deps | tee ~/log.txt
cat ~/log.txt | egrep 'undefined refer' | sort | less
```
I'll rebase on master when Yangqing's changes in 10504 land, but putting up for some testing.
cc mingzhe09088 anderspapitto ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10507
Reviewed By: Yangqing
Differential Revision: D9359606
Pulled By: orionr
fbshipit-source-id: c2a3683b3ea5839689f5d2661da0bc9055a54cd2
Summary:
Properly annotated all apis for cpu front. Checked with cmake using
cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON
and resulting libcaffe2.so has about 11k symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504
Reviewed By: ezyang
Differential Revision: D9316491
Pulled By: Yangqing
fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: ezyang, houseroad
Differential Revision: D9024330
fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: xw285cornell
Differential Revision: D8121878
fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
Summary: Adds support for backprop to While op, fixes gradient computation for Pow
Reviewed By: azzolini
Differential Revision: D6456875
fbshipit-source-id: 9f660317ad6f3898ff7d8ce43098f85c3426409b
Summary:
pretransposing FCs seems to offset loses we get from low
batch sizes in AdIndexer. First I confirmed this on local benchmarks (see
previous diff). Then in https://fburl.com/yuo49onj I showed how this
change saves 19% of FC time on AdIndexer. Which is already $0.4M in
cap. exp. and over 3 years gives 5x more ROI.
We also we reuse this code for later more efficient gemm
implementations. I.e. msmelyan is working on new fp16 gemm which
would cut bandwidth usage 2x. We can reuse code in this diff for
repacking required by a new gemm.
In this diff I had to take care of memory usage. Here are several
possible approaches to the transformation:
1. Perform on the fly, copy the memory. This is what is done in
skinny gemm (FC with engine SKINNY)
Cons: slow first execution, memory is replicated for each thread
2. Perform copy of weights in operator constructor. On the fly in dbg
mode verify that hash on original weight is the same
Cons: memory is still replicated for each thread
3. Perform copy weights in Predictor constructor
Cons: if we have 2 predictors sharing the same weight blob (via
PredictorContainer), we still get 3x more memory. I.e. original
weights and two copies for each of the predictors in a container
4. Replace weights in Predictor constructor, take care of mapping to
support weight sharing within a Predictor container
This is the approach taken in this diff, it solves issues above and
doensn't create any memory overhead.
Cons: Logic became complex, requires a mutex at initialization time
Reviewed By: akyrola
Differential Revision: D6214593
fbshipit-source-id: 25da6ba7bfd39fc8f4b578094d3f334c7957490d
Summary: Adds the ability to create a local blob in the workspace even if the blob exists in the parent workspace. This is to support cases where a user wants to create a local copy of the blob and hide the blob from the parent workspace.
Reviewed By: akyrola
Differential Revision: D6194386
fbshipit-source-id: 92c064159ac635ee76c211abc013b72bd8752447
Summary:
Adding backward pass support for If operator:
- Implemented necessary changes to Do operator and generation of gradient Do operator to properly forward gradient blobs in and out of subnet
- Using WorkspaceManager to keep track of workspaces used by Do, in case we need to have access to local blobs to compute gradients (also important for loop's backprop)
- Update to Workspace to handle blob binding from multiple parent workspaces
- Implemented generation of gradient If operator
- Unit test to build and train a net with If control op
Reviewed By: azzolini
Differential Revision: D5745096
fbshipit-source-id: 1023c90a2113716254424d1e50b9e560fe9083e5
Summary:
Better isolation for workspaces to allow forwarding selected blobs
from parent to child workspace, possibly under new names. Used for proper
isolation of subnets (loops, then/else branhes, etc) from outer workspace.
Reviewed By: azzolini
Differential Revision: D5681667
fbshipit-source-id: e61a2c7c98ee2abf1f0761905f4bfae47c201c32
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
Last time I used uuid filled into OperatorDef. And operator_tracebacks was populated using traceback.extract_stack. There were several issues with this approach:
1. A random field in OperatorDef breaks workflows relying on memoization, i.e. when computation is skipped based on already computed result before.
2. Adding one more field revealed RNNs being non forward compatible wrt to new fields in there. prototxt format seems to not allow forward compatibility (thanks jamesr66a for the investigation!). For RNNs we need to swtich them to a more resilient approach. azzolini's proposed change to OperatorDef / NetDef would allow that by just nesting NetDef dirrectly inside OperatorDef without need for extra serialization.
3. traceback.extract_stack is very slow when executable is on a remote filesystem. It does one or more os.stat for each frame on the stack. For some cases it ended up being up to 15 extra minutes on model construction.
In this diff I use a different approach which should fix all those problems above.
1.2. are solved by not adding a new field at all. Instead I report operator idx wrt to a net it runs in. Thanks akyrola and dzhulgakov for the idea. Downside here is that operator list manipulation breaks the logic and separately created ops are not covered at all.
3. I solved this by operating on raw frames without using traceback and inspect modules which end up doing a lot of file system calls. See function extract_stacktace in core.py with additional comments.
Reviewed By: dzhulgakov
Differential Revision: D5286285
fbshipit-source-id: 626dd0f5f6b8b1d86bd6bf519078b122f43ddcaa
Summary:
a few issues:
1. Randomization hurts memoization
1. Even if we make it non random, then we can get key colisions when loading it back.
2. RNNs use prototxt for step net and apparently its not forward compatible like normal protobuf is
I am thinking of a better less invasive solution now.
Reviewed By: jamesr66a
Differential Revision: D5272118
fbshipit-source-id: ab577fad04fbfc632e1fceffa923377a0d3da1be
Summary: This is going to show a python Caffe2 user where a failed operator was created. Motivation for having this information not right in protobuf is to avoid having it too verboose and keep ability to read protobufs of a net after a simple print() call.
Reviewed By: jamesr66a
Differential Revision: D5226047
fbshipit-source-id: 7edfe850e05a2ec209577142aa3368664a57a108
Summary: This RunPlan is getting complex and confusing. The first step to clean it up is to move it out of workspace.cc to better mark separation of concerns.
Reviewed By: kennyhorror
Differential Revision: D5100721
fbshipit-source-id: 4be0559eba1abb8bb1ddc3818698763c2e014ef2
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.
ajtulloch since we are doing Predictors in mobile, this should be safe right?
azzolini - I assume this would be safe, but would love to get your approval.
akyrola - would this hurt xray?
Reviewed By: dzhulgakov
Differential Revision: D4897725
fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.
For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.
For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).
I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.
Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.
Reviewed By: urikz
Differential Revision: D4853890
fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
Summary: Instead of reporting the number of total elements of tensor, report the number of bytes. But report the capacity of the tensor, not the current number of bytes.
Reviewed By: jamesr66a, salexspb
Differential Revision: D4851633
fbshipit-source-id: 464d552f41f1b5f25753b0e7001d299b6dac1966
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.
TODO: add sorting
Reviewed By: jamesr66a
Differential Revision: D4844628
fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
Summary:
Main idea is that on the backward pass we don't need to store all the backward outputs in memory. This diff addresses only ones used internally in each private workspace by creating that shares them all witing the backward pass.
Another thing we can do - get rid of state_grad blobs, but this would be a different effort.
See comments for more detailed description.
Reviewed By: urikz
Differential Revision: D4784900
fbshipit-source-id: 2dd8fe1b1215217ce92c09d918582d76c3051630
Summary:
When the execution step is representing things like:
for loop
execution_step
net1
execution_step
net2
net3
the preparation cost for execution step is too high.
This diff moves most of the shared information in the CompiledExecutionStep to save time.
After the change the benchmark result for parameter server handler is as following: (be aware that the first two have some variance)
INFO:__main__:==Summary==
INFO:__main__:Time <function case_if at 0x7f7160c32938> 0.0752924203873
INFO:__main__:Time <function case_loop at 0x7f7160c329b0> 0.0677666187286
INFO:__main__:Time <function case_simple_net at 0x7f7160c32a28> 0.0605396509171
INFO:__main__:Time <function case_one_loop at 0x7f7160c32aa0> 0.0611681699753
Before the change:
INFO:main:==Summary==
INFO:main:Time <function case_if at 0x7f19d079f848> 0.100815701485
INFO:main:Time <function case_loop at 0x7f19d079f8c0> 0.0864136457443
INFO:main:Time <function case_simple_net at 0x7f19d079f938> 0.0614696979523
INFO:main:Time <function case_one_loop at 0x7f19d079f9b0> 0.0598972082138
Reviewed By: azzolini
Differential Revision: D4643926
fbshipit-source-id: 5a4b97230ba778e0ff5cbafc8a216335a191068a
Summary:
This is a bit large diff, sorry about it. It includes basic shape and type inference functionality, based on YQ's Schema scaffolding. I added some helper functions to make it easier to write simple translations.
Bigger refactoring was needed for ConvPoolBase so that we could use the shape inference already there in the schema.
I annotated enough operators to be able to infer forward-pass of shapes for basic convnet, and added test for that. I intend to bootcamp some annotations and annotate enough to handle Resnets fully. Need to think about gradients, if they could be annotated in an easier way.
Only shapes are now exposed to Python, types will follow later. Also the inference is not called yet anywhere but unit test.
Also I am not sure if everything is in the best location in the code, but shouldn't be hard to move stuff around.
Reviewed By: dzhulgakov
Differential Revision: D4436818
fbshipit-source-id: eebee5937ccc9ac09c245465302388a1fae6933c