Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16929
Separate CPU reduce functions from math
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D13999469
fbshipit-source-id: bd628b15a6e3c1f04cc62aefffb0110690e1c0d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16175
Separate Moments from math and optimize it
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D13742472
fbshipit-source-id: 90757d908d38c98ca69818855aaf68315e525992
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16135
Separate affine_channel from math and optimize it
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D13727606
fbshipit-source-id: 8980af4afadaf964a18a9da581106fe30896a7e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15884
Codemod generated with clangr shard mode, 25 files per diff,
To eliminiate partially initialized Tensor, we split the initialization of local Tensor variables into two steps, first declare un uninitialized Tensor, and
call `ReinitializeTensor` to initialize it.
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: hyuen
Differential Revision: D13586737
fbshipit-source-id: dc8e49e9f29505b8898bb19f84c1a983f2d811ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12733
Conv in NHWC layout only works for 2D images. This has been a pain point when implementing quantized 3D convolution because we need NHWC layout for best performance (note that NHWC layout in general gives better performance in CPU not just for quantized operators). For example, our quantized ops have a functionality to measure quantized error operator by operator but this needs running a shadow fp32 operator, but this is not easy when there's no 3D conv in NHWC layout is available (currently we're doing layout conversion on the fly for the shadow fp32 operator which is error prone). Some of Caffe2 frameworks like brew generates error when we try to create a 3D conv op in NHWC layout. This was also a blocker for using aibench because aibench is using brew.
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D10333829
fbshipit-source-id: 2d203ee1db833cd3f9d39353219e3894b46c4389
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12306
In a future diff, I'm going to introduce non-placement constructor and destructor to TypeMeta.
To make it less ambigous, this diff is first renaming the existing ones to PlacementXXX.
Reviewed By: dzhulgakov
Differential Revision: D10184117
fbshipit-source-id: 119120ebc718048bdc1d66e0cc4d6a7840e666a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10888
Add cuda version of SpatialBNOp also optimize SpatialBN on CPU
Reviewed By: houseroad
Differential Revision: D9512435
fbshipit-source-id: 6f828c88d56d30dc9a2f98a297a161c35cc511b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11060
Adding synthetic data generation to the filler.h file (the exact distribution to be replaced later on).
Reviewed By: highker
Differential Revision: D9417594
fbshipit-source-id: 5d66dfbcb254a5961c36b7d3a081332c7372dac7
Summary:
Let's run CI tests to see what fails given the changes that just landed in https://github.com/pytorch/pytorch/pull/10624
cc mingzhe09088 ezyang Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10692
Reviewed By: mingzhe09088
Differential Revision: D9423617
Pulled By: orionr
fbshipit-source-id: 3bda1f118d13f8dd8e823727c93167cae747d8cf
Summary:
A continuation of https://github.com/pytorch/pytorch/pull/10504 for GPU, torch, etc. builds.
I was testing with
```
FULL_CAFFE2=1 python setup.py build_deps | tee ~/log.txt
cat ~/log.txt | egrep 'undefined refer' | sort | less
```
I'll rebase on master when Yangqing's changes in 10504 land, but putting up for some testing.
cc mingzhe09088 anderspapitto ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10507
Reviewed By: Yangqing
Differential Revision: D9359606
Pulled By: orionr
fbshipit-source-id: c2a3683b3ea5839689f5d2661da0bc9055a54cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10439
Update Im2Col related to make preparation for group conv in NHWC order.
Reviewed By: houseroad
Differential Revision: D9285344
fbshipit-source-id: 1377b0243acb880d2ad9cf73084529a787dcb97d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: ezyang, houseroad
Differential Revision: D9024330
fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: xw285cornell
Differential Revision: D8121878
fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9520
Add random data filler to predictor bench to support production nets
Reviewed By: salexspb
Differential Revision: D8712757
fbshipit-source-id: 2c732b2ba71ab210f9222adf94d08442ca71dc03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9299
Onnx has ReduceL1 and ReduceL2 operators that would facilitate this, so allow pytorch to export those and allow caffe2 to run them.
I only implemented this on CPU so far.
Reviewed By: pjh5
Differential Revision: D8757381
fbshipit-source-id: 68afc9e2f90042a70929b73ace05a499b5c670c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9350
Re-apply #9270
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
Reviewed By: mingzhe09088
Differential Revision: D8794431
fbshipit-source-id: de656334af46c697802073f8e8d9a6aeb9ca65a7
Summary:
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
cc mingzhe09088 smessmer BIT-silence Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9270
Reviewed By: mingzhe09088
Differential Revision: D8768025
Pulled By: orionr
fbshipit-source-id: 5b34017aeb67e35a1b5938d962181ccd4cd37591
* Update elementwise ops to support numpy style boradcast
Update elementwise ops to support numpy style boradcast
* Fix sqrt_op
* Fix compare ops
* Fix gradient test
* Fix optimizer legacy broadcast
* Fix legacy broadcast for elementwise ops
* Skip flaky test
* Fix eigen simple binary op
* Fix attention test
* Fix rnn test
* Fix LSTM test
* Fix tan grad
* Fix schema check
* Add full impl of GroupNorm
* Fix comments in math.h
* Remove unsed buffers
* Add #include <array> in gpu version
* Remove unused moments_buffer_
* Make inverse std to be a template.
* Add detailed comments
* [GanH][Easy]: Add assertion to adaptive weighting layer
0 weight causes numeric instability and exploding ne
* [Easy] Add cast op before computing norm in diagnose options
As LpNorm only takes floats we add a manual casting here.
* Introduce a new caching device allocator
`cudaMalloc` and `cudaFree` calls are slow, and become slower the
more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock
because GPU memory is transparently shared across all GPUs. Normally, this
isn't much of a concern since workloads allocate memory upfront, and reuse it
during later computation.
However, under some computation models (specifically, memory conserving
approaches like checkpoint-and-recompute, see
https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9)
this assumption is no longer true. In these situations, `cudaMalloc` and
`cudaFree` are common and frequent. Furthermore, in data parallel contexts,
these calls happen at nearly the same time from all GPUs worsening lock
contention.
A common solution to this problem is to add a custom allocator. In fact,
nVIDIA provides one out of the box: CUB, which Caffe2 already supports.
Unfortunately, the CUB allocator suffers from very high fragmentation. This is
primarily because it is a "buddy" allocator which neither splits nor merges
free cached blocks. Study
https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you
want to convince yourself.
This diff adapts a caching allocator from the Torch codebase
https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp
which does splitting and merging and ends up working really well, at least for
workloads like the checkpoint-and-recompute computation models noted above.
I simplified the implementation a little bit, made it a bit more C++-like. I
also removed a bunch of stream synchronization primitives for this diff. I
plan to add them back in subsequent diffs.
* Report reader progress in fblearner workflows
Integrate with fblearner progress reporting API and add support to report training progress from reader nodes.
If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split.
If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate.
* [GanH][Diagnose]: fix plotting
1. ganh diagnose needs to set plot options
2. modifier's blob name is used for metric field can need to be fixed before
generating net
* Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8
* Make CompositeReader stops as soon as one reader finishes
Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data.
* [dper] make sure loss is not nan
as desc.
* [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign
Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more
optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but
will soon become important.
* Intra-op parallel FC operator
Intra-op parallel FC operator
* [C2 Proto] extra info in device option
passing extra information in device option
design doc: https://fb.quip.com/yAiuAXkRXZGx
* Unregister MKL fallbacks for NCHW conversions
* Tracing for more executors
Modified Tracer to work with other executors and add more tracing
* Remove ShiftActivationDevices()
* Check for blob entry iff it is present
When processing the placeholders ops, ignore if the blob is not present in the blob_to_device.
* Internalize use of eigen tensor
Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries.
* feature importance for transformed features.
* - Fix unused parameter warnings
The changes in this diff comments out unused parameters.
This will allow us to enable -Wunused-parameter as error.
#accept2ship
* add opencv dependencies to caffe2
The video input op requires additional opencv packages. This is to add them to
cmake so that it can build
* Add clip_by_value option in gradient clipping
Add clip_by_value option in gradient clipping
when the value is bigger than max or smaller than min, do the clip
* std::round compat
* Update ReduceMean
* Add reduce mean to math
* Update cuda flag
* Update Eigen::Tensor ctor
* Remove unused variables
* Skip ReduceTensorGPUTest if no gpus
* Add NOMINMAX for windows
* Fix lpnorm_op in windows