Commit Graph

95 Commits

Author SHA1 Message Date
Xiaomeng Yang
278d398748 Add GPU version of math::Transpose
Summary: Add GPU version of math::Transpose

Reviewed By: Yangqing

Differential Revision: D6747958

fbshipit-source-id: 7047107609386c1ab53492381ca9bcf8bccd2924
2018-01-24 14:18:02 -08:00
Xiaomeng Yang
0a8a18ca01 Fix GemmBatched
Summary: Fix GemmBatched

Reviewed By: Yangqing

Differential Revision: D6678168

fbshipit-source-id: 132117633573600d4e31c1959a0ccbe34416e1f1
2018-01-10 18:16:52 -08:00
Xian Li
c1d9694f42 Backed out changeset 6f532bad5824
Summary: D6636282 caused regression test failure of nmt model use in prod, see 24949620 for besect history.

Reviewed By: pietern

Differential Revision: D6671602

fbshipit-source-id: d863013964666727cf488a6ac5b01f5216f149d9
2018-01-05 19:34:38 -08:00
Xiaomeng Yang
2cda295244 Adds cpu version of transpose util function in math.
Summary: Adds transpose CPU version to prepare for LC layer.

Reviewed By: Yangqing

Differential Revision: D6641358

fbshipit-source-id: 1825b4c270dea2c0049ba334303abcbf50b22ee7
2018-01-04 23:05:40 -08:00
Xiaomeng Yang
68726df0ac Fix GemmBatchedOp
Summary: Fix GemmBatchedOp to prepare for LC Layer.

Reviewed By: Yangqing

Differential Revision: D6636282

fbshipit-source-id: 6f532bad582442ebf3da843e973eb85405371c02
2018-01-03 21:16:18 -08:00
Yangqing Jia
77484ecc45 Manually applying cudnn5 pull request.
Summary: TSIA. Closes #1631

Reviewed By: pietern, Maratyszcza

Differential Revision: D6626887

fbshipit-source-id: 1a2dc7c47bc6ce794fdf598fbd547c04029edce4
2018-01-02 15:31:33 -08:00
Yangqing Jia
59b2654544 reapply header change after xplat move
Summary: This is a reapplication of the earlier PR due to xplat move. Original author is Christoph Conrads <christoph.conrads@fluent.ai> christoph-conrads .

Reviewed By: houseroad

Differential Revision: D6379736

fbshipit-source-id: b7482ecf3b9487a528c15e92976e915791210002
2017-11-22 13:04:37 -08:00
Xianjie Chen
d1c73eb407 use size_t for rand fill functions in math
Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow.

Reviewed By: ajtulloch

Differential Revision: D6278363

fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e
2017-11-09 18:44:46 -08:00
Ilia Cherniavskii
1dbbef6b48 Fix crash in blob deallocation
Summary: We have to use copy constructor in Concat when copying non-primitive types

Reviewed By: Yangqing

Differential Revision: D6002883

fbshipit-source-id: 0aebc955079975bb6423291589ed09ce0660acf3
2017-10-10 19:03:01 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Aapo Kyrola
fb45383ed6 resubmission of PR1175: fp16 BatchMatMul
Summary: PR 1175 caused a build error because gemmBatched was only under a specific #ifdef. Now put it outside the #ifdef, and things work.

Reviewed By: asaadaldien

Differential Revision: D5834868

fbshipit-source-id: 072a64c8f4b259ff7504104121766115b46b8aa0
2017-09-14 21:46:05 -07:00
Yangqing Jia
f0d0361609 Revert D5794634: [caffe2][PR] fp16: BatchMatMul
Summary:
This reverts commit 911c462824edec3de529a5a4385a4c437e24bf59

bypass-lint

Differential Revision: D5794634

fbshipit-source-id: 1863b02282329cbee6b10e5870f03051b4bb6c58
2017-09-13 18:46:47 -07:00
Luke Yeager
3cfc6f26e7 fp16: BatchMatMul
Summary:
Was https://github.com/caffe2/caffe2/pull/1151
Closes https://github.com/caffe2/caffe2/pull/1175

Reviewed By: Yangqing

Differential Revision: D5794634

Pulled By: akyrola

fbshipit-source-id: 911c462824edec3de529a5a4385a4c437e24bf59
2017-09-13 14:35:25 -07:00
Wojciech Glogowski
e27431ddf5 New math.h functions required by YellowFin
Summary: New math.h functions requred by YellowFin

Reviewed By: akyrola

Differential Revision: D5695258

fbshipit-source-id: b21a23b7f9647004173f8eb4f8ba9a852370d97a
2017-08-25 18:09:34 -07:00
Yangqing Jia
5954211ed9 Fix #997
Summary:
cc phg1024
Closes https://github.com/caffe2/caffe2/pull/998

Differential Revision: D5538341

Pulled By: Yangqing

fbshipit-source-id: 2df69e03c8c94c67628ab8051d2a863e93f49692
2017-08-01 11:21:00 -07:00
Wojciech Glogowski
f656e002a7 CosineSimilarity GPU
Reviewed By: asaadaldien, akyrola

Differential Revision: D5476812

fbshipit-source-id: d931a7d8e4a4dfdf22ee18f8b9c755cc21b0e75b
2017-07-25 13:34:01 -07:00
Matt Uyttendaele
7f28a891f3 added sincos function to caffe2/utils/math
Summary: In situations where both sin & cos are necessary to compute, the joint SinCos function is faster than doing these individually. Both MKL and CUDA support this function, so exposing it here.

Reviewed By: kmatzen

Differential Revision: D5465588

fbshipit-source-id: 7686498e4f2d4b5862d83a1ecf14fcc88ea53640
2017-07-21 09:55:21 -07:00
Junjie Bai
4fddc04054 Use the same schema of switching to device reduce sum for SumSqrElements
Summary: Based on benchmark script located at `caffe2/experiments/python/device_reduce_sum_bench.py`, device reduce sum is slower for N <= 10000, so we only switch to use device reduce for large N in SumElements. This diff applies the same schema for SumSqrElements.

Reviewed By: jamesr66a

Differential Revision: D5369868

fbshipit-source-id: ae13a611aff9d3464d1c4950ee155c740a2da339
2017-07-05 10:52:17 -07:00
Marat Dukhan
2ac9ff5c96 Cos, Sin, and Abs operators
Summary: add Cos, Sin, and Abs operators

Reviewed By: akyrola

Differential Revision: D5307632

fbshipit-source-id: 743c9d289e4d3fd439e4b5385841cdff87d9247a
2017-07-03 22:18:32 -07:00
Junjie Bai
f3a59aedff Use cub::DeviceReduce for faster math::Sum CUDA version
Summary: Port SumElements and softmax_ops.cu to use device reduce sum

Reviewed By: akyrola

Differential Revision: D5351881

fbshipit-source-id: ca9604186c261ffcb1480da2a17baab8a4809372
2017-06-30 15:04:06 -07:00
Jeff Johnson
3f860af050 Implement TopKOp for GPU
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.

There are two algorithm implementations:

-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.

Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.

Reviewed By: jamesr66a

Differential Revision: D5248206

fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
2017-06-17 08:47:38 -07:00
Ahmed Taei
b294aadc66 fp16 support for FullyConnected op(Fixed)
Summary: This diff resloved some issues in reverted PR246.

Differential Revision: D4911821

fbshipit-source-id: 0a6fa47f4c2405475697e40fb926758c534f8ef7
2017-04-19 12:49:12 -07:00
Aapo Kyrola
9ab077dc9d Revert D4871248: [caffe2][PR] fp16 support for FullyConnected op
Summary: This reverts commit 6a991c2c993dcf0b1e18aa3f2ffbe19e693dbadd

Differential Revision: D4871248

fbshipit-source-id: b6d812d09a00c83e363432e84742c503abfed65b
2017-04-17 21:31:20 -07:00
Simon Layton
1082db600e fp16 support for FullyConnected op
Summary:
Includes math lib support, removal of double-precision.
Closes https://github.com/caffe2/caffe2/pull/246

Reviewed By: Yangqing

Differential Revision: D4871248

Pulled By: asaadaldien

fbshipit-source-id: 6a991c2c993dcf0b1e18aa3f2ffbe19e693dbadd
2017-04-17 12:07:57 -07:00
Aapo Kyrola
092c1440a2 SumSqrElements
Summary:
Added SumSqrElements, since then we can avoid a large temporary blob which is needed when doing Sqr + SumElements.

Also moved to reduction_ops, because utlitity_ops has grown too big.

Reviewed By: jamesr66a

Differential Revision: D4844172

fbshipit-source-id: 032eec45e24d6724f0d5fb83f4ec1c771d1146e5
2017-04-10 16:16:52 -07:00
Aapo Kyrola
ed44e87f98 use striped batch add for the recurrent network gradient
Summary: Instead of callint batch-size many math::Adds, added a new function that does a batch of additions. For CPU there is no difference, but for CUDA we do everything in one kernel. I don't think this has huge performance impact, but at least makes the CUDA profiling look better with less kernel launches.

Reviewed By: jamesr66a

Differential Revision: D4798411

fbshipit-source-id: 44ac65b2da5a615971219809b9298b4e122085cd
2017-03-30 08:57:16 -07:00
Ahmed Taei
e41d35909a Conv-ND NCHW CUP/CUDA implementation
Summary: Migrate caffe1 ConvNd implementation to caffe2.

Reviewed By: Yangqing

Differential Revision: D4659868

fbshipit-source-id: 14b178af3faa2c0b12e5a9f7aa76c1d8945419ea
2017-03-20 14:01:07 -07:00
Yangqing Jia
1741fd839f Re-apply windows diff D4657831
Summary:
(Note: previous revert was due to a race condition between D4657831 and
D4659953 that I failed to catch.)

After this, we should have contbuild guarding the Windows build both with
and without CUDA.

This includes a series of changes that are needed to make Windows build,
specifically:

(1) Various flags that are needed in the cmake system, specially dealing
with /MD, /MT, cuda, cudnn, whole static linking, etc.
(2) Contbuild scripts based on appveyo.
(3) For Windows build, note that one will need to use "cmake --build" to
build stuff so that the build type is consistent between configuration and
actual build. see scripts\build_windows.bat for details.
(4) In logging.h, ERROR is already defined by Windows. I don't have a good
solution now, and as a result, LOG(ERROR) on windows is going to be
LOG(INFO).
(5) variable length array is not supported by MSVC (and it is not part of
C++ standard). As a result I replaced them with vectors.
(6) sched.h is not available on Windows, so akyrola 's awesome simple
async net might encounter some slowdown due to no affinity setting on
Windows.
(7) MSVC has a bug that does not work very well with template calls inide
a templated function call, which is a known issue that should be fixed in
MSVC 2017. However for now this means changes to conv_op_impl.h and
recurrent_net_op.h. No actual functionalities are changed.
(8) std host function calls are not supported in CUDA8+MSVC, so I changed
lp_pool (and maybe a few others) to use cuda device functions.
(9) The current Scale and Axpy has heavy templating that does not work
well with MSVC. As a result I reverted azzolini 's changes to the Scale
and Axpy interface, moved the fixed-length version to ScaleFixedSize and
AxpyFixedSize.
(10) CUDA + MSVC does not deal with Eigen well, so I guarded all Eigen
parts to only the non-CUDA part.
(11) In conclusion, it is fun but painful to deal with visual c++.

Differential Revision: D4666745

fbshipit-source-id: 3c9035083067bdb19a16d9c345c1ce66b6a86600
2017-03-07 11:02:12 -08:00
Avani Nandini
039c3cf0ba Revert D4657831: [caffe2][PR] Changes for Windows build to pass.
Summary: This reverts commit 070ded372ed78a7e3e3919fdffa1d337640f146e

Differential Revision: D4657831

fbshipit-source-id: 3a0fb403936a9257776d637ce3ba5dbd81e1119f
2017-03-06 21:02:36 -08:00
Yangqing Jia
7b8c7b11d2 Changes for Windows build to pass.
Summary:
After this, we should have contbuild guarding the Windows build both with
and without CUDA.

This includes a series of changes that are needed to make Windows build,
specifically:

(1) Various flags that are needed in the cmake system, specially dealing
with /MD, /MT, cuda, cudnn, whole static linking, etc.
(2) Contbuild scripts based on appveyo.
(3) For Windows build, note that one will need to use "cmake --build" to
build stuff so that the build type is consistent between configuration and
actual build. see scripts\build_windows.bat for details.
(4) In logging.h, ERROR is already defined by Windows. I don't have a good
solution now, and as a result, LOG(ERROR) on windows is going to be
LOG(INFO).
(5) variable length array is not supported by MSVC (and it is not part of
C++ standard). As a result I replaced them with vectors.
(6) sched.h is not available on Windows, so akyrola 's awesome simple
async net might encounter some slowdown due to no affinity setting on
Windows.
(7) MSVC has a
Closes https://github.com/caffe2/caffe2/pull/183

Reviewed By: ajtulloch

Differential Revision: D4657831

Pulled By: Yangqing

fbshipit-source-id: 070ded372ed78a7e3e3919fdffa1d337640f146e
2017-03-06 20:03:37 -08:00
Kittipat Virochsiri
718786add7 UniqueUniformFillOp
Summary: This is like `UniformIntFill` but guarantee to return unique elements in the output, excluding the optional avoiding elements.

Reviewed By: xianjiec

Differential Revision: D4511814

fbshipit-source-id: 5dc98ee580616e60e46ee74ebb3f5ddd29a09965
2017-02-15 16:00:44 -08:00
Yangqing Jia
d87edd39e7 math gemm interface fix
Summary:
I don't know why I did this embarrassing bug that changes the order of
ldb and beta in the gemm interface. This fixes that.

Differential Revision: D4014493

fbshipit-source-id: 1aec950b6e9d57e947654d4044e50930f2db1344
2016-12-19 10:45:20 -08:00
Xianjie Chen
dea27ca4ca use TIndex for set in math.h
Summary: as desc

Differential Revision: D4271900

fbshipit-source-id: 92f7cbbe33e0ce4fcc21a8af9ded4f436afb43e2
2016-12-05 11:53:27 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
b23e51d467 chunky sync 2016-09-06 15:55:19 -07:00
Yangqing Jia
05512d1e10 sync 2016-08-10 11:02:15 -07:00
Yangqing Jia
6463eebc7b chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
Yangqing Jia
559053d3a8 chunky sync 2016-05-13 14:43:48 -07:00
Yangqing Jia
4ae1bbbd7e bugfix 2016-03-11 10:30:16 -08:00
Yangqing Jia
50874dc746 relu and pool wip 2016-02-01 14:08:10 -08:00
Yangqing Jia
1740974347 average pooling wrapper: without this the NHWC path would throw an error as the order is not passed along. 2016-01-22 09:31:49 -08:00
Yangqing Jia
98c5b86ef7 A few changes:
(1) cudnn for conv
(2) cublas: after going through the work I feel it's beter to use HOST pointer mode, so changed it.
(3) storage order: despite that googlenet and multibox uses NHWC, it seems better to be still using
    NCHW as default to be consistent with caffe and cudnn; moved to NCHW as default.
2015-10-21 22:37:11 -07:00
Yangqing Jia
648d1b101a A consolidation of a couple random weekend work.
(1) various bugfixes.
(2) Tensor is now a class independent from its data type. This allows us
    to write easier type-independent operators.
(3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias.
(4) ParallelNet -> DAGNet to be more consistent with what it does.
(5) Caffe's own flags library instead of gflags.
(6) Caffe's own logging library instead of glog, but glog can be chosen with
    compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros
    like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes
    CAFFE_LOG_*.
(7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF
    in build_env.py.
2015-10-11 23:14:06 -07:00
Yangqing Jia
2ed1077a83 A clean init for Caffe2, removing my earlier hacky
commits.
2015-06-25 16:26:01 -07:00