Summary: D6636282 caused regression test failure of nmt model use in prod, see 24949620 for besect history.
Reviewed By: pietern
Differential Revision: D6671602
fbshipit-source-id: d863013964666727cf488a6ac5b01f5216f149d9
Summary: Adds transpose CPU version to prepare for LC layer.
Reviewed By: Yangqing
Differential Revision: D6641358
fbshipit-source-id: 1825b4c270dea2c0049ba334303abcbf50b22ee7
Summary: This is a reapplication of the earlier PR due to xplat move. Original author is Christoph Conrads <christoph.conrads@fluent.ai> christoph-conrads .
Reviewed By: houseroad
Differential Revision: D6379736
fbshipit-source-id: b7482ecf3b9487a528c15e92976e915791210002
Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow.
Reviewed By: ajtulloch
Differential Revision: D6278363
fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e
Summary: We have to use copy constructor in Concat when copying non-primitive types
Reviewed By: Yangqing
Differential Revision: D6002883
fbshipit-source-id: 0aebc955079975bb6423291589ed09ce0660acf3
Summary: PR 1175 caused a build error because gemmBatched was only under a specific #ifdef. Now put it outside the #ifdef, and things work.
Reviewed By: asaadaldien
Differential Revision: D5834868
fbshipit-source-id: 072a64c8f4b259ff7504104121766115b46b8aa0
Summary: In situations where both sin & cos are necessary to compute, the joint SinCos function is faster than doing these individually. Both MKL and CUDA support this function, so exposing it here.
Reviewed By: kmatzen
Differential Revision: D5465588
fbshipit-source-id: 7686498e4f2d4b5862d83a1ecf14fcc88ea53640
Summary: Based on benchmark script located at `caffe2/experiments/python/device_reduce_sum_bench.py`, device reduce sum is slower for N <= 10000, so we only switch to use device reduce for large N in SumElements. This diff applies the same schema for SumSqrElements.
Reviewed By: jamesr66a
Differential Revision: D5369868
fbshipit-source-id: ae13a611aff9d3464d1c4950ee155c740a2da339
Summary: Port SumElements and softmax_ops.cu to use device reduce sum
Reviewed By: akyrola
Differential Revision: D5351881
fbshipit-source-id: ca9604186c261ffcb1480da2a17baab8a4809372
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.
There are two algorithm implementations:
-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.
Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.
Reviewed By: jamesr66a
Differential Revision: D5248206
fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
Summary: This diff resloved some issues in reverted PR246.
Differential Revision: D4911821
fbshipit-source-id: 0a6fa47f4c2405475697e40fb926758c534f8ef7
Summary:
Added SumSqrElements, since then we can avoid a large temporary blob which is needed when doing Sqr + SumElements.
Also moved to reduction_ops, because utlitity_ops has grown too big.
Reviewed By: jamesr66a
Differential Revision: D4844172
fbshipit-source-id: 032eec45e24d6724f0d5fb83f4ec1c771d1146e5
Summary: Instead of callint batch-size many math::Adds, added a new function that does a batch of additions. For CPU there is no difference, but for CUDA we do everything in one kernel. I don't think this has huge performance impact, but at least makes the CUDA profiling look better with less kernel launches.
Reviewed By: jamesr66a
Differential Revision: D4798411
fbshipit-source-id: 44ac65b2da5a615971219809b9298b4e122085cd
Summary:
(Note: previous revert was due to a race condition between D4657831 and
D4659953 that I failed to catch.)
After this, we should have contbuild guarding the Windows build both with
and without CUDA.
This includes a series of changes that are needed to make Windows build,
specifically:
(1) Various flags that are needed in the cmake system, specially dealing
with /MD, /MT, cuda, cudnn, whole static linking, etc.
(2) Contbuild scripts based on appveyo.
(3) For Windows build, note that one will need to use "cmake --build" to
build stuff so that the build type is consistent between configuration and
actual build. see scripts\build_windows.bat for details.
(4) In logging.h, ERROR is already defined by Windows. I don't have a good
solution now, and as a result, LOG(ERROR) on windows is going to be
LOG(INFO).
(5) variable length array is not supported by MSVC (and it is not part of
C++ standard). As a result I replaced them with vectors.
(6) sched.h is not available on Windows, so akyrola 's awesome simple
async net might encounter some slowdown due to no affinity setting on
Windows.
(7) MSVC has a bug that does not work very well with template calls inide
a templated function call, which is a known issue that should be fixed in
MSVC 2017. However for now this means changes to conv_op_impl.h and
recurrent_net_op.h. No actual functionalities are changed.
(8) std host function calls are not supported in CUDA8+MSVC, so I changed
lp_pool (and maybe a few others) to use cuda device functions.
(9) The current Scale and Axpy has heavy templating that does not work
well with MSVC. As a result I reverted azzolini 's changes to the Scale
and Axpy interface, moved the fixed-length version to ScaleFixedSize and
AxpyFixedSize.
(10) CUDA + MSVC does not deal with Eigen well, so I guarded all Eigen
parts to only the non-CUDA part.
(11) In conclusion, it is fun but painful to deal with visual c++.
Differential Revision: D4666745
fbshipit-source-id: 3c9035083067bdb19a16d9c345c1ce66b6a86600
Summary:
After this, we should have contbuild guarding the Windows build both with
and without CUDA.
This includes a series of changes that are needed to make Windows build,
specifically:
(1) Various flags that are needed in the cmake system, specially dealing
with /MD, /MT, cuda, cudnn, whole static linking, etc.
(2) Contbuild scripts based on appveyo.
(3) For Windows build, note that one will need to use "cmake --build" to
build stuff so that the build type is consistent between configuration and
actual build. see scripts\build_windows.bat for details.
(4) In logging.h, ERROR is already defined by Windows. I don't have a good
solution now, and as a result, LOG(ERROR) on windows is going to be
LOG(INFO).
(5) variable length array is not supported by MSVC (and it is not part of
C++ standard). As a result I replaced them with vectors.
(6) sched.h is not available on Windows, so akyrola 's awesome simple
async net might encounter some slowdown due to no affinity setting on
Windows.
(7) MSVC has a
Closes https://github.com/caffe2/caffe2/pull/183
Reviewed By: ajtulloch
Differential Revision: D4657831
Pulled By: Yangqing
fbshipit-source-id: 070ded372ed78a7e3e3919fdffa1d337640f146e
Summary: This is like `UniformIntFill` but guarantee to return unique elements in the output, excluding the optional avoiding elements.
Reviewed By: xianjiec
Differential Revision: D4511814
fbshipit-source-id: 5dc98ee580616e60e46ee74ebb3f5ddd29a09965
Summary:
I don't know why I did this embarrassing bug that changes the order of
ldb and beta in the gemm interface. This fixes that.
Differential Revision: D4014493
fbshipit-source-id: 1aec950b6e9d57e947654d4044e50930f2db1344
(1) cudnn for conv
(2) cublas: after going through the work I feel it's beter to use HOST pointer mode, so changed it.
(3) storage order: despite that googlenet and multibox uses NHWC, it seems better to be still using
NCHW as default to be consistent with caffe and cudnn; moved to NCHW as default.
(1) various bugfixes.
(2) Tensor is now a class independent from its data type. This allows us
to write easier type-independent operators.
(3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias.
(4) ParallelNet -> DAGNet to be more consistent with what it does.
(5) Caffe's own flags library instead of gflags.
(6) Caffe's own logging library instead of glog, but glog can be chosen with
compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros
like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes
CAFFE_LOG_*.
(7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF
in build_env.py.