Commit Graph

60 Commits

Author SHA1 Message Date
Eddie Yan
bac33ea8b6 [CUDA] Drop CUDA 10 support (#89582)
CC @ptrblck @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89582
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-01-05 05:11:53 +00:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Jane Xu
533cb9530e Introducing TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API to the code (#50627)
Summary:
Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target.

This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627

Reviewed By: mruberry

Differential Revision: D25955441

Pulled By: janeyx99

fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869
2021-01-21 19:09:11 -08:00
Richard Barnes
c44300884e Clarify timing of GetDeviceProperty() (#46715)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46715

Test Plan: N/A

Reviewed By: ezyang

Differential Revision: D24455538

fbshipit-source-id: 1770807d178f618ef6338e28f669f09e4cbd2009
2020-10-22 11:29:31 -07:00
Natalia Gimelshein
9c19a12965 fix asserts in cuda code (#39047)
Summary:
Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary
replaces host code `assert`s with `TORCH_INTERNAL_ASSERT`
Another group of asserts is in fractional max pooling kernels which should be fixed regardless https://github.com/pytorch/pytorch/issues/39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39047

Differential Revision: D21750392

Pulled By: ngimel

fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719
2020-05-28 15:51:38 -07:00
Natalia Gimelshein
ba14a701dc restore proper cuda assert behavior with DNDEBUG (#38943)
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943

Differential Revision: D21723767

Pulled By: ngimel

fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
2020-05-26 18:11:00 -07:00
Xiang Gao
5e2d8745c8 RIP CUDA <9.2: circleci, aten, and caffe2 (#36846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36846

Test Plan: Imported from OSS

Differential Revision: D21620850

Pulled By: ngimel

fbshipit-source-id: 7ad1676a12f86250f301095ffc6f365a3b370f34
2020-05-18 13:41:05 -07:00
Edward Yang
38986e1dea Split libtorch.so back into libtorch_{cpu,cuda,hip} (#30315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30315

The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
This is a reland of https://github.com/pytorch/pytorch/pull/29731 but
I've extracted all of the prep work into separate PRs which can be
landed before this one.

Some things of note:

* torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
* The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/libprotobuf.a(arena.cc.o) is referenced by DSO"
* A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly
* I had to torch_cpu/torch_cuda caffe2_interface_library so that they get whole-archived linked into torch when you statically link. And I had to do this in an *exported* fashion because torch needs to depend on torch_cpu_library. In the end I exported everything and removed the redefinition in the Caffe2Config.cmake. However, I am not too sure why the old code did it in this way in the first place; however, it doesn't seem to have broken anything to switch it this way.
* There's some uses of `__HIP_PLATFORM_HCC__` still in `torch_cpu` code, so I had to apply it to that library too (UGH). This manifests as a failer when trying to run the CUDA fuser. This doesn't really matter substantively right now because we still in-place HIPify, but it would be good to fix eventually. This was a bit difficult to debug because of an unrelated HIP bug, see https://github.com/ROCm-Developer-Tools/HIP/issues/1706

Fixes #27215 (as our libraries are smaller), and executes on
part of the plan in #29235.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18790941

Pulled By: ezyang

fbshipit-source-id: 01296f6089d3de5e8365251b490c51e694f2d6c7
2019-12-04 08:04:57 -08:00
Junjie Bai
352731bd6e Revert D18632773: Split libtorch.so back into libtorch_{cpu,cuda,hip}
Test Plan: revert-hammer

Differential Revision:
D18632773

Original commit changeset: ea717c81e0d7

fbshipit-source-id: 18601439f9f81c9f389020e5a0e4e04adb21772d
2019-11-21 15:01:09 -08:00
Edward Yang
ec30d9028a Split libtorch.so back into libtorch_{cpu,cuda,hip} (#29731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29731

The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.

Some subtleties about the patch:
- There were a few functions that crossed CPU-CUDA boundary without API macros. I just added them, easy enough. An inverse situation was aten/src/THC/THCTensorRandom.cu where we weren't supposed to put API macros directly in a cpp file.
- DispatchStub wasn't getting all of its symbols related to static members on DispatchStub exported properly. I tried a few fixes but in the end I just moved everyone off using DispatchStub to dispatch CUDA/HIP (so they just use normal dispatch for those cases.) Additionally, there were some mistakes where people incorrectly were failing to actually import the declaration of the dispatch stub, so added includes for those cases.
- torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
- The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
- In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/l
ibprotobuf.a(arena.cc.o) is referenced by DSO"
- A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly. This situation also happens with custom C++ extensions.
- There's a ROCm compiler bug where extern "C" on functions is not respected. There's a little workaround to handle this.
- Because I was too lazy to check if HIPify was converting TORCH_CUDA_API into TORCH_HIP_API, I just made it so HIP build also triggers the TORCH_CUDA_API macro. Eventually, we should translate and keep the nature of TORCH_CUDA_API constant in all cases.

Fixes #27215 (as our libraries are smaller), and executes on
part of the plan in #29235.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18632773

Pulled By: ezyang

fbshipit-source-id: ea717c81e0d7554ede1dc404108603455a81da82
2019-11-21 11:27:33 -08:00
Karl Ostmo
49481d576d Torch rename (#20774)
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants).  Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.

The shell `libtorch` library that had been introduced as a stopgap in https://github.com/pytorch/pytorch/issues/17783 is again flattened in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20774

Differential Revision: D15769965

Pulled By: kostmo

fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
2019-06-12 20:12:34 -07:00
Edward Yang
1e6acc676f Replace caffe2::DeviceGuard with c10::cuda::CUDAGuard (#17623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623

Despite it's generic sounding name, caffe2::DeviceGuard actually
only worked on CUDA devices.  Rename it to something that more
clearly spells out its applicability.

I'm not sure if it's the right call, but in this patch I added
'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more
in-line with how the Caffe2 codebase is currently written.  More
idiomatic c10 namespace style would be to say cuda::CUDAGuard.
Willing to change this if people shout.

This is a respin of D13156470 (#14284)

Reviewed By: dzhulgakov

Differential Revision: D14285504

fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d
2019-03-06 10:48:15 -08:00
Dmytro Dzhulgakov
96ea2594d8 Don't call cudaStreamDestroy at destruction time (#15692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692

It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed.

Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose.

Reviewed By: Yangqing

Differential Revision: D13571988

fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46
2019-01-11 12:36:41 -08:00
hbraun@nvidia.com
3fdf567752 Adding CUDA version for C2 operators generate proposals and nms (#13694)
Summary:
Related to issue #13684
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13694

Reviewed By: wat3rBro

Differential Revision: D13017791

Pulled By: newstzpz

fbshipit-source-id: 4bdc58e474d8e1f6cd73a02bf51f91542a2b9d0b
2018-12-20 14:39:09 -08:00
rohithkrn
7e2b074219 Integrate rocBLAS fp16 api into Caffe2 (#14882)
Summary:
This PR integrates rocBLAS half and mixed precision APIs in to Caffe2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14882

Differential Revision: D13407840

Pulled By: bddppq

fbshipit-source-id: 75cb0d74da066776fa66575f1d255e879d36121e
2018-12-10 17:54:06 -08:00
Junjie Bai
0d7a986da1 Change hip filename extension to .hip (#14036)
Summary:
xw285cornell

- To make hip files to have unique filename extension we change hip files from _hip.cc to .hip (it's the only blessing option other than .cu in hipcc 3d51a1fb01/bin/hipcc (L552)).
- Change to use host compiler to compile .cc|.cpp files. Previously we use hcc to compile them which is unnecessary
- Change the hipify script to not replace "gpu" with "hip" in the filename of the generated hipified files. Previously we do this because hcc has a bug when linking files that have same filename. We have now changed to use host linker to do linking so this is unnecessary anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14036

Reviewed By: xw285cornell

Differential Revision: D13091813

Pulled By: bddppq

fbshipit-source-id: ea3d887751d8abb39d75f5d5104aa66ce66b9ee0
2018-11-16 11:55:59 -08:00
Edward Yang
fed8d8975a Various improvements to hipify_python.py (#13973)
Summary:
- Speed up hipify_python.py by blacklisting useless (and quite large)
  directory trees that it would otherwise recurse into

- Pass around relative paths instead of absolute paths.  This makes it
  easier to do filename matches based on the root of the tree.

- Redo the streaming output to contain more useful information

- Make it handle c10/cuda correctly, rewrite c10::cuda to
  c10::hip, and the header name from CUDAMathCompat.h to
  CUDAHIPCompat.h

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13973

Differential Revision: D13062374

Pulled By: ezyang

fbshipit-source-id: f0858dd18c94d449ff5dbadc22534c695dc0f8fb
2018-11-14 17:11:24 -08:00
Junjie Bai
95ca66763d Add math functions overloaded over different numeric types for cuda and hip (#13602)
Summary:
petrex ashishfarmer rohithkrn iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13602

Reviewed By: dzhulgakov

Differential Revision: D12935797

Pulled By: bddppq

fbshipit-source-id: a49ec66fb60bfd947c63dd2133d431884df62235
2018-11-06 01:40:31 -08:00
Xiaodong Wang
e6b6cc06ee caffe2/core hipify (#13457)
Summary:
Small edits to caffe2/core hipify to make it compile in fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13457

Reviewed By: bddppq

Differential Revision: D12883472

Pulled By: xw285cornell

fbshipit-source-id: 1da231d721311d105892db13ed726240398ba49e
2018-11-01 15:49:56 -07:00
Xiaomeng Yang
017b91f861 Optimize channel_shuffle_op on GPU (#13066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13066

Optimize channel_shuffle_op on GPU

Reviewed By: houseroad

Differential Revision: D10639281

fbshipit-source-id: 394b937403e5d4e9df93548bbf87285bffaa55a9
2018-10-30 12:18:27 -07:00
Sergei Nikolaev
1c7832c854 CUDA 10 warnings fixed (#12442)
Summary:
Deprecation warning against `cudaPointerAttributes`, where `memoryType` field has been deprecated in favor of `type`, see https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__UNIFIED.html#contents-end
for details
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12442

Differential Revision: D10251239

Pulled By: zou3519

fbshipit-source-id: 500f1e02aa8e11c510475953ef5244d5fb13bf9e
2018-10-11 00:25:22 -07:00
Mingzhe Li
aa8cd7319a Enable build_test on windows (#11802)
Summary:
This PR enables BUILD_TEST for Caffe2 on windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11802

Reviewed By: orionr

Differential Revision: D9951223

Pulled By: mingzhe09088

fbshipit-source-id: 7cdc1626b999daadeae482bd569eebdbd53eb6d4
2018-09-19 20:17:49 -07:00
Mingzhe Li
a7cbcb1bb9 Enable build_python on windows (#11385)
Summary:
The PR aims to resolve issues related to BUILD_PYTHON and BUILD_TEST after FULL_CAFFE2 is removed on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11385

Reviewed By: orionr

Differential Revision: D9884906

Pulled By: mingzhe09088

fbshipit-source-id: fc114c0cbff6223f1ec261161e4caecc1fef5dd6
2018-09-17 21:40:03 -07:00
Yangqing Jia
68613cf5a2 Windows DLL build with Caffe2 code (#11266)
Summary:
This is an experimental build on top of what orionr and mingzhe09088 built.

Essentially, the idea is that we will need separate *_API versions for different shared libraries. If this theory is right, I'll try to clean up the design a bit and document it properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11266

Reviewed By: orionr

Differential Revision: D9682942

Pulled By: Yangqing

fbshipit-source-id: c79653199e67a1500c9174f39f8b0357324763f3
2018-09-06 15:12:20 -07:00
Lukasz Wesolowski
e2ecf3914a Change default CUDA block size from 512 to 128 (#10090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10090

Decreasing the block size improves GPU utilization for use cases with small input sizes (e.g. 10000)

Reviewed By: pjh5

Differential Revision: D9093573

fbshipit-source-id: c8f995b773a00b1bea3a3809c0f6557133efd9dd
2018-08-02 15:40:13 -07:00
Xiaomeng Yang
9243b64bff
[Caffe2] Update elementwise ops to support numpy style boradcast (#8070)
* Update elementwise ops to support numpy style boradcast

Update elementwise ops to support numpy style boradcast

* Fix sqrt_op

* Fix compare ops

* Fix gradient test

* Fix optimizer legacy broadcast

* Fix legacy broadcast for elementwise ops

* Skip flaky test

* Fix eigen simple binary op

* Fix attention test

* Fix rnn test

* Fix LSTM test

* Fix tan grad

* Fix schema check
2018-06-05 15:49:16 -07:00
Xiaomeng Yang
a61d4a3374
[Caffe2] Refactor reduce ops to take flexible input types (#7164)
* Refactor reduce ops to take flexible input types

* Add DISPATCH_FUNCTION macros in common_gpu.h

* Use macros to reduce switch case in dispatching cuda functions
2018-05-02 12:08:38 -07:00
Xiaomeng Yang
2ef23b6241 [caffe2] Update transpose with compile time dimension (#6614)
* Update transpose with compile time dimension

* Change return to break
2018-04-15 19:20:39 -07:00
Xiaomeng Yang
cd2112717c
[caffe2] Update math functions with params on host. (#6602)
* Update ReduceMean

Add reduce mean to math

Add reduce mean to math

* sync reduce_ops_test

* Update math_gpu.cu
2018-04-14 21:41:41 -07:00
Xianjie Chen
6ed9a0c3f2 fix cuda elementwise ops for empty batch
CUDA will fail to launch empty kernel
2018-03-27 18:10:39 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Yangqing Jia
ced2c7e2b2 Remove Set/GetDefaultGPUID and move to use current gpu id instead.
Summary:
Reason for this change:

(1) Setting/Getting default gpu id doesn't seem to be used at all.
(2) It actually is confusing compared to the CUDA_VISIBLE_DEVICES options etc.
(3) When setting cuda_gpu_id=-1 in the CUDAContext arg, it used to use the
default gpu id but probably we should use the current gpu - so that the caller
will be able to control the device placement.

One use case is for TensorRT - if we have a custom callback layer, then it would
be easier for TRT or whatever caller to set the running device.

Reviewed By: dzhulgakov

Differential Revision: D6740357

fbshipit-source-id: 2ea710e434b10220d5a198e31c93847304636863
2018-01-19 18:03:21 -08:00
Xianjie Chen
d1c73eb407 use size_t for rand fill functions in math
Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow.

Reviewed By: ajtulloch

Differential Revision: D6278363

fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e
2017-11-09 18:44:46 -08:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Yangqing Jia
cf769a7b6f Avoid race condition in get device properties.
Summary: TSIA

Reviewed By: salexspb

Differential Revision: D5898125

fbshipit-source-id: 1822ef2a017719442045fa446321d007b9d544b8
2017-09-23 16:01:23 -07:00
Yangqing Jia
26f0943130 Do CaffeCudaSetDevice and CaffeCudaGetDevice
Summary:
These are wrapper functions so that if we run in a Caffe2-only mode, we can
turn the flag on and get some small speedup on cuda device switches.

The purpose of the diff is to allow us to quickly assess the overhead of cuda
device switch functions. Ideally, the caching behavior shall live in the cuda
driver, which is the only safe place to ensure correctness.

If other code is running aside Caffe2 and does not properly do device guard,
this functionality will fail as separate cudaSetDevice() calls will not update
Caffe2's thread local device id. As a result, the functionality is only enabled
when/if one explicitly sets the flag.

This might not be safe, so use with caution.

- cudaGetDevice can go from 90ns to 2ns
- when setting the same device, we can go from 100ns to 2 ns
- when setting a different device, things are the same (1ns overhead on top of 143ns)

Reviewed By: azzolini

Differential Revision: D5709398

fbshipit-source-id: 6255f17a3d41f59a30327436383f306a2287896e
2017-08-25 18:20:14 -07:00
Yangqing Jia
93e12e75df Allow caffe2 to detect if cuda lib has been linked, and also fix oss build error.
Summary: Closes https://github.com/caffe2/caffe2/pull/1114

Reviewed By: pietern

Differential Revision: D5686557

Pulled By: Yangqing

fbshipit-source-id: 6b7245ebbe4eeb025ce9d0fe8fda427a0c3d9770
2017-08-23 18:41:15 -07:00
Yangqing Jia
5d24a4eeef Early design for a general Event abstraction cross-devices.
Summary:
There are ad-hoc efforts on avoiding excessive device synchronizations, such as
async_dag, singlethread_async, etc. This diff aims to provide an early design
for a general Event class, that can achieve the following:

(1) It is device agnostic, essentially using a vtable to do cross device record,
wait and synchronization.
(2) Created new functions WaitEvent and Record in the Context class for
interacting with Events.
(3) Exposed the corresponding WaitEvent and Record functions in the OperatorBase
class as well.

An example use case is that, after potential future refactoring, one can achieve
a real async execution per operator by running

op.WaitEvent(previous_event);
op.RunAsync();
op.RecordEvent(this_op_event);

and the next op can do

next_op.WaitEvent(this_op_event);

Right now, I changed async_dag net implementation so that it uses the general
event design. The old Event class is assimilated to the general Event class and
the old Stream class is now essentially taken over by the Context class itself.

Reviewed By: harouwu

Differential Revision: D5648463

fbshipit-source-id: 58bd84d06e4a9977b0b835110ddb2f18be3b7cbc
2017-08-18 15:46:51 -07:00
Pieter Noordhuis
692f4e4e3b Disable -Wstrict-aliasing when including cuda_fp16.h
Summary:
The cuda_fp16.h header in CUDA 9 RC triggers this diagnostic.
It is included by cusparse.h as well, so guarding the
inclusion of only cuda_fp16.h is not enough.

Reviewed By: Yangqing

Differential Revision: D5651995

fbshipit-source-id: 4778a8a793761e7a1dbebf3792b85b33a3e26219
2017-08-17 14:15:32 -07:00
Simon Layton
85788a0f65 Add TensorCore support
Summary:
Add support for TensorCore convolution and gemm on Volta hardware.

Currently built on top of #1055
Closes https://github.com/caffe2/caffe2/pull/1056

Differential Revision: D5604068

Pulled By: Yangqing

fbshipit-source-id: 100f67e26ed5fabb1dbb31dcd77f7ecb84de4ee7
2017-08-10 20:16:48 -07:00
Yangqing Jia
475eff5281 Allow peer access only in groups of 8
Summary:
This is the hardware limit set by NVidia. Basically, on Amazon P2 machines that
have 16 gpus, the previous setting will trigger an error. This fixes the issue
but is pending verification from Amazon.

Differential Revision: D4888402

fbshipit-source-id: 8d26a24d6e0476f895b9afdb979144eb8e6b9321
2017-04-14 09:47:48 -07:00
Yangqing Jia
81d5461973 cuda check -> enforce
Summary:
In the past we have moved most of the CHECKs to CAFFE_ENFORCE (exceptions).
However, we kept the name "*_CHECK" for cuda calls, and that caused some
confusion especially in the destructor calls: while our destructors are not
written to handle exceptions, these CUDA_CHECKs could actually throw some
exceptions.

As a result, this diff

(1) Renames all cuda related "*_CHECK" to "*_ENFORCE"
(2) Explicitly marked the destructor of core Caffe2 classes as noexcept
(3) Added proper, really-CHECK cuda check macros, and used those in the
corresponding destructors.

This should not change any of existing functionality.

Reviewed By: dzhulgakov

Differential Revision: D4656368

fbshipit-source-id: 32e3056e66c0400156c5ca0187b6151cf3d52404
2017-03-05 22:46:22 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
6463eebc7b chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
Yangqing Jia
559053d3a8 chunky sync 2016-05-13 14:43:48 -07:00
Yangqing Jia
137b880aac cuda initialization.
This makes it callable multiple times but the
actual code only runs once. TODO: make it thread
safe. I am too lazy for now.
2016-03-15 12:52:05 -07:00
Yangqing Jia
78aa266770 Fix 2016-01-19 14:49:48 -08:00
Yangqing Jia
d84545c5fb fp16: allow one to override. 2016-01-19 14:39:26 -08:00
Yangqing Jia
fa59b90c72 misc updates 2016-01-13 21:00:56 -08:00