Commit Graph

39 Commits

Author SHA1 Message Date
Jeff Daily
5379b5f927 [ROCm] use hipblas instead of rocblas (#105881)
- BatchLinearAlgebraLib.cpp is now split into one additional file
  - BatchLinearAlgebraLib.cpp uses only cusolver APIs
  - BatchLinearAlgebraLibBlas.cpp uses only cublas APIs
  - hipify operates at the file level and cannot mix cusolver and cublas APIs within the same file
- cmake changes to link against hipblas instead of rocblas
- hipify mappings changes to map cublas -> hipblas instead of rocblas

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105881
Approved by: https://github.com/albanD
2023-07-31 20:42:55 +00:00
Xiaodong Wang
025cd69a86 [AMD] Fix some legacy hipify script (#70594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70594

Pull Request resolved: https://github.com/facebookincubator/gloo/pull/315

Fix some out-dated hipify script:
* python -> python3 (fb internal)
* rocblas return code
* gloo makefile for hip clang

Test Plan: Sandcastle + OSS build

Reviewed By: malfet, shintaro-iwasaki

Differential Revision: D33402839

fbshipit-source-id: 5893039451bcf77bbbb1b88d2e46ae3e39caa154
2022-01-05 11:34:25 -08:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Dmytro Dzhulgakov
06d978a9ad [c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Reviewed By: ngimel

Differential Revision: D22824329

fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
2020-08-05 11:39:31 -07:00
Xiang Gao
5e2d8745c8 RIP CUDA <9.2: circleci, aten, and caffe2 (#36846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36846

Test Plan: Imported from OSS

Differential Revision: D21620850

Pulled By: ngimel

fbshipit-source-id: 7ad1676a12f86250f301095ffc6f365a3b370f34
2020-05-18 13:41:05 -07:00
iotamudelta
4fe857187c switch to rocThrust for thrust/cub APIs (#25620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602

Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.

Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.

Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.

Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.

Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864

Reviewed By: xw285cornell

Differential Revision: D16940768

Pulled By: bddppq

fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
2019-09-03 22:16:30 -07:00
Andrew Naguib
3cba9e8aaa Error Message Paraphrasing (#22369)
Summary:
Saying `I` in an err msg is too subjective to be used in a framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22369

Differential Revision: D16067712

Pulled By: soumith

fbshipit-source-id: 2a390646bd5b15674c99f65e3c460a7272f508b6
2019-06-30 00:13:02 -07:00
Edward Yang
1a9602d5db Delete caffe2_cuda_full_device_control (#14283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14283

According to Yangqing, this code was only used by us to do some end-to-end
performance experiments on the impact of cudaSetDevice and cudaGetDevice.  Now
that the frameworks are merged, there are a lot of bare calls to those functions
which are not covered by this flag.  It doesn't seem like a priority to restore
this functionality, so I am going to delete it for now.  If you want to bring
it back, you'll have to make all get/set calls go through this particular
interfaces.

Reviewed By: dzhulgakov

Differential Revision: D13156472

fbshipit-source-id: 4c6d2cc89ab5ae13f7c816f43729b577e1bd985c
2018-11-29 18:33:22 -08:00
Junjie Bai
0d7a986da1 Change hip filename extension to .hip (#14036)
Summary:
xw285cornell

- To make hip files to have unique filename extension we change hip files from _hip.cc to .hip (it's the only blessing option other than .cu in hipcc 3d51a1fb01/bin/hipcc (L552)).
- Change to use host compiler to compile .cc|.cpp files. Previously we use hcc to compile them which is unnecessary
- Change the hipify script to not replace "gpu" with "hip" in the filename of the generated hipified files. Previously we do this because hcc has a bug when linking files that have same filename. We have now changed to use host linker to do linking so this is unnecessary anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14036

Reviewed By: xw285cornell

Differential Revision: D13091813

Pulled By: bddppq

fbshipit-source-id: ea3d887751d8abb39d75f5d5104aa66ce66b9ee0
2018-11-16 11:55:59 -08:00
Xiaodong Wang
e6b6cc06ee caffe2/core hipify (#13457)
Summary:
Small edits to caffe2/core hipify to make it compile in fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13457

Reviewed By: bddppq

Differential Revision: D12883472

Pulled By: xw285cornell

fbshipit-source-id: 1da231d721311d105892db13ed726240398ba49e
2018-11-01 15:49:56 -07:00
Junjie Bai
883da952be Hipify caffe2/core (#13148)
Summary:
petrex ashishfarmer iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13148

Reviewed By: xw285cornell

Differential Revision: D10862276

Pulled By: bddppq

fbshipit-source-id: 1754834ec50f7dd2f752780e20b2a9cf19d03fc4
2018-10-26 15:27:32 -07:00
Yangqing Jia
7d5f7ed270 Using c10 namespace across caffe2. (#12714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714

This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.

Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where

```
using namespace c10;
```

is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).

Reviewed By: dzhulgakov

Differential Revision: D10390486

fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
2018-10-17 12:57:19 -07:00
Sergei Nikolaev
1c7832c854 CUDA 10 warnings fixed (#12442)
Summary:
Deprecation warning against `cudaPointerAttributes`, where `memoryType` field has been deprecated in favor of `type`, see https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__UNIFIED.html#contents-end
for details
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12442

Differential Revision: D10251239

Pulled By: zou3519

fbshipit-source-id: 500f1e02aa8e11c510475953ef5244d5fb13bf9e
2018-10-11 00:25:22 -07:00
Yangqing Jia
38f3d1fc40 move flags to c10 (#12144)
Summary:
still influx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12144

Reviewed By: smessmer

Differential Revision: D10140176

Pulled By: Yangqing

fbshipit-source-id: 1a313abed022039333e3925d19f8b3ef2d95306c
2018-10-04 02:09:56 -07:00
Yangqing Jia
9c49bb9ddf Move registry fully to c10 (#12077)
Summary:
This does 6 things:

- add c10/util/Registry.h as the unified registry util
  - cleaned up some APIs such as export condition
- fully remove aten/core/registry.h
- fully remove caffe2/core/registry.h
- remove a bogus aten/registry.h
- unifying all macros
- set up registry testing in c10

Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077

Reviewed By: ezyang

Differential Revision: D10050771

Pulled By: Yangqing

fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf
2018-09-27 03:09:54 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Yangqing Jia
91d76f5dbd Reapply Windows fix
Summary:
Last fix was uncommitted due to a bug in internal build (CAFFE2_API causing error). This one re-applies it as well as a few more, especially enabling gtest.

Earlier commit message: Basically, this should make windows {static_lib, shared_lib} * {static_runtime, shared_runtime} * {cpu, gpu} work other than gpu shared_lib, which willyd kindly pointed out a symbol limit problem. A few highlights:
(1) Updated newest protobuf.
(2) use protoc dllexport command to ensure proper symbol export for windows.
(3) various code updates to make sure that C2 symbols are properly shown
(4) cmake file changes to make build proper
(5) option to choose static runtime and shared runtime similar to protobuf
(6) revert to visual studio 2015 as current cuda and msvc 2017 do not play well together.
(7) enabled gtest and fixed testing bugs.

Earlier PR is #1793

Closes https://github.com/caffe2/caffe2/pull/1827

Differential Revision: D6832086

Pulled By: Yangqing

fbshipit-source-id: 85f86e9a992ee5c53c70b484b761c9d6aed721df
2018-01-29 10:03:28 -08:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Yangqing Jia
cf769a7b6f Avoid race condition in get device properties.
Summary: TSIA

Reviewed By: salexspb

Differential Revision: D5898125

fbshipit-source-id: 1822ef2a017719442045fa446321d007b9d544b8
2017-09-23 16:01:23 -07:00
Yangqing Jia
26f0943130 Do CaffeCudaSetDevice and CaffeCudaGetDevice
Summary:
These are wrapper functions so that if we run in a Caffe2-only mode, we can
turn the flag on and get some small speedup on cuda device switches.

The purpose of the diff is to allow us to quickly assess the overhead of cuda
device switch functions. Ideally, the caching behavior shall live in the cuda
driver, which is the only safe place to ensure correctness.

If other code is running aside Caffe2 and does not properly do device guard,
this functionality will fail as separate cudaSetDevice() calls will not update
Caffe2's thread local device id. As a result, the functionality is only enabled
when/if one explicitly sets the flag.

This might not be safe, so use with caution.

- cudaGetDevice can go from 90ns to 2ns
- when setting the same device, we can go from 100ns to 2 ns
- when setting a different device, things are the same (1ns overhead on top of 143ns)

Reviewed By: azzolini

Differential Revision: D5709398

fbshipit-source-id: 6255f17a3d41f59a30327436383f306a2287896e
2017-08-25 18:20:14 -07:00
Yangqing Jia
93e12e75df Allow caffe2 to detect if cuda lib has been linked, and also fix oss build error.
Summary: Closes https://github.com/caffe2/caffe2/pull/1114

Reviewed By: pietern

Differential Revision: D5686557

Pulled By: Yangqing

fbshipit-source-id: 6b7245ebbe4eeb025ce9d0fe8fda427a0c3d9770
2017-08-23 18:41:15 -07:00
Simon Layton
85788a0f65 Add TensorCore support
Summary:
Add support for TensorCore convolution and gemm on Volta hardware.

Currently built on top of #1055
Closes https://github.com/caffe2/caffe2/pull/1056

Differential Revision: D5604068

Pulled By: Yangqing

fbshipit-source-id: 100f67e26ed5fabb1dbb31dcd77f7ecb84de4ee7
2017-08-10 20:16:48 -07:00
Jeff Johnson
3f860af050 Implement TopKOp for GPU
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.

There are two algorithm implementations:

-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.

Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.

Reviewed By: jamesr66a

Differential Revision: D5248206

fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
2017-06-17 08:47:38 -07:00
Yangqing Jia
81d5461973 cuda check -> enforce
Summary:
In the past we have moved most of the CHECKs to CAFFE_ENFORCE (exceptions).
However, we kept the name "*_CHECK" for cuda calls, and that caused some
confusion especially in the destructor calls: while our destructors are not
written to handle exceptions, these CUDA_CHECKs could actually throw some
exceptions.

As a result, this diff

(1) Renames all cuda related "*_CHECK" to "*_ENFORCE"
(2) Explicitly marked the destructor of core Caffe2 classes as noexcept
(3) Added proper, really-CHECK cuda check macros, and used those in the
corresponding destructors.

This should not change any of existing functionality.

Reviewed By: dzhulgakov

Differential Revision: D4656368

fbshipit-source-id: 32e3056e66c0400156c5ca0187b6151cf3d52404
2017-03-05 22:46:22 -08:00
Yangqing Jia
107966b059 add error message for asan
Summary:
This makes sure that we have useful CUDA error message in asan mode. Also
made a fb specific task pass by explicitly marking it not asan-able.

Reviewed By: dzhulgakov

Differential Revision: D4243471

fbshipit-source-id: 2ce303b97b3b4728c05575a8e7e21eb5960ecbc7
2016-11-29 15:18:39 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
6463eebc7b chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
Yangqing Jia
559053d3a8 chunky sync 2016-05-13 14:43:48 -07:00
Yangqing Jia
137b880aac cuda initialization.
This makes it callable multiple times but the
actual code only runs once. TODO: make it thread
safe. I am too lazy for now.
2016-03-15 12:52:05 -07:00
Yangqing Jia
fa59b90c72 misc updates 2016-01-13 21:00:56 -08:00
Yangqing Jia
05eda208a5 Last commit for the day. With all the previous changes this should give an exact reference speed that TensorFlow with CuDNN3 should achieve in the end. 2016-01-05 09:55:21 -08:00
Yangqing Jia
648d1b101a A consolidation of a couple random weekend work.
(1) various bugfixes.
(2) Tensor is now a class independent from its data type. This allows us
    to write easier type-independent operators.
(3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias.
(4) ParallelNet -> DAGNet to be more consistent with what it does.
(5) Caffe's own flags library instead of gflags.
(6) Caffe's own logging library instead of glog, but glog can be chosen with
    compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros
    like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes
    CAFFE_LOG_*.
(7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF
    in build_env.py.
2015-10-11 23:14:06 -07:00
Yangqing Jia
5b9584c227 carpet bombing 2015-09-15 21:30:23 -07:00
Yangqing Jia
d72cfcebaf fixes to allow more consistent build tests 2015-09-06 22:34:22 +00:00
Yangqing Jia
d2ff13d332 put a peer access pattern function to caffe2. 2015-09-06 08:59:04 -07:00
Yangqing Jia
ec069cb3ea Use a global init function: it seems that with the multiple components optionally linked in, it is best to just enable a registering mechanism for inits. 2015-09-06 08:59:03 -07:00
Yangqing Jia
a12a471b2d suppress compiler warning. 2015-08-28 14:02:53 -07:00
Yangqing Jia
2ed1077a83 A clean init for Caffe2, removing my earlier hacky
commits.
2015-06-25 16:26:01 -07:00