Commit Graph

24 Commits

Author SHA1 Message Date
albanD
09cbf34e93 [BE] Preserve caller source location in the error message (#162808)
Summary:
Currently the C10_CUDA_CHECK only shows source location in CUDAException like below:
```
Exception raised from c10_cuda_check_implementation at fbcode/caffe2/c10/cuda/CUDAException.cpp:44
```
which is not terribly useful.

By checking the original diff D39619861 that introduced c10_cuda_check_implementation, it seems the original macro would show the source location correctly but c10_cuda_check_implementation broke it.

This diff will propagate caller source location to c10_cuda_check_implementation to fix the issue.

Test Plan:
CI

Observed desired error message after the change:
```
CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Device-side assertion tracking was not enabled by user.
Exception raised from operator() at fbcode/sigrid/predictor/aed/AedContainer.cpp:659 (most recent call first):
```

Note the last line reports actual caller location.

Rollback Plan:

Reviewed By: Raymo111

Differential Revision: D81880552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162808
Approved by: https://github.com/janeyx99
2025-09-15 13:29:43 +00:00
Richard Barnes
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
Richard Barnes
8dd575faf6 [BE] Modernize C10_UNUSED (#138102)
[`[[maybe_unused]]`](https://en.cppreference.com/w/cpp/language/attributes/maybe_unused) is part of C++17 standard

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138102
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/eqy
2024-10-18 16:33:01 +00:00
Nikita Shulga
53e32d12c4 [c10] Use nested namespace in c10/cuda (#116464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464
Approved by: https://github.com/Skylion007
2023-12-27 23:14:00 +00:00
Richard Barnes
eadbf762fc Fix CUDA error not getting captured by handler (#92227)
Fixes #91758. Still leaves functions on the hotpath.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92227
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-01-17 00:16:29 +00:00
Richard Barnes
ad188a227e Introduce CUDA Device Assertions Infrastructure (#84609)
Summary:
This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of

**`CUDA_KERNEL_ASSERT2`**

A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state.

Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary.

**`TORCH_DSA_KERNEL_ARGS`**

This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified.

**`c10::cuda::get_global_cuda_kernel_launch_registry()`**

This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected).

**`TORCH_DSA_KERNEL_LAUNCH`**

This host-side preprocessor macro replaces the standard
```
kernel_name<<<blocks, threads, shmem, stream>>>(args)
```
invocation with
```
TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args);
```
Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging.

**`c10::cuda::c10_retrieve_device_side_assertion_info`**
This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with:
1. Information (file, line number) of what kernel was launched.
2. Information (file, line number, message) about the device-side assertion
3. Information (file, line number) about where the failure was detected.

**Checking for device-side assertions**

Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating
> CUDA error: device-side assert triggered CUDA kernel errors

Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!)

# Notes on special cases

* Multiple assertions from the same block are recorded
* Multiple assertions from different blocks are recorded
* Launching kernels from many threads on many streams seems to be handled correctly
* If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue
* X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition
* X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition

Differential Revision: D37621532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-08 01:26:07 +00:00
Richard Barnes
e59d307e2f Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)
Test Plan: Sandcastle

Differential Revision: D40949947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350
Approved by: https://github.com/Skylion007, https://github.com/soumith
2022-11-03 02:48:41 +00:00
Richard Barnes
f0869cc8d0 Make CUDA exceptions unlikely and isolate C10_CUDA_CHECK body (#85256)
This marks CUDA exception checks as unlikely, which might have a positive performance impact.

If further isolates part of `C10_CUDA_CHECK` into a separate function and file so that code can be made more expressive in subsequent diffs without bloating functions using the check or creating readability issues.

Test Plan: Sandcastle

Differential Revision: D39619861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85256
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-09-22 23:15:10 +00:00
Richard Barnes
1249d490de Add additional CUDA error handling macros (#74865)
Summary:
Introduces additional ways of handling CUDA errors that allow automated linters to detect if errors are being handled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74865

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D35194530

fbshipit-source-id: f4fe61594edbfd81e97a4b605935961b893df167
(cherry picked from commit 919ddf677c5b9b46c5e493ed64346a5f2527bf08)
2022-03-29 18:03:03 +00:00
Richard Barnes
11aa1961c1 Use (void)error_unused to avoid unused warning (#71000)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71000

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33470600

fbshipit-source-id: 868a6ee33a04846bd1efbe06ab306fbaad3bf9db
2022-01-07 23:39:30 -08:00
Natalia Gimelshein
6284d2a82b wrap cudaStreamSynchronize calls (#61889)
Summary:
This is a first step towards creating context manager that errors out on synchronizing calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889

Reviewed By: albanD

Differential Revision: D29805280

Pulled By: ngimel

fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a
2021-07-21 19:30:52 -07:00
Nikita Shulga
d125694d0b Move CUDA async warning to suffix (#59467)
Summary:
After the change async error warnings look as follows:
```
$ python -c "import torch;torch.eye(3,3,device='cuda:777')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467

Reviewed By: ngimel

Differential Revision: D28904360

Pulled By: malfet

fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0
2021-06-04 17:26:28 -07:00
Atul Jangra
3948ce2fd9 [Caffe2] Introduce c10::CudaError for CUDA Exceptions (#57609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57609

Throw c10::CudaError for CUDA Exceptions for better classification of errors

Test Plan: Test locally by running some workflows

Reviewed By: dzhulgakov

Differential Revision: D28209356

fbshipit-source-id: 19a5fc8548433238dc224ea81a5f63a945fc5cc3
2021-05-06 14:28:45 -07:00
Scott Wolchok
44cc873fba [PyTorch] Autoformat c10 (#56830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830

Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D27979080

fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
2021-04-30 21:23:28 -07:00
Kurtis David
b824fc4de2 [pytorch] [PR] Rename cuda kernel checks to C10 (#48615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48615

Convert the macro from `TORCH_CUDA_KERNEL_LAUNCH_CHECK` to `C10_CUDA_KERNEL_LAUNCH_CHECK`, since it is now accessible through c10, not just torch.

Test Plan:
```
buck build //caffe2/caffe2:caffe2_cu
buck build //caffe2/aten:ATen-cu
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```

Reviewed By: jianyuh

Differential Revision: D25228727

fbshipit-source-id: 9c65feb3d0ea3fbd31f1dcaecdb88ef0534f9121
2020-12-01 18:19:07 -08:00
Kurtis David
7be30d1883 Move CUDA kernel check to c10 (#48277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48277

We move `TORCH_CUDA_KERNEL_LAUNCH_CHECK` from `//caffe2/aten/src/ATen/cuda/Exceptions.h` to `//caffe2/c10/cuda/CUDAException.h`.

The primary reason is for allowing us to use this MACRO in other subdirectories of //caffe2, not just in ATen. Refer to D24309971 (353e7f940f) for context.

An example of this use case is D24868557, where we add these checks to `//caffe2/caffe2/sgd`.

Also, this should not affect current files, because `Exceptions.h` includes `CUDAException.h`.

Test Plan:
```
buck build //caffe2/aten:ATen-cu
```
- https://fburl.com/buck/oq3rxbir

Also wait for sandcastle tests.

Reviewed By: ngimel

Differential Revision: D25101720

fbshipit-source-id: e2b05b39ff1413a21e64949e26ca24c8f7d0400f
2020-11-20 14:58:15 -08:00
Mike Ruberry
a024e1e091 Creates Torch-friendly Event class and adds Stream tracking to autograd (#25130)
Summary:
Resubmission of https://github.com/pytorch/pytorch/issues/23424 because previous PR was borked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25130

Test Plan: Two tests were added to cuda_stream_test for this functionality.

Differential Revision: D17145538

Pulled By: mruberry

fbshipit-source-id: 2546c5907c038412e03aa0d3328a972b0164c455
2019-09-01 12:37:52 -07:00
SsnL
8482efb203 pin_memory malloc now uses existing context if available. (#22229)
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState        301 ns        301 ns    2319746
```

Commits:

1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
    Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
    Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
    Add test for this in that file.

Fixes https://github.com/pytorch/pytorch/issues/21081.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229

Differential Revision: D16170194

Pulled By: zou3519

fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
2019-07-16 10:18:30 -07:00
Edward Yang
535c5540bc Back out "Back out "[pytorch][PR] Move thnvrtc and DynamicLibrary to ATen"" (#22794)
Summary:
Original commit changeset: 227df3b85316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/22794
ghstack-source-id: 86400904

Differential Revision: D16222777

fbshipit-source-id: 0b198ac59e640df0b8204b4ed30f8e822c15fd9a
2019-07-15 06:28:56 -07:00
Edward Yang
ac78a86e1d Back out "[pytorch][PR] Move thnvrtc and DynamicLibrary to ATen" (#22749)
Summary:
Original commit changeset: add2ee8a8865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/22749
ghstack-source-id: 86323899

Differential Revision: D16203552

fbshipit-source-id: 227df3b85316315c15d2cb7b6a5c884096a82e9e
2019-07-11 12:21:21 -07:00
SsnL
31d821e267 Move thnvrtc and DynamicLibrary to ATen (#22362)
Summary:
Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking https://github.com/pytorch/pytorch/pull/22229.

`DynamicLibrary` is also moved as it is used in the stub code, and seems general enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22362

Differential Revision: D16131787

Pulled By: ezyang

fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73
2019-07-09 07:28:27 -07:00
Junjie Bai
212024282b Mark cudaGetLastError return value unused in C10_CUDA_CHECK
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17605

Reviewed By: xw285cornell

Differential Revision: D14277586

Pulled By: bddppq

fbshipit-source-id: 38879208f2ab83cf39d8a8a61b288cd09fcafd9a
2019-03-01 00:05:46 -08:00
Soumith Chintala
3a47d56946 Fix static linkage cases and NO_DISTRIBUTED=1 + CUDA (#16705) (#17337)
Summary:
Attempt #2 (attempt 1 is https://github.com/pytorch/pytorch/pull/16705 and got reverted because of CI failures)

Fixes https://github.com/pytorch/pytorch/issues/14805
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17337

Differential Revision: D14175626

Pulled By: soumith

fbshipit-source-id: 66f2e10e219a1bf88ed342ec5c89da6f2994d8eb
2019-02-21 16:12:02 -08:00
Edward Yang
48099c23b4 Move AT_CUDA_CHECK to c10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13910

Reviewed By: smessmer

Differential Revision: D13046201

fbshipit-source-id: 8d360a0e4d6c2edf070d130e600c6b04f0ee0058
2018-11-19 08:20:10 -08:00