Summary:
Currently the C10_CUDA_CHECK only shows source location in CUDAException like below:
```
Exception raised from c10_cuda_check_implementation at fbcode/caffe2/c10/cuda/CUDAException.cpp:44
```
which is not terribly useful.
By checking the original diff D39619861 that introduced c10_cuda_check_implementation, it seems the original macro would show the source location correctly but c10_cuda_check_implementation broke it.
This diff will propagate caller source location to c10_cuda_check_implementation to fix the issue.
Test Plan:
CI
Observed desired error message after the change:
```
CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Device-side assertion tracking was not enabled by user.
Exception raised from operator() at fbcode/sigrid/predictor/aed/AedContainer.cpp:659 (most recent call first):
```
Note the last line reports actual caller location.
Rollback Plan:
Reviewed By: Raymo111
Differential Revision: D81880552
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162808
Approved by: https://github.com/janeyx99
Summary:
This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of
**`CUDA_KERNEL_ASSERT2`**
A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state.
Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary.
**`TORCH_DSA_KERNEL_ARGS`**
This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified.
**`c10::cuda::get_global_cuda_kernel_launch_registry()`**
This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected).
**`TORCH_DSA_KERNEL_LAUNCH`**
This host-side preprocessor macro replaces the standard
```
kernel_name<<<blocks, threads, shmem, stream>>>(args)
```
invocation with
```
TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args);
```
Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging.
**`c10::cuda::c10_retrieve_device_side_assertion_info`**
This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with:
1. Information (file, line number) of what kernel was launched.
2. Information (file, line number, message) about the device-side assertion
3. Information (file, line number) about where the failure was detected.
**Checking for device-side assertions**
Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating
> CUDA error: device-side assert triggered CUDA kernel errors
Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!)
# Notes on special cases
* Multiple assertions from the same block are recorded
* Multiple assertions from different blocks are recorded
* Launching kernels from many threads on many streams seems to be handled correctly
* If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue
* X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition
* X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition
Differential Revision: D37621532
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609
Approved by: https://github.com/ezyang, https://github.com/malfet
This marks CUDA exception checks as unlikely, which might have a positive performance impact.
If further isolates part of `C10_CUDA_CHECK` into a separate function and file so that code can be made more expressive in subsequent diffs without bloating functions using the check or creating readability issues.
Test Plan: Sandcastle
Differential Revision: D39619861
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85256
Approved by: https://github.com/ezyang, https://github.com/ngimel
Summary:
Introduces additional ways of handling CUDA errors that allow automated linters to detect if errors are being handled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74865
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D35194530
fbshipit-source-id: f4fe61594edbfd81e97a4b605935961b893df167
(cherry picked from commit 919ddf677c5b9b46c5e493ed64346a5f2527bf08)
Summary:
This is a first step towards creating context manager that errors out on synchronizing calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889
Reviewed By: albanD
Differential Revision: D29805280
Pulled By: ngimel
fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a
Summary:
After the change async error warnings look as follows:
```
$ python -c "import torch;torch.eye(3,3,device='cuda:777')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467
Reviewed By: ngimel
Differential Revision: D28904360
Pulled By: malfet
fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57609
Throw c10::CudaError for CUDA Exceptions for better classification of errors
Test Plan: Test locally by running some workflows
Reviewed By: dzhulgakov
Differential Revision: D28209356
fbshipit-source-id: 19a5fc8548433238dc224ea81a5f63a945fc5cc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830
Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.
Test Plan: CI
Reviewed By: zertosh
Differential Revision: D27979080
fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48615
Convert the macro from `TORCH_CUDA_KERNEL_LAUNCH_CHECK` to `C10_CUDA_KERNEL_LAUNCH_CHECK`, since it is now accessible through c10, not just torch.
Test Plan:
```
buck build //caffe2/caffe2:caffe2_cu
buck build //caffe2/aten:ATen-cu
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```
Reviewed By: jianyuh
Differential Revision: D25228727
fbshipit-source-id: 9c65feb3d0ea3fbd31f1dcaecdb88ef0534f9121
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48277
We move `TORCH_CUDA_KERNEL_LAUNCH_CHECK` from `//caffe2/aten/src/ATen/cuda/Exceptions.h` to `//caffe2/c10/cuda/CUDAException.h`.
The primary reason is for allowing us to use this MACRO in other subdirectories of //caffe2, not just in ATen. Refer to D24309971 (353e7f940f) for context.
An example of this use case is D24868557, where we add these checks to `//caffe2/caffe2/sgd`.
Also, this should not affect current files, because `Exceptions.h` includes `CUDAException.h`.
Test Plan:
```
buck build //caffe2/aten:ATen-cu
```
- https://fburl.com/buck/oq3rxbir
Also wait for sandcastle tests.
Reviewed By: ngimel
Differential Revision: D25101720
fbshipit-source-id: e2b05b39ff1413a21e64949e26ca24c8f7d0400f
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746
```
Commits:
1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
Add test for this in that file.
Fixes https://github.com/pytorch/pytorch/issues/21081.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229
Differential Revision: D16170194
Pulled By: zou3519
fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
Summary:
Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking https://github.com/pytorch/pytorch/pull/22229.
`DynamicLibrary` is also moved as it is used in the stub code, and seems general enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22362
Differential Revision: D16131787
Pulled By: ezyang
fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73