pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

hongxyan 66a76516bf [ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660 ) Related to #103973 #110532 #108404 #94891 Context: As commented in `6ae0554d11/cmake/Dependencies.cmake (L1198)` Kernel asserts are enabled by default for CUDA and disabled for ROCm. However it is somewhat broken, and Kernel assert was still enabled for ROCm. Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues) Changes: This pull request serves the following purposes: * Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts * Fix the bug that Kernel Asserts for ROCm was not disabled by default. Specifically, - Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons: (1) This variable only applies to ROCm. (2) The new name is more align with #define CUDA_KERNEL_ASSERT function. (3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build). - Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain - Added `#cmakedefine` to carry over the CMake variable to C++ Tests: (1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled: ``` python setup.py develop ``` Verify CMakeCache.txt has correct value. ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=0 ``` Tested the following code in ROCm build and CUDA build, and expected the return code differently. ``` subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) ``` This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future) ``` python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async ``` Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing: ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>> r 0 ``` (2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON ``` USE_ROCM_KERNEL_ASSERT=1 python setup.py develop ``` Verify `USE_ROCM_KERNEL_ASSERT` is `1` ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=1 ``` Run the assert test, and expected return code not equal to 0. ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed. :0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 >>> r -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd		2023-12-13 15:44:53 +00:00
..
hip	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
nomnigraph	[BE] Enforce missing `override` keyword (#104032 )	2023-06-24 02:34:24 +00:00
__init__.py
allocator.cc
allocator.h
blob_gpu_test.cc
blob_serialization_gpu.cc
blob_serialization.cc	[caffe2] Don't copy Tensor dims during deserialization (#79471 )	2022-07-12 21:36:26 +00:00
blob_serialization.h	[caffe2] Don't copy Tensor dims during deserialization (#79471 )	2022-07-12 21:36:26 +00:00
blob_serializer_base.h
blob_stats.cc
blob_stats.h
blob_test.cc	Fix sign-compare in caffe2 cpp tests	2022-04-05 00:08:05 +00:00
blob.h	[caffe2] Micro-optimizations in BlobGetMutableTensor (#98103 )	2023-04-10 19:43:02 +00:00
CMakeLists.txt	Remove caffe2 mobile (#84338 )	2022-09-08 01:49:55 +00:00
common_cudnn.cc	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
common_cudnn.h
common_gpu.cc	[ROCm] use hipblas instead of rocblas (#105881 )	2023-07-31 20:42:55 +00:00
common_gpu.h	[CUDA] Drop CUDA 10 support (#89582 )	2023-01-05 05:11:53 +00:00
common_omp.h
common_test.cc	[Reland] Eliminate invocations of c10::stoi,c10::stod,c10::stoull,c10::stoll (#109566 )	2023-09-19 07:15:25 +00:00
common.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
common.h
context_base.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
context_base.h
context_gpu_test.cc	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
context_gpu.cu	Revert "Add function to materialize COW storages (#113396 )"	2023-11-20 10:26:01 +00:00
context_gpu.h	[caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382 )	2023-03-10 06:42:51 +00:00
context_test.cc	cleanup unused include (#93359 )	2023-02-04 02:15:50 +00:00
context.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
context.h	use irange for loops 2 (#66746 )	2021-12-10 04:26:23 -08:00
cudnn_wrappers.h	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
db.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
db.h	use irange for loops 2 (#66746 )	2021-12-10 04:26:23 -08:00
distributions_stubs.h
event_cpu.h
event_gpu_test.cc
event_gpu.cc	cast return of cudaGetLastError() to void when discarding (#62518 )	2021-08-03 11:17:22 -07:00
event_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
event.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
event.h
export_c10_op_to_caffe2.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
export_c10_op_to_caffe2.h	use irange for loops 2 (#66746 )	2021-12-10 04:26:23 -08:00
export_caffe2_op_to_c10.h	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
flags.h
graph_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
graph.cc
graph.h
init_denormals.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
init_intrinsics_check.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
init_omp.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
init_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
init.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
init.h
int8_serialization.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
logging.h
macros.h
macros.h.in	[ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660 )	2023-12-13 15:44:53 +00:00
memonger.cc
memonger.h
module_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
module.cc
module.h
net_async_base.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
net_async_base.h
net_async_scheduling.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
net_async_scheduling.h
net_async_task_future.cc
net_async_task_future.h
net_async_task_graph.cc
net_async_task_graph.h
net_async_task.cc
net_async_task.h
net_async_tracing_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
net_async_tracing.cc	[Reland] Eliminate invocations of c10::stoi,c10::stod,c10::stoull,c10::stoll (#109566 )	2023-09-19 07:15:25 +00:00
net_async_tracing.h
net_dag_utils_test.cc	Fix warnings (#62930 )	2021-08-11 14:07:10 -07:00
net_dag_utils.cc	[caffe2] Replace `CAFFE_` prefixes in `static_tracepoint.h` macros with `TORCH_` (#106380 )	2023-08-03 21:51:36 +00:00
net_dag_utils.h
net_gpu_test.cc	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
net_parallel.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
net_parallel.h
net_simple_refcount_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
net_simple_refcount.cc	[caffe2] Replace `CAFFE_` prefixes in `static_tracepoint.h` macros with `TORCH_` (#106380 )	2023-08-03 21:51:36 +00:00
net_simple_refcount.h
net_simple.cc	[caffe2] Replace `CAFFE_` prefixes in `static_tracepoint.h` macros with `TORCH_` (#106380 )	2023-08-03 21:51:36 +00:00
net_simple.h
net_test.cc	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
net.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
net.h
numa.cc
numa.h
observer_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
observer.h
operator_gpu_test.cc
operator_gradient.h
operator_schema_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
operator_schema.cc	[PyTorch] Remove unnecessary iostream includes in headers (#61500 )	2021-08-19 18:54:51 -07:00
operator_schema.h	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
operator_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
operator.cc	[caffe2] Remove OperatorBase::newstyle_outputs_ (#67093 )	2023-01-23 22:41:59 +00:00
operator.h	[BE] Enforce missing `override` keyword (#104032 )	2023-06-24 02:34:24 +00:00
parallel_net_test.cc	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
plan_executor_test.cc	Fix broken caffe2 test: PlanExecutorTest.BlockingErrorPlan (#64401 )	2021-09-02 08:30:29 -07:00
plan_executor.cc	some reference and move fixes (#95942 )	2023-03-10 03:44:09 +00:00
plan_executor.h
prof_dag_counters.cc	turn on -Werror=type-limits in our Bazel CPU build	2022-06-10 10:04:08 +00:00
prof_dag_counters.h
qtensor_serialization.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
qtensor_serialization.h	[BE] Enforce missing `override` keyword (#104032 )	2023-06-24 02:34:24 +00:00
qtensor.cc
qtensor.h	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 )	2022-07-26 01:20:44 +00:00
scope_guard.h
serialization_test.cc	Fix sign-compare in caffe2 cpp tests	2022-04-05 00:08:05 +00:00
stats_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
stats.cc
stats.h	[caffe2] Replace `CAFFE_` prefixes in `static_tracepoint.h` macros with `TORCH_` (#106380 )	2023-08-03 21:51:36 +00:00
storage.h
tensor_impl.h
tensor_int8.cc
tensor_int8.h
tensor.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
tensor.h	[PyTorch] Further reduce cost of TypeMeta::_typeMetaData (by 10x!) (#98105 )	2023-04-12 17:44:48 +00:00
test_utils.cc
test_utils.h	use irange for loops 2 (#66746 )	2021-12-10 04:26:23 -08:00
timer_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
timer.h
transform_test.cc	Fix sign-compare in caffe2 cpp tests	2022-04-05 00:08:05 +00:00
transform.cc	Revert "Use missing-prototypes in torch_cpu (#103725 )"	2023-06-22 18:30:31 +00:00
transform.h
types.cc	Speed up DataTypeToTypeMeta (#66113 )	2021-10-07 08:06:09 -07:00
types.h	Speed up DataTypeToTypeMeta (#66113 )	2021-10-07 08:06:09 -07:00
workspace_test.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
workspace.cc	Disable `avoid-non-const-global-variables` lint check (#62008 )	2021-07-22 18:04:40 -07:00
workspace.h