[Inductor][Autotune] Gracefully restart the autotune process after ULF failure (#166073)

This PR partially fixes https://github.com/pytorch/torchtitan/issues/1791, as it will work with `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` setting only.

The core of the problem: In `max-autotune` mode Inductor runs multiple benchmarks to determine the best config. If one of these benchmarks fails with `cudaErrorLaunchFailure`, all other CUDA calls within the same process will fail including the rest of the benchmarks.

The solution: Restart the child process gracefully and continue benchmarking.

Unfortunately, if autotuning is done in the main process, the whole program falls into unrecoverable state. In this case, the only way of successful execution would be just preventing the ULF.

Here is some info from [CUDA documentation](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html):
>cudaErrorLaunchFailure = 719
An exception occurred on the device while executing a kernel. ... . This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166073
Approved by: https://github.com/syed-ahmed, https://github.com/drisspg
This commit is contained in:
Aidyn-A 2025-10-25 10:40:59 +00:00 committed by PyTorch MergeBot
parent b0e9c86971
commit 798a6d2be1
2 changed files with 18 additions and 4 deletions

View File

@ -231,6 +231,13 @@ class TuningProcess:
self.process.kill() self.process.kill()
self.close() self.close()
def restart(self) -> None:
"""
Gracefully restarts the child process.
"""
self.shutdown(wait=True)
self.start()
class TuningProcessPool: class TuningProcessPool:
""" """
@ -311,11 +318,16 @@ class TuningProcessPool:
) )
# Set to INF so this choice will be ignored # Set to INF so this choice will be ignored
return float("inf") return float("inf")
except Exception: except Exception as process_exception:
warnings.warn( warnings.warn(
f"Failed to benchmark choice '{choice}'. It will be ignored. " f"Failed to benchmark choice '{choice}'. It will be ignored. "
"Please debug the root cause in case the choice can bring perf gains." "Please debug the root cause in case the choice can bring perf gains."
) )
# An unspecified launch failure (cudaErrorLaunchFailure) corrupts the
# CUDA context, making it unrecoverable. All subsequent CUDA calls will
# fail as well. The process must be restarted to restore CUDA functionality.
if "cudaErrorLaunchFailure" in str(process_exception):
process.restart()
# Set to INF so this choice will be ignored # Set to INF so this choice will be ignored
return float("inf") return float("inf")
finally: finally:

View File

@ -3287,9 +3287,11 @@ class AlgorithmSelectorCache(PersistentCache):
msg = str(e) msg = str(e)
if "invalid argument" in msg: if "invalid argument" in msg:
msg += "\n\nThis may mean this GPU is too small for max_autotune mode.\n\n" msg += "\n\nThis may mean this GPU is too small for max_autotune mode.\n\n"
else: elif "illegal memory access" in msg:
if "illegal memory access" in msg: msg += "\n\nEither error in template or triton bug.\n"
msg += "\n\nEither error in template or triton bug.\n" elif "unspecified launch failure" in msg:
msg += "\n\nAn unrecoverable unspecified launch failure was caught during autotuning."
msg += "\nPlease try re-running with TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1.\n\n"
if isinstance(choice, CUDATemplateCaller): if isinstance(choice, CUDATemplateCaller):
log.debug( log.debug(