pytorch/torch
fduwjj 06e9deabb6 [c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652)
When debugging FR missing dump and missing dump logs, I have couple initial findings:
1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min.
2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652
Approved by: https://github.com/kwen2501
2025-04-07 16:33:27 +00:00
..
_awaits
_C Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)" (#150542) 2025-04-03 21:15:38 +00:00
_C_flatbuffer
_custom_op
_decomp Remove aten.elu core ATen decomp because it is now core ATen (#149780) 2025-03-25 01:59:57 +00:00
_dispatch [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549) 2025-02-28 03:03:53 +00:00
_dynamo Generalize compile collective to avoid cuda-bias (#150405) 2025-04-07 01:54:20 +00:00
_export [export] Make aoti_call_delegate hop traceable (#148804) 2025-04-03 20:44:31 +00:00
_functorch Make CompileEventLogger more defensive w.r.t to AOTAutogradCache and FXGraphCache (#150423) 2025-04-04 01:55:13 +00:00
_higher_order_ops [export] Make aoti_call_delegate hop traceable (#148804) 2025-04-03 20:44:31 +00:00
_inductor Add RECORD_FUNCTION for AOTI (#150150) 2025-04-07 15:12:29 +00:00
_lazy
_library [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555) 2025-04-01 18:45:48 +00:00
_logging [export] Beef up guard_added logs (#149465) 2025-03-20 23:02:07 +00:00
_numpy
_prims Support torch.compile rng selective activation checkpointing with cudagraph (#146878) 2025-02-28 00:47:03 +00:00
_prims_common PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
_refs [export] fix stft decomp and making it consistent with cpp impl. (#149232) 2025-03-19 18:40:35 +00:00
_strobelight Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549) 2025-02-22 03:44:53 +00:00
_subclasses [aoti] Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering (#150570) 2025-04-03 20:06:15 +00:00
_vendor
accelerator Move get accelerator to use build time flags when possible (#146098) 2025-03-10 13:17:58 +00:00
amp [MPS] grad scaler (#150255) 2025-04-06 17:06:55 +00:00
ao [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595) 2025-04-03 23:50:13 +00:00
autograd Compare device name of profiler dynamically (#150396) 2025-04-02 06:06:06 +00:00
backends [ROCm] change preferred blas lib defaults (#150212) 2025-03-29 03:33:07 +00:00
compiler [dynamo] add reason field to torch.compiler.disable (#150341) 2025-04-02 04:26:48 +00:00
contrib
cpu [CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935) 2025-02-20 18:50:55 +00:00
csrc [c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652) 2025-04-07 16:33:27 +00:00
cuda [ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142) 2025-03-31 04:12:08 +00:00
distributed [torchrec] update local_shards_wrapper to latest version (#150469) 2025-04-07 13:00:52 +00:00
distributions [typing] Add type hints to __init__ methods in torch.distributions. (#144197) 2025-04-06 17:50:35 +00:00
export [export] specialize for aten.to (#149235) 2025-04-03 05:20:10 +00:00
fft
func
futures PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
fx [MTIA] Map names to operand indices when folding submodules (#150692) 2025-04-06 03:11:14 +00:00
jit scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906) 2025-02-28 23:25:47 +00:00
legacy
lib [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393) 2025-03-04 04:20:04 +00:00
linalg Implement gradient for the residuals of torch.linalg.lstsq (#148526) 2025-03-10 12:35:09 +00:00
masked Use variadic length tuple for torch.masked.DimOrDims (#149870) 2025-03-31 07:06:58 +00:00
monitor
mps [MPS] Make torch.mps.compile_shader public (#148972) 2025-03-11 20:20:58 +00:00
mtia [MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340) 2025-03-18 15:15:12 +00:00
multiprocessing
nested [aotd] Guess tangents stride as output strides (#144579) 2025-03-20 15:41:36 +00:00
nn Move formulas on separate line in loss.py (#150565) 2025-04-03 20:47:35 +00:00
onnx [export] refactor _Dim into Dim (#149891) 2025-03-28 06:19:03 +00:00
optim [MPS] grad scaler (#150255) 2025-04-06 17:06:55 +00:00
package Remove code for Python < 3.9 (#147097) 2025-02-14 03:22:49 +00:00
profiler [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190) 2025-03-06 20:37:06 +00:00
quantization
signal
sparse Fix spelling (#149277) 2025-03-20 01:02:32 +00:00
special
testing cpp_wrapper: Fix even more tests (#147225) 2025-04-07 14:20:06 +00:00
utils Revert "bound sympy accuracy (#150383)" 2025-04-04 16:26:00 +00:00
xpu xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431) 2025-02-24 01:35:54 +00:00
__config__.py
__future__.py
__init__.py Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808) 2025-03-25 23:03:47 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py
_deploy.py
_environment.py
_guards.py [dynamo] Always trace into tensor subclass __torch_function__ (#149792) 2025-04-02 20:57:00 +00:00
_jit_internal.py [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546) 2025-02-27 20:46:16 +00:00
_linalg_utils.py
_lobpcg.py [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546) 2025-02-27 20:46:16 +00:00
_lowrank.py
_meta_registrations.py enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462) 2025-04-02 01:08:40 +00:00
_namedtensor_internals.py
_ops.py Add Any return annotation to __getattr__ methods that return a union of types. (#150204) 2025-04-02 05:25:07 +00:00
_python_dispatcher.py
_size_docs.py
_sources.py
_storage_docs.py
_streambase.py
_tensor_docs.py Add type hints to _tensor_docs.add_docstr_all (#150715) 2025-04-06 22:25:34 +00:00
_tensor_str.py add torch.float4_e2m1fn_x2 to PyTorch (#148791) 2025-03-27 17:32:20 +00:00
_tensor.py Revert "Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)" 2025-02-18 19:01:27 +00:00
_thread_safe_fork.py
_torch_docs.py Optimize torch.equal description (#149618) 2025-03-21 03:44:49 +00:00
_utils_internal.py [ROCm] OCP FP8 Support for new GPUs (#146632) 2025-02-24 22:47:52 +00:00
_utils.py Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786) 2025-03-06 12:04:32 +00:00
_VF.py
_vmap_internals.py
_weights_only_unpickler.py Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759) 2025-02-25 23:51:12 +00:00
CMakeLists.txt Add new dependences for gen_pyi.py (#150391) 2025-04-03 14:18:18 +00:00
custom_class_detail.h
custom_class.h Remove unneeded Clang-tidy suppression (#148246) 2025-03-01 16:51:54 +00:00
extension.h
functional.py Fix invalid nested int guarding in broadcast_shapes() (#145957) 2025-03-11 00:53:13 +00:00
hub.py [BE][CI][Easy] bump ruff to 0.9.0: long statements in docstrings (#146509) 2025-02-24 19:56:08 +00:00
library.h [pytorch] add experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150537) 2025-04-03 22:36:17 +00:00
library.py [Docs] Make torch.Library's kind have no default value to be consistent with the code (#149390) 2025-03-21 04:42:10 +00:00
overrides.py Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00
py.typed
quasirandom.py
random.py
README.md Rename README.txt to README.md (#149811) 2025-03-24 22:33:33 +00:00
return_types.py
script.h
serialization.py Move get accelerator to use build time flags when possible (#146098) 2025-03-10 13:17:58 +00:00
storage.py add torch.float4_e2m1fn_x2 to PyTorch (#148791) 2025-03-27 17:32:20 +00:00
torch_version.py
types.py
version.py.tpl

Note [TH abstraction violation]


TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.