Commit Graph

2 Commits

Author SHA1 Message Date
cyy
94e12f97dc [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404)
Follows #137072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404
Approved by: https://github.com/Skylion007
2024-10-10 18:05:34 +00:00
Tristan Rice
bebf5302ba TCPStoreLibUvBackend: trace operations (#136320)
Summary:
This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values

This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead

Test Plan:
```
TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo"
```

```
I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500.
I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running
I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500.
I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500).
I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500
I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646
I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646
I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646
I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646
I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646
I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646
I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646
I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646
```

Differential Revision: D62924454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
2024-09-20 00:53:21 +00:00