Enable fr_trace to read local traces from multiple hosts. (#159490)

Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.

Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```

Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```

After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
    dump = pickle.load(infile)
           ^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).

Rollback Plan:

Differential Revision: D79224727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
This commit is contained in:
Tianhao Huang 2025-08-06 03:15:30 +00:00 committed by PyTorch MergeBot
parent 8ce81bcee1
commit 14c7358c64

View File

@ -78,9 +78,9 @@ def read_dir(args: argparse.Namespace) -> tuple[dict[str, dict[str, Any]], str]:
if prefix is None:
prefix = _determine_prefix(files)
for f in files:
if f.find(prefix) != 0:
if (offset := f.find(prefix)) == -1:
continue
details[f] = read_dump(prefix, os.path.join(root, f))
details[f] = read_dump(f[:offset] + prefix, os.path.join(root, f))
filecount += 1
if not version:
version = str(details[f]["version"])