mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Enable fr_trace to read local traces from multiple hosts. (#159490)
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.
Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```
Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```
After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
dump = pickle.load(infile)
^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).
Rollback Plan:
Differential Revision: D79224727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
This commit is contained in:
parent
8ce81bcee1
commit
14c7358c64
|
|
@ -78,9 +78,9 @@ def read_dir(args: argparse.Namespace) -> tuple[dict[str, dict[str, Any]], str]:
|
|||
if prefix is None:
|
||||
prefix = _determine_prefix(files)
|
||||
for f in files:
|
||||
if f.find(prefix) != 0:
|
||||
if (offset := f.find(prefix)) == -1:
|
||||
continue
|
||||
details[f] = read_dump(prefix, os.path.join(root, f))
|
||||
details[f] = read_dump(f[:offset] + prefix, os.path.join(root, f))
|
||||
filecount += 1
|
||||
if not version:
|
||||
version = str(details[f]["version"])
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user