IvanKobzarev
|
25c170b72e
|
[inductor] Runtime estimations: use nccl estimator; mm only benchmark mode (#161405)
During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms.
Adding optional usage of:
- c10d.time_estimator for collectives, which is based on NCCL estimator
Benchmark mode only for matmuls, as they are highly dependent on mm backend
- The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572
This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()`
Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405
Approved by: https://github.com/eellison
|
2025-09-08 14:33:19 +00:00 |
|