Commit Graph

5 Commits

Author SHA1 Message Date
Yifu Wang
27ffede878 [reland] Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #119102
2024-02-12 18:48:06 +00:00
PyTorch MergeBot
7315ec7505 Revert "Fix estimate_nccl_collective_runtime (#118986)"
This reverts commit 0dab6fb352.

Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))
2024-02-08 18:11:53 +00:00
Yifu Wang
0dab6fb352 Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #118910, #118911, #118437
2024-02-07 18:02:51 +00:00
Will Feng
e9804aaacc Fix unit tests and add logging for Inductor intra-graph reordering (#111981)
1. Fix code to make unit tests pass (incl. collect_env issue called out by @int3  in https://github.com/pytorch/pytorch/pull/108091#discussion_r1362901686).
2. Add logging for Inductor intra-graph reordering passes (`TORCH_LOGS="overlap"`), for easier debugging. Example log:
```
[rank0]:[2023-10-24 16:28:26,446] [0/0] torch._inductor.comms.__overlap: [DEBUG] ==== Visualize overlap before reordering pass <function reorder_compute_for_overlap at 0x7fa68c5568e0> ====
[rank0]:[2023-10-24 16:28:26,446] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf0)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf1)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf2)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf3)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf4)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf5)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf6)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf7)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf8)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf9)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf10)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf11)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] Est. runtime (ms): 0.000228

[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ==== Visualize overlap after reordering pass <function reorder_compute_for_overlap at 0x7fa68c5568e0> ====
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf2)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf3)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] | ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf0)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] | ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf1)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] | ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf9)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf4)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf5)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf6)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf7)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf8)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf10)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf11)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] Est. runtime (ms): 0.000217
```
The `| SomeComputeOp` means the compute op is overlapped with the comm op above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111981
Approved by: https://github.com/wanchaol
2023-10-25 18:19:43 +00:00
Will Feng
b28cb43f5c Intra-graph reordering pass on Inductor scheduler IR (based on #100762) (#108091)
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.

Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2023-10-14 14:51:24 +00:00