Summary: This PR solves two issues:
1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak.
2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159818
Approved by: https://github.com/henryhu6, https://github.com/yushangdi
Implementations:
1. Move collective ops to c10d namespace, so that we can call them externally.
2. Add AOTI shims for collective ops.
Testing
1. Add c10d functional UT for cpu.
2. Include the above one in cpp wrapper UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154492
Approved by: https://github.com/desertfire
```
NOTE [lowering-time collective optimization]
In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.
To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:
- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.
To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.
For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```
### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).
### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029
Approved by: https://github.com/Chillee
ghstack dependencies: #138028