pytorch/caffe2
Yifu Wang 67416a2996 [c10d] Introduce a util for detecting DMA connectivity among devices (#129510)
This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices.

The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible.

`_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced.

Example:

```python3
>>> from torch._C._autograd import DeviceType
>>> from torch._C._distributed_c10d import _detect_dma_connectivity
>>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink")
>>> for row in connectivity.matrix:
...     print(row)
...
[0, 18, 18, 18, 18, 18, 18, 18]
[18, 0, 18, 18, 18, 18, 18, 18]
[18, 18, 0, 18, 18, 18, 18, 18]
[18, 18, 18, 0, 18, 18, 18, 18]
[18, 18, 18, 18, 0, 18, 18, 18]
[18, 18, 18, 18, 18, 0, 18, 18]
[18, 18, 18, 18, 18, 18, 0, 18]
[18, 18, 18, 18, 18, 18, 18, 0]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510
Approved by: https://github.com/weifengpy
2024-06-27 23:02:07 +00:00
..
core Don't install remaining caffe2 python files (#129067) 2024-06-27 17:25:59 +00:00
perfkernels Don't install remaining caffe2 python files (#129067) 2024-06-27 17:25:59 +00:00
serialize [3/N] Remove inclusion of c10/util/string_utils.h (#128504) 2024-06-15 06:38:40 +00:00
utils [Caffe2] Remove remaining unused perfkernels (#128477) 2024-06-12 22:19:36 +00:00
.clang-format
CMakeLists.txt [c10d] Introduce a util for detecting DMA connectivity among devices (#129510) 2024-06-27 23:02:07 +00:00
unexported_symbols.lds Hide all symbols in llvm namespace (#63272) 2021-08-15 11:29:43 -07:00
version_script.lds Hide all symbols in llvm namespace (#63272) 2021-08-15 11:29:43 -07:00