Commit Graph

7640 Commits

Author SHA1 Message Date
Zitong Zhan
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
Mikayla Gawarecki
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
PyTorch MergeBot
fa1d7b0262 Revert "Remove unused Caffe2 macros (#132979)"
This reverts commit da65cfbdea.

Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))
2024-08-12 18:34:56 +00:00
xinan.lin
92f650c5b3 [Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255. (#126678)
[Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126678
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison
2024-08-10 14:33:39 +00:00
cyy
da65cfbdea Remove unused Caffe2 macros (#132979)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979
Approved by: https://github.com/ezyang
2024-08-09 04:48:20 +00:00
Yan Zhiwei
fe4f8e97cd [Intel GPU] xpu-ops codegen via backend whitelist (#130082)
# Motivation

This PR intends to enhance the codegen to allow generate codes for XPU backend.

XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels).  Manually porting code is erro-prone and may lead to high maintaining efforts.

We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.

# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`

We use the following commands to generate XPU operators

` python -m torchgen.gen --source-path path/to/yaml/of/xpu   --install-dir  build/xpu    --per-operator-headers    --static-dispatch-backend     --backend-whitelist=XPU`

The diff lies at `backend-whitelist=XPU`.  The backend-whitelist key is an existent argument in torchgen.

The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`

# Result

All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend.  XPU operators only include headers in this folder.

# Verification

* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
2024-07-31 16:31:38 +00:00
Cui, Yifeng
df0494bbba Clean redundant link libraries for XPU (#131322)
`torch_xpu` should link to `libtorch_cpu.so` instead of `torch_cpu_library`, otherwise redundant link libraries will contaminate `torch_xpu`, especially when there are MKL in both CPU and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131322
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-07-31 02:15:15 +00:00
PyTorch MergeBot
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9d.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
Mikayla Gawarecki
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
angelayi
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
Nikita Shulga
62e566b345 [BE] Remove suppression of inconsistent missing overrides (#131524)
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204

- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
2024-07-24 10:07:36 +00:00
PyTorch MergeBot
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
Mikayla Gawarecki
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
PyTorch MergeBot
7ce5b5767c Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c9551a3f50.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))
2024-07-12 00:37:04 +00:00
cyy
c9551a3f50 Make c10::string_view an alias of std::string_view (#130417)
Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-07-11 12:31:06 +00:00
cyy
d31f866b33 [BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409)
AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409
Approved by: https://github.com/malfet
2024-07-10 15:50:47 +00:00
cyy
a6345d3477 [CMake] [3/N] Remove unused code (#130322)
Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322
Approved by: https://github.com/r-barnes
2024-07-09 19:33:33 +00:00
Valentin Andrei
b139b5090f [pytorch] Name threads in thread pools for better debugging (#130270)
Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools.

This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270
Approved by: https://github.com/d4l3k, https://github.com/albanD
2024-07-09 08:03:47 +00:00
cyy
07b06f0f0a [2/N] Remove outdated CMake code (#130006)
Follows #129851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130006
Approved by: https://github.com/drisspg
2024-07-04 06:24:22 +00:00
cyy
46366888d7 Remove outdated CMake code (#129851)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851
Approved by: https://github.com/ezyang
2024-07-02 00:40:37 +00:00
Yifu Wang
67416a2996 [c10d] Introduce a util for detecting DMA connectivity among devices (#129510)
This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices.

The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible.

`_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced.

Example:

```python3
>>> from torch._C._autograd import DeviceType
>>> from torch._C._distributed_c10d import _detect_dma_connectivity
>>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink")
>>> for row in connectivity.matrix:
...     print(row)
...
[0, 18, 18, 18, 18, 18, 18, 18]
[18, 0, 18, 18, 18, 18, 18, 18]
[18, 18, 0, 18, 18, 18, 18, 18]
[18, 18, 18, 0, 18, 18, 18, 18]
[18, 18, 18, 18, 0, 18, 18, 18]
[18, 18, 18, 18, 18, 0, 18, 18]
[18, 18, 18, 18, 18, 18, 0, 18]
[18, 18, 18, 18, 18, 18, 18, 0]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510
Approved by: https://github.com/weifengpy
2024-06-27 23:02:07 +00:00
Aleksei Nikiforov
9174d14551 Don't install remaining caffe2 python files (#129067)
It is assumed that they are no longer needed.
And keeping their installation as is breaks
"python setup.py develop --user" workflow
when non-root user is used.

This change is follow up for 3d617333e7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129067
Approved by: https://github.com/cyyever, https://github.com/r-barnes
2024-06-27 17:25:59 +00:00
Chirag Pandya
64f1111d38 Expose nholmann json to torch (#129570)
Summary:

Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D59035246

Pulled By: c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
2024-06-26 21:59:26 +00:00
cyy
479ce5e2f4 Remove outdated CUDA code from CMake (#128801)
It's possible to simplify some CUDA handling logic in CMake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801
Approved by: https://github.com/r-barnes, https://github.com/malfet
2024-06-21 15:00:00 +00:00
Yifu Wang
217aac96d7 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-21 08:49:11 +00:00
PyTorch MergeBot
63a724d8e1 Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 8771e3429c.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))
2024-06-20 22:31:29 +00:00
PaliC
236fbcbdf4 [Split Build] Test split build in pull CI workflow (#126813)
This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813
Approved by: https://github.com/atalman
ghstack dependencies: #127934
2024-06-19 15:57:21 +00:00
Yifu Wang
8771e3429c Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-19 03:38:58 +00:00
PyTorch MergeBot
77830d509f Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 7a39755da2.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))
2024-06-18 18:11:43 +00:00
Yifu Wang
7a39755da2 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-15 10:20:21 +00:00
cyy
e4c32d14a8 [3/N] Remove inclusion of c10/util/string_utils.h (#128504)
Follows #128372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504
Approved by: https://github.com/malfet
2024-06-15 06:38:40 +00:00
cyy
3008644297 [Caffe2] Remove remaining unused perfkernels (#128477)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-12 22:19:36 +00:00
cyy
2126ae186e Remove caffe2/perfkernels files (#128186)
These files are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-10 23:40:18 +00:00
Xinya Zhang
d34075e0bd Add Efficient Attention support on ROCM (#124885)
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation

Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.

This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129

`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change.  [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
2024-06-08 22:41:05 +00:00
Aaron Orenstein
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
sdp
b4a0161449 Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390)
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.

-------
As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU.

This PR  enables XPU build on Windows as the first step of #126719:

- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang
2024-06-06 01:41:06 +00:00
cyy
3d617333e7 Simplify CMake code (#127683)
Due to the recent adoption of find(python), it is possible to further simplify some CMake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683
Approved by: https://github.com/ezyang
2024-06-05 15:17:31 +00:00
cyy
df75a9dc80 Remove Caffe2/onnx (#127991)
Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991
Approved by: https://github.com/ezyang
2024-06-05 15:10:12 +00:00
Nikita Shulga
d48c25c7d1 [BE] Fix missing-prototypes errors in Metal backend (#127994)
By declaring a bunch of functions static.
Removed `USE_PYTORCH_METAL` from list of flags that suppress `-Werror=missing-prototypes`. This  will prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/127942 to sneak past CI, that builds PyTorch with Metal support.
Use nested namespaces
Remove spurious semicolon after TORCH_LIBRARY declaration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127994
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi
2024-06-05 14:58:19 +00:00
Tristan Rice
597922ba21 Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3e.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00
cyy
059cae6176 [Caffe2] Remove Caffe2 proto and other files (#127655)
Remove Caffe2 proto files altogether.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127655
Approved by: https://github.com/ezyang
2024-06-04 14:22:21 +00:00
cyy
e7cb43a2d2 Check unused variables in tests (#127498)
Enables unused variable checks in CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498
Approved by: https://github.com/ezyang
2024-06-04 05:35:25 +00:00
Xinya Zhang
ef9451ac8d Move the build of AOTriton to base ROCM docker image. (#127012)
Mitigates #126111

AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.

This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.

Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)

Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
2024-06-03 20:35:22 +00:00
cyy
a6bae1f6db Remove more caffe2 files (#127511)
Remove more caffe2 files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127511
Approved by: https://github.com/r-barnes
2024-05-31 11:26:27 +00:00
PyTorch MergeBot
7646825c3e Revert "distributed debug handlers (#126601)"
This reverts commit 3d541835d5.

Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))
2024-05-31 01:21:24 +00:00
cyy
d44daebdbc [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-31 01:20:45 +00:00
Tristan Rice
3d541835d5 distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
PyTorch MergeBot
67739d8c6f Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 699db7988d.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))
2024-05-30 01:16:57 +00:00
cyy
84b5aa9a68 [Caffe2] [Reland] Remove Caffe2 proto files (#127394)
Reland of #126134, which was reverted due to the wrong base. Now that #126705 has been relanded, it's time to remand this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127394
Approved by: https://github.com/r-barnes
2024-05-29 20:37:02 +00:00