Commit Graph

373 Commits

Author SHA1 Message Date
fduwjj
ffc6cbfaf7 [symm_mem] Move all symm mem code into a dedicated folder (#155573)
We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573
Approved by: https://github.com/fegin, https://github.com/d4l3k
2025-06-10 22:30:11 +00:00
Bin Bao
13044b2b04 Move c10/macros/Export.h to torch/standalone (#154850)
Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner.

Test Plan: CI

Bifferential Revision: D75756619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850
Approved by: https://github.com/janeyx99
2025-06-03 06:18:59 +00:00
Peter Y. Yeh
43390d8b13 ROCm Sparsity through HipSparseLT (#150578)
TLDR:

- This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity.
- Require **ROCm 6.4** && **gfx942/gfx950**.
- The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way.

### Dense vs. Sparse Performance Comparison

#### **NT (Row-major)**
**Average Uplift**: `1.20`

| M     | N      | K      | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-------|--------|--------|-------------------------|-------------------------------|--------|
| 14336 | 8      | 4096   | 20.05                   | 25.3                          | 1.26   |
| 4096  | 8      | 14336  | 21.07                   | 25.28                         | 1.20   |
| 3072  | 3072   | 10240  | 299.05                  | 351.82                        | 1.18   |
| 3072  | 1536   | 768    | 18.56                   | 20.05                         | 1.08   |
| 3072  | 17664  | 768    | 163.13                  | 173.91                        | 1.07   |
| 3072  | 196608 | 768    | 1717.30                 | 1949.63                       | 1.14   |
| 3072  | 24576  | 768    | 206.84                  | 242.98                        | 1.17   |
| 3072  | 6144   | 768    | 53.90                   | 56.88                         | 1.06   |
| 3072  | 98304  | 768    | 833.77                  | 962.28                        | 1.15   |
| 768   | 1536   | 768    | 8.53                    | 19.65                         | 2.30   |
| 768   | 17664  | 768    | 46.02                   | 46.84                         | 1.02   |
| 768   | 196608 | 768    | 463.15                  | 540.46                        | 1.17   |
| 768   | 24576  | 768    | 54.32                   | 59.55                         | 1.10   |
| 768   | 6144   | 768    | 19.47                   | 20.15                         | 1.03   |
| 768   | 98304  | 768    | 231.88                  | 258.73                        | 1.12   |

---

#### **NN (Row-major)**
**Average Uplift**: `1.13`

| M   | N      | K     | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-----|--------|-------|-------------------------|-------------------------------|--------|
| 768 | 1536   | 3072  | 27.50                   | 28.78                         | 1.05   |
| 768 | 17664  | 3072  | 125.06                  | 158.94                        | 1.27   |
| 768 | 196608 | 3072  | 1568.38                 | 1767.12                       | 1.13   |
| 768 | 24576  | 3072  | 171.05                  | 203.49                        | 1.19   |
| 768 | 6144   | 3072  | 58.72                   | 60.39                         | 1.03   |
| 768 | 98304  | 3072  | 787.15                  | 887.60                        | 1.13   |

-------------------------

This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt.

### ROCm and hipSPARSELt Support:

* [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support.
* [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version.
* [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration.
* [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro.
* `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153)

### Codebase Updates:

* [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285)

### Test Suite Updates:

* [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578
Approved by: https://github.com/jeffdaily
2025-05-31 02:03:40 +00:00
zhxchen17
ffd58293f7 [dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616)
Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states.

## Test Cases:

0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed.
1. torch.compile() nested in vmap.
2. torch.compile() nested in grad.
3. torch.compile() nested in jvp + vmap
4. torch.compile() nested functionalize
5. torch.compile() nested in vmap + grad

Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616
Approved by: https://github.com/zou3519
ghstack dependencies: #152615
2025-05-05 18:05:56 +00:00
Ke Wen
a7f1ddc184 [SymmMem] Experimental NVSHMEM integration (#151261)
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151261
Approved by: https://github.com/fegin, https://github.com/fduwjj
2025-05-01 05:24:50 +00:00
Jared Hance
055e59e709 [bazel] Build flatbuffers within bazel (#151364)
This is similar to how we handle protobufs and it makes it more convenient for bazel users to handle their version of flatbuffers. (Flatbuffers is very picky about the generated code matching the runtime). Instead of using the checked in generated code, we generate it on the fly.

This is relevant to https://github.com/pytorch/pytorch/issues/112903, because having the version of flatbuffers tied to pytorch will make pytorch difficult to use as an external workspace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151364
Approved by: https://github.com/malfet
2025-04-17 18:33:51 +00:00
Huy Do
681894546b Fix bazel job after #144489 (#146840)
This is currently failing in trunk with the following error https://github.com/pytorch/pytorch/actions/runs/13246034191/job/36972742610

### Testing

Bazel job passing https://github.com/pytorch/pytorch/actions/runs/13247495161/job/36977571965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146840
Approved by: https://github.com/atalman
2025-02-10 22:17:36 +00:00
Nikhil Gupta
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
albanD
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
PyTorch MergeBot
6c713ccb5e Revert "Make functionalization ViewMeta serializable with pickle. (#143712)"
This reverts commit b8abdaa286.

Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))
2025-01-17 00:52:50 +00:00
Yukio Siraichi
b8abdaa286 Make functionalization ViewMeta serializable with pickle. (#143712)
Fix: #141974

This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.

In summary, these are the changes this PR introduces:

- `ViewMeta` is turned into an abstract class (see
  _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
  functions that need to be implemented. `to_out_index` should be
  implemented by operations that might return more than 1 output.

- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
  created (see _FunctionalizeFallbackKernel.h_).

- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
  declaration and definition of the `ViewMeta` specializations, which
  are automatically generated in the ATen codegen (see _gen.py_).

- New `_functionalization` Python sub-module is created (see
  _Module.cpp_). It serves as namespace for the `ViewMeta`
  specializations and `InverseReturnMode` enum.

- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
  the automatically generated Python bindings for the `ViewMeta`
  specialization, which are generated in the torch codegen (see
  _generate_code.py_).

Note that this PR makes use of codegen at 2 different moments:

- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
  them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
2025-01-16 19:41:41 +00:00
Scott Wolchok
b46d00c1b7 Shard RegisterDispatchKey (#144364)
Should fix https://github.com/pytorch/pytorch/issues/143952 .

Testing: built PyTorch on Raspberry Pi 5; this seemed to alleviate high peak memory requirement. (I did increase shard counts for other generated files along the way, but I need to go back and figure out how much of that was strictly necessary vs. needing to use -j1 or -j2.)

Differential Revision: [D67925496](https://our.internmc.facebook.com/intern/diff/D67925496/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144364
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #144363
2025-01-10 18:21:19 +00:00
Catherine Lee
bb5e439f2d Add networkx as bazel dep to fix CI failure (#143995)
Add networkx as a dependency for test_bazel

Example failure: https://github.com/pytorch/pytorch/actions/runs/12551752021/job/34996706301

```

INFO: From Testing //:test_bazel:
==================== Test output for //:test_bazel:
Traceback (most recent call last):
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 33, in <module>
    test_simple_compile_eager()
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 27, in test_simple_compile_eager
    opt_foo1 = torch.compile(foo, backend="eager")
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2533, in compile
    backend = _TorchCompileWrapper(backend, mode, options, dynamic)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2342, in __init__
    self.compiler_fn = lookup_backend(backend)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 66, in lookup_backend
    _lazy_import()
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 102, in _lazy_import
    import_submodule(backends)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/utils.py", line 2797, in import_submodule
    importlib.import_module(f"{mod.__name__}.{filename[:-3]}")
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/common.py", line 12, in <module>
    from torch._functorch.aot_autograd import (
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/aot_autograd.py", line 147, in <module>
    from .partitioners import default_partition
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/partitioners.py", line 31, in <module>
    from ._activation_checkpointing.graph_info_provider import GraphInfoProvider
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module>
    import networkx as nx
ModuleNotFoundError: No module named 'networkx'
```

No periodic runs on this PR or its main branch commit, but I'm pretty sure its started on https://togithub.com/pytorch/pytorch/pull/143539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143995
Approved by: https://github.com/huydhn
2025-01-02 19:42:18 +00:00
Nikhil Gupta
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
PyTorch MergeBot
8136daff5a Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit 4b82251011.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))
2024-12-19 23:33:17 +00:00
Nikhil Gupta
4b82251011 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-19 18:51:26 +00:00
PyTorch MergeBot
14fe1f7190 Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit d3ff2d42c2.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))
2024-12-19 01:05:11 +00:00
Nikhil Gupta
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
PyTorch MergeBot
964655bf0c Revert "Remove THC from OSS build (#134969)"
This reverts commit 9c7660be0e.

Reverted https://github.com/pytorch/pytorch/pull/134969 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking the installation of https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/csrc/deformable/deform_conv_cuda_kernel.cu#L76 ([comment](https://github.com/pytorch/pytorch/pull/134969#issuecomment-2499275378))
2024-11-26 00:00:12 +00:00
cyy
9c7660be0e Remove THC from OSS build (#134969)
THC is not used in OSS version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134969
Approved by: https://github.com/albanD
2024-11-25 00:39:42 +00:00
Mengwei Liu
a02e88d19c [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041)
Summary:
Bump miniz version from 2.1.0 to 3.0.2 and apply these patches:

* #79636 patches internal BUCK and bazel build
* #138959 adds `bool compute_crc32` argument
* miniz PR: https://github.com/richgel999/miniz/pull/324 to support
  zip64

Anyone bumping miniz version again, please apply these patches as well.

Test Plan:
Rely on unit test

Imported from OSS

Differential Revision: D65586230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041
Approved by: https://github.com/mikaylagawarecki
2024-11-09 00:13:16 +00:00
Yifu Wang
1659e241c8 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-08 23:28:25 +00:00
PyTorch MergeBot
36e0f119d0 Revert "[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)"
This reverts commit 5203138483.

Reverted https://github.com/pytorch/pytorch/pull/139227 on behalf of https://github.com/yifuwang due to Need to address internal build failure D65605027 ([comment](https://github.com/pytorch/pytorch/pull/139227#issuecomment-2463204467))
2024-11-07 21:01:36 +00:00
Yifu Wang
5203138483 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-07 03:43:12 +00:00
PyTorch MergeBot
ec5fbee6c0 Revert "Drop caffe2 string_utils (#139217)"
This reverts commit 1797a2035d.

Reverted https://github.com/pytorch/pytorch/pull/139217 on behalf of https://github.com/huydhn due to Chatting with @r-barnes, this is still used in lots of place internally ([comment](https://github.com/pytorch/pytorch/pull/139217#issuecomment-2448568071))
2024-10-30 22:23:32 +00:00
Richard Barnes
1797a2035d Drop caffe2 string_utils (#139217)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139217
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2024-10-30 01:13:16 +00:00
PyTorch MergeBot
564d00f364 Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
This reverts commit 7cfd23636c.

Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))
2024-09-13 16:42:37 +00:00
cyy
7cfd23636c Fix clang-tidy warnings in Caffe2 code (#134935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935
Approved by: https://github.com/ezyang
2024-09-12 03:27:09 +00:00
CaoE
f7c0c06692 Add oneDNN BRGEMM support on CPU (#131878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-07 13:22:30 +00:00
Ke Wen
3645634f3c [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 08:28:49 +00:00
PyTorch MergeBot
cbf5ba1e97 Revert "[1/N] Move NaN check onto NCCL stream (#134300)"
This reverts commit 94caba4899.

Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
Ke Wen
94caba4899 [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-27 16:02:27 +00:00
cyyever
c638a40a93 [Caffe2] Remove unused AVX512 code (#133160)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160
Approved by: https://github.com/albanD
2024-08-23 23:16:16 +00:00
Yifu Wang
78d69bfe11 [SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)
### Summary
- Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address.
- Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation).

### Benchmark

8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support.

![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947)

![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe)

Differential Revision: [D61682507](https://our.internmc.facebook.com/intern/diff/D61682507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424
Approved by: https://github.com/yf225, https://github.com/weifengpy
2024-08-23 20:09:20 +00:00
PyTorch MergeBot
cedfac20c7 Revert "[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)"
This reverts commit 66d3eb783c.

Reverted https://github.com/pytorch/pytorch/pull/133424 on behalf of https://github.com/jeanschmidt due to Broke internal ADS builds, see D61611517 ([comment](https://github.com/pytorch/pytorch/pull/133424#issuecomment-2304676328))
2024-08-22 13:29:27 +00:00
Yifu Wang
66d3eb783c [SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)
### Summary
- Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address.
- Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation).

### Benchmark

8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support.

![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947)

![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424
Approved by: https://github.com/yf225, https://github.com/weifengpy
2024-08-21 05:11:21 +00:00
cyy
05e8e87a69 [Submodule] Remove foxi (#132976)
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
PyTorch MergeBot
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9d.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
Mikayla Gawarecki
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
PyTorch MergeBot
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
Mikayla Gawarecki
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
Chirag Pandya
64f1111d38 Expose nholmann json to torch (#129570)
Summary:

Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D59035246

Pulled By: c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
2024-06-26 21:59:26 +00:00
Yifu Wang
217aac96d7 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-21 08:49:11 +00:00
PyTorch MergeBot
63a724d8e1 Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 8771e3429c.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))
2024-06-20 22:31:29 +00:00
Yifu Wang
8771e3429c Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-19 03:38:58 +00:00
PyTorch MergeBot
77830d509f Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 7a39755da2.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))
2024-06-18 18:11:43 +00:00
Yifu Wang
7a39755da2 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-15 10:20:21 +00:00
cyy
3008644297 [Caffe2] Remove remaining unused perfkernels (#128477)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-12 22:19:36 +00:00
cyy
2126ae186e Remove caffe2/perfkernels files (#128186)
These files are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-10 23:40:18 +00:00
Tristan Rice
597922ba21 Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3e.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00