Commit Graph

420 Commits

Author SHA1 Message Date
PyTorch MergeBot
5ae9daa4a2 Revert "[AOTI] support freezing for MKLDNN (#124350)"
This reverts commit 654afb6f3a.

Reverted https://github.com/pytorch/pytorch/pull/124350 on behalf of https://github.com/clee2000 due to Seems to have broken inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_freezing_non_abi_compatible_cpu 654afb6f3a https://github.com/pytorch/pytorch/actions/runs/9224838183/job/25382780192 ([comment](https://github.com/pytorch/pytorch/pull/124350#issuecomment-2129889809))
2024-05-24 16:03:07 +00:00
Wu, Chunyuan
654afb6f3a [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-24 13:34:04 +00:00
PyTorch MergeBot
1b29c16e5e Revert "Introduce ProcessGroupCudaP2P (#122163)"
This reverts commit 2dd2699860.

Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))
2024-05-23 16:06:14 +00:00
Yifu Wang
2dd2699860 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-22 09:33:05 +00:00
Tristan Rice
ac51920656 Reapply "c10d: add Collectives abstraction (#125978)" (#126695)
This reverts commit d9c3485146.

Reapplies #125978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695
Approved by: https://github.com/c-p-i-o
2024-05-21 18:00:09 +00:00
PyTorch MergeBot
d9c3485146 Revert "c10d: add Collectives abstraction (#125978)"
This reverts commit 4b2ae2ac33.

Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))
2024-05-20 07:40:41 +00:00
Tristan Rice
4b2ae2ac33 c10d: add Collectives abstraction (#125978)
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.

Test plan:

```
python test/distributed/test_collectives.py -v
```

This tests both functionality using multiple threads as well as timeout behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978
Approved by: https://github.com/shuqiangzhang
2024-05-17 05:09:11 +00:00
Shuqiang Zhang
c860df5a9d [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-16 04:35:15 +00:00
haozhe.zhu
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
FEI
b950217f19 Support third-party devices emit a range for each autograd operator (#125822)
Fixes #125752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125822
Approved by: https://github.com/aaronenyeshi
2024-05-15 05:06:24 +00:00
PyTorch MergeBot
20aa7cc678 Revert "[c10d] Add an option for NAN check on every collective (#125726)"
This reverts commit 6db3271007.

Reverted https://github.com/pytorch/pytorch/pull/125726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new test is failing on both multigpu and rocm distributed, i.e. c712b0f8a3 ([comment](https://github.com/pytorch/pytorch/pull/125726#issuecomment-2110646075))
2024-05-14 16:26:34 +00:00
Wang, Eikan
d1f254dce8 Add a cache mechanism to accelerate torch.compile-for-eager (#116368)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.

The persistent config consists of two parts - meta_info and kernel_path.

- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.

When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map;  the meta_info is key, and the kernel library is the value.

Currently, this PR only supports static shape to guard the kernel.

Take a `mul` as an example.
```python
class MulKernel:
  def __init__(self) -> None:
    pass

  def __call__(self, *args: Any, **kwargs: Any) -> Any:
    with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
      opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
          "aot_inductor.eager_mode": True,
          "aot_inductor.eager_op_name": "mul_Tensor"
        }
      )
      return opt_fn(*args, **kwargs)

torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")

_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
  reg_name = f"{reg_name}.{schema.overload_name}"
  torch_compile_op_lib_impl.impl(
    reg_name,
    MulKernel(),
    "CUDA",
    compile_mode=True)

a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
    fn(a, b)

# Collect performance
beg = time.time()
for _ in range(iter):
    fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.

```json
[
    {
        "meta_info": [
            {
                "is_symbolic": false,
                "device_type": "cuda",
                "dtype": "torch.float32",
                "sizes": [1024, 1024],
                "strides": [1024, 1]
            },
            {
                "is_symbolic": false,
                "device_type": "cuda",
                "dtype": "torch.float32",
                "sizes": [1024, 1024],
                "strides": [1024, 1]
            }
        ],
        "kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
    }
]
```

Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.

- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms

Hardware:

- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10

Software:

- PyTorch Version: 39df084001
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0

Differential Revision: [D57216427](https://our.internmc.facebook.com/intern/diff/D57216427)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116368
Approved by: https://github.com/jansel, https://github.com/atalman
2024-05-14 15:43:48 +00:00
Shuqiang Zhang
6db3271007 [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-14 00:05:41 +00:00
PyTorch MergeBot
bd3cbdba2f Revert "[optim] add fused_adagrad support for CPU device (#124905)"
This reverts commit 1c3fe84033.

Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk 1c3fe84033 ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))
2024-05-13 20:53:22 +00:00
haozhe.zhu
1c3fe84033 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-13 01:16:20 +00:00
Wang, Eikan
978b572652 Add registration API for torch.compile-eager (#121387)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows.

- Load cache
- Check cache according to the input tensors
  - Cache Hit: Run the cached kernel directly
  - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function.

Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368

Differential Revision: [D57164385](https://our.internmc.facebook.com/intern/diff/D57164385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5
2024-05-10 00:30:27 +00:00
PyTorch MergeBot
ca0f070065 Revert "Add registration API for torch.compile-eager (#121387)"
This reverts commit 61e937f3d6.

Reverted https://github.com/pytorch/pytorch/pull/121387 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/121387#issuecomment-2087541956))
2024-04-30 22:13:04 +00:00
Wang, Eikan
61e937f3d6 Add registration API for torch.compile-eager (#121387)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows.

- Load cache
- Check cache according to the input tensors
  - Cache Hit: Run the cached kernel directly
  - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function.

Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5
2024-04-27 12:49:58 +00:00
egienvalue
73744a2c00 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-26 16:17:54 +00:00
PyTorch MergeBot
e04c7b19f4 Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit 381653de63.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
egienvalue
381653de63 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-24 20:51:20 +00:00
egienvalue
408aa0182c Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-04-24 20:51:17 +00:00
haozhe.zhu
3c964ad1ca add fused_sgd_kernel support for CPU device (#123629)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9
https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_sgd time: 0.2301 seconds
_fused_sgd time: 0.0925 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_sgd time: 2.6195 seconds
_fused_sgd time: 1.7543 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```
Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-23 08:28:19 +00:00
Ashwin Hari
5f5778476a rename ort to maia (#123265)
Fixes #123264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265
Approved by: https://github.com/albanD
2024-04-23 00:33:25 +00:00
PyTorch MergeBot
0feab7d6c3 Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)"
This reverts commit cb17721899.

Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
PyTorch MergeBot
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff9.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
Jane Xu
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
egienvalue
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
egienvalue
cb17721899 Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def query(self) -> _bool: ...
    def synchronize(self) -> None: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
2024-04-18 17:35:09 +00:00
blegouix
ccfc87b199 include scheduler_on_plateau in optim.h (#121722)
Fixes #121593
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121722
Approved by: https://github.com/albanD
2024-03-27 19:45:25 +00:00
mingfeima
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
mingfeima
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
PyTorch MergeBot
0c07c0c15f Revert "add int4 packed gemm support on CPU device (#117475)"
This reverts commit 30befa592e.

Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))
2024-03-04 21:20:57 +00:00
mingfeima
30befa592e add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-02 00:17:34 +00:00
Simon Fan
82b356193d Move VariableInfo into its own file to avoid circular dependency (#120732)
VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way.

Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732
Approved by: https://github.com/jansel
2024-03-01 08:48:13 +00:00
soulitzer
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
Yu, Guangye
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
Joel Schlosser
9ec8dd2467 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-14 22:00:43 +00:00
Jeff Daily
0e6eee3c89 [ROCm] TunableOp (#114894)
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides.

See the README.md for additional details.

TunableOp was ported from onnxruntime starting from commit 08dce54266.  The content was significantly modified and reorganized for use within PyTorch.  The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following:

- onnxruntime/core/framework/tunable.h -> Tunable.h
- onnxruntime/core/framework/tuning_context.h -> Tunable.h
- onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h
- onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h
- onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h
- onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h
- onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h
- onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-02-14 19:03:49 +00:00
albanD
ca777fbbb7 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-02-13 23:15:24 +00:00
PyTorch MergeBot
214f06ae3a Revert "Add Accelerator device and shell hooks (#119329)"
This reverts commit 4b9568a360.

Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))
2024-02-13 02:23:45 +00:00
PyTorch MergeBot
9f8c84a399 Revert "Add missing include for internal build (#119721)"
This reverts commit e0cabebad9.

Reverted https://github.com/pytorch/pytorch/pull/119721 on behalf of https://github.com/huydhn due to This fixes the build failures but there is still an issue with the missing libcaffe2_torch_fb_sparsenn_sparsenn_operators_gpu.so on D53686094 ([comment](https://github.com/pytorch/pytorch/pull/119721#issuecomment-1940191340))
2024-02-13 01:56:12 +00:00
albanD
e0cabebad9 Add missing include for internal build (#119721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119721
Approved by: https://github.com/huydhn
2024-02-12 22:36:16 +00:00
PyTorch MergeBot
24bdd03d23 Revert "Reify view_func() closures as ViewFuncs (#118404)"
This reverts commit d5a6762263.

Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))
2024-02-12 12:38:51 +00:00
Yu, Guangye
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
albanD
4b9568a360 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang
2024-02-09 18:54:28 +00:00
Joel Schlosser
d5a6762263 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-09 18:51:36 +00:00
Yu, Guangye
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
William Wen
ae4e866bba [dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438)
Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090.

Changes:
- Move CacheEntry and ExtraState to C++
- Use pybind to control reference counting
- Use std::list instead of manually implementing a linked list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438
Approved by: https://github.com/jansel
2024-02-06 20:48:11 +00:00
PyTorch MergeBot
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b3.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
Yu, Guangye
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
Yu, Guangye
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
CaoE
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
Bin Bao
79ba39710e [AOTI] Forward fix a Windows build failure (#116790)
Summary: forward fix https://github.com/pytorch/pytorch/pull/116269
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116790
Approved by: https://github.com/khabinov, https://github.com/huydhn
2024-01-05 06:00:58 +00:00
Bin Bao
70f3a530d7 [AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269)
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.

Test Plan: CI

Reviewed By: khabinov

Differential Revision: D52339116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
2024-01-04 18:58:24 +00:00
Bin Bao
2dce364634 [AOTI][refactor] Remove model_container_runner_cuda.cpp (#116113)
Differential Revision: [D52301272](https://our.internmc.facebook.com/intern/diff/D52301272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116113
Approved by: https://github.com/khabinov
ghstack dependencies: #116047
2023-12-21 14:56:25 +00:00
Bin Bao
fabf9433e7 [AOTI][refactor] Organize model runner files (#116022)
Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file

Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022
Approved by: https://github.com/khabinov
2023-12-20 15:35:34 +00:00
Yifu Wang
6e1ba79b7f [re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001) (#116125)
This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125
Approved by: https://github.com/yf225
2023-12-20 07:13:50 +00:00
PyTorch MergeBot
91e184fd74 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit 4edc921857.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))
2023-12-19 16:01:19 +00:00
Yifu Wang
4edc921857 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-15 08:17:35 +00:00
PyTorch MergeBot
7ecddaef23 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit adfbd2b219.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))
2023-12-14 20:33:10 +00:00
Yifu Wang
adfbd2b219 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-14 08:13:08 +00:00
soulitzer
4d8ad4fb82 Move SingletonSymNodeImpl from c10 to aten (#114895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895
Approved by: https://github.com/jbschlosser
2023-12-13 20:01:18 +00:00
Antonio Kim
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
Mu-Chu Lee
eddce3c054 [AOTInductor] Rename model_runner to model_container_runner (#111324)
Summary:
We rename the model_runner to model_container_runner to prepare for
adding tests of pure model without container.

Test Plan:
commit itself is a test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111324
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-11-16 19:14:22 +00:00
PyTorch MergeBot
252e68a83b Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 54493fe8c4.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))
2023-11-15 00:51:23 +00:00
Antonio Kim
54493fe8c4 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-13 23:18:14 +00:00
PyTorch MergeBot
9a28a7b498 Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 27e31ab6e8.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))
2023-11-07 15:53:32 +00:00
soulitzer
d5007d8d8e Split out input_metadata.cpp from input_metadata.h (#113031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113031
Approved by: https://github.com/albanD
2023-11-07 00:03:21 +00:00
Antonio Kim
27e31ab6e8 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-06 21:27:02 +00:00
jjsjann123
9d23440c81 Nvfuser code base nuke (#111447)
removing nvfuser code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111447
Approved by: https://github.com/albanD
2023-11-01 20:53:14 +00:00
Yifu Wang
ec18ef62f4 Native c10d_functional ops (#110570)
This PR introduces a native version of c10d_functional ops. The main goal is to add collective support in AOTInductor and allow collective ops to work in multi-threaded native runtimes.

The native version also incorporated API improvements we wished to implement in Python c10d_functional:

- Removed `ranks` and `group_size` from collective op signatures which were proven to be redundant.
- Use tensor storage as opposed to `void*` to resolve in-flight work.

The native process group registration/resolution mechansim is only used for native c10d_functional in the PR. It will become the single source of truth in upcoming PRs.

The upcoming PRs will implement Inductor/AOTInductor support for c10d_functional, after which native c10d_functional will replace Python c10d_functional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110570
Approved by: https://github.com/wanchaol
2023-10-25 22:56:06 +00:00
PyTorch MergeBot
5ce8002d24 Revert "Remove deprecated fbgemm operators (#104535)"
This reverts commit 57c7aa12db.

Reverted https://github.com/pytorch/pytorch/pull/104535 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104535#issuecomment-1779650412))
2023-10-25 16:34:16 +00:00
jjsjann123
39c09d4da6 Revert "Revert "Nvfuser code removal (#111093)"" (#111604)
This reverts commit 715dfced72.

The original PR #111093 is reverted due to broken internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111604
Approved by: https://github.com/davidberard98
2023-10-23 18:32:41 +00:00
Peter Bell
57c7aa12db Remove deprecated fbgemm operators (#104535)
These operators are not used and have been deprecated since #72690 (Feb 2022). Additionally, the `torch.jit.quantized` interface has been deprecated since #40102 (June 2020).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104535
Approved by: https://github.com/ezyang
2023-10-22 06:10:09 +00:00
PyTorch MergeBot
715dfced72 Revert "Nvfuser code removal (#111093)"
This reverts commit 572628e520.

Reverted https://github.com/pytorch/pytorch/pull/111093 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, @albanD please help to support the author with the next steps to get this diff merged ([comment](https://github.com/pytorch/pytorch/pull/111093#issuecomment-1771434853))
2023-10-19 17:39:49 +00:00
jjsjann123
572628e520 Nvfuser code removal (#111093)
Removes the existing integration code & build of nvfuser in TorchScript.

Note that I intentionally left the part where we wipe out `third_party/nvfuser` repo. I'll do that in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111093
Approved by: https://github.com/albanD
2023-10-18 01:00:47 +00:00
PyTorch MergeBot
1e70f4d02c Revert "Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)"
This reverts commit bb1424d46e.

Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))
2023-10-16 23:03:26 +00:00
PyTorch MergeBot
f68d6e8108 Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881)"
This reverts commit 68a1219f74.

Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/kit1980 due to breaking internal builds, undefined symbol: _ZN3c1022RefcountedMapAllocator6decrefEv ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1761950014))
2023-10-13 17:57:53 +00:00
Will Constable
bb1424d46e Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)
This reverts commit 314a502eb0.

Changes since original PR:
Reland 1
 *  rename torch.distributed.hooks to torch.distributed._hooks

Reland 2
 * make _hooks importable even if !distributed.is_available()
 * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
2023-10-12 16:59:23 +00:00
Peter Bell
68a1219f74 Move at::{Refcounted,}MapAllocator to c10 (#109881)
`libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`,
 so it makes sense to move it to c10 along with the other memory allocators.

This means `libshm.so` only depends on `c10` and we don't need to relink
`libshm.so` for every ATen change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881
Approved by: https://github.com/albanD
2023-10-12 10:51:13 +00:00
Bin Bao
3058700f7f [aotinductor] Add AOTIModelRunner as a utility class (#110891)
Summary: Introduce a utility class AOTIModelRunner to take care of running an AOTInductor compiled model. It does things like dlopen a model, initialize the model container, setup inputs and outputs, and destroy the model container.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110891
Approved by: https://github.com/chenyang78
ghstack dependencies: #110652
2023-10-11 15:58:28 +00:00
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
PyTorch MergeBot
02a02a23ee Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881)"
This reverts commit 0341deb1c7.

Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/albanD due to It does break buck build ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1756195823))
2023-10-10 20:39:12 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
Peter Bell
0341deb1c7 Move at::{Refcounted,}MapAllocator to c10 (#109881)
`libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`,
 so it makes sense to move it to c10 along with the other memory allocators.

This means `libshm.so` only depends on `c10` and we don't need to relink
`libshm.so` for every ATen change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881
Approved by: https://github.com/albanD
2023-10-09 23:53:47 +00:00
George White
f4796df914 Add support for generators on the IPU device (#110704)
This change adds hooks similar to those used on other device types, to allow the Torch to create and use generators provided by the IPU backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110704
Approved by: https://github.com/ezyang
2023-10-06 21:36:14 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
zdevito
3fe3439242 Use LLVMSymbolizer directly for unwind inside fbcode (#108800)
Using LLVMSymbolizer directly avoids having to call fork which has caused timeouts in some circumstances.

Differential Revision: [D49070589](https://our.internmc.facebook.com/intern/diff/D49070589/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108800
Approved by: https://github.com/aaronenyeshi
2023-10-04 04:04:08 +00:00
cyy
7f5fd92372 Reland use std::make_unique after internal changes (#109742)
check internal
follow up of #109780
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109742
Approved by: https://github.com/ezyang
2023-09-28 17:24:08 +00:00
rzou
8124a6c40c [TORCH_LIBRARY] Add impl_abstract_pystub (#109529)
We want users to be able to define custom ops in C++ but put the
abstract impl in Python (since it is easier to write them in Python and
the abstract impl better models device semantics and data-dependent
operators).

`m.impl_abstract_pystub(opname, python_module, context)` declares the
abstract_impl of the operator to exist in the given python module.
When the abstract_impl needs to be accessed (either via FakeTensor or
Meta), and it does not exist, the PyTorch Dispatcher will yell
with a descriptive error message.

Some details:
- We construct a new global AbstractImplPyStub mapping in
  Dispatcher.cpp. Read/write to this map is protected by the Dispatcher
  lock.
- We add a new Meta Tensor fallback kernel. The fallback errors out if there is
  no meta kernel, but also offers a nicer error message if we see that there is
  a pystub.
- We create a `torch._utils_internal.throw_abstract_impl_not_imported_error`
  helper function to throw errors. This way, we can throw different error
  messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we
  added a PyInterpreter::throw_abstract_impl_not_imported_error.

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/)

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-09-22 04:55:36 +00:00
Bin Bao
9c2715bbb2 [inductor] Clean up AOTInductor runtime ABI (#109678)
Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436,  https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes.

Reviewed By: frank-wei

Differential Revision: D49374820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2023-09-21 00:25:24 +00:00
mingfeima
a683bc54fd move transform_bias_rescale_qkv vectorized code to cpu sub folder (#109095)
`at::vec::Vectorized<>` will not be properly vectorized under the folder of `aten/src/ATen/native/transformers`, move the vectorized code to `aten/src/ATen/native/cpu` where the macros of `CPU_CAPABILITY_AVX2`, `CPU_CAPABILITY_AVX512` etc. are defined.

Here is the vtune log before and after this patch on `transform_bias_rescale_qkv_cpu`
1. before:
![transformer_bioas_rescale_qkv_before](https://github.com/pytorch/pytorch/assets/20233731/582f6873-d86e-47a6-bd2a-620b97acc5b1)
2. after:
![transformer_bioas_rescale_qkv_after](https://github.com/pytorch/pytorch/assets/20233731/949004ab-3cbc-4a1d-a03d-9a17efa981ae)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109095
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-09-18 08:40:06 +00:00
PyTorch MergeBot
525e4f42d0 Revert "replace torch::make_unique with std::make_unique (#108866)"
This reverts commit 03e35efbf7.

Reverted https://github.com/pytorch/pytorch/pull/108866 on behalf of https://github.com/clee2000 due to Sorry but I found more usages of `torch::make_unique` internally, I can go change all of these, but I'd prefer if that gets done before this gets merged ([comment](https://github.com/pytorch/pytorch/pull/108866#issuecomment-1722577925))
2023-09-17 21:57:30 +00:00
Bin Bao
0f646b1d15 [inductor] Add a C shim layer for libtorch (#109391)
Summary:
This PR adds a limited C shim layer for libtorch. The ultimate goal is to ban any direct reference to aten/c10 data structures or functions, to avoid ABI breakage by providing stable C interfaces.

To make the review and landing easier, we broke the changes into several steps. In this PR (a combination of https://github.com/pytorch/pytorch/pull/109022 and https://github.com/pytorch/pytorch/pull/109351), we add C interfaces for certain libtorch functions and modify the wrapper codegen to generate calls to those interfaces. There are a few other items to be addressed in future PRs:

* The AOTInductor runtime interface still takes lists of aten tensors as input and output
* The interaction with ProxyExecutor (general fallback support) needs to move away from aten tensor
* Remove all references to aten/c10 headers in the AOTInductor-generated code

Differential Revision: D49302669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109391
Approved by: https://github.com/chenyang78
2023-09-16 16:46:26 +00:00
cyy
03e35efbf7 replace torch::make_unique with std::make_unique (#108866)
It should be safe to remove the old torch::make_unique functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108866
Approved by: https://github.com/albanD
2023-09-14 20:52:26 +00:00
drisspg
ad90ab31f2 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-13 13:59:05 +00:00
Kurt Mohler
4c5e43574c Reland 2: Add PyObject preservation for UntypedStorage (#109039)
Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039
Approved by: https://github.com/ezyang
2023-09-12 22:26:05 +00:00
PyTorch MergeBot
59f605be57 Revert "Reland 2: Add PyObject preservation for UntypedStorage (#109039)"
This reverts commit 419e4e17a2.

Reverted https://github.com/pytorch/pytorch/pull/109039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing linter job in trunk, probably due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/109039#issuecomment-1715147020))
2023-09-12 07:26:11 +00:00
Kurt Mohler
419e4e17a2 Reland 2: Add PyObject preservation for UntypedStorage (#109039)
Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039
Approved by: https://github.com/ezyang
2023-09-12 01:19:40 +00:00
PyTorch MergeBot
68238606f3 Revert "Reland: Add PyObject preservation for UntypedStorage (#103907)"
This reverts commit 56b848157c.

Reverted https://github.com/pytorch/pytorch/pull/103907 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing torchdistx build which uses check_pyobj here 9c1b9f5cb2/src/python/torchdistx/_C/deferred_init.cc (L87) ([comment](https://github.com/pytorch/pytorch/pull/103907#issuecomment-1712121158))
2023-09-08 19:27:07 +00:00
Huy Do
a9c663c269 Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1c.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 07:43:04 +00:00
PyTorch MergeBot
e45b290127 Revert "Revert "Flash Attention v2 (#105602)" (#108827)"
This reverts commit 24e9bbe22a.

Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))
2023-09-08 03:25:45 +00:00
Huy Do
24e9bbe22a Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1c.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 02:54:20 +00:00
Kurt Mohler
56b848157c Reland: Add PyObject preservation for UntypedStorage (#103907)
This relands #97470 after #102553 reverted it. This PR attempts to fix the internal failure by avoiding an unnecessary intermediate storage buffer allocation in `c10::newStorageImplFromRefcountedDataPtr`.

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103907
Approved by: https://github.com/ezyang
2023-09-07 04:24:11 +00:00
David Berard
06b173780d [dynamo] "TorchDynamo Cache Lookup" event: use C++ api (#108436)
**Background**: "TorchDynamo Cache Lookup" events appear in traces to indicate a dynamo cache lookup; it's useful to check when cache lookups are taking a long time. To add a profiler event, one can use the `torch.profiler.record_function` context manager, or the C++ equivalent. Previously, the python version was used; first, when the profiler was enabled, callbacks for record_function_enter and record_function_exit were registered; then those would be called before and after every cache lookup.

**This PR**: Instead of calling the python bindings for `torch.profiler.record_function`, directly call the C++ implementation. This simplifies a lot of the code for binding C/C++. It also improves performance; previously there was a lot of overhead in the "TorchDynamo Cache Lookup" event, making the event artificially take a long time. After this change the events now appear shorter, because there's less overhead in starting/stopping the event: in other words, the profiler no longer distorts the results as much.

**Performance results**:
I ran using the script below on a cpu-only 1.6GHz machine. I report the median time (from 100 measurements) of a "TorchDynamo Cache Lookup" event before and after this PR. I think it is reasonable to consider the difference to be due to a reduction in overhead.

<details>

<summary>Benchmarking script</summary>

```python
def fn(x, y):
    return (x * y).relu()

a, b = [torch.rand((4, 4), requires_grad=True) for _ in range(2)]

opt_fn = torch.compile(fn)

opt_fn(a, b)
opt_fn(a, b)

with torch.profiler.profile() as prof:
    opt_fn(a, b)
```

</details>

Median before PR: 198-228 us (median of 100, measured 5 times)
Median after PR: 27us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108436
Approved by: https://github.com/anijain2305, https://github.com/jansel
2023-09-04 04:37:26 +00:00
drisspg
add45aea1c Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-01 22:14:44 +00:00
PyTorch MergeBot
d569e506ab Revert "Flash Attention v2 (#105602)"
This reverts commit 9df3d882c8.

Reverted https://github.com/pytorch/pytorch/pull/105602 on behalf of https://github.com/huydhn due to I think we miss a case here for sm80 build on inductor workflow as it is now OOM on trunk https://github.com/pytorch/pytorch/actions/runs/6042843139 ([comment](https://github.com/pytorch/pytorch/pull/105602#issuecomment-1701974862))
2023-09-01 01:15:01 +00:00
drisspg
9df3d882c8 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-08-31 16:02:20 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
Liao, Xuan
71632d4d24 [cpu] add sdpa choice and UT (#105131)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen.

## Performance of the stack

### NanoGPT's SDPA kernel
Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket.
Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64.
Machine: SPR.

| Dtype    | Causal   | Mode      | SDPA            | Time (ms per iter) | Speedup |
| -------- | -------- | -------   | -------         | -------            | ------- |
| float32  | FALSE    | Inference | Unfused         | 3.081              |         |
|          |          |           | Flash attention | 1.665              | **1.85045** |
| float32  | TRUE     | Inference | Unfused         | 3.463              |         |
|          |          |           | Flash attention | 1.662              | **2.083634**|
| bfloat16 | FALSE    | Inference | Unfused         | 1.203              |         |
|          |          |           | Flash attention | 1.154              | **1.042461**|
| bfloat16 | TRUE     | Inference | Unfused         | 1.543              |         |
|          |          |           | Flash attention | 1.154              | **1.337088**|
| float32  | FALSE    | Training  | Unfused         | 54.938             |         |
|          |          |           | Flash attention | 23.029             | **2.385601**|
| float32  | TRUE     | Training  | Unfused         | 58.266             |         |
|          |          |           | Flash attention | 17.835             | **3.266947**|
| bfloat16 | FALSE    | Training  | Unfused         | 18.924             |         |
|          |          |           | Flash attention | 18.886             | **1.002012**|
| bfloat16 | TRUE     | Training  | Unfused         | 21.08              |         |
|          |          |           | Flash attention | 14.172             | **1.48744** |

### Stable Diffusion
Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md).
Mode: Inference; Machine: SPR.

| Dtype    | SDPA                    | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup |
| -------- | --------                | -------          | -------      | -------         | ------- |
| float32  | Unfused                 | 1.63             |              | 1139            |         |
|          | Flash attention         | 1.983            | 1.216564     | 547.488         | **2.080411**|
| bfloat16 | Flash attention in IPEX | 4.784            |              | 429.051         |         |
|          | Flash attention         | 4.857            | 1.015259     | 408.823         | **1.049479**|

### LLM models of Torchbench

Dtype: float32; Mode: Inference, single socket; Machine: CPX.
Model   name | SDPA | Inductor_new | Inductor_old | Inductor   Ratio(old/new)
-- | -- | -- | -- | --
hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024**
hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841**
hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636**
llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451**

Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR.
Model   name | SDPA | Inductor_new | Inductor_old | Inductor   Ratio(old/new)
-- | -- | -- | -- | --
hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324**
hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842**
hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355**
llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105131
Approved by: https://github.com/drisspg
ghstack dependencies: #104583, #104584, #103826, #104693, #104863, #107128
2023-08-20 08:56:21 +00:00
Liao, Xuan
5516fe12ec [cpu] implement scaled dot product flash attention (#103826)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103826
Approved by: https://github.com/drisspg, https://github.com/jgong5
ghstack dependencies: #104583, #104584
2023-08-20 08:43:48 +00:00
shibo19
7047d132fd add context support for custom device (#105056)
Fixes #ISSUE_NUMBER
as the title, add context support for custom device and testcase.
And in the future, we may want to refactor these hooks for different device to unify the APIs, would you agree my
idea? @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105056
Approved by: https://github.com/albanD
2023-07-29 12:56:03 +00:00
Rodrigo Kumpera
2636751fb9 [C10d] Add skeleton of LibUV backend. (#105672)
This commit hooks up tcpstore creation and build flags.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672
Approved by: https://github.com/fduwjj
2023-07-28 13:19:06 +00:00
Rodrigo Kumpera
fe284b0d97 [C10D] Extract some bits of TCPStore into TCPStoreBackend. (#105163)
This moves BackgroundThread to TCPStoreBackend.hpp. This will eventually be the
interface shared between the current TCPStore backend and the new libuv one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105163
Approved by: https://github.com/fduwjj, https://github.com/H-Huang
2023-07-25 17:59:15 +00:00
Jason Ansel
c902b84e0b Compiled autograd (#103822)
This branch:
1) converts the autograd tape into an FX graph
2) caches that conversion using a "shadow" graph
3) compiles and runs the generated FX graph instead of the normal autograd

What works currently:
1) Caching, capture, and initial integration
2) Backwards hooks
3) Inlining AotAutograd generated subgraphs
4) torch.compiling the generated FX graph
5) Auto-detecting dynamic shapes based on changes

Future work
1) Larger scale testing
1) Boxed calling convention, so memory can be freed incrementally
1) Support hooks on SavedTensor
1) Additional testing by running eager autograd tests under compiled_autograd.enable()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-24 21:12:05 +00:00
Bin Bao
528ab477ce [reland][inductor] Register an op for mm_plus_mm (#105153)
Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues

Test Plan: CI

Differential Revision: D47442849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153
Approved by: https://github.com/clee2000
2023-07-14 14:35:29 +00:00
Catherine Lee
c36dca7bc5 Revert "[inductor] Register an op for mm_plus_mm (#104835)" (#105150)
This reverts commit 9c46a1620c.

Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149

#104835 is causing internal builds to fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150
Approved by: https://github.com/atalman
2023-07-13 17:13:45 +00:00
Bin Bao
9c46a1620c [inductor] Register an op for mm_plus_mm (#104835)
Summary: Currently the aten version of mm_plus_mm has no cpp
implementation, and thus cpp_wrapper can not generate the correct cpp
function call for it.

Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835
Approved by: https://github.com/jansel, https://github.com/SherlockNoMad
2023-07-12 02:34:02 +00:00
Louis Feng
5847cb55e4 [PyPer][ET] Refactor EG to ET (#99694)
Summary:
Change execution graph to execution trace.
See post: https://fb.workplace.com/groups/873291503156329/permalink/1529496217535851/

Test Plan: Run a job.

Reviewed By: chaekit

Differential Revision: D44121392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99694
Approved by: https://github.com/chaekit
2023-06-22 19:41:54 +00:00
leslie-fang-intel
9832cfbbfe Quantization oneDNN backend only support VNNI CPU (#103653)
**Summary**

- Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support.
- Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-19 09:50:07 +00:00
xuanqi
b27c3558a4 [RFC]: Create aten native op for constrain_range (#103346)
At high current implementation of constrains functions (constrain_as_**) will raise exception for the following code snippets:
```
def f(x):
    a = x.item()
    constrain_as_size(a, 4, 7)
    return torch.empty((a, 4))

inp = torch.tensor([5])
ep = torch._export.export(f, (inp,))
```

The reason is because current constrain logic is:
1) Purely python so it won't survive AOT export (the full node is gone after AOT export since AOT export only maintains aten level op).
2) Utilize side effect to add range constraints for traced symbol's shape env ([code](9591e52880/torch/fx/experimental/symbolic_shapes.py (L370-L372))).
3) If runtime assertion is turned on (by default). [`_AddRuntimeAssertionsForConstraintsPass`](9591e52880/torch/_export/passes/add_runtime_assertions_for_constraints_pass.py (L98-L100)) will try to append assertion node based on range constrains extracted from shape env of symbol during another interpretation round.
4). However, since 1), in the round of AOT export, range constraints logic won't run for symbols generated during this round. And later there is no range constrains information available for assertion round and caused issue.
5) As a result of above, it will failure at `torch.empty((a, 4))` (there is no constrains for `a` that it must be positive).

The fix here is just to implement range constrain logic as a native aten op (CPU implementation as no-op) to make it be able to survive AOT export.

**NOTE:**
[Logic](2d745b95d7/torch/fx/experimental/symbolic_shapes.py (L350-L365C15)) within [`constrain_range`](2d745b95d7/torch/fx/experimental/symbolic_shapes.py (LL313C74-L313C74)) is split out as `constrain_range_int` to capture case when non `SymInt` is passed in and reused in the new `_constrain_range`. The reason is when non `SymInt` is provided:
* If it directly calls `sym_constrain_range`, the C++ version will be called which will be no-op.
* So in this case it calls `constrain_range_int` instead to be able to capture issue like user provides a input whose tensor's shape could be out of range during exporting, like the following for above code example:
```
...
inp = torch.tensor([10])
ep = torch._export.export(f, (inp,)) # immediately raise error
```

Differential Revision: [D46734204](https://our.internmc.facebook.com/intern/diff/D46734204)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103346
Approved by: https://github.com/tugsbayasgalan
2023-06-16 14:55:40 +00:00
mingfeima
69b09eca5a optimize reflection padding performance on CPU (#102254)
This patch improves reflection padding performance on CPU.

Original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102254
Approved by: https://github.com/cpuhrsch
2023-06-14 17:18:51 +00:00
PyTorch MergeBot
2c313e7b99 Revert "Record view stacks if running anomaly mode (#103185)"
This reverts commit a02c573a89.

Reverted https://github.com/pytorch/pytorch/pull/103185 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629734 ([comment](https://github.com/pytorch/pytorch/pull/103185#issuecomment-1588258206))
2023-06-12 23:52:10 +00:00
Edward Z. Yang
a02c573a89 Record view stacks if running anomaly mode (#103185)
Now, when you do an inplace mutation and the view is naughty, you get this message:

```
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). To find out where this view was allocated, run your entire forward region under anomaly mode (torch.autograd.detect_anomaly(check_nan=False)).
```

When you run under anomaly mode, you get:

```
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). This view was allocated at:
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4299, in arglebargle
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4306, in test_anomaly_gives_view_stack
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2266, in _run_with_retry
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2337, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 894, in run_tests
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 11209, in <module>
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103185
Approved by: https://github.com/zdevito
2023-06-09 16:56:28 +00:00
Shiyan Deng
685505353a Back out "Add PyObject preservation for UntypedStorage (#97470)" (#102553)
Summary:
Original commit changeset: c24708d18ccb

Original Phabricator Diff: D46159983

Test Plan: SL tests and CI

Differential Revision: D46284986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102553
Approved by: https://github.com/DanilBaibak
2023-06-01 17:23:43 +00:00
Kurt Mohler
5fe629e314 Add PyObject preservation for UntypedStorage (#97470)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97470
Approved by: https://github.com/ezyang
2023-05-23 01:27:30 +00:00
Richard Li
c523d7d899 Add a new hook (#99854)
Differential Revision: D45220984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99854
Approved by: https://github.com/albanD
2023-04-26 23:00:38 +00:00
William Wen
785676ccb0 [dynamo 3.11] refactor cpython function defs out of eval_frame.c (#99947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99947
Approved by: https://github.com/voznesenskym, https://github.com/albanD
2023-04-26 00:18:12 +00:00
Elias Ellison
d881b2978c Make autocast cache and buffer stealing aware of cudagraph static output tensors (#99368)
In this stack of PRs we adding caching to output tensors for cudagraph trees after we've done initial recording. On initial recording we do not cache tensor outputs because this prevents memory from being reclaimed. On subsequent exeuctions we do cache them to avoid overhead. However, because there is an extra reference around, this caused divergent recording & execution behavior in both autocast caching and autograd gradient stealing. Divergent recording & execution would keep on re-recording and eventually stabilize, but it's not what you want to see happen.

This pr makes the autocast cache and buffer stealing aware of the cudagraph static output tensors.

I will add this to the other cudagraph impl in another pr.

Not sure if this should be in autograd or in autocast since it affects both.. Or somewhere else

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99368
Approved by: https://github.com/albanD, https://github.com/ezyang
2023-04-24 20:23:12 +00:00
Rodrigo Kumpera
38e964056b Reland python ops (#99170)
Waiting for the revert to land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170
Approved by: https://github.com/albanD
2023-04-18 15:15:46 +00:00
PyTorch MergeBot
1c042a2137 Revert "Reland python ops (#99170)"
This reverts commit d4de64ae8d.

Reverted https://github.com/pytorch/pytorch/pull/99170 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-18 11:37:43 +00:00
Rodrigo Kumpera
d4de64ae8d Reland python ops (#99170)
Waiting for the revert to land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170
Approved by: https://github.com/albanD
2023-04-17 21:53:41 +00:00
Rodrigo Kumpera
a910045add [PATCH] Back out "Move functional collectives implementation to python. (#98595) (#99168)
Summary:
Original commit changeset: ba36f8751adc

Original Phabricator Diff: D44788697

Test Plan: model loading is fine after reverting the diff

Reviewed By: zyan0, sayitmemory

Differential Revision: D44921259
---

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99168
Approved by: https://github.com/izaitsevfb
2023-04-14 23:48:19 +00:00
Rodrigo Kumpera
24d9001527 Move functional collectives implementation to python. (#98595)
This simplifies a lot the work we need to add new ops.

This relands the previous PR, not sure why it was reverted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98595
Approved by: https://github.com/wconstab
2023-04-07 21:48:05 +00:00
PyTorch MergeBot
67d1a77086 Revert "Move functional collectives implementation to python. (#98315)"
This reverts commit 8b0374f83c.

Reverted https://github.com/pytorch/pytorch/pull/98315 on behalf of https://github.com/huydhn due to Sorry for reverting for PR. This is failing in trunk probably due to a landrace
2023-04-06 16:49:40 +00:00
Rodrigo Kumpera
8b0374f83c Move functional collectives implementation to python. (#98315)
This simplifies a lot the work we need to add new ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98315
Approved by: https://github.com/albanD, https://github.com/wconstab, https://github.com/Neilblaze
2023-04-06 14:06:16 +00:00
PyTorch MergeBot
f4f1a5b5b3 Revert "Move functional collectives to the right namespace (#97793)"
This reverts commit 184bfbc3d7.

Reverted https://github.com/pytorch/pytorch/pull/97793 on behalf of https://github.com/atalman due to breaks internal builds
2023-03-31 16:02:07 +00:00
Rodrigo Kumpera
184bfbc3d7 Move functional collectives to the right namespace (#97793)
This moves them from `torch._C._nn` to `torch._C._dist`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97793
Approved by: https://github.com/albanD
2023-03-30 22:18:13 +00:00
Shuming Hu
b45880c537 Optionally ignore utf-8 decoding error when converting std::string to python str. (#97282)
Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97282
Approved by: https://github.com/davidberard98
2023-03-23 01:19:08 +00:00
Zachary DeVito
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1c.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
PyTorch MergeBot
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
Ramin Azarmehr
234df29901 [MPS] Add C++ API support for MPS backend (#96668)
- This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++.
- Added test case for C++ APIs to `mps_test_allocator.cpp`

Fixes #96425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668
Approved by: https://github.com/kulinseth, https://github.com/albanD, https://github.com/malfet
2023-03-14 20:27:40 +00:00
Zachary DeVito
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
PyTorch MergeBot
a07817ad8f Revert "[MPS] Add C++ API support for MPS backend (#96668)"
This reverts commit 069ace131c.

Reverted https://github.com/pytorch/pytorch/pull/96668 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-14 12:43:04 +00:00
Nikita Shulga
82daf98151 [Sparse] Move SparseTensorUtils.* to native/ (#96696)
Fixes internal linking problem after `DECLARE_DISPATCH` was introduced in SparseTensorUtils.cpp, but implemented inside the native library.

Also, fix `sign-unsigned` compare in `_flatten_indices_impl`
Followups:
 Move code declared/implemented in `SparseTensorUtils.*` to `at::native` namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96696
Approved by: https://github.com/albanD
2023-03-14 02:56:52 +00:00
Ramin Azarmehr
069ace131c [MPS] Add C++ API support for MPS backend (#96668)
- This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++.
- Added test case for C++ APIs to `mps_test_allocator.cpp`

Fixes #96425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-03-13 23:15:37 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Xunsong, Huang
b053a0f2ba [XPU][Profiler] Add API support for XPU profiler to Kineto path (#94502)
This patch is aimed to add support to XPU profiler which will co-work with Kineto. After this PR, kineto will follow these API to fit itself. Also, the development of interface in python is near done.

Signed-off-by: Huang, Xunsong <xunsong.huang@intel.com>

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94502
Approved by: https://github.com/ezyang
2023-03-10 12:17:14 +00:00
Nikita Vedeneev
d0f4d62961 flatten_indices: remove syncs (#94401)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94401
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-03-10 12:03:26 +00:00
David Berard
53b4f6c0f6 Revert "[jit] Add c++ stacktraces for jit::ErrorReport (#94842)" (#95886)
This reverts commit 70029214f3.

It broke some internal tests.

Differential Revision: [D43735833](https://our.internmc.facebook.com/intern/diff/D43735833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95886
Approved by: https://github.com/malfet, https://github.com/qihqi
2023-03-03 05:49:40 +00:00
David Berard
70029214f3 [jit] Add c++ stacktraces for jit::ErrorReport (#94842)
**Summary**: This PR adds C++ stacktraces to jit::ErrorReports. After this PR, if you run with `TORCH_SHOW_CPP_STACKTRACES=1` environment variable and a jit::ErrorReport is thrown, then the C++ stacktrace should be displayed.

**More background**: This behavior already occurs for c10::Error; but not for jit::ErrorReport. jit::ErrorReport _does_ usually have a python stacktrace for the python source, but it is sometimes still helpful to know where in the C++ codebase the error came from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94842
Approved by: https://github.com/qihqi
2023-02-28 22:37:51 +00:00
Rodrigo Kumpera
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00
Ramin Azarmehr
bdd8f518d7 [MPS] Add Python Module Bindings for the MPS backend (#94417)
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
  `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`

Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
2023-02-12 21:22:30 +00:00
PyTorch MergeBot
4fe365774a Revert "[MPS] Add Python Module Bindings for the MPS backend (#94417)"
This reverts commit beb4f5bf39.

Reverted https://github.com/pytorch/pytorch/pull/94417 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to break MacOS test in trunk bae397ec63
2023-02-11 05:24:45 +00:00
Ramin Azarmehr
beb4f5bf39 [MPS] Add Python Module Bindings for the MPS backend (#94417)
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
  `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`

Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
2023-02-10 23:18:41 +00:00
Maxwell Nuyens
0d0ebcdfe5 feature: adding the ability to restore shapes after loading a traced model (#90744)
Adds the ability to store inputs used in tracing models when calling torch.jit.save and restore the input shapes using torch.jit.load if the appropriate variables are set.

Fixes [89185](https://github.com/pytorch/pytorch/issues/89185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90744
Approved by: https://github.com/davidberard98
2023-02-10 17:12:52 +00:00
mingfeima
c620ece726 port sparse_mm.reduce to pytorch and optimize it on CPU (#83727)
### Motivation of this PR

This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`

The reduce type can be choose from: "max", "mean", "max",  "min".

extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

### Performance

Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.

#### before:
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
       torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                 aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                 aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                     aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                   aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
              aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                   aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
            aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
```

#### after
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                 aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                 aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                     aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                   aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
              aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                   aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                 aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
            aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu
2023-02-10 15:56:40 +00:00
jjsjann123
c11b301bcd [NVFUSER] refactor nvfuser build (#89621)
This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library.

Contents inside this PR:
1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp)
2. splits the build system so nvfuser is generating its own `.so` files. Currently there are:
    - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser
    - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser`
3. nvfuser cpp tests is currently being compiled into `nvfuser_tests`
4. cmake is refactored so that:
    - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`.
    - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more
    - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built.
    - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary`

Future work that's scoped in following PR:
- Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet
- Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621
Approved by: https://github.com/davidberard98
2023-01-26 02:50:44 +00:00
Kurt Mohler
4d9920fa9c Move PyInterpreter code in python_variable.cpp to its own files (#92647)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92647
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-01-24 23:08:23 +00:00
Elias Ellison
70f4b3551c Add Hook to store arbitrary python objects that are copied over in tls (#89169)
For the cudagraphs implementation, we would like to reuse objects that are defined in python across the forward and backward. The backward is run in a different thread, so to handle this we add an api for copying over arbitrary python objects in pytorch's thread local state, in the same way that C++ objects are copied over currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89169
Approved by: https://github.com/albanD
2023-01-24 05:24:57 +00:00
soulitzer
1bc60c6b31 [reland] Improve hooks ordering behavior (#92559)
This reverts commit e525f433e1.

Original PR:  #85849
Fixes #ISSUE_NUMBER

In addition to reverting the revert, this PR:
- defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92559
Approved by: https://github.com/albanD
2023-01-19 08:17:32 +00:00
yanbing-j
94a7c01159 Enable oneDNN implementation in LSTM op (#91158)
### Description
This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported.

### Performance improvement
In CPX 28C, with setting iomp and jemalloc.
We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original.

In single socket:
![image](https://user-images.githubusercontent.com/61222868/211182994-833debec-518a-4b35-8504-6b0fadb17930.png)

![image](https://user-images.githubusercontent.com/61222868/211183012-31e1253f-2c60-4c92-a656-c239a971b453.png)

In single core:
![image](https://user-images.githubusercontent.com/61222868/211183017-186e5d47-cb9a-4c1e-914f-fa718e769f1c.png)

![image](https://user-images.githubusercontent.com/61222868/211183022-53266857-5a9e-4a95-b300-33fa34811d08.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91158
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-01-18 04:41:18 +00:00
mingfeima
3ab58fd5ed optimize sampled_addmm performance on CPU (SparseCSR) (#90978)
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-01-12 12:04:07 +00:00
Nikita Shulga
6cef59487a [BE] Move internal only non-globbed lists to OSS (#91513)
Summary:
Should prevent internal only fixes that were required for https://github.com/pytorch/pytorch/pull/91104
Just moves the list to `build_variables.bzl` and makes it a sublist of aten_cpu_source_non_codegen_list

Test Plan: CI

Differential Revision: D42281502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91513
Approved by: https://github.com/kit1980, https://github.com/atalman
2022-12-31 00:02:43 +00:00
Ramin Azarmehr
eeb9154b27 [MPS] Add MPSHooks interface to enable accessing MPS functions globally (#91104)
This PR is a prerequisite to the upcoming MPSGenerator changes required for Random Ops.

Add `MPSHooksInterface.cpp` to `aten_cpu_source_non_codegen_list`

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91104
Approved by: https://github.com/kulinseth, https://github.com/malfet
2022-12-21 17:37:09 +00:00
min-jean-cho
6d2b0cbb40 [Re-landing 86706] [JIT] Frozen Graph Linear-BatchNormNd Folding (#91020)
Re-landing #86706

This PR adds linear-batchnormNd folding for JIT frozen graphs.

**Performance benchmark**
A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine.

**FP32, JIT**
without linear-bn folding
![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png)

with linear-bn folding
![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91020
Approved by: https://github.com/davidberard98
2022-12-21 08:00:32 +00:00
Salil Desai
8c80a4684b [Vulkan + Profiler] Report Vulkan Events to Profiler in QueryPool (#90670)
@bypass-github-export-checks

With this change, we see Vulkan events reported on the generated chrometrace with proper names and durations.

However, their start/end times are not yet synced with the cpu event timeline, and their parent/child relationships are not established properly. These concerns will be addressed in future diffs

Differential Revision: [D39834807](https://our.internmc.facebook.com/intern/diff/D39834807/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39834807/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90670
Approved by: https://github.com/kimishpatel
2022-12-19 19:56:28 +00:00
PyTorch MergeBot
31b8dc7542 Revert "[JIT] Frozen Graph Linear-BatchNormNd Folding (#86706)"
This reverts commit e585156c59.

Reverted https://github.com/pytorch/pytorch/pull/86706 on behalf of https://github.com/davidberard98 due to possibly causing internal build failures, will revert and investigate later
2022-12-16 00:49:54 +00:00
min-jean-cho
e585156c59 [JIT] Frozen Graph Linear-BatchNormNd Folding (#86706)
This PR adds linear-batchnormNd folding for JIT frozen graphs.

**Performance benchmark**
A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine.

**FP32, JIT**
without linear-bn folding
![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png)

with linear-bn folding
![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86706
Approved by: https://github.com/davidberard98
2022-12-14 23:24:50 +00:00
Yuxin Wu
c8ed84ad06 Fix a static initialization order fiasco in c10d (#90149)
The `TORCH_LIBRARY_IMPL` registrations in `OpsImpl.cpp` needs to happen after `ProcessGroup` is registered as a torch class -- which happens in `Ops.cpp`. However, the order of the registrations is undefined between the two files.

If the registration in `OpsImpl.cpp` runs before `Ops.cpp`, we get a crash at program launch similar to #83255 . This happens in our internal build.

This PR moves `OpsImpl.cpp` to the end of `Oops.cpp`. Because according to the omniscient lord of chatGPT:
<img width="600" alt="2022-12-04_19-25" src="https://user-images.githubusercontent.com/1381301/205542847-3535b319-3c2a-4e8e-bc11-27913f6afb39.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90149
Approved by: https://github.com/kwen2501, https://github.com/H-Huang, https://github.com/soumith
2022-12-12 08:21:54 +00:00
Han Qi (qihqi)
25eb7c3ae3 Clean up dependancy for flatbuffer_loader (#86041)
Test Plan: waitforsandcastle

Differential Revision: D38445936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86041
Approved by: https://github.com/cccclai
2022-12-08 03:48:04 +00:00
Richard Zou
4b1053497c [vmap] Prepend "legacy" to files for old vmap implementation (#90324)
We have an older torch.vmap implementation. It is no longer supported.
It still needs to exist somewhere for the sake of BC with
torch.autograd.functional.

This PR makes it clear what files are meant for implementing the old
vmap implementation. I've seen a couple of PRs recently adding support
for the old vmap implementation, so this will lessen the confusion.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324
Approved by: https://github.com/samdow
2022-12-07 18:46:15 +00:00
Emilio Castillo
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
Chen Lai
bca75fd2d3 Move xnnpack taget to fb code base (#88909)
1. Move the source file list to the `build_variables.bzl`, as it's the source of truth for both internal buck build and oss build
2. Move target definitions to `fb` internal folder
3. Some changes are triggered from auto format.

Differential Revision: [D40906961](https://our.internmc.facebook.com/intern/diff/D40906961/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40906961/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88909
Approved by: https://github.com/mcr229
2022-11-13 12:04:35 +00:00
Antoni Viros i Martin
c77368d416 Implement a constructor for nested_tensor that is similar to torch.tensor() (#88213)
Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213
Approved by: https://github.com/cpuhrsch
2022-11-08 00:03:18 +00:00
jjsjann123
7b419e8513 [NVFuser] Upstream push 1026 (#87779)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

* codegen improvement:
    i. allow non-root trivial reductions, allow empty/no-op fusion
    ii. fixes vectorization checks and size calculation
    iii. bank conflict handle improvement
    iv. enables transpose scheduler

* misc:
    i. CI tests failure fixes
    ii. cpp tests file clean up
    iii. trivial forwarding supports added in codegen runtime
    iv. added factory methods support in codegen

Commits that's in this PR from the devel branch:

```
7117a7e37ebec372d9e802fdfb8abb7786960f4a patching nvfuser conv cudnn test numerics mismatch (#2048)
65af1a4e7013f070df1ba33701f2d524de79d096 Inserting sync for redundant parallel types is already done at the (#2023)
6ac74d181689c8f135f60bfc1ec139d88941c98c Fix sync map (#2047)
f5bca333355e2c0033523f3402de5b8aac602c00 Bank conflict checker improvements (#2032)
d2ca7e3fd203537946be3f7b435303c60fa7f51e Minor update on cp.async code generation. (#1901)
d36cf61f5570c9c992a748126287c4e7432228e0 Test file cleanup (#2040)
0b8e83f49c2ea9f04a4aad5061c1e7f4268474c6 Allow non-root trivial reductions (#2037)
a2dfe40b27cd3f5c04207596f0a1818fbd5e5439 Fix vectorize size calculation (#2035)
e040676a317fe34ea5875276270c7be88f6eaa56 Use withPredicate to replace setPredicate to maintain Exprs immutable (#2025)
197221b847ad5eb347d7ec1cf2706733aacbf97c removing ci workflow (#2034)
40e2703d00795526e7855860aa00b9ab7160755f Reduction rand like patch (#2031)
bc772661cbdb3b711d8e9854ae9b8b7052e3e4a3 Add utility for checking bank conflict of shared memory (#2029)
ddd1cf7695f3fb172a0e4bcb8e4004573617a037 Add back FusionReductionWithTrivialReduction_CUDA (#2030)
fbd97e5ef15fa0f7573800e6fbb5743463fd9e57 Revert "Cleanup trivial reduction workarounds (#2006)" (#2024)
bca20c1dfb8aa8d881fc7973e7579ce82bc6a894 Cleanup trivial reduction workarounds (#2006)
e4b65850eee1d70084105bb6e1f290651adde23e Trivial forwarding (#1995)
1a0e355b5027ed0df501989194ee8f2be3fdd37a Fix contiguity analysis of predicates to match updated contiguity. (#1991)
a4effa6a5f7066647519dc56e854f4c8a2efd2a7 Enable output allocation cache (#2010)
35440b7953ed8da164a5fb28f87d7fd760ac5e00 Patching bn inference (#2016)
0f9f0b4060dc8ca18dc65779cfd7e0776b6b38e8 Add matmul benchmark (#2007)
45045cd05ea268f510587321dbcc8d7c2977cdab Enable tests previously disabled due to an aliasing bug (#2005)
967aa77d2c8e360c7c01587522eec1c1d377c87e Contiguous indexing for View operations (#1990)
a43cb20f48943595894e345865bc1eabf58a5b48 Make inlining even more modular (#2004)
dc458358c0ac91dfaf4e6655a9b3fc206fc0c897 Test util cleanup (#2003)
3ca21ebe4d213f0070ffdfa4ae5d7f6cb0b8e870 More strict validation (#2000)
a7a7d573310c4707a9f381831d3114210461af01 Fix build problem (#1999)
fc235b064e27921fa9d6dbb9dc7055e5bae1c222 Just fixes comments (#1998)
482386c0509fee6edb2964c5ae72074791f3e43a cleanup (#1997)
4cbe0db6558a82c3097d281eec9c85ad2ea0893a Improve divisible split detection (#1970)
42ccc52bdc18bab0330f4b93ed1399164e2980c9 Minor build fix. (#1996)
fcf8c091f72d46f3055975a35afd06263324ede6 Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989)
15f2f6dba8cbf408ec93c344767c1862c30f7ecc Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988)
8f1c7f52679a3ad6acfd419d28a2f4be4a7d89e2 Minor cleanup lower_unroll.cpp (#1994)
1d9858c80319ca7f0037db7de5f04e47f540d76c Minor cleanup (#1992)
f262d9cab59f41c669f53799c6d4a6b9fc4267eb Add support for uniform RNG (#1986)
eb1dad10c73f855eb1ecb20a8b1f7b6edb0c9ea3 Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987)
634820c5e3586c0fe44132c51179b3155be18072 Add support for some empty fusion (#1981)
eabe8d844ad765ee4973faa4821d451ef71b83c3 Segment self mapping fusions (#1954)
e96aacfd9cf9b3c6d08f120282762489bdf540c8 Enable Transpose operation (#1882)
425dce2777420248e9f08893765b5402644f4161 Add a null scheduler that helps segmenting away no-op schedules (#1835)
306d4a68f127dd1b854b749855e48ba23444ba60 Fix canScheduleCompileTime check of transpose scheduler (#1969)
b1bd32cc1b2ae7bbd44701477bddbcfa6642a9be Minor fix (#1967)
bd93578143c1763c1e00ba613a017f8130a6b989 Enable transpose scheduler (#1927)
b7a206e93b4ac823c791c87f12859cf7af264a4c Move scheduler vectorize utilities into their own file (#1959)
d9420e4ca090489bf210e68e9912bb059b895baf View scheduling (#1928)
c668e13aea0cf21d40f95b48e0163b812712cdf2 Upstream push ci fixes (#1965)
c40202bb40ce955955bb97b12762ef3b6b612997 Fix dump effective bandwidth (#1962)
93505bcbb90a7849bd67090fe5708d867e8909e4 WAR on index mapping when exact and permissive maps differ (#1960)
45e95fd1d3c773ee9b2a21d79624c279d269da9f Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930)
a3ecb339442131f87842eb56955e4f17c544e99f Improve the comments at the beginning of index_compute.h (#1946)
f7bc3417cc2923a635042cc6cc361b2f344248d6 Remove unused variables (#1955)
df3393adbb5cb0309d091f358cfa98706bd4d313 Some cleanup (#1957)
7d1d7c8724ab5a226fad0f5a80feeac04975a496 TVDomainGuard factory (#1953)
357ba224c0fb41ed3e4e8594d95599c973f4a0ca Fill allocation with nan on tests (#1956)
8eafc54685d406f5ac527bcbacc475fda4492d7a Fix detection of unmappable root domains (#1952)
90a51f282601ba8ebd4c84b9334efd7762a234bc Some indexing cleanups, Add eye support (#1940)
ddc01e4e16428aec92f9c84d698f959b6436a971 Exclude unsupported data types (#1951)
992e17c0688fe690c51b50e81a75803621b7e6aa test the groups the same order as they are merged (#1949)
208262b75d1fed0597a0329d61d57bc8bcd7ff14 Move detection of self mapping IDs to IterDomainGraph from (#1941)
ac4de38c6ee53b366e85fdfe408c3642d32b57df Merge pull request #1945 from csarofeen/master_merge_0828
631094891a96f715d8c9925fb73d41013ca7f2e3 Add full, full_like, zeros, zeros_like, ones, ones_like (#1943)
aab10bce4541204c46b91ff0f0ed9878aec1bfc4 Merge remote-tracking branch 'upstream/viable/strict' into HEAD
4c254c063bb55887b45677e3812357556a7aa80d Fix arange when step is negative (#1942)
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D40869846](https://our.internmc.facebook.com/intern/diff/D40869846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87779
Approved by: https://github.com/davidberard98
2022-11-04 20:04:34 +00:00
Digant Desai
0fc7de3986 [profiler] Add Linux Perf support (#87866)
* Add support to use Linux kernel perf subsystem via the profiler.
* For now the perf configurability is quite limited to just event names. Threading etc. to come later.
* Given we want to support variety of different cpu types, number of events list (in addition to the standard set of events) is also limited.
* Rather than failing with unsupported feature for non-Linux platforms, it returns zeros for all the event counts.
* For now, max event counts is capped at 4, time multiplexing is not allowed.
* Threadpool recreate hack is restricted to mobile only - need to add better support for threading in general

Differential Revision: [D40238033](https://our.internmc.facebook.com/intern/diff/D40238033/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238033/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87866
Approved by: https://github.com/SS-JIA
2022-11-02 13:42:24 +00:00
Christian Puhrsch
6fe41e76a9 Create separate files for NT Unary, Binary and Matmul ops (#88091)
Improves code organization and code share.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88091
Approved by: https://github.com/drisspg
2022-10-31 20:10:07 +00:00
Edward Z. Yang
d3c01c722d Fix pybind11 problems with c10::SymInt unregistered (#88011)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88011
Approved by: https://github.com/weiwangmeta, https://github.com/albanD
2022-10-29 07:55:45 +00:00
Taylor Robie
fb64f7b804 [Profiler][Trivial] Move ID assignment code to data_flow.cpp (#87670)
ID assignment has become a very complex facet of the profiler. The existing code has grown organically as I've discovered various refinements and has become very difficult to understand or reason about. (With more complexity coming in https://github.com/pytorch/pytorch/pull/87133)

I want to take a step back and add some structure and additional comments to the ID assignment algorithm. Before I do, however, it's time to move it out of `collection.cpp` to a dedicated data flow file.

Differential Revision: [D40666360](https://our.internmc.facebook.com/intern/diff/D40666360/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40666360/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87670
Approved by: https://github.com/slgong-fb
2022-10-28 18:40:18 +00:00
Edward Z. Yang
1ff52225f1 Unify SymIntNode and SymFloatNode into SymNode (#87817)
This refactor was prompted by challenges handling mixed int/float
operations in C++.  A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
approach.

The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode.  This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods.  This has a number of
knock on effects.

- We no longer have C++ classes to bind to Python.  Instead, we take an
  entirely new approach to our Python API, where we have a SymInt/SymFloat
  class defined entirely in Python, which hold a SymNode (which corresponds
  to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
  it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
  when it goes into C++.  This implies a userland rename.

  In principle, it is also possible for the canonical implementation of SymNode
  to be written in C++, and then bound to Python with pybind11 (we have
  this code, although it is commented out.)  However, I did not implement
  this as we currently have no C++ implementations of SymNode.

  Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
  code needs to know how to find these classes.  Currently, this is done
  just by manually importing torch and getting the attributes.

- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
  takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
  __torch_dispatch__ works.

Some miscellaneous improvements:

- SymInt now has a constructor that takes SymNode.  Note that this
  constructor is ambiguous if you pass in a subclass of SymNode,
  so an explicit downcast is necessary.  This means toSymFloat/toSymInt
  are no more.  This is a mild optimization as it means rvalue reference
  works automatically.

- We uniformly use the caster for c10::SymInt/SymFloat, rather than
  going the long way via the SymIntNode/SymFloatNode.

- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
  functions, pretty sure this doesn't do anything.

- guard_int is now a free function, since to guard on an int you cannot
  assume the method exists.  A function can handle both int and SymInt
  inputs.

- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
  ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
  plain methods; this is to help avoid confusion between the two types.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-10-27 20:56:02 +00:00
Jiewen Tan
2205f56f46 [LTC] Remove lazy::View (#87822)
Summary:
This is the first part to remove the whole view and aliasing infrastructure in LTC, which is deprecated in favor of functionalization. It mainly removes things that use lazy::View.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87822
Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/wconstab
2022-10-27 20:39:30 +00:00
Antoni Viros i Martin
d94e33f041 Add support for .to() for NestedTensor backends (#87146)
Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140.

Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job.

Differential Revision: D40437585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146
Approved by: https://github.com/drisspg
2022-10-20 03:46:50 +00:00
PyTorch MergeBot
8eb579e362 Revert "[Profiler] Move legacy profiler out of torch/csrc/autograd (#85512)"
This reverts commit 157a3d2a7c.

Reverted https://github.com/pytorch/pytorch/pull/85512 on behalf of https://github.com/DanilBaibak due to Due to files were deleted, the internal build failed. Please re-submit via codev.
2022-10-14 14:56:59 +00:00
Taylor Robie
157a3d2a7c [Profiler] Move legacy profiler out of torch/csrc/autograd (#85512)
The legacy profiler is an eyesore in the autograd folder. At this point the implementation is almost completely decoupled from the rest of profiler, and it is in maintaince mode pending deprecation.

As a result, I'm moving it to `torch/csrc/profiler/standalone`. Unfortuantely BC requires that the symbols remain in `torch::autograd::profiler`, so I've put some basic forwarding logic in `torch/csrc/autograd/profiler.h`.

One strange bit is that `profiler_legacy.h` forward declares `torch::autograd::Node`, but doesn't seem to do anything with it. I think we can delete it, but I want to test to make sure.

(Note: this should not land until https://github.com/pytorch/torchrec/pull/595 is landed.)

Differential Revision: [D39108648](https://our.internmc.facebook.com/intern/diff/D39108648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85512
Approved by: https://github.com/aaronenyeshi
2022-10-14 05:38:48 +00:00
Taylor Robie
35fb007749 [Profiler][Minor] Separate standalone profilers from the main PyTorch profiler. (#85511)
There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT)

They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder.

Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511
Approved by: https://github.com/albanD
2022-10-14 05:38:48 +00:00
Taylor Robie
b8f14b7877 [Profiler][Minor] Group and consolidate stub APIs (#85510)
There is a concept in profiler of a stub that wraps a profiling API. It was introduced for CUDA profiling before Kineto, and ITT has adopted it to call into VTune APIs. However for the most part we don't really interact with them when developing the PyTorch profiler.

Thus it makes sense to unify the fallback registration mechanism and create a subfolder to free up real estate in the top level `torch/csrc/profiler` directory.

Differential Revision: [D39108647](https://our.internmc.facebook.com/intern/diff/D39108647/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85510
Approved by: https://github.com/aaronenyeshi
2022-10-14 05:38:46 +00:00
Jason Ansel
f1fdb6efbd Manual changes for moving dynamo to core (#86621)
This is the subset of the changes in #86461 not auto-generated by `copy_to_core.sh`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86621
Approved by: https://github.com/albanD
2022-10-11 23:01:21 +00:00
Zachary DeVito
736adc0808 Memory snapshots from C++ (#86190)
Sometimes the driving process want to save memory snapshots but isn't Python.
Add a simple API to turn it on without python stack traces. It still
saves to the same format for the vizualization and summary scripts, using
the C++ Pickler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190
Approved by: https://github.com/ezyang
2022-10-05 07:36:39 +00:00
Horace He
0e256c2550 removed compile cache and static argnums (#85783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85783
Approved by: https://github.com/wconstab
2022-09-28 08:33:59 +00:00
Howard Huang
ccac8d13d5 [3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations (#83735)
### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735
Approved by: https://github.com/kwen2501
2022-09-28 03:24:06 +00:00
cpuhrsch
6a04df3ac8 Get flash_attn to compile for CUDA 11.6 linux nightly build (#84941)
This PR only attempts to get this code to compile for all archs so that we can dispatch to it in https://github.com/pytorch/pytorch/pull/84653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84941
Approved by: https://github.com/drisspg, https://github.com/malfet
2022-09-26 20:49:19 +00:00
nikitaved
12ae3bea43 Faster mul(sparse, sparse) with broadcasting in dense dims. (#85336)
This is a combo PR of https://github.com/pytorch/pytorch/pull/84929 and ~https://github.com/pytorch/pytorch/pull/83428~.

Preliminary benchmarks (square matrices of shape (n, n)).

<details>

<summary>Script</summary>

```python
import torch
import math
from IPython import get_ipython
from itertools import product, repeat
import pickle
from torch.utils.benchmark import Timer, Compare

torch.manual_seed(13)

problem_dims = (
    # n > nnz
    (10000, 100),
    (100000, 1000),
    (1000000, 10000),
    # n < nnz
    (10, 100),
    (10, 1000),
    (10, 10000),
    (100, 1000),
    (100, 10000),
    (1000, 10000),
    (1000, 100000),
    (1000, 1000000),
    #(1000000, 1000000000),
)

name = "PR"
device = "cuda"
results = []

for n, nnz in problem_dims:
    def gen_tensor(coalesce=False):
        shape = (n, n)
        nrows, ncols = shape
        rowidx = torch.randint(low=0, high=nrows, size=(nnz,), device=device)
        colidx = torch.randint(low=0, high=ncols, size=(nnz,), device=device)
        itemidx = torch.vstack((rowidx, colidx))
        xvalues = torch.randn(nnz, device=device)
        itemidx = torch.hstack((itemidx, itemidx))
        xvalues = torch.hstack((xvalues, xvalues))
        res = torch.sparse_coo_tensor(itemidx, xvalues, size=shape)
        if coalesce:
            return res.coalesce()
        else:
            return res

    for x_coalesce, y_coalesce in product(*repeat((True, False), 2)):
        x = gen_tensor(x_coalesce)
        y = gen_tensor(y_coalesce)
        smtp = "x * y"
        timer = Timer(smtp,
                      globals=globals(),
                      label="coo.mul",
                      description=f"{name}: mul, device: {device}",
                      sub_label=f"n={n}, nnz={nnz}, coalesce=({x_coalesce, y_coalesce})",
                      num_threads=torch.get_num_threads())
        results.append(timer.blocked_autorange())

compare = Compare(results)
compare.trim_significant_figures()
compare.print()

with open(f"{name}_{device}_mul.pickle", 'wb') as f:
    pickle.dump(results, f)

```

</details>

<details>

<summary>Gather results</summary>

```python
import pickle
from torch.utils.benchmark import Timer, Compare

files = [
        "PR",
        "master"
        ]

device = 'cuda'

timers = []
for name in files:
    with open("{}_{}_mul.pickle".format(name, device), 'rb') as f:
        timers += pickle.load(f)

compare = Compare(timers)
compare.trim_significant_figures()
compare.print()

```

</details>

<details>

<summary>CUDA</summary>

```
[------------------------------------------------- coo.mul -------------------------------------------------]
                                                       |  PR: mul, device: cuda  |  master: mul, device: cuda
24 threads: -------------------------------------------------------------------------------------------------
      n=10000, nnz=100, coalesce=((True, True))        |             95          |                91
      n=10000, nnz=100, coalesce=((True, False))       |             87          |               242
      n=10000, nnz=100, coalesce=((False, True))       |             87          |               226
      n=10000, nnz=100, coalesce=((False, False))      |            130          |               371
      n=100000, nnz=1000, coalesce=((True, True))      |            100          |               521
      n=100000, nnz=1000, coalesce=((True, False))     |             90          |               649
      n=100000, nnz=1000, coalesce=((False, True))     |            100          |               659
      n=100000, nnz=1000, coalesce=((False, False))    |            200          |               781
      n=1000000, nnz=10000, coalesce=((True, True))    |            100          |              4861
      n=1000000, nnz=10000, coalesce=((True, False))   |            100          |              5012
      n=1000000, nnz=10000, coalesce=((False, True))   |             98          |              5010
      n=1000000, nnz=10000, coalesce=((False, False))  |            384          |              5174
      n=10, nnz=100, coalesce=((True, True))           |            100          |                79
      n=10, nnz=100, coalesce=((True, False))          |            100          |               221
      n=10, nnz=100, coalesce=((False, True))          |            100          |               221
      n=10, nnz=100, coalesce=((False, False))         |            100          |               350
      n=10, nnz=1000, coalesce=((True, True))          |            100          |               100
      n=10, nnz=1000, coalesce=((True, False))         |            100          |               240
      n=10, nnz=1000, coalesce=((False, True))         |            100          |               254
      n=10, nnz=1000, coalesce=((False, False))        |            100          |               392
      n=10, nnz=10000, coalesce=((True, True))         |            100          |               110
      n=10, nnz=10000, coalesce=((True, False))        |            110          |               286
      n=10, nnz=10000, coalesce=((False, True))        |            110          |               286
      n=10, nnz=10000, coalesce=((False, False))       |            271          |               455
      n=100, nnz=1000, coalesce=((True, True))         |            110          |               851
      n=100, nnz=1000, coalesce=((True, False))        |            110          |              1000
      n=100, nnz=1000, coalesce=((False, True))        |            110          |               990
      n=100, nnz=1000, coalesce=((False, False))       |            140          |              1124
      n=100, nnz=10000, coalesce=((True, True))        |            110          |              5137
      n=100, nnz=10000, coalesce=((True, False))       |            110          |              5391
      n=100, nnz=10000, coalesce=((False, True))       |            100          |              5405
      n=100, nnz=10000, coalesce=((False, False))      |            249          |              5539
      n=1000, nnz=10000, coalesce=((True, True))       |            100          |              8598
      n=1000, nnz=10000, coalesce=((True, False))      |            100          |              8800
      n=1000, nnz=10000, coalesce=((False, True))      |            100          |              8782
      n=1000, nnz=10000, coalesce=((False, False))     |            255          |              8956
      n=1000, nnz=100000, coalesce=((True, True))      |            120          |             84500
      n=1000, nnz=100000, coalesce=((True, False))     |            200          |             88560
      n=1000, nnz=100000, coalesce=((False, True))     |            160          |             89000
      n=1000, nnz=100000, coalesce=((False, False))    |            373          |             89000
      n=1000, nnz=1000000, coalesce=((True, True))     |            312          |            606400
      n=1000, nnz=1000000, coalesce=((True, False))    |           1340          |            609200
      n=1000, nnz=1000000, coalesce=((False, True))    |           1340          |            609100
      n=1000, nnz=1000000, coalesce=((False, False))   |           4408          |            611400

Times are in microseconds (us).
```

</details>

<details>

<summary>CPU</summary>

```
[------------------------------------------------ coo.mul ------------------------------------------------]
                                                       |  PR: mul, device: cpu  |  master: mul, device: cpu
24 threads: -----------------------------------------------------------------------------------------------
      n=10000, nnz=100, coalesce=((True, True))        |              8         |                8
      n=10000, nnz=100, coalesce=((True, False))       |             32         |               34
      n=10000, nnz=100, coalesce=((False, True))       |             32         |               34
      n=10000, nnz=100, coalesce=((False, False))      |             41         |               56
      n=100000, nnz=1000, coalesce=((True, True))      |             24         |               24
      n=100000, nnz=1000, coalesce=((True, False))     |             90         |              100
      n=100000, nnz=1000, coalesce=((False, True))     |             87         |              100
      n=100000, nnz=1000, coalesce=((False, False))    |            231         |              255
      n=1000000, nnz=10000, coalesce=((True, True))    |            190         |              200
      n=1000000, nnz=10000, coalesce=((True, False))   |            908         |             2023
      n=1000000, nnz=10000, coalesce=((False, True))   |            800         |             2036
      n=1000000, nnz=10000, coalesce=((False, False))  |           3684         |             3989
      n=10, nnz=100, coalesce=((True, True))           |              8         |                7
      n=10, nnz=100, coalesce=((True, False))          |             34         |               30
      n=10, nnz=100, coalesce=((False, True))          |             33         |               30
      n=10, nnz=100, coalesce=((False, False))         |             44         |               50
      n=10, nnz=1000, coalesce=((True, True))          |              8         |                7
      n=10, nnz=1000, coalesce=((True, False))         |            100         |              100
      n=10, nnz=1000, coalesce=((False, True))         |            130         |              100
      n=10, nnz=1000, coalesce=((False, False))        |            746         |              210
      n=10, nnz=10000, coalesce=((True, True))         |              8         |                7
      n=10, nnz=10000, coalesce=((True, False))        |           1000         |             1500
      n=10, nnz=10000, coalesce=((False, True))        |           1000         |             1510
      n=10, nnz=10000, coalesce=((False, False))       |           3063         |             2457
      n=100, nnz=1000, coalesce=((True, True))         |             25         |               25
      n=100, nnz=1000, coalesce=((True, False))        |            180         |              130
      n=100, nnz=1000, coalesce=((False, True))        |            200         |              130
      n=100, nnz=1000, coalesce=((False, False))       |            271         |              255
      n=100, nnz=10000, coalesce=((True, True))        |            100         |              100
      n=100, nnz=10000, coalesce=((True, False))       |           2444         |             2290
      n=100, nnz=10000, coalesce=((False, True))       |           2455         |             2357
      n=100, nnz=10000, coalesce=((False, False))      |           5316         |             3783
      n=1000, nnz=10000, coalesce=((True, True))       |            204         |              211
      n=1000, nnz=10000, coalesce=((True, False))      |           2457         |             2480
      n=1000, nnz=10000, coalesce=((False, True))      |           2448         |             2539
      n=1000, nnz=10000, coalesce=((False, False))     |           3665         |             4801
      n=1000, nnz=100000, coalesce=((True, True))      |           2293         |             2374
      n=1000, nnz=100000, coalesce=((True, False))     |           9000         |            24620
      n=1000, nnz=100000, coalesce=((False, True))     |           8000         |            25080
      n=1000, nnz=100000, coalesce=((False, False))    |          26500         |            47650
      n=1000, nnz=1000000, coalesce=((True, True))     |          10000         |            13000
      n=1000, nnz=1000000, coalesce=((True, False))    |          80000         |           362200
      n=1000, nnz=1000000, coalesce=((False, True))    |          78050         |           392600
      n=1000, nnz=1000000, coalesce=((False, False))   |         312100         |           766900

Times are in microseconds (us).
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85336
Approved by: https://github.com/cpuhrsch
2022-09-23 23:31:19 +00:00
jjsjann123
0e582fbfcc [NVFuser] Upstream push 0907 (#84626)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84626
Approved by: https://github.com/malfet
2022-09-23 20:29:48 +00:00
Richard Zou
5e5c319549 Move functorch python bindings to torch/csrc (#85426)
This moves functorch's python bindings to torch/csrc/functorch/init.cpp.
Coming next is the torchdim move. I didn't do torchdim yet because
moving functorch's python bindings unblocks some other things that I
want to do first.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85426
Approved by: https://github.com/ezyang
2022-09-22 18:47:12 +00:00
Mikayla Gawarecki
77f1f98479 Re-introduce torch.Tensor.to_padded_tensor (#85293)
Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293
Approved by: https://github.com/cpuhrsch
2022-09-21 18:45:56 +00:00