Commit Graph

1178 Commits

Author SHA1 Message Date
xinan.lin
854d83133b [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268
2024-12-07 19:22:04 +00:00
eqy
8fc6d3a5d8 [SDPA] Allow user-specified priority order with context manager (#140467)
TODO: docs changes?
For better debuggability of issues like https://github.com/pytorch/pytorch/issues/139298

Better testing, current sketch:

``` Python
import torch
from torch.nn.functional import scaled_dot_product_attention
from torch.nn.attention import SDPBackend, sdpa_kernel

q = torch.randn(64, 1024, 8, 64, dtype=torch.half, device='cuda')
print(torch._C._get_sdp_priority_order())

orders = [[SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.MATH, SDPBackend.CUDNN_ATTENTION, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH]]
import time
times = list()
for order in orders:
    print(order)
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t1 = time.perf_counter()
    times.append(t1 - t0)
print(times)
assert times[0] < times[1]
assert times[0] > times[2]
assert times[1] > times[2]
print(torch._C._get_sdp_priority_order())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140467
Approved by: https://github.com/drisspg
2024-12-06 07:56:35 +00:00
Yukio Siraichi
12d28a5929 Move overlapping guards to C++. (#140013)
This PR moves the logic for computing the overlapping relations between input tensors that
share a storage instance to C++.

In summary, this PR:

- Moves both `tensors_definitely_do_not_overlap` and part of `compute_overlapping_tensors`
to C++
- Introduces a `check_overlapping` function that re-runs `compute_overlapping_tensors`,
checking that the result is consistent with what is expected
- Introduces the `StorageOverlapChecker` class
    - Keeps track of overlapping and non-overlapping tensors
    - Actually checks the overlapping relation (call `check_overlapping`) when all tensors
    are collected
- Introduces the `STORAGE_OVERLAPPING` relational guard
    - Has a reference to a `StorageOverlapChecker`
    - Stores the to-be-checked tensors in the checker, and triggers its check
- Introduces `install_storage_overlapping_guard` python function
    - Creates an instance of `StorageOverlapChecker`
    - Creates 2 instances of the `STORAGE_OVERLAPPING` guard (for overlapping and
    non-overlapping tensors), referencing the same `StorageOverlapChecker` instance

**Why is `StorageOverlapChecker` needed?**

The way `GuardManager` is implemented, we have no control over the order in which the
check methods are called, i.e. no control over the order the tensors are collected. So, we
can't easily split them in "overlapping" and non-overlapping kinds.

Instead, we create 2 instances of `STORAGE_OVERLAPPING` guard, each of which helps
collecting the tensors for one of the kinds mentioned above. They are then used in a
single `StorageOverlapChecker` instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140013
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555
2024-12-05 14:43:58 +00:00
Yu, Guangye
8dd4673cea Support torch.xpu.mem_get_info API (#141230)
# Motivate
Fix https://github.com/pytorch/pytorch/issues/130599
This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads.
For example, [here](403c0714d1/src/accelerate/utils/modeling.py (L721)) we need to get current GPU memory usage to split or load the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-05 08:17:25 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Hyunho Yeo
d70b7029c8 [MTIA] Support torch.mtia.empty_cache() (#141533)
Summary: As title

Test Plan:
Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testrun/4785074861101240

Reviewed By: nautsimon

Differential Revision: D66481778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533
Approved by: https://github.com/nautsimon
2024-11-28 02:24:19 +00:00
Simon Fan
db4e8a1d8a [ca] expose option to collect sizes as dynamic (#141153)
This is to address recompiles from eager nodes that saved dynamic activations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141153
Approved by: https://github.com/jansel
ghstack dependencies: #141152
2024-11-22 19:26:27 +00:00
Animesh Jain
fb529c2c84 [dynamo] skip_guard_eval_unsafe stance for power users (#140251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140251
Approved by: https://github.com/jansel
ghstack dependencies: #140223, #140250
2024-11-21 06:28:58 +00:00
Syed Tousif Ahmed
e0482fdf95 Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-21 01:40:11 +00:00
Animesh Jain
9d229f08f4 [dynamo][guards] Introduce a diff_guard_manager (#140250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140250
Approved by: https://github.com/jansel
ghstack dependencies: #140223
2024-11-20 17:59:30 +00:00
FFFrog
a9e54f64ee Remove unused Python API named _set_torch_function_mode (#141023)
Detailed description:

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141023
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-20 09:48:03 +00:00
PyTorch MergeBot
9fac5a16fd Revert "[PGNCCL] Add an API to get the status/error code of each PG (#140087)"
This reverts commit 80aa19a622.

Reverted https://github.com/pytorch/pytorch/pull/140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140087#issuecomment-2486912231))
2024-11-19 22:53:46 +00:00
PyTorch MergeBot
496c1e78c5 Revert "Implements user buffer registration using MemPool (#133603)"
This reverts commit 25d9be37be.

Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))
2024-11-19 22:42:26 +00:00
Animesh Jain
f4ce9ac29d [dynamo] Dont erase the cache line on invalidation (#140821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140821
Approved by: https://github.com/jansel
2024-11-19 19:11:10 +00:00
Yifu Wang
ab5c8857ef [SymmetricMemory] support specifying group_name at rendezvous time (#139529)
Before this PR, users need to call `empty_strided_p2p()` with a `group_name`:

```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0")
symm_mem = _SymmetricMemory.rendezvous(tensor)
```

Users can now omit `group_name` at allocation time and specify it later at rendezvous time:

```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device)
symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0")
```

Rationales for this change:
- This allows the same allocation to establish symmetric memory under different groups
- Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529
Approved by: https://github.com/lw
2024-11-17 09:31:17 +00:00
Ryan Guo
ac6684ebbc [dynamo] Identify pre-existing captured cells by cell id rather than content id (#140436)
In `match_nested_cell`, Dynamo tried to identify pre-existing captured
cells by `(cell_name, id(cell_contents))`. This works in most cases, but
as the test added in this patch shows, it's not a complete solution.

This patch
1. changes `match_nested_cell` to `lookup_variable_for_captured_cell`,
   and does the lookup based on id of cell objects, not their contents.
   This requires plumbing a tuple of captured cell objects from
   different CPython versions all the way to
   `InstructionTranslator.__init__`, where we store a mapping from the
   ids of these cell objects, and use it later in
   `UserFunctionVariable.bind_args` to look for these unboxed cells.
2. builds off (1) -- rather than using a `VariableTracker` that
   represents the content of the unboxed cells, use `ClosureVariable`,
   which enables codegen in case these cells escape as closure of a
   `NestedUserFunctionVariable`.

The patch adds a regression test for each of the scenarios above:
1. `test_write_to_cells_with_name_shadowing` where Dynamo mistakenly
   thought the program is writing to a cell captured by root frame (which
   it doesn't support atm), which resulted in
```
  File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/symbolic_convert.py", line 3340, in STORE_DEREF
    unimplemented("write to __closure__ while inlining")
  File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented
    raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: write to __closure__ while inlining
```
2. `test_existing_func_that_creates_capturing_nested_func` where Dynamo
   ended up trying to codegen a `NestedUserFunctionVariable` that
   captures a cell which was also captured by the root frame, so it was
   unboxed and ends up emitting `LOAD_DEREF` rather than
   `LOAD_FAST/LOAD_CLOSURE` during codegen, resulting in
```
  File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/variables/functions.py", line 105, in _create_nested_fn
    func = FunctionType(code, f_globals, name, defaults, closure)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: arg 5 (closure) expected cell, found int
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140436
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #140330, #140152
2024-11-15 17:17:30 +00:00
Ryan Guo
85dd7b84cf [dynamo] Add a DynamoFrameType type above Python frame object (#140330)
This patch introduces a `DynamoFrameType` to serve as a layer between
Dynamo and different versions of Python frame object. In
`DynamoFrameType`, we only register attributes Dynamo cares about (e.g.,
`f_code`, `f_locals`, etc.

This will be helpful when it comes to adding new attributes to this
`DynamoFrameType`, or dealing with Python version changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140330
Approved by: https://github.com/jansel, https://github.com/williamwen42
2024-11-15 17:17:30 +00:00
Syed Tousif Ahmed
25d9be37be Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-15 12:47:49 +00:00
Shuqiang Zhang
80aa19a622 [PGNCCL] Add an API to get the status/error code of each PG (#140087)
Summary:
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective

Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087
Approved by: https://github.com/kwen2501
2024-11-15 04:11:00 +00:00
Zhengxu Chen
3ef2dfc1ba [export] Implement cpp deserializer. (#136398)
Differential Revision: D63206258

This diff introduces a mechanism to generate a json-compatible deserializer in cpp using nlohmann json (already being used by AOTI).

Why we need this? Because there will be a lot of cases where people don't want to use Python to load the graph (e.g. cpp runtime), and instead they can use this header to deserialize the JSON graph.

Every time we call update_schema.py to update the schema, the header will be auto generated and included into the source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136398
Approved by: https://github.com/angelayi
2024-11-14 16:34:59 +00:00
Aaron Orenstein
82597d07aa type annotations for meta_utils (#140203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140203
Approved by: https://github.com/ezyang
2024-11-13 20:07:47 +00:00
Yu, Guangye
659d2132be Add architecture to XPU device property (#138186)
# Motivation
Add `architecture` to XPU device property.
In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications.
Modified from https://github.com/pytorch/pytorch/pull/129675/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186
Approved by: https://github.com/ezyang
2024-11-13 03:35:13 +00:00
Yifu Wang
0a0915fb5e [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 18:49:22 +00:00
PyTorch MergeBot
5f4a21dc58 Revert "[SymmetricMemory] improve the API for stream_write_value32 (#139934)"
This reverts commit 2f3a5a15ef.

Reverted https://github.com/pytorch/pytorch/pull/139934 on behalf of https://github.com/malfet due to Broke distributed tests, see https://github.com/pytorch/pytorch/actions/runs/11770673088/job/32784210441 ([comment](https://github.com/pytorch/pytorch/pull/139934#issuecomment-2468641512))
2024-11-11 17:02:07 +00:00
Yifu Wang
2f3a5a15ef [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 01:54:35 +00:00
Yifu Wang
1659e241c8 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-08 23:28:25 +00:00
Animesh Jain
e6c5a77485 [dynamo][guards] Profile guard manager in C++ (#140110)
This should remove the pybind noise from the profiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140110
Approved by: https://github.com/jansel
ghstack dependencies: #139953
2024-11-08 18:44:08 +00:00
Luca Wehrstedt
5f287df422 Add type information for FakeProcessGroup (#133211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211
Approved by: https://github.com/Skylion007
2024-11-08 11:18:52 +00:00
PyTorch MergeBot
36e0f119d0 Revert "[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)"
This reverts commit 5203138483.

Reverted https://github.com/pytorch/pytorch/pull/139227 on behalf of https://github.com/yifuwang due to Need to address internal build failure D65605027 ([comment](https://github.com/pytorch/pytorch/pull/139227#issuecomment-2463204467))
2024-11-07 21:01:36 +00:00
Yifu Wang
5203138483 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-07 03:43:12 +00:00
CaoE
9e14d86573 [Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255)
`kernel_micro_gemm` generated using BRGEMM:
```
template <bool accum>
inline void kernel_micro_gemm(
    const half* __restrict__ A,
    const half* __restrict__ B,
    float* __restrict__ C,
    int64_t M,
    int64_t N,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc
) {
    at::native::cpublas::brgemm(
      M, N, K,
      lda, ldb, ldc,
      1.f, accum ? 1.f : 0.f,
      A,
      B,
      C);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 05:33:29 +00:00
Shivam Raikundalia
ac7acfb894 [Profiler] Create Auto-Trace Frontend for Trace ID (#139310)
Summary:
This PR adds Auto-Trace implementation for Trace ID. By default, the python side will generate a uuid in the same format as the one set in the backend by kineto. Upon running an auto-trace, the python generated trace id will overwrite the one set in kineto using the Config variable. Since we don't expect users to generate on-demand traces after an auto-trace we can simply keep overwriting the backend trace id whenever autotrace is ran. If we one day want to eventually do something like this, we simply have to add a call in kineto on the backend to generate a new ID upon start of profiling.

We also implement a custom callback in the frontend such that users can generate their own trace ids if they wish to. This works similarly as the default, only difference being that they have to manually set this callback after a profiler is generated. We use a specific call to set this rather then putting it in the frontend initializer in case users want to change the trace_id for different repeats.

Test Plan: Tested both default and custom callbacks using the verbose prints added. Trace ids on the frontend and the prints on the backend for the manifold upload matched.

Differential Revision: D65178308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139310
Approved by: https://github.com/shengfukevin
2024-10-31 19:02:57 +00:00
Richard Zou
ef380f7b8e [real tensor prop] Add some asserts for custom ops (#139212)
When we see a custom op:
- check that its mutation annotations are correct
- check that its aliasing constraints matches our constraints for custom
  ops.

Otherwise, there may be undefined behavior.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139212
Approved by: https://github.com/angelayi
2024-10-30 19:29:11 +00:00
Will Feng
4ee514144b [c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763)
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
    y = x * x
    req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
    assert isinstance(req, torch.distributed.Work)
    return y

@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
    torch.ops.c10d_functional.wait_tensor(y)
    return y * y

x = torch.ones(1280, 1280, device="cuda") + self.rank
with allow_inflight_collective_as_graph_input_ctx():
    y = all_reduce_eager(x)
    z = all_reduce_wait_compiled(y)
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.

This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.

----

**Update**: Did two items to prevent regression to existing use cases:

1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case
2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users).

The risk of this new version of PR causing regression should be very low.

------

Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`

------

Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
2024-10-29 03:31:19 +00:00
Simon Fan
fd9f4e6770 Back out "[compiled autograd] tls access helpers (#138061)" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821)" (#139086)
Summary:
Original commit changeset: 9bf80c1492d7

Original Phabricator Diff: D64796226

Original commit changeset: aa1d9ef8f6e6

Original Phabricator Diff: D64796212

Differential Revision: D65072644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086
Approved by: https://github.com/malfet
2024-10-28 23:37:05 +00:00
William Wen
35be6aef69 [dynamo] add some cpython debugging methods (#138030)
This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g.

```python
x = x + 1
torch._dynamo.eval_frame.raise_sigtrap();
# can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C.
x = torch.sin(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138030
Approved by: https://github.com/jansel
2024-10-28 22:25:21 +00:00
PyTorch MergeBot
e5595f10c8 Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763)"
This reverts commit a688c57033.

Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/yf225 due to Seems to have bad interaction with latest commits on trunk, reverting to be safe ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2442527696))
2024-10-28 20:13:46 +00:00
Will Feng
a688c57033 [c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763)
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
    y = x * x
    req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
    assert isinstance(req, torch.distributed.Work)
    return y

@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
    torch.ops.c10d_functional.wait_tensor(y)
    return y * y

x = torch.ones(1280, 1280, device="cuda") + self.rank
with allow_inflight_collective_as_graph_input_ctx():
    y = all_reduce_eager(x)
    z = all_reduce_wait_compiled(y)
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.

This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.

------

Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`

------

Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
2024-10-28 18:11:23 +00:00
Wouter Devriendt
bae3426af7 reimport pr137735 due to merging check issues (#138959)
This is  a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959
Approved by: https://github.com/mikaylagawarecki
2024-10-27 16:31:34 +00:00
Yu, Guangye
40c098f731 Introduce a device-agnostic runtime API design (#132204)
# Motivation
According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design.
I personally prefer the **Simple Version** APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does **NOT** break the previous design philosophies.
I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle:
1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter.
2. Device-specific APIs should be placed under device-specific submodules.
3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter.

Also, I list the pros and cons of **Simple Version** here:
Pros:
- `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience;
- more concise, facilitate the developer to write a device-agnostic code.

Cons:
- no obvious drawbacks.

# Additional Context
I list the new APIs here:
```python
torch.accelerator.is_available() -> bool:
torch.accelerator.current_accelerator() -> torch.device:
torch.accelerator.device_count() -> int:
torch.accelerator.current_device_idx() -> int:
torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None:
torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream:
torch.accelerator.set_stream(stream: torch.Stream) -> None:
torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None:
```
According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204
Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD
2024-10-27 10:37:09 +00:00
Yifu Wang
ed9169df98 Removed the typing information for already deleted ProcessGroupCudaP2P (#138753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138753
Approved by: https://github.com/weifengpy
2024-10-25 00:32:07 +00:00
Shuqiang Zhang
4c91481656 [c10d] allow sub group to be eagerly inited even if default one is not (#138665)
Summary:
Currently, eager mode is applied either to all PGs or NONE of them.
There are cases where we don't want to initialize the comms for default
PG, but we still want to initialize the comms for sub PG. Now with a
device_id passed to new group, we can achieve this case
Test Plan:
newly added UT

Tags:

Resolves https://github.com/pytorch/pytorch/issues/137018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665
Approved by: https://github.com/kwen2501
ghstack dependencies: #138781
2024-10-24 23:51:28 +00:00
Simon Fan
49fa437097 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-22 08:03:52 +00:00
Syed Tousif Ahmed
03c72976a5 Properly uses ref-counting for torch.cuda.use_mem_pool (#133600)
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.

The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-10-22 03:21:53 +00:00
Jeff Daily
3f3b692a00 [ROCm] CK-based GEMM (#131004)
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony

Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-10-20 02:57:43 +00:00
Yu, Guangye
8cda774a03 Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773)
# Motivation
Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-10-18 02:28:08 +00:00
PyTorch MergeBot
dd32a32cb6 Revert "Expose option to disable CRC-32 computation during torch.save (#137735)"
This reverts commit 534fa96f2d.

Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))
2024-10-16 17:03:06 +00:00
PyTorch MergeBot
361f42bc42 Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)"
This reverts commit 9aba0b91c8.

Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))
2024-10-16 16:38:29 +00:00
William Wen
4c8718d8e7 [dynamo] add torch.compiler.set_stance (#137504)
Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771.

Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly.

See added tests for usage examples.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504
Approved by: https://github.com/yf225, https://github.com/jansel
2024-10-16 16:18:25 +00:00
Simon Fan
9aba0b91c8 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-16 09:28:32 +00:00
PyTorch MergeBot
4557f6e339 Revert "[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)"
This reverts commit bf0b670598.

Reverted https://github.com/pytorch/pytorch/pull/137669 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing test_public_bindings in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/137669#issuecomment-2415331274))
2024-10-15 23:22:58 +00:00
Michael Lazos
bf0b670598 [Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)
Fixes https://github.com/pytorch/pytorch/issues/114369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669
Approved by: https://github.com/anijain2305
2024-10-15 20:52:58 +00:00
Mikayla Gawarecki
534fa96f2d Expose option to disable CRC-32 computation during torch.save (#137735)
Option only works in open source, not internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735
Approved by: https://github.com/albanD
2024-10-15 19:30:02 +00:00
Ke Wen
fe148024fe [c10d][experimental] Add _abort_process_group (#132291)
Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797

This PR is meant to:
- provide a way to abort multiple PGs without deadlocking each other.
- provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such).
One can find an example from: https://github.com/NVIDIA/nccl/issues/1013

## How is it different from `destroy_process_group`?
`destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`.

## What's new in `_abort_process_group`?
It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22.

## What's next?
Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs).

In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291
Approved by: https://github.com/eqy
2024-10-11 05:04:17 +00:00
Aditya Tewari
575f260229 Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672)
**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571

* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1

<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">

It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-10 13:20:40 +00:00
Michael Lazos
38afac2917 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-09 02:29:40 +00:00
eellison
8893881867 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-10-09 00:05:52 +00:00
PyTorch MergeBot
e5f9131327 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)"
This reverts commit f9d69cde88.

Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
PyTorch MergeBot
796c3c3415 Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221)"
This reverts commit 7e13e7dd7e.

Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))
2024-10-07 21:46:13 +00:00
Michael Lazos
f9d69cde88 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-07 18:55:26 +00:00
Jeff Daily
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
rzou
7e13e7dd7e Disallow FakeTensor.data_ptr access in eager mode (#137221)
Previously we raised a deprecation warning (beginning PyTorch 2.4). Now
that we are on 2.6, we're completing the deprecation and disallowing
this behavior.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221
Approved by: https://github.com/albanD, https://github.com/eellison
2024-10-03 23:47:55 +00:00
PyTorch MergeBot
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 7001907480.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
Jeff Daily
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
William Wen
2157e396a3 [dynamo] attempt run only mode when dynamo cache limit is hit (#136655)
Implement https://github.com/pytorch/pytorch/issues/135458.

Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655
Approved by: https://github.com/jansel
2024-09-27 17:15:05 +00:00
Animesh Jain
289df45cee Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)" (#136590)
This reverts commit 7743149b2b.

Reverts
* https://github.com/pytorch/pytorch/pull/135503
* https://github.com/pytorch/pytorch/pull/135502
* https://github.com/pytorch/pytorch/pull/135422

This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation.

```
import torch
from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention
from torch._inductor.lowering import make_pointwise, register_lowering
from torch._inductor.virtualized import ops
from torch.nn.attention.flex_attention import create_block_mask

torch.set_default_device('cuda')

flex_attention = torch.compile(flex_attention, dynamic=False)

prefix_lengths = torch.arange(8)
def prefix_lm(b, h, q, kv):
    return prefix_lengths[b] >= kv

mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590
Approved by: https://github.com/Chillee
2024-09-25 21:10:43 +00:00
Jianyu Huang
0a35986cdb Add option to configure reduced precision math backend for SDPA (#135964)
Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA.

Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels

Differential Revision: D62625515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964
Approved by: https://github.com/jbschlosser
2024-09-24 07:11:38 +00:00
PyTorch MergeBot
783c5ba80a Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)"
This reverts commit 0b81f700aa.

Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))
2024-09-20 17:10:27 +00:00
Trung Truong
cc365fdd7b [MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889)
Summary:
Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn

At the moment, both the major and minor version are just 0

Test Plan:
Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/

Differential Revision: D62595296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889
Approved by: https://github.com/egienvalue
2024-09-17 17:42:56 +00:00
fduwjj
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
Michael Lazos
8df01c8258 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-14 18:52:22 +00:00
PyTorch MergeBot
72b868d034 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit e77bd0ebd2.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:54 +00:00
Michael Lazos
e77bd0ebd2 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-14 02:41:16 +00:00
PyTorch MergeBot
dc71e7a7d4 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit c56728b643.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
Michael Lazos
c56728b643 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-13 08:41:32 +00:00
angelayi
cd9ee49a69 [aoti] Add cpp loader (#135374)
* Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python...
* Added a new config, `aot_inductor.package_cpp_only` which will **not** package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users.
* Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config.
* Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`.
* `load_package` will load a singular model, given the model name.
* The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows?

Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374
Approved by: https://github.com/desertfire, https://github.com/malfet
2024-09-11 03:00:01 +00:00
Yu, Guangye
b53d97c7be [Intel GPU] Add XPU memory-related APIs (#129919)
# Motivation
According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification.
But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification.

# Additional Context
Fixes: #127929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919
Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #130923
2024-09-07 11:15:17 +00:00
William Wen
95e976a63f [dynamo] recursively skip frames when Dynamo cache limit is hit (#135144)
Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723).

In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-06 21:38:53 +00:00
sanchitintel
43dcb4bb61 Revise CPU vectorization ISA support API (#135075)
Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang
2024-09-05 12:14:56 +00:00
PyTorch MergeBot
351ba3e67c Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)"
This reverts commit 65864d0134.

Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))
2024-08-30 16:27:40 +00:00
fduwjj
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
Syed Tousif Ahmed
4655eb3ee2 Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685)
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
Laith Sakka
d6091c8726 Add compile time instruction count metric (#133834)
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
 will add result from CI once populated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
2024-08-27 23:29:02 +00:00
Shivam Raikundalia
0b81f700aa [PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)
Summary:
We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse.
This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context".

Test Plan:
Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling.

Differential Revision: D60803317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765
Approved by: https://github.com/anijain2305
2024-08-27 04:55:04 +00:00
FFFrog
a5ef04a3b8 add relevant function (#133946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133946
Approved by: https://github.com/ezyang
2024-08-21 23:04:59 +00:00
Jesse Cai
255cd75a97 [sparse] Add cuSPARSELt as a backend (#128534)
Summary:

This PR adds in cuSPARSELt as a backend to PyTorch.

It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```

Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed
2024-08-21 22:06:07 +00:00
Michael Lazos
48ee0984ac Add C API to return all torch function disablement status (#133136)
This PR adds a C function to check if all torch function is disabled.
Recall that there are three torch function enablement states:
* All disabled
* Torch Function Subclass disabled
* All enabled

The API before this change provides two functions:
* `_is_torch_function_enabled` - returns True iff the current TF state is All enabled
* `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty.

The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case.

In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API:
* A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state)
* A flag to indicate if modes are disabled (ie if All disabled is the current state)
* A symbolic stack which can be checked if any modes are present

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134
2024-08-20 07:15:04 +00:00
Michael Lazos
09e366cb57 [Dynamo] Add torch function mode stack guard to dynamo (#133130)
This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run.

Details:
The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state.

To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130
Approved by: https://github.com/anijain2305
2024-08-20 07:14:33 +00:00
Zhengxu Chen
59b3f5911d [sigmoid] Support custom obj deserialization. (#133463)
Summary:
It seems we have multiple places deserializing torchbind objects. Moving the code around so that every load essentially share the same implementation.

Also added a test case "package_reader_testing" which load back the archive file in Python and eagerly validate the numerical result.

Test Plan: buck test mode/opt sigmoid/inference/test:e2e_test_cpu

Reviewed By: SherlockNoMad

Differential Revision: D61235770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133463
Approved by: https://github.com/ydwu4
2024-08-15 17:58:44 +00:00
Mikayla Gawarecki
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
Shivam Raikundalia
d2ecdcb2f7 [Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035)
Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing.

Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi

Differential Revision: D60541767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035
Approved by: https://github.com/aaronenyeshi
2024-08-09 21:54:54 +00:00
Xuehai Pan
24dee99cb7 Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-08 10:20:25 +00:00
Howard Huang
0f90ffe94a Remove ProcessGroupRoundRobin (#132888)
`_round_robin_process_groups` is deprecated and should be removed.

258f47fc0b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp (L10-L12)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132888
Approved by: https://github.com/Skylion007, https://github.com/wanchaol, https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-08 01:07:40 +00:00
PyTorch MergeBot
ff81ca8e0c Revert "Populate submodules of torch._C to sys.modules recursively (#132216)"
This reverts commit 672ce4610e.

Reverted https://github.com/pytorch/pytorch/pull/132216 on behalf of https://github.com/PaliC due to was breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/132216#issuecomment-2274112397))
2024-08-07 18:45:00 +00:00
Matthew Hoffman
258f47fc0b Add padding_side to pad_sequence with "left" and "right" options ("right" as default) (#131884)
Fixes #10536

Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work!

As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral.

The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)), which requires only a very small change in the C++ code.

Here are the benchmarks of the new implementation!

`float32`:

![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb)

`bool`:

![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e)

Code:

```python
from __future__ import annotations

import random
import time
from typing import Literal

import numpy as np
import torch

def pad_sequence_with_flips(
    sequences: list[torch.Tensor],
    batch_first: bool = False,
    padding_value: int | float | bool = 0.0,
    padding_side: Literal["left", "right"] | str = "left",
) -> torch.Tensor:
    if padding_side == 'right':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value)
    elif padding_side=='left':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value)  # pyright: ignore[reportArgumentType]
        padded_sequence = padded_sequence.flip(int(batch_first))
    else:
        raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}")

    return padded_sequence

sequence_lengths: list[int] = []

flip_left_pad_times: list[float] = []
flip_left_pad_times_std: list[float] = []

left_pad_times: list[float] = []
left_pad_times_std: list[float] = []

RUNS_PER_LOOP: int = 100

for i in range(1, 7):
    sequence_length = i * int(1e6) // 6
    sequence_lengths.append(sequence_length)

    sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)]

    inner_left_pad_times: list[float] = []
    inner_right_pad_times: list[float] = []

    inner_flip_left_pad_times: list[float] = []
    inner_flip_right_pad_times: list[float] = []

    for _ in range(RUNS_PER_LOOP):

        start = time.perf_counter()
        torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_left_pad_times.append(end - start)

        start = time.perf_counter()
        pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_flip_left_pad_times.append(end - start)

    left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times))
    left_pad_times_std.append(np.std(inner_left_pad_times))

    flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times))
    flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times))

    print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}")

import matplotlib.pyplot as plt

plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left")
plt.scatter(sequence_lengths, left_pad_times)
plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^')

plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)")
plt.scatter(sequence_lengths, flip_left_pad_times)
plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^')

plt.xlabel("Sequence Length")
plt.ylabel("Time (s)")
plt.legend(loc="upper right")

# Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794
# Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562
# Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863
# Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877
# Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024
# Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751
```

Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884
Approved by: https://github.com/ezyang
2024-08-07 15:53:07 +00:00
Apurva Jain
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
Nicolas Macchioni
527f104a69 add L2 cache size to device properties (#132819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819
Approved by: https://github.com/eellison
2024-08-07 04:55:06 +00:00
Brian Hirsh
e3394e5548 torch.autograd.graph.increment_version: accept List[Tensor], use in AOTDispatcher (#132652)
The regression from https://github.com/pytorch/pytorch/issues/132281 pinpoints e4ace1a396 as the cause. The main delta that commit introduces is that we now manually check `is_inference()` and call `increment_version()` (a pybind call) on every mutated input tensor to the graph.

This PR attempts to reduce overhead a bit by bundling up all of those checks into a single pybind call, by:

(1) updating `torch.autograd.graph.increment_version()` to accept a `Union[Tensor, List[Tensor]]`

(2) updating its semantics to no-op if you pass in a tensor with no version counter, instead of erroring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132652
Approved by: https://github.com/albanD
2024-08-06 17:46:48 +00:00
fduwjj
4e610924d4 [c10d] Add a new API for adding ephemeral timeout for one local rank and the timeout will reset when the first collective finishes (#130905)
We provide an API for user to add ephemeral timeout across all PGs within one rank and the timeout will reset when the first collective issued after the timeout added finishes.

Each extension only covers collectives after the issue and before the first collective finished. The diagram below shows how the timeout changes:

<img width="1174" alt="image" src="https://github.com/user-attachments/assets/354923b7-581c-40de-ae0f-1cd3da273ccc">

While this feature provides flexibility in specific scenarios, it introduces statefulness to timeout setting. Therefore, it is advisable to use this API sparingly and consider alternative approaches, such as directly setting the timeout or utilizing a barrier collective (one can set any timeout to the barrier), whenever feasible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130905
Approved by: https://github.com/ezyang
2024-08-06 03:47:58 +00:00
Xuehai Pan
4226ed1585 [BE] Format uncategorized Python files with ruff format (#132576)
Remove patterns `**`, `test/**`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #132574
2024-08-04 17:13:31 +00:00
PyTorch MergeBot
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
Xuehai Pan
672ce4610e Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-01 12:04:59 +00:00
Syed Tousif Ahmed
7c89ec0f7c Implements torch.cuda.MemPool() API (#131152)
In this PR:
- Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change.
- MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator.
- MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-01 01:29:30 +00:00
jainapurva
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
Luca Wehrstedt
f4f7aba75d Expose function to probe whether PyTorch was built with FlashAttention (#131894)
This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-07-31 11:33:09 +00:00
Matthew Hoffman
deb788f6cc Merge torch.nn.utils.rnn type stubs (#131872)
I want to re-attempt:

* #61467

See:

* https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730

and this is one of the files I would touch.

quoting @ezyang:

* https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129

> The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-07-31 02:24:59 +00:00
Shreyans Pathak
12b67bd998 Fix pyi annotation for ProcessGroupGloo.Options (#132080)
This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file.

Fixes #132054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080
Approved by: https://github.com/Skylion007
2024-07-30 13:52:31 +00:00
PyTorch MergeBot
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00
jainapurva
d039b14207 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-29 21:49:06 +00:00
Simon Mahns
dcb03106b7 [Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007)
Summary: as title

Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962

Differential Revision: D60335413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007
Approved by: https://github.com/hanzlfs, https://github.com/egienvalue
2024-07-29 20:47:18 +00:00
Edward Z. Yang
6c6fbb4691 Fix pyi annotation for ProcessGroupNCCL.Options (#130957)
Probably all the other options need updating too, but this is the one I
needed.  The accurate annotation was determined by reading
torch/csrc/distributed/c10d/init.cpp

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-07-29 17:46:01 +00:00
PyTorch MergeBot
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9d.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
Boyuan Feng
40cc5c0697 [AOT Autograd] Donated Buffer (#130580)
Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor.

Fixes #129496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580
Approved by: https://github.com/bdhirsh
2024-07-26 17:14:34 +00:00
Boyuan Feng
16d7cb5049 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 06:14:06 +00:00
PyTorch MergeBot
03f49c9523 Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621)"
This reverts commit 16699c7d84.

Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))
2024-07-26 02:08:45 +00:00
Boyuan Feng
16699c7d84 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 01:40:23 +00:00
PyTorch MergeBot
b343644f3a Revert "MTIA equivalent of torch.cuda.memory_stats (#131673)"
This reverts commit 513ce5f69a.

Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))
2024-07-26 00:54:37 +00:00
Mikayla Gawarecki
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
Simon Mahns
513ce5f69a MTIA equivalent of torch.cuda.memory_stats (#131673)
Summary: Adding MTIA equivalent of `torch.cuda.memory_stats`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673
Approved by: https://github.com/egienvalue
2024-07-25 21:59:59 +00:00
Animesh Jain
e2b941a1b4 [dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480)
Fixes https://github.com/pytorch/pytorch/issues/129667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347, #131367, #131378, #131389, #131405
2024-07-24 00:06:53 +00:00
Yifu Wang
161c18ed0b SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583)
```python
# NOTE [low-contention collectives]
# When a collective is overlapped with abundant compute, it makes sense to
# prioritize reducing the contention between the collective and the overlapped
# compute, even at the cost of a slightly slower collective.
#
# Common collective implementations (e.g., NCCL without user buffer
# registration) optimize for throughput with no ambient compute. However, such
# implementations may not be optimal when they are overlapped with compute:
# - These impls typically fuse the entire collective into a single kernel and
# reserve SM resources based on the most demanding portion of the collective,
# even when a large portion of the collective does not require this much
# resource.
# - These implementations typically fuse the entire collective into a single
# kernel and reserve SM resources based on the most demanding portion of the
# collective, even when a large portion of the collective does not require this
# much resource.
# - These implementations often use SM-based P2P copy as opposed to copy
# engine-based P2P copy. Copy engine-based P2P copy may not have a significant
# advantage when there's no ambient compute. However, it may significantly
# improve overall resource utilization in the presence of ambient compute.
#
# When overlapped with intensive compute (e.g., persistent matmul kernels), the
# SM-usage of a collective can lead to inefficient overlapping.
#
# Low-contention collectives achieve their goals with the following strategies:
# - Use copy engine-based copy whenever possible.
# - Break down portions of a collective with different resource requirements
# into multiple kernels. This improves the overlapping efficiency at the cost
# of additional launching overhead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583
Approved by: https://github.com/weifengpy
2024-07-23 23:37:48 +00:00
PyTorch MergeBot
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
Oguz Ulgen
4ca8705035 Add mypy typing to fx node (#131434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434
Approved by: https://github.com/zou3519
2024-07-23 17:00:31 +00:00
Mikayla Gawarecki
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
Florian
1614891946 [Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733)
Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733
Approved by: https://github.com/aaronenyeshi
2024-07-22 04:35:21 +00:00
Aaron Orenstein
b193894b94 FakeTensor cache SymInt support (#127596)
Adds support for SymInts in the FakeTensor cache.

A couple notes:
1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain.
2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail.
3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not).
4. In the cache entry we store SymInts as a _SymIntOutputStub.

Perf example:
```
python benchmarks/dynamo/timm_models.py --ci --accuracy --timing
--explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda
--training --amp --total-partitions 2 --partition-id 0 --output
/tmp/training_timm_models.csv --filter crossvit_9_240
```
fake tensor cache before:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 68137
INFO:   cache_misses: 837
INFO:   cache_bypasses:
INFO:     symbolic shape:            48224
INFO:     CompositeImplicitAutograd: 917
INFO:     non-fake tensor:           70
INFO:     non-FakeTensor output:     62
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```
and after:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 88187
INFO:   cache_misses: 14233
INFO:   cache_bypasses:
INFO:     CompositeImplicitAutograd: 1037
INFO:     non-FakeTensor output:     602
INFO:     non-fake tensor:           70
INFO:     unsafe view:               36
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596
Approved by: https://github.com/eellison
ghstack dependencies: #131014, #129780
2024-07-21 19:26:38 +00:00
Wu, Chunyuan
a8319698b3 [inductor] [cpp] improve cache blocking with CPU info (#129348)
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
     - size_of_B < L1
     - size_of_A < 0.5 * L2

For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.

## Performance
No regressions. Models with > 3% performance speedup are listed below:

### BF16 single thread (measured on CPU with AMX support)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

### FP32 single thread (measured on Ice Lake)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

### Next step
The E2E level improvement is limited due to the below reasons:

- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.

- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.

We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
2024-07-20 06:53:31 +00:00
PyTorch MergeBot
7c299b46ca Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 8390843eba.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))
2024-07-19 22:58:51 +00:00
Isuru Fernando
8390843eba Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-16 14:29:29 +00:00
PyTorch MergeBot
78799e82b0 Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 1bc390c5f5.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 1bc390c5f5. Test was introduced by fa5f572748 which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))
2024-07-15 21:59:46 +00:00
Shuqiang Zhang
77fb5b0e23 [c10d] a new Pytorch API (split_group) to create a process group (#130507)
This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407

ncclCommSplit
Summary:
In current Pytorch/c10d, the new_group API is used to create a new
process group from the default pg.  When device_id is specified in
init_process_group and nccl is used as the backend, the new_group call
will use ncclCommSplit to create the nccl communicators to save
communicator resources. It has a few drawbacks:

Redundant calls
Suppose the default group has 256 ranks, we need to have 32 children PGs
and each child PG has 8 ranks. in this case, each rank needs to call
new_group and ncclCommSplit 32 times because of how we implement
new_group API and the collective requirement of ncclCommSplit. For a
specific global rank, 31 calls of ncclCommSplit would be no_color split,
and only 1 of them is colored split. With the proposed new split_group
API, we expect only 1 call of split_group/ncclCommSplit is needed per
rank in the above example case

new_group can only split from default_pg
Ideally, a new pg should be able to be split from any pg

With the new split_group API, users can create new PGs using
ncclCommSplit with less number of calls and initialize the PG eagerly.
This is also useful in the cases of creating many P2P communicators.
Test Plan:
New UTs:
e.g., python test/distributed/test_c10d_nccl.py -k
test_comm_split_group_larger_scale
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507
Approved by: https://github.com/wconstab
2024-07-15 21:26:43 +00:00
Isuru Fernando
1bc390c5f5 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-15 04:16:17 +00:00
Shivam Raikundalia
6f275ae4d0 Add kwinputs to Kineto Traces (#130373)
Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces.

Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested.

Differential Revision: D59472345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373
Approved by: https://github.com/davidberard98
2024-07-14 00:40:59 +00:00
Aaron Orenstein
567482973d typing fake_tensor.py (#128041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041
Approved by: https://github.com/eellison
ghstack dependencies: #129182
2024-07-13 06:07:40 +00:00
Wang, Eikan
f52b2ee90f Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`.

Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman
2024-07-11 13:17:25 +00:00
Ivan Zaitsev
6b3460ae0d fix discrepancy from the export of #126601 (#130296)
#126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-07-10 17:26:44 +00:00
Xuehai Pan
3f50e197c4 [BE] annotate torch.autograd.graph (#129558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558
Approved by: https://github.com/soulitzer
2024-07-06 18:14:16 +00:00
Xuehai Pan
d1d0a7080f [torchgen] reference generated comment to actual location of the generator and template (#130020)
As per title.

```diff
# torch/_VF.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/return_types.pyi

- # @generated from torch/_C/return_types.pyi
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in
```

```diff
# torch/_C/__init__.pyi

- # @generated from torch/_C/__init__.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in
```

```diff
# torch/_C/_nn.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in
```

```diff
# torch/_C/_VariableFunctions.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/nn/functional.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020
Approved by: https://github.com/ezyang
2024-07-05 21:47:14 +00:00
Ramana Cherukuri
f6a0be5023 Add warpSize to Device properties (#128449)
Adding warp_size to CudaDeviceProperties.

>>> import torch
>>> prop = torch.cuda.get_device_properties(torch.cuda.current_device())
>>> prop.warp_size
64
>>>

@jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449
Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet
2024-07-01 09:13:32 +00:00
eqy
f845a7a91a [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-30 19:22:16 +00:00
Xuehai Pan
56935684c3 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-29 09:23:39 +00:00
PyTorch MergeBot
83caf4960f Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)"
This reverts commit e40f50cb87.

Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:24 +00:00
FEI
59e4e92556 sdp::SDPBackend::flash_attention support PrivateUse1 (#126392)
Fixes https://github.com/pytorch/pytorch/issues/124271

cc  @cpuhrsch @drisspg @albanD @soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392
Approved by: https://github.com/drisspg
2024-06-28 17:48:40 +00:00
Xuehai Pan
e40f50cb87 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-28 15:37:57 +00:00
PyTorch MergeBot
999eec8dea Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit b7e7a4cb01.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))
2024-06-28 06:03:54 +00:00
Jeff Daily
169b4ca07e add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet
2024-06-27 23:53:13 +00:00
Chen, Zejun
a028e5862d [profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554)
Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler.

Fix https://github.com/pytorch/pytorch/issues/101861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554
Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi
2024-06-27 10:46:05 +00:00
Eddie Yan
b7e7a4cb01 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-26 00:49:18 +00:00
Yifu Wang
bbd47f7b2f Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762)
This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762
Approved by: https://github.com/wanchaol
2024-06-25 22:32:21 +00:00
Xuehai Pan
93a33bf3ac [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 18:04:38 +00:00