PyTorch MergeBot
3d1fa40ae1
Revert "[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml ( #164641 )"
...
This reverts commit 64108bdbed .
Reverted https://github.com/pytorch/pytorch/pull/164641 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164641#issuecomment-3386346474 ))
2025-10-09 15:42:51 +00:00
Yuanyuan Chen
64108bdbed
[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml ( #164641 )
...
This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641
Approved by: https://github.com/albanD , https://github.com/justinchuby
2025-10-08 08:27:58 +00:00
AaronWang04
04a393507b
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/albanD
2025-07-22 22:25:44 +00:00
PyTorch MergeBot
35f1b4ad9e
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit 15ef4f28df .
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts . See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935 ))
2025-07-21 17:31:42 +00:00
AaronWang04
15ef4f28df
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/eqy , https://github.com/albanD
2025-07-18 23:24:21 +00:00
Xuehai Pan
6d5c789ad5
[BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format ( #144555 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555
Approved by: https://github.com/ezyang
ghstack dependencies: #144551 , #144554
2025-06-24 04:53:54 +00:00
Ting Lu
c2bc7e2827
API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ ( #150536 )
...
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.
For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103 ), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t ) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t
Error we see without the change
```
RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))`
To execute this test, run the following from the base repo dir:
python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536
Approved by: https://github.com/jcaip , https://github.com/atalman
2025-05-14 23:36:53 +00:00
Aaron Gokaslan
3555ebb63d
[BE]: Update ruff to 0.11.8 ( #153249 )
...
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever , https://github.com/albanD , https://github.com/seemethere
2025-05-12 18:30:52 +00:00
Ke Wen
35c45a4a31
[Reland] Launch kernel on current stream & remove record_stream entirely ( #150398 )
...
Relanding #148590 due to merge conflict.
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460 ).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Squashed contents:
* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820 )
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**
* [PGNCCL] Make avoid-record-stream default
* [c10d] Add asyncOp argument to Ops
* Change python side wait
* Pass asyncOp at ProcessGroup level
* Watchdog unstashing tensors as a safety net
* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753
* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079
* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
PyTorch MergeBot
afa1eda901
Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely ( #148590 )"
...
This reverts commit ef6296e7f2 .
Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626 ))
2025-03-17 22:43:15 +00:00
Ke Wen
ef6296e7f2
[PGNCCL] Launch kernel on current stream & remove record_stream entirely ( #148590 )
...
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460 ).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy , https://github.com/Aidyn-A , https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
PyTorch MergeBot
a95eb0c0a7
Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely ( #148590 )"
...
This reverts commit 2149f6c684 .
Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270 ))
2025-03-10 22:38:40 +00:00
Ke Wen
2149f6c684
[PGNCCL] Launch kernel on current stream & remove record_stream entirely ( #148590 )
...
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460 ).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy , https://github.com/Aidyn-A , https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
Lu Fang
45d35f5f5a
Clean up op BC check list ( #146577 )
...
Summary: Remove the expired ones
Test Plan: ci
Differential Revision: D69226556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577
Approved by: https://github.com/hl475
2025-02-07 22:40:49 +00:00
soulitzer
c72b536420
Add manual override flag for core ATen op detection during bc check ( #146052 )
...
Fixes https://github.com/pytorch/pytorch/issues/146049
Today the bc detection logic ignores allow_list for core ATen ops (A PR landed 4 months ago to enable this). The problem is that if I have a PR that removes an op, the script can no longer check whether that op is core ATen op (today we just error out).
With my fix: (1) conservatively assume core ATen op in such cases (2) allows the user to specify in their ALLOW_LIST entry that their op is not a core ATen op.)
Test plan:
- This is tested 2 PRs above
016bdafdcb/test/forward_backward_compatibility/check_forward_backward_compatibility.py (L129-L137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146052
Approved by: https://github.com/albanD
2025-01-30 23:57:01 +00:00
IvanKobzarev
f85e238186
[aotd] capture rrelu_with_noise noise mutation in compile ( #141867 )
...
Rebase-copy of long standing already approved PR https://github.com/pytorch/pytorch/pull/138503 that was blocked on landing by xla build issues.
Got a new PR with the same content (ghstack checkout was failing due to changed submodules)
Corresponding xla PR:
https://github.com/pytorch/xla/pull/8363
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141867
Approved by: https://github.com/bdhirsh
2024-12-04 12:18:58 +00:00
Mengwei Liu
7365a57dc0
[BC] Add check for core ATen opset schema BC ( #137664 )
...
Summary: Based on core ATen opset BC policy: https://dev-discuss.pytorch.org/t/core-aten-opset-backward-forward-compatibility-policy/1772
Encorcing this policy in `check_forward_backward_compatibility.py`.
Basically the script will error out if any BC breaking schema changes
occurs to core ATen operators.
Test Plan:
Run `python test/forward_backward_compatibility/dump_all_function_schemas.py --filename nightly_schemas.txt`
Manually added a argument to `nightly_schemas.txt`, `convolution`
schema, see the following error:
```
[WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:329] Can NOT find backward compatible schemas after changes for schema aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor from the following candidates:
[
aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups) -> Tensor
aten::convolution.out(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, *, Tensor(a!) out) -> Tensor(a!)
]. Please contact PyTorch team to confirm if this BC breaking change is safe or not.
...
[WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:342] The PR is introducing backward incompatible changes to core ATen operators. Please contact PyTorch team to confirm whether this change is wanted or not.
Broken ops: [
aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor
]
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137664
Approved by: https://github.com/albanD
2024-10-18 01:58:33 +00:00
Lu Fang
d170c410f2
Clean up op BC check list ( #137634 )
...
Summary: Remove some stale items
Test Plan: CI
Differential Revision: D64133246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634
Approved by: https://github.com/hl475
2024-10-10 04:29:21 +00:00
Huamin Li
fd494dd426
Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix ( #135401 )
...
Summary: In https://github.com/pytorch/pytorch/pull/134232 , we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix
Differential Revision: D62325142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401
Approved by: https://github.com/houseroad
2024-09-08 04:16:24 +00:00
leslie-fang-intel
86e2d16ba0
[Inductor][Quant] Change the schema of QLinear Binary ( #129049 )
...
**Summary**
We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template.
- Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar.
- We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion.
**Test Plan**
```
python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049
Approved by: https://github.com/jgong5 , https://github.com/jansel
ghstack dependencies: #128825 , #129048
2024-07-02 12:36:38 +00:00
drisspg
fc2913fb80
Remove amax return from _scaled_mm ( #128683 )
...
# Summary
The primary reason for the change was lack of current use case and the need to work around an two Inductor issue.
- Tensor arguments as kwarg only
- multiple outputs from triton templates
If the need for the amax return type arises we can consider either adding it, more likely creating a separate op.
In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels.
### Changes:
- This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision.
- We currently still allow for fp8 returns and scaled result. Perhaps we should also ban this as well...
New signature:
```Python
def meta_scaled_mm(
self: torch.Tensor,
mat2: torch.Tensor,
scale_a: torch.Tensor,
scale_b: torch.Tensor,
bias: Optional[torch.Tensor] = None,
scale_result: Optional[torch.Tensor] = None,
out_dtype: Optional[torch.dtype] = None,
use_fast_accum: bool = False,
) -> torch.Tensor:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683
Approved by: https://github.com/vkuzo
2024-06-17 16:48:00 +00:00
dan_the_3rd
4a384d813b
[SDPA/memeff] Backport changes from xFormers to PT ( #127090 )
...
Backporting a few fixes from xFormers:
* Bug fixes for local attention (which is not exposed in PT at the moment)
* Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028 )
Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time
The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090
Approved by: https://github.com/drisspg
2024-06-05 07:33:27 +00:00
Tobias Ringwald
4f1a56cd42
Switched from parameter in can_cast to from_. ( #126030 )
...
Fixes #126012 .
`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.
If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
2024-05-16 20:58:24 +00:00
PyTorch MergeBot
315389bfed
Revert "Remove deprecated _aminmax operator ( #125995 )"
...
This reverts commit 0116ffae7f .
Reverted https://github.com/pytorch/pytorch/pull/125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](https://github.com/pytorch/pytorch/pull/125995#issuecomment-2113769497 ))
2024-05-16 01:45:37 +00:00
PyTorch MergeBot
6065a4d46e
Revert "Switched from parameter in can_cast to from_. ( #126030 )"
...
This reverts commit 06d6bb4eba .
Reverted https://github.com/pytorch/pytorch/pull/126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with https://github.com/pytorch/pytorch/pull/125995 . Please help rebase and I will reland the change ([comment](https://github.com/pytorch/pytorch/pull/126030#issuecomment-2113757469 ))
2024-05-16 01:42:23 +00:00
Tobias Ringwald
06d6bb4eba
Switched from parameter in can_cast to from_. ( #126030 )
...
Fixes #126012 .
`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.
If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
2024-05-16 00:09:54 +00:00
cyy
0116ffae7f
Remove deprecated _aminmax operator ( #125995 )
...
It has been deprecated for a long time.
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125995
Approved by: https://github.com/ezyang
2024-05-12 17:50:17 +00:00
leslie-fang-intel
c337395cdb
[Inductor][Quant] Change the QConv output scale name ( #124246 )
...
**Summary**
Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246
Approved by: https://github.com/jgong5 , https://github.com/peterbell10
ghstack dependencies: #124041
2024-05-09 08:44:00 +00:00
PyTorch MergeBot
ca579c177b
Revert "[Inductor][Quant] Change the QConv output scale name ( #124246 )"
...
This reverts commit 9ba9f7fa82 .
Reverted https://github.com/pytorch/pytorch/pull/124246 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558 ))
2024-05-09 01:34:19 +00:00
leslie-fang-intel
9ba9f7fa82
[Inductor][Quant] Change the QConv output scale name ( #124246 )
...
**Summary**
Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246
Approved by: https://github.com/jgong5 , https://github.com/peterbell10
ghstack dependencies: #124041
2024-05-09 00:57:10 +00:00
eqy
a866bfff45
[cuDNN] cuDNN SDPA (Flash Attention) Backward ( #122510 )
...
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs
Will also collect benchmark data,
CC @drisspg
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-27 04:15:49 +00:00
PyTorch MergeBot
7a6813b7b3
Revert "[cuDNN] cuDNN SDPA (Flash Attention) Backward ( #122510 )"
...
This reverts commit 64af899fdf .
Reverted https://github.com/pytorch/pytorch/pull/122510 on behalf of https://github.com/jeanschmidt due to Breaking amd gpu builds ([comment](https://github.com/pytorch/pytorch/pull/122510#issuecomment-2076743868 ))
2024-04-25 09:22:37 +00:00
eqy
64af899fdf
[cuDNN] cuDNN SDPA (Flash Attention) Backward ( #122510 )
...
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs
Will also collect benchmark data,
CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-24 01:00:34 +00:00
Ashwin Hari
5f5778476a
rename ort to maia ( #123265 )
...
Fixes #123264
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265
Approved by: https://github.com/albanD
2024-04-23 00:33:25 +00:00
Aleksandar Samardžić
f5331aade5
Simplify ATen sparse semi-structured operators based on CUTLASS ( #123473 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473
Approved by: https://github.com/cpuhrsch
2024-04-14 06:57:41 +00:00
PyTorch MergeBot
97261be0a8
Revert "Simplify ATen sparse semi-structured operators based on CUTLASS ( #123473 )"
...
This reverts commit b2a0b8c446 .
Reverted https://github.com/pytorch/pytorch/pull/123473 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123473#issuecomment-2053561077 ))
2024-04-13 07:47:32 +00:00
Aleksandar Samardžić
b2a0b8c446
Simplify ATen sparse semi-structured operators based on CUTLASS ( #123473 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473
Approved by: https://github.com/cpuhrsch
2024-04-11 11:56:27 +00:00
Arun Pa
266e278ccf
UFMT formatting on test/distributions, test/error_messages, test/forward_backward_compatability ( #123527 )
...
Partiall addresses #123062
UFMT formatting on
- test/distributions
- test/error_messages, test/forward_backward_compatability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123527
Approved by: https://github.com/huydhn
2024-04-09 16:03:46 +00:00
PyTorch MergeBot
df048f4da4
Revert "[RELAND] Remove deprecated fbgemm operators ( #112153 )"
...
This reverts commit 19e8ba95e5 .
Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780 ))
2024-02-01 18:35:19 +00:00
Peter Bell
19e8ba95e5
[RELAND] Remove deprecated fbgemm operators ( #112153 )
...
These operators are not used and have been deprecated since #72690
(Feb 2022).
BC-breaking message:
`TorchScript` models that were exported with the deprecated
`torch.jit.quantized` API will no longer be loadable, as the required
internal operators have been removed.
Please re-export your models using the newer `torch.ao.quantization` API
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153
Approved by: https://github.com/jerryzh168
2024-01-30 16:32:37 +00:00
Aleksandar Samardžić
f081c45a34
Add out_dtype support for sparse semi-structured CUTLASS back-end ( #116519 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116519
Approved by: https://github.com/cpuhrsch
2024-01-03 16:23:17 +00:00
Lu Fang
d791074c81
Clean up PyTorch op BC check list ( #116468 )
...
Summary: Remove the expired items.
Test Plan: CI
Differential Revision: D52435764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116468
Approved by: https://github.com/feikou
2023-12-28 06:05:59 +00:00
soulitzer
8885128dcc
Fix backward for SDPA NT jagged layout ( #115576 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576
Approved by: https://github.com/jbschlosser , https://github.com/ani300
2023-12-12 18:35:40 +00:00
Antoni Viros
d47f715d29
Expose Flash attn to autograd ( #114378 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378
Approved by: https://github.com/drisspg
2023-12-01 23:42:06 +00:00
drisspg
c46fc46dba
expose mem-eff to autograd ( #110495 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495
Approved by: https://github.com/jbschlosser
2023-11-13 17:47:40 +00:00
Ke Wen
bb7ac12cbf
[ProcessGroupNCCL] Avoid recording stream for broadcast and scatter ( #112896 )
...
Summary: Follows PR #111431 , save memory for DTensor init
Test Plan: Sandcastle
Reviewed By: wanchaol
Differential Revision: D50985365
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896
Approved by: https://github.com/wanchaol
2023-11-07 15:44:04 +00:00
leslie-fang-intel
b6fc7af8a0
Enable oneDNN QConv FP32/BF16 output ( #112010 )
...
**Summary**
- PR 1 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640 .
- Enable QConv (relu, add, add_relu) with BFloat16 or Float32 output.
**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_float_output_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112010
Approved by: https://github.com/jerryzh168 , https://github.com/jgong5
2023-11-03 08:16:45 +00:00
chunyuan
29f3d392bf
Inductor cpp wrapper: support QLinear ( #112378 )
...
Align the type of `post_op_args` in the schema of `onednn::qlinear_pointwise` to be the same as other fusion OPs like qconv, conv, conv_transpose, linear by changing from `float[]` to `Scalar?[]`:
cb942ef2b1/aten/src/ATen/native/quantized/library.cpp (L260-L266)
cb942ef2b1/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp (L48-L59)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112378
Approved by: https://github.com/jgong5 , https://github.com/desertfire
ghstack dependencies: #112373
2023-11-01 06:22:16 +00:00
PyTorch MergeBot
5ce8002d24
Revert "Remove deprecated fbgemm operators ( #104535 )"
...
This reverts commit 57c7aa12db .
Reverted https://github.com/pytorch/pytorch/pull/104535 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104535#issuecomment-1779650412 ))
2023-10-25 16:34:16 +00:00
PyTorch MergeBot
b0087b4cf7
Revert "record_function: remove legacy internal operators ( #72303 )"
...
This reverts commit 0be84bb41e .
Reverted https://github.com/pytorch/pytorch/pull/72303 on behalf of https://github.com/izaitsevfb due to Apparently _record_function_enter is still used internally at Meta in several places and in lots of internal tests. ([comment](https://github.com/pytorch/pytorch/pull/72303#issuecomment-1777942975 ))
2023-10-24 20:01:14 +00:00