Kevin Fu
858fb80b9b
[PT2]: Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight ( #160451 )
...
Summary: Add static dispatch kernel for wrapped_fbgemm_linear_fp16_weight. This optimization should improve perf for all Ads DSNN models using Sigmoid.
Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=37
OTHER_MODEL_ENTITY_ID=892669089
OTHER_SNAPSHOT_ID=36
MODULES=(mix prepare_float_features object user)
SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only)
for i in "${!MODULES[@]}"; do
MODULE=${MODULES[i]}
SUFFIX=${SUFFIXES[i]}
buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true
```
Before: P1900475429
I0810 19:29:22.782902 2717337 load_net_predictor_lib.cpp:1807] Average latency A: 0.0843 ms
I0810 19:29:22.782905 2717337 load_net_predictor_lib.cpp:1807] Average latency B: 0.0989 ms
After: P1900825771
I0811 15:42:34.866408 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency A: 0.0854 ms[0m
I0811 15:42:34.866411 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency B: 0.092 ms[0m
Still has some regression but the gap is smaller...
Rollback Plan:
Reviewed By: henryoier, muchulee8
Differential Revision: D80042054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160451
Approved by: https://github.com/henryoier
2025-08-15 04:06:17 +00:00
PyTorch MergeBot
641ee74781
Revert "Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss ( #150282 )"
...
This reverts commit f990490a23 .
Reverted https://github.com/pytorch/pytorch/pull/150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150282#issuecomment-3182844949 ))
2025-08-13 09:01:52 +00:00
zeshengzong
f990490a23
Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss ( #150282 )
...
Fixes #91545
## Changes
- Add `label_smoothing` param and docs
- Add test case for `label_smoothing`
- Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss`
## Test Result
```bash
pytest -s test/test_nn.py -k test_bce
```



Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282
Approved by: https://github.com/cyyever , https://github.com/mikaylagawarecki
2025-08-12 09:37:03 +00:00
Mikayla Gawarecki
7f649ed4f8
Add basic torch.hash_tensor op ( #154149 )
...
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.
- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
AaronWang04
04a393507b
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/albanD
2025-07-22 22:25:44 +00:00
PyTorch MergeBot
35f1b4ad9e
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit 15ef4f28df .
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts . See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935 ))
2025-07-21 17:31:42 +00:00
Yukio Siraichi
a10f15718d
[DLPack] Add support for missing keyword-arguments. ( #150218 )
...
This PR introduces the rest of the keyword-arguments added in DLPack
version 2023.12: `dl_device` and `copy`.
In summary, we handle these arguments in the C++ implementation of
`to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the
`maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also
introduces the following changes:
- Add a new Python API `torchDeviceToDLDevice()`, which is simply a
refactoring of the `getDLDevice()` function at
_aten/src/ATen/DLConvertor.cpp_.
- Add both keyword-arguments to the `from_dlpack()` function at
_torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder
method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218
Approved by: https://github.com/albanD
ghstack dependencies: #150216 , #150217
2025-07-20 00:46:20 +00:00
AaronWang04
15ef4f28df
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/eqy , https://github.com/albanD
2025-07-18 23:24:21 +00:00
Xuehai Pan
4cc8b60d1b
[BE][1/16] fix typos in torch/ ( #156311 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311
Approved by: https://github.com/albanD
2025-07-09 11:02:22 +00:00
PyTorch MergeBot
6401d1d53d
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit e1aee86646 .
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001 ) [HUD commit link](e1aee86646 ) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176 ))
2025-07-01 18:46:45 +00:00
AaronWang04
e1aee86646
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
Yukio Siraichi
b54eac2a5e
Upgrade to DLPack 1.0. ( #145000 )
...
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:
- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
- Fallback to old implementation if no `max_version` or if version
lower than 1.0
- Check that the to-be-consumed capsule is of version up to 1.X
In order to accommodate these new specifications, this PR adds the
following main changes:
- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-30 16:58:06 +00:00
PyTorch MergeBot
b4442f42a9
Revert "Upgrade to DLPack 1.0. ( #145000 )"
...
This reverts commit 6e185c5312 .
Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400 ))
2025-06-20 15:32:47 +00:00
Yukio Siraichi
6e185c5312
Upgrade to DLPack 1.0. ( #145000 )
...
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:
- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
- Fallback to old implementation if no `max_version` or if version
lower than 1.0
- Check that the to-be-consumed capsule is of version up to 1.X
In order to accommodate these new specifications, this PR adds the
following main changes:
- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-19 16:27:42 +00:00
Sean McGovern
297805fd8f
Typo fixes for "overridden" in comments and function names ( #155944 )
...
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
2025-06-14 03:37:38 +00:00
Oguz Ulgen
d1947a8707
Migrate from lru_cache to cache ( #155613 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
PaulZhang12
3ed5f1fb77
[CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs ( #150812 )
...
Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators:
* mm
* bmm
* addmm
* baddmm
Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390
Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812
Approved by: https://github.com/ngimel , https://github.com/eqy
2025-04-18 01:53:26 +00:00
Thomas Adams
8494d5582a
Propagate callable parameter types using ParamSpec ( #142306 ) ( #151014 )
...
Partially addresses #142306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151014
Approved by: https://github.com/Skylion007
2025-04-13 20:38:11 +00:00
cyy
98bf2f1170
Use Python 3.9 typing ( #148157 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157
Approved by: https://github.com/janeyx99
2025-03-04 03:09:55 +00:00
Xuehai Pan
c73a92fbf5
[BE][CI] bump ruff to 0.9.2: multiline assert statements ( #144546 )
...
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
Jack Zhang
ed309b9156
Re-add stft option to align window for center = false ( #146379 )
...
Skips advancing the fc window on https://github.com/pytorch/pytorch/pull/145437 , since I just found that there were non-trivial efforts to do so a while ago that eventually was reverted: https://github.com/pytorch/pytorch/pull/73434
Works around the issue by keeping the stft sans center overload
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146379
Approved by: https://github.com/justinchuby , https://github.com/iseeyuan
2025-02-06 14:07:13 +00:00
PyTorch MergeBot
ad36f4f42c
Revert "Add generator parameter to rand*_like functions ( #136780 )"
...
This reverts commit c7b2f7dd14 .
Reverted https://github.com/pytorch/pytorch/pull/136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/136780#issuecomment-2613191933 ))
2025-01-24 19:00:21 +00:00
Aaron Orenstein
f2cfe8b59f
PEP585 update - mostly toplevels ( #145178 )
...
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178
Approved by: https://github.com/bobrenjc93
2025-01-22 02:21:14 +00:00
Sam
c7b2f7dd14
Add generator parameter to rand*_like functions ( #136780 )
...
Fixes #128786
Fixes #101974
Fixes #27072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136780
Approved by: https://github.com/Chillee , https://github.com/ezyang
2025-01-15 21:16:52 +00:00
gasoonjia
29e985b7b0
[dim_order] raised runtime error when tensor has ambiguous dim order ( #141632 )
...
This diff makes tensor.dim_order() raise error when tensor's dim order is ambiguous. Detail discussion can be found https://fb.workplace.com/groups/894363187646754/permalink/2039987243084337/
Differential Revision: [D65133579](https://our.internmc.facebook.com/intern/diff/D65133579/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141632
Approved by: https://github.com/larryliu0820
2024-12-08 23:16:57 +00:00
Yukio Siraichi
f8c212a925
Transform unbacked int expressions into a fresh unbacked int. ( #141917 )
...
Fix : #141419
This PR introduces the `torch.sym_fresh_size` API, which transforms an unbacked int
expression into a fresh unbacked int.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141917
Approved by: https://github.com/ezyang
2024-12-05 16:53:44 +00:00
Donald Tolley
c1e7d85ce6
Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss ( #132049 )
...
#### Summary
This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others.
#### Changes
- **`weighted_huber_loss`**: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter.
- **`wmse_loss`** (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks.
- **`wmae_loss`** (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers.
#### Code Details
- **Input Validation**: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors.
- **Reduction Options**: Supports `none`, `mean`, and `sum` reductions to suit various computational needs.
- **Backward Compatibility**: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument.
#### Usage Example
```python
import torch
input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32)
target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32)
weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32)
loss = weighted_huber_loss(input, target, weights, delta=1.0)
print(loss)
```
---
Feedback on these implementations is welcome; please let me know if further modifications are required.
Resolves #132465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049
Approved by: https://github.com/mikaylagawarecki
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2024-10-31 21:59:43 +00:00
Aaron Gokaslan
5d074746e9
[BE]: Add better optional typing ( #138426 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426
Approved by: https://github.com/XuehaiPan , https://github.com/malfet
2024-10-27 14:19:00 +00:00
Laith Sakka
ed313a5ca2
Introduce torch.sym_add, variadic add ( #138660 )
...
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
if min is None and max is None:
torch._check_is_size(size)
return
```
Partially addresses https://github.com/pytorch/pytorch/issues/128150
When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation. Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments. Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang , https://github.com/bobrenjc93
2024-10-23 17:42:41 +00:00
Gabriel Ferns
2b4af6fa74
Mark torch.get_device as overridable at the python level ( #132706 )
...
Summary:
- add a value to `get_testing_overrides` function for `torch.get_device()`
- remove `torch.get_device()` from the `get_ignored_functions` list
Test Plan:
Existing override testing infra, which should pick up the updates to these two variables.
Closes the loop on:
https://github.com/pytorch/pytorch/pull/132567
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706
Approved by: https://github.com/ezyang
2024-10-22 07:20:42 +00:00
Michael Lazos
a20a17fd6f
[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode ( #137669 )
...
Fixes https://github.com/pytorch/pytorch/issues/114369
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669
Approved by: https://github.com/anijain2305
2024-10-19 04:12:45 +00:00
PyTorch MergeBot
16a2c2cfd4
Revert "Introduce torch.sym_sum ( #136429 )"
...
This reverts commit 90bed32b98 .
Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147 ))
2024-10-09 20:08:01 +00:00
Edward Z. Yang
90bed32b98
Introduce torch.sym_sum ( #136429 )
...
Partially addresses https://github.com/pytorch/pytorch/issues/128150
When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation. Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments. Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.
update_hint_regression benchmark, before and after:
```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
Joel Schlosser
6374a19a6e
Fix wrapper subclass serialization with custom sizes / strides ( #137030 )
...
Fixes #130154
This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.
The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.
Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
2024-10-02 18:55:03 +00:00
PyTorch MergeBot
e9d2765ec8
Revert "Add deterministic path for CUDA cumsum ( #136224 )"
...
This reverts commit d1bb8e828f .
Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226 ))
2024-09-27 12:54:47 +00:00
Kurt Mohler
d1bb8e828f
Add deterministic path for CUDA cumsum ( #136224 )
...
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.
Fixes #89492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang , https://github.com/justinchuby
2024-09-26 04:52:05 +00:00
PyTorch MergeBot
e3b89ca124
Revert "Add deterministic path for CUDA cumsum ( #136224 )"
...
This reverts commit b1a02bf708 .
Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626 ))
2024-09-25 14:11:01 +00:00
Kurt Mohler
b1a02bf708
Add deterministic path for CUDA cumsum ( #136224 )
...
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.
Fixes #89492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang , https://github.com/justinchuby
2024-09-24 21:34:43 +00:00
PyTorch MergeBot
fd182b90a7
Revert "Add deterministic path for CUDA cumsum ( #136224 )"
...
This reverts commit d45b0151e5 .
Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135 ))
2024-09-23 19:57:13 +00:00
Kurt Mohler
d45b0151e5
Add deterministic path for CUDA cumsum ( #136224 )
...
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.
Fixes #89492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang , https://github.com/justinchuby
2024-09-20 02:41:56 +00:00
Huamin Li
fd494dd426
Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix ( #135401 )
...
Summary: In https://github.com/pytorch/pytorch/pull/134232 , we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix
Differential Revision: D62325142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401
Approved by: https://github.com/houseroad
2024-09-08 04:16:24 +00:00
Huamin Li
311af3b988
Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked ( #134232 )
...
Summary:
This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff.
We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI
Reviewed By: jerryzh168
Differential Revision: D61395887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232
Approved by: https://github.com/houseroad
2024-08-23 04:54:26 +00:00
Avik Chaudhuri
b454c51060
remove dynamic_dim ( #134211 )
...
Summary: As promised in https://github.com/pytorch/pytorch/pull/134045 .
Test Plan: existing
Differential Revision: D61646937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211
Approved by: https://github.com/angelayi
2024-08-23 04:13:03 +00:00
Xuehai Pan
4226ed1585
[BE] Format uncategorized Python files with ruff format ( #132576 )
...
Remove patterns `**`, `test/**`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576
Approved by: https://github.com/ezyang , https://github.com/Skylion007
ghstack dependencies: #132574
2024-08-04 17:13:31 +00:00
Oguz Ulgen
72d2dba992
Add None return type to init ( #132335 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
Jun Luo
00e19ae97a
[MTIA] Support module.mtia() ( #131499 )
...
Summary: Following other device backends' implementation to support module.mtia() API.
Test Plan: OSS and Internal CIs.
Differential Revision: D60076584
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131499
Approved by: https://github.com/mikaylagawarecki
2024-07-25 04:23:48 +00:00
Xuehai Pan
973037be6a
[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() ( #130199 )
...
This PR changes the empty collection factory call to Python literals:
- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`
The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:
```bash
$ python3 -m dis - <<EOS
import collections
d1 = {}
d2 = dict()
dict = collections.OrderedDict
d3 = dict()
EOS
```
```text
0 0 RESUME 0
1 2 LOAD_CONST 0 (0)
4 LOAD_CONST 1 (None)
6 IMPORT_NAME 0 (collections)
8 STORE_NAME 0 (collections)
3 10 BUILD_MAP 0
12 STORE_NAME 1 (d1)
4 14 PUSH_NULL
16 LOAD_NAME 2 (dict)
18 CALL 0
26 STORE_NAME 3 (d2)
6 28 LOAD_NAME 0 (collections)
30 LOAD_ATTR 8 (OrderedDict)
50 STORE_NAME 2 (dict)
7 52 PUSH_NULL
54 LOAD_NAME 2 (dict)
56 CALL 0
64 STORE_NAME 5 (d3)
66 RETURN_CONST 1 (None)
```
The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Xuehai Pan
f85d1e845a
[BE] enable UFMT for torch/nn/*.py ( #128593 )
...
Part of #123062
- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593
Approved by: https://github.com/mikaylagawarecki
2024-06-23 16:05:13 +00:00
PyTorch MergeBot
cc8193c707
Revert "[BE] enable UFMT for torch/nn/functional.py ( #128592 )"
...
This reverts commit f6e6e55fa7 .
Reverted https://github.com/pytorch/pytorch/pull/128592 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128592#issuecomment-2181783936 ))
2024-06-21 00:44:16 +00:00
Xuehai Pan
f6e6e55fa7
[BE] enable UFMT for torch/nn/functional.py ( #128592 )
...
Part of #123062
- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596 , #128594
2024-06-17 16:29:29 +00:00