When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`.
Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817
Approved by: https://github.com/Skylion007
Similar to #140425, we are making the implementation usable via header-only code sharing.
Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't.
Testing: existing correctness tests should cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673
Approved by: https://github.com/cyyever, https://github.com/Skylion007
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376
To enable more balanced sharding, had to push 148ae19935
Added `@xfailIfSM89` to the following tests:
- test_fp8_pattern_2
- test_original_aten_preserved_split_addmm
- test_sparse_semi_structured_scaled_mm
- test_sparse_semi_structured_scaled_mm_fp8
- test_sparse_fp8fp8_mm
Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`
Skipped following inductor tests (that either flaky OOMs or timeouts):
- test_reduction_fn_std_float64
- test_reduction_fn_var_mean_float64
- test_multi_output_unbacked_custom_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.
For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
Fixes#127920
This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.
**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
431 | __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally
By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.
**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
Fixes#10536
Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work!
As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral.
The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)), which requires only a very small change in the C++ code.
Here are the benchmarks of the new implementation!
`float32`:

`bool`:

Code:
```python
from __future__ import annotations
import random
import time
from typing import Literal
import numpy as np
import torch
def pad_sequence_with_flips(
sequences: list[torch.Tensor],
batch_first: bool = False,
padding_value: int | float | bool = 0.0,
padding_side: Literal["left", "right"] | str = "left",
) -> torch.Tensor:
if padding_side == 'right':
padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value)
elif padding_side=='left':
padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value) # pyright: ignore[reportArgumentType]
padded_sequence = padded_sequence.flip(int(batch_first))
else:
raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}")
return padded_sequence
sequence_lengths: list[int] = []
flip_left_pad_times: list[float] = []
flip_left_pad_times_std: list[float] = []
left_pad_times: list[float] = []
left_pad_times_std: list[float] = []
RUNS_PER_LOOP: int = 100
for i in range(1, 7):
sequence_length = i * int(1e6) // 6
sequence_lengths.append(sequence_length)
sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)]
inner_left_pad_times: list[float] = []
inner_right_pad_times: list[float] = []
inner_flip_left_pad_times: list[float] = []
inner_flip_right_pad_times: list[float] = []
for _ in range(RUNS_PER_LOOP):
start = time.perf_counter()
torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left")
end = time.perf_counter()
inner_left_pad_times.append(end - start)
start = time.perf_counter()
pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left")
end = time.perf_counter()
inner_flip_left_pad_times.append(end - start)
left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times))
left_pad_times_std.append(np.std(inner_left_pad_times))
flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times))
flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times))
print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}")
import matplotlib.pyplot as plt
plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left")
plt.scatter(sequence_lengths, left_pad_times)
plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^')
plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)")
plt.scatter(sequence_lengths, flip_left_pad_times)
plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^')
plt.xlabel("Sequence Length")
plt.ylabel("Time (s)")
plt.legend(loc="upper right")
# Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794
# Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562
# Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863
# Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877
# Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024
# Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751
```
Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884
Approved by: https://github.com/ezyang
Copy of #126089, with some additional fixes & tests
Partial fix for #125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation.
The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes:
* (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying
* (still kind of wrong): views won't actually alias each other after deepcopying.
* (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias.
BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++.
Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126126
Approved by: https://github.com/ezyang
This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
This PR updates the error message in autograd when an input tensor does not set to `require_grad`. The original message does not contain the index info, making users hard to debug.
The error message style consists with that on line 105-109.
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123154
Approved by: https://github.com/soulitzer
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```
Fixes https://github.com/pytorch/pytorch/issues/101839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```
Fixes https://github.com/pytorch/pytorch/issues/101839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD