This is needed for the HSTU model.
Details:
* ~~NT `chunk` now calls into NT `split_with_sizes` since the latter is more general~~ (removed; they're totally separate)
* Throws for backward
* Only operates over the last dim (`dim=-1`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97446
Approved by: https://github.com/cpuhrsch
# Summary
There exists an optimization within the scaled_dot_product_efficieint bacwkard attention path to, under the right conditions, output grad_q, grad_k, grad_v all as aliases of the same storage. This was done to optimize for the hot path where mha does packed linear_projection -> chunk -> (view stuff) -> sdpa. The thought was that chunk-> would be able to "trivially" cat inputs to chunk.backward(). However upon closer inspection chunk.backward will call ` cat` irregardless of the inputs so this is not being utilized.
I validated this by profiling on main and then this branch and the traces produced the same both with `split.backward()` calling into cat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96880
Approved by: https://github.com/cpuhrsch
Fixes internal linking problem after `DECLARE_DISPATCH` was introduced in SparseTensorUtils.cpp, but implemented inside the native library.
Also, fix `sign-unsigned` compare in `_flatten_indices_impl`
Followups:
Move code declared/implemented in `SparseTensorUtils.*` to `at::native` namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96696
Approved by: https://github.com/albanD
follow-up https://github.com/pytorch/pytorch/pull/93901.
Unexpected numerical mismatches observed in some foreach functions' backward result seemed to be caused by the wrong order of `IndexRangeGenerator::range` call.
This pr has `args_with_derivatives` have the same or similar order of `foreach_native_function.func.arguments.flat_non_out`
---
what the current master generates for `_foreach_mul.List`:
```cpp
variable_list ForeachMulBackward0List::apply(variable_list&& grads) {
std::lock_guard<std::mutex> lock(mutex_);
TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE);
TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE);
IndexRangeGenerator gen;
auto other_ix = gen.range(other_size_);
auto self_ix = gen.range(self_size_);
variable_list grad_inputs(gen.size());
auto other = unpack_list(other_);
auto self = unpack_list(self_);
if (task_should_compute_output({ other_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type()));
}
copy_range(grad_inputs, other_ix, grad_result);
}
if (task_should_compute_output({ self_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type()));
}
copy_range(grad_inputs, self_ix, grad_result);
}
return grad_inputs;
}
```
with this PR the generated backward is
```cpp
variable_list ForeachMulBackward0List::apply(variable_list&& grads) {
std::lock_guard<std::mutex> lock(mutex_);
TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE);
TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE);
IndexRangeGenerator gen;
auto self_ix = gen.range(self_size_); <----- diff
auto other_ix = gen.range(other_size_); <----- diff
variable_list grad_inputs(gen.size());
auto self = unpack_list(self_);
auto other = unpack_list(other_);
if (task_should_compute_output({ other_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type()));
}
copy_range(grad_inputs, other_ix, grad_result);
}
if (task_should_compute_output({ self_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type()));
}
copy_range(grad_inputs, self_ix, grad_result);
}
return grad_inputs;
}
```
The change is to fix the order of `self_ix` and `other_ix`.[](url)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95263
Approved by: https://github.com/soulitzer
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
Another PR towards solving #89205.
What's in this PR:
* The implementation of forward `logcumsumexp` for complex numbers in CPU & CUDA
* The tests on forward call of `logcumsumexp` for complex numbers
* The implementation of backward `logcumsumexp` for complex numbers
What's missing:
* The test on backward gradient of `logcumsumexp` (it complaints `RuntimeError: logcumsumexp does not support automatic differentiation for outputs with complex dtype.` and I don't know how to solve the error and I don't know where to put the test for the backward computation). If possible, I'd like this to be done in this PR.
It's really tricky to handle the edge cases here (i.e. the ones involving `inf`), but I've tried my best to put some comments explaining the reasonings of my decisions in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90847
Approved by: https://github.com/albanD
`log1p` offers better precision near zero since `(1 + x) - 1` truncates any
values less than the float epsilon to zero. For `soft_margin_loss` this also
requires one fewer kernel invocation which for numel=1e7 gives me a 1.2x speedup
on CUDA and a 1.1x speedup on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92114
Approved by: https://github.com/ngimel, https://github.com/lezcano
Ref #70924
This addresses part 1 of the issue, allowing `torch.squeeze` to be
passed a tuple of dimensions. e.g.
```python
x.squeeze(0).squeeze(0)
```
can now be written
```python
x.squeeze((0, 1))
```
(assuming x has at least 2 dimensions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89017
Approved by: https://github.com/albanD
The eager implementation of softmax supports computation along zero dimensions, but many of the other implementations did not, including:
* decompositions & refs (this was causing dynamo failures)
* forward AD for logsumexp
* MPS log_softmax_backward
This PR handles the `input.numel() == 0` cases separately to avoid running `amax()`, which fails for zero dimensions, and updates opinfos.
example of "computation along zero dimensions":
```python
# example of where
import torch
t = torch.rand((4, 0, 0))
print("~")
print(torch.nn.functional.softmax(t, dim=-1)) # this passes
print("~")
torch._refs.softmax(t, dim=-1) # this fails
print("~")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91322
Approved by: https://github.com/lezcano
Ref #70924
This addresses part 1 of the issue, allowing `torch.squeeze` to be
passed a tuple of dimensions. e.g.
```python
x.squeeze(0).squeeze(0)
```
can now be written
```python
x.squeeze((0, 1))
```
(assuming x has at least 2 dimensions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89017
Approved by: https://github.com/albanD
Applies various automated fixes that reduces the number of spurious copies in torch, aten, and c10. I also inlined any default dtors that would have made the type trivially destructible.
Follow up to #89000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90629
Approved by: https://github.com/ezyang
We have an older torch.vmap implementation. It is no longer supported.
It still needs to exist somewhere for the sake of BC with
torch.autograd.functional.
This PR makes it clear what files are meant for implementing the old
vmap implementation. I've seen a couple of PRs recently adding support
for the old vmap implementation, so this will lessen the confusion.
Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324
Approved by: https://github.com/samdow