Fixes#88940
According to the [doc](https://pytorch.org/docs/stable/generated/torch.index_select.html):
1. "The returned tensor has the same number of dimensions as the original tensor (`input`). "
2. "The `dim`th dimension has the same size as the length of `index`; other dimensions have the same size as in the original tensor."
These two conditions cannot be satisfied at the same time if the `input` is a scalar && `index` has multiple values: because a scalar at most holds one element (according to property 1, the output is a scalar), it is impossible to satisfy "The `dim`th dimension has the same size as the length of `index`" when `index` has multiple values.
However, currently, if we do so we either get:
1. Buffer overflow with ASAN;
2. Or (w/o ASAN) silently returns outputs that is not consistent with the doc (`x.index_select(0, torch.Tensor([0, 0, 0]).int())` returns `x`).
As a result, we should explicitly reject such cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94347
Approved by: https://github.com/malfet
As found in #92709, thanks to @ngimel and @jansel, currently `torch.Tensor.fn` points to `UserDefinedObjectVariable` rather than `TorchVariable`. The root cause is due to https://github.com/pytorch/pytorch/pull/92709#pullrequestreview-1273357406. To prevent this, build `TorchVariable` of `torch.Tensor.fn` pointing to `torch.ops.aten.fn`.
This issue propagates to `torch.Tensor.fn` causing graph break with `nopython=True`.
```python
import torch
import torch._dynamo as dynamo
#op = torch.ops.aten.abs_ # no graph break
op = torch.Tensor.abs_ # graph break
args = torch.empty(10)
def foo(args):
return op(args)
opt_foo = dynamo.optimize("inductor", nopython=True)(foo)
y_ = opt_foo(args)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93243
Approved by: https://github.com/jansel
Exponential distribution is continuous. Fixes CPU MKL exponential implementation to exclude integer dtypes.
```python
import torch
dtypes = [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64]
for dtype in dtypes:
x = torch.empty(10000, dtype=dtype).exponential_() # should fail !
print("dtype: ", x.dtype, "sum: ", x.sum())
```
### Additional Context
Related to #92709. This issue propagates to OpInfo of exponential.
```
AssertionError: The supported dtypes for exponential on device type cpu are incorrect!
The following dtypes worked in forward but are not listed by the OpInfo: {torch.int64, torch.uint8, torch.int8, torch.int16, torch.int32}.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92891
Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/ngimel
Fixes#88470
I added the "method" keyword in `aten/src/ATen/native/native_functions.yaml` for the function `where` with Scalar Overload.
This way, you can now use `Tensor.where()` with a scalar parameter the same way `torch.where()` can.
I added a test in `test/test_torch.py` as requested.
It uses the `where()` method on a tensor and then checks it has the same results as the `torch.where()` function.
The test is roughly the same as the one provided by the author of the issue.
PS: this is the second PR I make to resolve this issue, the first one is #92747. I had troubles with commit signatures and is therefore closed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92849
Approved by: https://github.com/albanD
Another PR towards solving #89205.
What's in this PR:
* The implementation of forward `logcumsumexp` for complex numbers in CPU & CUDA
* The tests on forward call of `logcumsumexp` for complex numbers
* The implementation of backward `logcumsumexp` for complex numbers
What's missing:
* The test on backward gradient of `logcumsumexp` (it complaints `RuntimeError: logcumsumexp does not support automatic differentiation for outputs with complex dtype.` and I don't know how to solve the error and I don't know where to put the test for the backward computation). If possible, I'd like this to be done in this PR.
It's really tricky to handle the edge cases here (i.e. the ones involving `inf`), but I've tried my best to put some comments explaining the reasonings of my decisions in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90847
Approved by: https://github.com/albanD
`TORCH_CHECK_TENSOR_ALL(cond, ...)` is a wrapper around `TORCH_CHECK` which allows the condition argument to be a tensor, batched or unbatched. `cond` can be a boolean tensor of any size. If any element is False, or if `cond.numel() == 0`, then `TORCH_CHECK_TENSOR_ALL` raises an error
Part of #72948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89097
Approved by: https://github.com/zou3519
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.
`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.
### Testing
- Single core:
Before:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794
After:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003
- Single socket (28 cores):
Before:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568
After:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.
`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.
### Testing
- Single core:
Before:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794
After:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003
- Single socket (28 cores):
Before:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568
After:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.
`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.
### Testing
- Single core:
Before:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794
After:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003
- Single socket (28 cores):
Before:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568
After:
shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
# Motivate
We need to add XPU backend to support torch.save and torch.load when parameter _use_new_zipfile_serialization=False.
# Solution
We give a design via wrap data as a tensor:
>1. and use an in-place copy for H2D
>2. directly call a tensor.to() for D2H.
This can help us:
>1. unify the generic code for all backends.
>2. support all the non-CPU device backends.
# Additional Context
No need more UT.
test/test_serialization.py will cover this code change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89679
Approved by: https://github.com/ezyang
Avoids
```
$ python foo.py
Traceback (most recent call last):
File "foo.py", line 3, in <module>
a = torch.cuda.Stream()
File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
```
And now gets
```
$ python foo.py
Traceback (most recent call last):
File "foo.py", line 3, in <module>
a = torch.cuda.Stream()
File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn
raise RuntimeError(
RuntimeError: Tried to instantiate dummy base class Stream
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592
Approved by: https://github.com/soumith
See strategy at PythonOpRegistrationTrampoline.cpp for the
big picture.
Along the way, I made OperatorHandle support == and hashing,
and slightly changed the low level python_dispatch impl API
to disallow empty strings for dispatch key, which had the knock
on effect of requiring us to explicitly make sure we pass in
CompositeImplicitAutograd if we would have passed in "" (I didn't apply
this to the rest of the file because I'm lazy.)
Test strategy is we delete the logic for preventing Python op
registrations in torch from being skipped in a torchdeploy context
and show CI still works.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162
Approved by: https://github.com/anjali411, https://github.com/bdhirsh