Commit Graph

1730 Commits

Author SHA1 Message Date
Heitor Schueroff
f32f85e6da Implemented torch.corrcoef (#60420)
Summary:
Implements `torch.corrcoef` similar to [`np.corrcoef`](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) using `torch.cov` implemented in https://github.com/pytorch/pytorch/pull/58311.

closes https://github.com/pytorch/pytorch/issues/1254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60420

Reviewed By: mruberry

Differential Revision: D29474687

Pulled By: heitorschueroff

fbshipit-source-id: f3c7c5610363aebd88274a51fc77e3cf879cb611
2021-06-30 12:36:02 -07:00
Victor Bittorf
91c076eadc Add TorchVitals for DataLoader (#60959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60959

Add TorchVitals for Dataloader, this indicates that the data loader was enabled.

This is a no-op if TORCH_VITALS environment variable is not set.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex vitals

Reviewed By: VitalyFedyunin

Differential Revision: D29445146

fbshipit-source-id: d5778fff3dafb3c0463fec7a498bff4905597518
2021-06-29 14:08:32 -07:00
Heitor Schueroff
ec9c03c234 Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

closes https://github.com/pytorch/pytorch/issues/19037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: jbschlosser

Differential Revision: D29431651

Pulled By: heitorschueroff

fbshipit-source-id: 167dea880f534934b145ba94291a9d634c25b01b
2021-06-29 14:02:39 -07:00
kshitij12345
956faea585 [fix] cauchy sampling inf on cuda (#60186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59144

As pointed by ngimel, the issue is indeed with calling `tan`.

However the C++ `std::tan` [documenation](https://en.cppreference.com/w/cpp/numeric/math/tan) states that

```
The function has mathematical poles at π(1/2 + n); however no common floating-point representation
is able to represent π/2 exactly, thus there is no value of the argument for which a pole error occurs.
```

All `torch.tan`,`numpy.tan` and `math.tan` are compliant with the above statement.

<details>

```python
import torch
import math
import numpy as np

# Single Precision
print(torch.tan(torch.tensor(math.pi, device='cuda', dtype=torch.float32) * 0.5))
print(np.tan(np.array(np.pi, dtype=np.float32) * 0.5))

# Double Precision
print(math.tan(math.pi * 0.5))
print(torch.tan(torch.tensor(math.pi, device='cuda', dtype=torch.double) * 0.5))
print(np.tan(np.array(np.pi, dtype=np.float64) * 0.5))
```

Output
```
tensor(-22877334., device='cuda:0')
-22877332.42885646
1.633123935319537e+16
tensor(1.6331e+16, device='cuda:0', dtype=torch.float64)
1.633123935319537e+16
```

</details>

So this issue stems from the use of `__tanf` faster approximation of tan from CUDA library (for float16, bfloat16 and float).

8a839c5478/aten/src/ATen/NumericUtils.h (L91-L100)

The fix in the PR is to use the **slower** but more correct version.

Benchmark::
```
[ cauchy : input dtype torch.float16 device cuda ]
                             |  Before  |  After
1 threads: -------------------------------------
      (128,)                 |    3.8   |    4.3
      (256, 128)             |    3.8   |    4.2
      (2, 512, 256)          |    3.8   |    4.2
      (2, 64, 256, 128)      |   22.8   |   29.6
      (4, 2, 512, 256, 128)  |  649.6   |  869.3

Times are in microseconds (us).

[ cauchy : input dtype torch.bfloat16 device cuda ]
                             |  Before  |  After
1 threads: -------------------------------------
      (128,)                 |    3.8   |    4.3
      (256, 128)             |    3.8   |    4.3
      (2, 512, 256)          |    3.8   |    4.3
      (2, 64, 256, 128)      |   23.8   |   30.8
      (4, 2, 512, 256, 128)  |  682.5   |  904.2

Times are in microseconds (us).

[ cauchy : input dtype torch.float32 device cuda ]
                             |  Before  |  After
1 threads: --------------------------------------
      (128,)                 |     3.8  |     4.2
      (256, 128)             |     3.7  |     4.2
      (2, 512, 256)          |     3.7  |     4.2
      (2, 64, 256, 128)      |    35.3  |    37.1
      (4, 2, 512, 256, 128)  |  1020.0  |  1058.3

Times are in microseconds (us).

[- cauchy : input dtype torch.float64 device cuda ]
                             |   Before  |   After
1 threads: ----------------------------------------
      (128,)                 |      3.8  |      4.2
      (256, 128)             |      8.0  |      8.0
      (2, 512, 256)          |     46.0  |     46.0
      (2, 64, 256, 128)      |    669.2  |    669.4
      (4, 2, 512, 256, 128)  |  21255.0  |  21262.1

Times are in microseconds (us).
```

<details>

Benchmark Script:
```python
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle

print('Using pytorch %s' % (torch.__version__))

cuda_shapes = [(128,), (256, 128), (2, 512, 256), (2, 64, 256, 128), (4, 2, 512, 256, 128)]
cuda_dtypes = [torch.half, torch.bfloat16, torch.float, torch.double]
results = []
repeats = 10

for device in ['cuda']:
    dtypes = cuda_dtypes
    shapes = cuda_shapes

    for dtype in dtypes:
        for shape in shapes:
            t = torch.randn(shape, device=device, dtype=dtype) * 10

            tasks = [("t.cauchy_()", "After", "")]
            timers = [Timer(stmt=stmt, label=f"cauchy : input dtype {dtype} device {device}", sub_label=f"{(shape)}", description=desc, globals=globals()) for stmt, desc, label in tasks]

            for i, timer in enumerate(timers * repeats):
                results.append(
                    timer.blocked_autorange()
                )
                print(f"\r{i + 1} / {len(timers) * repeats}", end="")
                sys.stdout.flush()

with open('after-pr.pkl', 'wb') as f:
    pickle.dump(results, f)

comparison = Compare(results)
comparison.print()
```

Compare Script:
```
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle

with open('before-pr.pkl', 'rb') as f:
    after_results = pickle.load(f)

with open('after-pr.pkl', 'rb') as f:
    before_results = pickle.load(f)

comparison = Compare(after_results + before_results)
comparison.print()
```

</details>

TODO:
* [x] Add comment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60186

Reviewed By: jbschlosser

Differential Revision: D29433897

Pulled By: ngimel

fbshipit-source-id: 9c5f14b83e3372bed72369f70eed9256c04385c6
2021-06-28 12:49:30 -07:00
Victor Bittorf
8b6487c650 Add CUDA Vital (#58059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58059

Add CUDA.used vital sign which is true only if CUDA was "used" which technically means the context was created.

Also adds the following features:
- Force vitals to be written even if vitals are disabled, to enable testing when the env variable is not set from the start of execution
- Add a read_vitals call for python to read existing vital signs.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex basic_vitals

Reviewed By: xuzhao9

Differential Revision: D28357615

fbshipit-source-id: 681bf9ef63cb1458df9f1c241d301a3ddf1e5252
2021-06-25 16:31:11 -07:00
Masaki Kozuki
a404cc9a7b CUDA addcmul and addcdiv do math in float for 16 bits I/O (#60715)
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: 42c8439b6e/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu (L30) and cast inputs and results here:
42c8439b6e/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L133-L135)

Related to https://github.com/pytorch/pytorch/issues/58833 #60227 https://github.com/pytorch/pytorch/issues/60454
cc ptrblck mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
2021-06-25 10:21:35 -07:00
Ilqar Ramazanli
90cd57ee16 To add edge_order=2 and documentation for gradient operator (#58165)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56036
Fixes https://github.com/pytorch/pytorch/issues/56130

* All the interior points are computed using second order accurate central differences method for gradient operator. However, currently we only have first order method computation for edge points. In this PR we are adding second order methods for edge points as well.

* Currently, there is no detailed description of how gradient operator computed using second order method, and how to use parameters correctly. We add detailed explanation of meaning of each parameter, and return of the gradient operator, meanwhile giving description of the second-order computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58165

Reviewed By: mruberry

Differential Revision: D29305321

Pulled By: iramazanli

fbshipit-source-id: 0e0e418eed801c8510b8babe2ad3d064479fb4d6
2021-06-23 03:35:15 -07:00
Philip Meier
0c916c8a4e up the priority of numpy array comparisons in self.assertEqual (#59067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0
2021-06-22 13:07:07 -07:00
praneeth
9b30fb8528 add support for constant (#60166)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58739 Add support for constants according to python array API stipulation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60166

Reviewed By: anjali411

Differential Revision: D29253958

Pulled By: mruberry

fbshipit-source-id: 0bc86b74d3a4eb3ec4a65c941ec2710747402db1
2021-06-21 20:47:21 -07:00
Thomas J. Fan
c16f87949f ENH Adds nn.ReflectionPad3d (#59791)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27655

This PR adds a C++ and Python version of ReflectionPad3d with structured kernels. The implementation uses lambdas extensively to better share code from the backward and forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59791

Reviewed By: gchanan

Differential Revision: D29242015

Pulled By: jbschlosser

fbshipit-source-id: 18e692d3b49b74082be09f373fc95fb7891e1b56
2021-06-21 10:53:14 -07:00
Peter Bell
e8e3394ea8 Recognize transposed dense tensors as a form of partial overlap (#59014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59014

Fixes #48401

`assert_no_overlap` currently has a false-negative where it recognizes
the transpose of a contiguous tensor as fully overlapping. This happens because
the memory regions do fully overlap, but of course the strides are different so
the actual elements don't all overlap.

This goes slightly in the other direction, by requiring strides to exactly
match we get false-positives for some unusual situations, e.g.
```
torch.add(a, a, out=a.view([1, *a.shape]))
```
Or replacing strides of length-1 dimensions, etc. However, I think these are
sufficiently obscure that it's okay to error and the common cases like
inplace operations still work as before.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D29040928

Pulled By: ngimel

fbshipit-source-id: 5a636c67536a3809c83f0d3117d2fdf49c0a45e6
2021-06-18 16:29:25 -07:00
Mike Ruberry
92513038e8 Revert D28994140: [pytorch][PR] Implemented torch.cov
Test Plan: revert-hammer

Differential Revision:
D28994140 (23c232554b)

Original commit changeset: 1890166c0a9c

fbshipit-source-id: 73dfe1b00464e38f004f99960cdeeb604ed4b20a
2021-06-13 02:33:37 -07:00
Heitor Schueroff
23c232554b Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

TODO

- [x] Improve documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: mruberry

Differential Revision: D28994140

Pulled By: heitorschueroff

fbshipit-source-id: 1890166c0a9c01e0a536acd91571cd704d632f44
2021-06-11 09:40:50 -07:00
Kimish Patel
4f79270b89 [PyTorch ] Thread parallel bmm across batch dim (#59596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59596

Parallelize batch matmul across batch dim. This was found to improve perf for
some usecases on mobile.
ghstack-source-id: 130989569

Test Plan: CI unit tests

Reviewed By: albanD

Differential Revision: D26833417

fbshipit-source-id: 9b84d89d29883a6c9d992d993844dd31a25f76b1
2021-06-10 08:25:40 -07:00
Yukio Siraichi
84061dadad Add reduce variants for scatter operation. (#57015)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56463 #56464

- Add reduce variants for `scatter` in both _native_functions.yaml_ and _TensorAdvancedIndexing.cpp_
- Add `OpInfo` tests and reduce tests in _test_torch.py_
- Fix default reduce argument for `scatter_` in __tensor_docs.py_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57015

Reviewed By: mrshenli

Differential Revision: D28162657

Pulled By: ezyang

fbshipit-source-id: 4d37ed1569ce8560aca1085c9cf5349f11427c4f
2021-06-08 13:37:26 -07:00
Mike Ruberry
de40c8e495 Adds remaining OpInfos and removes redundant test generators (#55558)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55558

Reviewed By: ngimel

Differential Revision: D28922522

Pulled By: mruberry

fbshipit-source-id: 89cefd93788bc8aa0683f4583cf5caa81aa2dc93
2021-06-06 14:52:26 -07:00
Natalia Gimelshein
344ecb2e71 flip via TI (#59509)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59509

Reviewed By: mruberry

Differential Revision: D28918665

Pulled By: ngimel

fbshipit-source-id: b045c7b35eaf22e53b1bc359ffbe5a4fda05dcda
2021-06-05 15:43:29 -07:00
Natalia Gimelshein
5117ac3bb4 Revert D28877076: [pytorch][PR] torch.flip via TI
Test Plan: revert-hammer

Differential Revision:
D28877076 (d82bc3feb8)

Original commit changeset: 4fa6eb519085

fbshipit-source-id: c81e7d3283ff6822db913bf9f49a1533268755d0
2021-06-04 23:03:53 -07:00
lezcano
d82bc3feb8 torch.flip via TI (#58747)
Summary:
Implements an idea by ngimel to improve the performance of `torch.flip` via a clever hack into TI to bypass the fact that TI is not designed to work with negative indices.

Something that might be added is vectorisation support on CPU, given how simple the implementation is now.

Some low-hanging fruits that I did not implement:
- Write it as a structured kernel
- Migrate the tests to opinfos
- Have a look at `cumsum_backward` and `cumprod_backward`,  as I think that they could be implemented faster with `flip`, now that `flip` is fast.

**Edit**
This operation already has OpInfos and it cannot be migrated to a structured kernel because it implements quantisation

Summary of the PR:
- x1.5-3 performance boost on CPU
- x1.5-2 performance boost on CUDA
- Comparable performance across dimensions, regardless of the strides (thanks TI)
- Simpler code

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(size, dims, num_threads, device):
    x = torch.rand(*size, device=device)

    timer = Timer(
        "torch.flip(x, dims=dims)",
        globals={"x": x, "dims": dims},
        label=f"Flip {device}",
        description=f"dims: {dims}",
        sub_label=f"size: {size}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    sizes = ((1000,)*2, (1000,)*3, (10000,)*2)
    for size, device in product(sizes, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        list_dims = [(0,), (1,), (0, 1)]
        if len(size) == 3:
            list_dims.append((0, 2))
        for num_threads, dims in product(threads, list_dims):
            yield size, dims, num_threads, device

def compare():
    compare = Compare([get_timer(*params) for params in get_params()])
    compare.trim_significant_figures()
    compare.colorize()
    compare.print()

compare()
```
</details>

<details>
<summary>
Benchmark PR
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139954-81e46d80-ba3b-11eb-9aad-e825e515d41b.png)

</details>

<details>
<summary>
Benchmark master
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139915-76914200-ba3b-11eb-9aa8-84b3ca220c93.png)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58747

Reviewed By: agolynski

Differential Revision: D28877076

Pulled By: ngimel

fbshipit-source-id: 4fa6eb519085950176cb3a9161eeb3b6289ec575
2021-06-04 20:13:38 -07:00
Elton Leander Pinto
2119efd234 reflection_pad1d_backward: Port to structured (#59103)
Summary:
Tracking Issue: https://github.com/pytorch/pytorch/issues/55070
Port `reflection_pad1d_backward` to structured kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59103

Test Plan: Pre-existing tests

Reviewed By: jbschlosser

Differential Revision: D28836043

Pulled By: ezyang

fbshipit-source-id: 4c3b0880edf305896f540113dcab70c8af24253b
2021-06-04 10:23:53 -07:00
Edward Yang
f05d5bec48 Preserve PyObject even when it goes dead (#56017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56017

Fixes #55686

This patch is seemingly straightforward but some of the changes are very
subtle.  For the general algorithmic approach, please first read the
quoted issue.  Based on the algorithm, there are some fairly
straightforward changes:

- New boolean on TensorImpl tracking if we own the pyobj or not
- PythonHooks virtual interface for requesting deallocation of pyobj
  when TensorImpl is being released and we own its pyobj, and
  implementation of the hooks in python_tensor.cpp
- Modification of THPVariable to MaybeOwned its C++ tensor, directly
  using swolchok's nice new class

And then, there is python_variable.cpp.  Some of the changes follow the
general algorithmic approach:

- THPVariable_NewWithVar is simply adjusted to handle MaybeOwned and
  initializes as owend (like before)
- THPVariable_Wrap adds the logic for reverting ownership back to
  PyObject when we take out an owning reference to the Python object
- THPVariable_dealloc attempts to resurrect the Python object if
  the C++ tensor is live, and otherwise does the same old implementation
  as before
- THPVariable_tryResurrect implements the resurrection logic.  It is
  modeled after CPython code so read the cited logic and see if
  it is faithfully replicated
- THPVariable_clear is slightly updated for MaybeOwned and also to
  preserve the invariant that if owns_pyobj, then pyobj_ is not null.
  This change is slightly dodgy: the previous implementation has a
  comment mentioning that the pyobj nulling is required to ensure we
  don't try to reuse the dead pyobj.  I don't think, in this new world,
  this is possible, because the invariant says that the pyobj only
  dies if the C++ object is dead too.  But I still unset the field
  for safety.

And then... there is THPVariableMetaType.  colesbury explained in the
issue why this is necessary: when destructing an object in Python, you
start off by running the tp_dealloc of the subclass before moving up
to the parent class (much in the same way C++ destructors work).  The
deallocation process for a vanilla Python-defined class does irreparable
harm to the PyObject instance (e.g., the finalizers get run) making it
no longer valid attempt to resurrect later in the tp_dealloc chain.
(BTW, the fact that objects can resurrect but in an invalid state is
one of the reasons why it's so frickin' hard to write correct __del__
implementations).  So we need to make sure that we actually override
the tp_dealloc of the bottom most *subclass* of Tensor to make sure
we attempt a resurrection before we start finalizing.  To do this,
we need to define a metaclass for Tensor that can override tp_dealloc
whenever we create a new subclass of Tensor.  By the way, it was totally
not documented how to create metaclasses in the C++ API, and it took
a good bit of trial error to figure it out (and the answer is now
immortalized in https://stackoverflow.com/q/67077317/23845 -- the things
that I got wrong in earlier versions of the PR included setting
tp_basicsize incorrectly, incorrectly setting Py_TPFLAGS_HAVE_GC on
the metaclass--you want to leave it unset so that it inherits, and
determining that tp_init is what actually gets called when you construct
a class, not tp_call as another not-to-be-named StackOverflow question
suggests).

Aside: Ordinarily, adding a metaclass to a class is a user visible
change, as it means that it is no longer valid to mixin another class
with a different metaclass.  However, because _C._TensorBase is a C
extension object, it will typically conflict with most other
metaclasses, so this is not BC breaking.

The desired new behavior of a subclass tp_dealloc is to first test if
we should resurrect, and otherwise do the same old behavior.  In an
initial implementation of this patch, I implemented this by saving the
original tp_dealloc (which references subtype_dealloc, the "standard"
dealloc for all Python defined classes) and invoking it.  However, this
results in an infinite loop, as it attempts to call the dealloc function
of the base type, but incorrectly chooses subclass type (because it is
not a subtype_dealloc, as we have overridden it; see
b38601d496/Objects/typeobject.c (L1261) )
So, with great reluctance, I must duplicate the behavior of
subtype_dealloc in our implementation.  Note that this is not entirely
unheard of in Python binding code; for example, Cython
c25c3ccc4b/Cython/Compiler/ModuleNode.py (L1560)
also does similar things.  This logic makes up the bulk of
THPVariable_subclass_dealloc

To review this, you should pull up the CPython copy of subtype_dealloc
b38601d496/Objects/typeobject.c (L1230)
and verify that I have specialized the implementation for our case
appropriately.  Among the simplifications I made:

- I assume PyType_IS_GC, because I assume that Tensor subclasses are
  only ever done in Python and those classes are always subject to GC.
  (BTW, yes!  This means I have broken anyone who has extend PyTorch
  tensor from C API directly.  I'm going to guess no one has actually
  done this.)

- I don't bother walking up the type bases to find the parent dealloc;
  I know it is always THPVariable_dealloc.  Similarly, I can get rid
  of some parent type tests based on knowledge of how
  THPVariable_dealloc is defined

- The CPython version calls some private APIs which I can't call, so
  I use the public PyObject_GC_UnTrack APIs.

- I don't allow the finalizer of a Tensor to change its type (but
  more on this shortly)

One alternative I discussed with colesbury was instead of copy pasting
the subtype_dealloc, we could transmute the type of the object that was
dying to turn it into a different object whose tp_dealloc is
subtype_dealloc, so the stock subtype_dealloc would then be applicable.
We decided this would be kind of weird and didn't do it that way.

TODO:

- More code comments

- Figure out how not to increase the size of TensorImpl with the new
  bool field

- Add some torture tests for the THPVariable_subclass_dealloc, e.g.,
  involving subclasses of Tensors that do strange things with finalizers

- Benchmark the impact of taking the GIL to release C++ side tensors
  (e.g., from autograd)

- Benchmark the impact of adding a new metaclass to Tensor (probably
  will be done by separating out the metaclass change into its own
  change)

- Benchmark the impact of changing THPVariable to conditionally own
  Tensor (as opposed to unconditionally owning it, as before)

- Add tests that this actually indeed preserves the Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27765125

Pulled By: ezyang

fbshipit-source-id: 857f14bdcca2900727412aff4c2e2d7f0af1415a
2021-06-03 10:50:36 -07:00
Thomas J. Fan
7f2e620105 FIX Validates that weights are 2d in embedding (#59314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59314

Reviewed By: H-Huang

Differential Revision: D28837753

Pulled By: jbschlosser

fbshipit-source-id: 683378244c61b0937c95563f91ef87ab09fd1653
2021-06-02 12:52:21 -07:00
Natalia Gimelshein
12418a4f86 Back out "Revert D28664514: [pytorch][PR] various TensorIterator speed improvements"
Summary: Original commit changeset: fcad039b7dc8

Test Plan: Existing tests

Reviewed By: mruberry

Differential Revision: D28720186

fbshipit-source-id: 14ac99ee2d7cafb86b20c979f8917beeefd616e1
2021-05-26 12:22:48 -07:00
Edward Yang
17fb651a3b Make torch.Tensor(torch.tensor(1.0)) work (#58885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58885

Fixes #58884

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28687510

Pulled By: ezyang

fbshipit-source-id: 81325f501cc3e83cbac02f7c44ded9d396356bb8
2021-05-26 11:33:05 -07:00
Natalia Gimelshein
8398ebaa86 Revert D28664514: [pytorch][PR] various TensorIterator speed improvements
Test Plan: revert-hammer

Differential Revision:
D28664514 (8a28bbeeb9)

Original commit changeset: 2e03cf90b37a

fbshipit-source-id: fcad039b7dc823fec8afa694ab74a7ac5011f8ab
2021-05-26 10:49:58 -07:00
Xiang Gao
c88333484f [resubmit] masked_scatter thrust->cub (#58865)
Summary:
See ae7760cf50bb2cddff4663a07b9d68decf4b6c75 for the fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58865

Reviewed By: mruberry

Differential Revision: D28657940

Pulled By: ngimel

fbshipit-source-id: 9155c710b0e18ebb3bfa2dabfdd117355ac30840
2021-05-25 11:00:50 -07:00
Natalia Gimelshein
8a28bbeeb9 various TensorIterator speed improvements (#58810)
Summary:
1) remove pushing back to strides vector for 1D tensors, those strides are never used in the loop anyway
2) avoid calling get_data_ptrs unless necessary
3) don't call into assert_no_partial_overlap if tensorImpls are the same (assert_no_partial_overlap has this comparison too, but after a couple of nested function calls)
4) is_non_overlapping_and_dense instead of is_contiguous in memory overlap (which, for some reason, is faster than is_contiguous, though I hoped after is_contiguous is non-virtualized, it should be the same).

Altogether, brings instruction count down from ~110K to 102735 for the following binary inplace benchmark:
```
In [2]:  timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({1}); at::Tensor b = torch::empty({1});", language="c++", timer=timeit.default_timer)
   ...:  stats=timer.collect_callgrind(number=30, repeats=3)
   ...:  print(stats[1].as_standardized().stats(inclusive=False))
```
similar improvements for unary inplace.

Upd: returned stride packing for now, counts is now 104295, so packing is worth ~ 52 instructions, we should think about how to remove it  safely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58810

Reviewed By: bhosmer

Differential Revision: D28664514

Pulled By: ngimel

fbshipit-source-id: 2e03cf90b37a411d9994a7607402645f1d8f3c93
2021-05-25 10:44:51 -07:00
Serhat Yilmaz
b4f3a989da [torch][repeat_interleave] Fix ambigious function call (#58881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58881

recently added new parameter to the function with PR: https://github.com/pytorch/pytorch/pull/58417

However, this introduced ambiguity when making call below:
  some_tensor.repeat_interleave(some_integer_value)

Making it optional to avoid the issue.

Reviewed By: ezyang, ngimel

Differential Revision: D28653820

fbshipit-source-id: 5bc0b1f326f069ff505554b51e3b24d60e69c843
2021-05-25 00:31:32 -07:00
Yu Guo
74c12da451 add deterministic path for scatter_add_cuda for 1D tensors (#58761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58761

previously we implemented deterministic path for gather_backward in https://github.com/pytorch/pytorch/pull/55573, which replaced non-deterministic scatter_add_cuda.

It's better to move it inside scatter_add so scatter_add can benefit from the deterministic path.

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_scatter_add_one_dim_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.063)
    ✓ Pass: caffe2/test:torch_cuda - test_scatter_add_one_dim_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (30.909)
    ✓ Pass: caffe2/test:torch_cuda - main (30.909)
Summary
  Pass: 2
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_gather_backward

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (4.613)
    ✓ Pass: caffe2/test:torch_cuda - test_gather_backward_deterministic_path_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.369)

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.356)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_CTCLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_accumulate_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_scatter_add_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_NLLLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_median_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_gather_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_bincount_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_histc_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bicubic_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_MaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_EmbeddingBag_max_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_trilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_kthvalue_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_linear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - main (28.146)
Summary
  Pass: 30
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28585659

fbshipit-source-id: 1ad003d4130501ceff5f6a7a870ca3dbc9a3f1f2
2021-05-23 21:36:02 -07:00
kshitij12345
ee3ea31f12 OpInfo: split, split_with_sizes (#58184)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58184

Reviewed By: ngimel

Differential Revision: D28627271

Pulled By: mruberry

fbshipit-source-id: e6c0d2b005904ddebc9dab76685403530a6f6519
2021-05-23 15:47:35 -07:00
Serhat Yilmaz
4ca4640bae [torch][repeat_interleave] remove stream syncronization if output size is given (#58417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58417

Same as title.

Test Plan:
Rely on CI signal.

Update unit test to exercise new code path as well.

Reviewed By: ngimel

Differential Revision: D28482927

fbshipit-source-id: 3ec8682810ed5c8547b1e8d3869924480ce63dcd
2021-05-22 20:53:28 -07:00
Natalia Gimelshein
9e261de630 Revert D28547564: [pytorch][PR] masked_scatter thrust->cub
Test Plan: revert-hammer

Differential Revision:
D28547564 (5152cf8647)

Original commit changeset: 83aeddfaf702

fbshipit-source-id: d5259afb584e0f6c0a11de4d4cb3d56a2a562eb7
2021-05-21 09:18:34 -07:00
Xiang Gao
5152cf8647 masked_scatter thrust->cub (#56750)
Summary:
Benchmark:

```python
import torch
import itertools

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

run50_sync(lambda: torch.randperm(1000000, device='cuda'))

def benchmark(M):
    a = torch.randn(M, device='cuda')
    m = torch.randint(1, (M,), dtype=torch.long, device='cuda').bool()
    v = torch.randn(M, device='cuda')

    torch.cuda.synchronize()

    %timeit run50_sync(lambda:a.masked_scatter_(m, v))

for M in (100, 1000, 100000, 10000000):
    print(M)
    benchmark(M)
```

Before:
```
100
8.65 ms ± 80.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.75 ms ± 72.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
9.27 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
33.6 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

After
```
100
8.04 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.09 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
8.63 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
31.9 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56750

Reviewed By: ailzhang

Differential Revision: D28547564

Pulled By: ngimel

fbshipit-source-id: 83aeddfaf7023f9f9501c6b1e2faf91e8b6277b1
2021-05-20 10:27:58 -07:00
lezcano
452569dffb cfloat and cdouble functions (#58137)
Summary:
This adds the methods `Tensor.cfloat()` and `Tensor.cdouble()`.

I was not able to find the tests for `.float()` functions. I'd be happy to add similar tests for these functions  once someone points me to them.

Fixes https://github.com/pytorch/pytorch/issues/56014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58137

Reviewed By: ejguan

Differential Revision: D28412288

Pulled By: anjali411

fbshipit-source-id: ff3653cb3516bcb3d26a97b9ec3d314f1f42f83d
2021-05-13 21:13:37 -07:00
kshitij12345
6b1eeef601 OpInfo: squeeze (#58080)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58080

Reviewed By: agolynski

Differential Revision: D28379485

Pulled By: mruberry

fbshipit-source-id: 2b288036f595a5bd6b948a072494ee87f82322ce
2021-05-12 21:29:31 -07:00
Yu Guo
8a45006765 enable deterministic path for index_copy_cuda with index_put (#58144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58144

reland D28291041 (14badd9929), which was reverted due to a type error from Tuple[torch.Tensor], seems that mypy requires Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
    ✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
    ✓ Pass: caffe2/test:torch_cuda - main (25.750)

Reviewed By: ngimel

Differential Revision: D28383178

fbshipit-source-id: 38896fd6ddd670cfcce36e079aee7ad52adc2a28
2021-05-12 16:26:50 -07:00
kshitij12345
d09abf004c OpInfo: narrow (#58082)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58082

Reviewed By: agolynski

Differential Revision: D28379371

Pulled By: mruberry

fbshipit-source-id: 484e560b1e6ceba234e497585ed308a27cd8b7a0
2021-05-12 15:39:15 -07:00
Mike Ruberry
c911c30520 Revert D28291041: enable deterministic path for index_copy_cuda with index_put
Test Plan: revert-hammer

Differential Revision:
D28291041 (14badd9929)

Original commit changeset: 7f0cf3ec7280

fbshipit-source-id: 6117bc6e5b2044ce70d4e4a19bccd8c183ea3702
2021-05-12 03:33:57 -07:00
Kurt Mohler
c7fb0a0e82 Remove beta warning for use_deterministic_algorithms (#58074)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58074

Reviewed By: ngimel

Differential Revision: D28373676

Pulled By: mruberry

fbshipit-source-id: cae9a92ebbf6ac5f8d3008aa6a6a9cd5c1041c9f
2021-05-12 03:30:12 -07:00
Yu Guo
14badd9929 enable deterministic path for index_copy_cuda with index_put (#57870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57870

this is similar to index_add_cuda with index_put accumulate = True

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
    ✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
    ✓ Pass: caffe2/test:torch_cuda - main (25.750)

Reviewed By: ngimel

Differential Revision: D28291041

fbshipit-source-id: 7f0cf3ec72805f3617fd1de9ff03e1d49114fed8
2021-05-12 00:32:35 -07:00
Yu Guo
a07a0190f9 enable deterministic path for index_put with accumulate=False on CPU and CUDA (#57839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57839

we reuse the `index_put_accum_kernel`, rename it to  `index_put_deterministic_kernel` and add a bool `accumulate` in `index_backward_kernel`

Test Plan:
buck test mode/opt //caffe2/test:torch -- test_index_put_non_accumulate_deterministic

    ✓ Pass: caffe2/test:torch - test_index_put_non_accumulate_deterministic_cpu (test_torch.TestTorchDeviceTypeCPU) (5.120)
Summary
  Pass: 1
  Skip: 1
    ↻ caffe2/test:torch - test_index_put_non_accumulate_deterministic_meta (test_torch.TestTorchDeviceTypeMETA)
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_index_put_non_accumulate_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (6.397)
    ✓ Pass: caffe2/test:torch_cuda - test_index_put_non_accumulate_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (26.030)
    ✓ Pass: caffe2/test:torch_cuda - main (26.030)
Summary
  Pass: 2
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28290699

fbshipit-source-id: df8bbe7af2e72017566161b05b85737fda4ceb3f
2021-05-12 00:31:19 -07:00
Ilqar Ramazanli
8b816e9010 To implement gradient for Pytorch (#54617)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54617

Reviewed By: anjali411

Differential Revision: D28057452

Pulled By: iramazanli

fbshipit-source-id: 9bd86679282d34f5e5393e6447121586517eb4f0
2021-05-11 18:52:20 -07:00
kshitij12345
502eb664ae OpInfo: chunk (#57935)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57935

Reviewed By: ngimel

Differential Revision: D28346217

Pulled By: mruberry

fbshipit-source-id: 331995aa18fd2983fc2122a9af31fba43ab9839c
2021-05-11 10:16:10 -07:00
Edward Yang
da8cc355a3 Relax tp_new so that it is OK to call (#57544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57544

Instead of removing tp_new from the superclass (which causes
super().__new__ to not work), I now still install tp_new on the
superclass, but verify that you are not trying to directly
construct _TensorBase.

Fixes https://github.com/pytorch/pytorch/issues/57421

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28189475

Pulled By: ezyang

fbshipit-source-id: 9397a3842a77f5428d182dd62244b42425bca827
2021-05-05 09:04:39 -07:00
Peter Bell
33eea146ee torch.clamp with tensor min and max (#52695)
Summary:
Fixes gh-2793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52695

Reviewed By: mruberry

Differential Revision: D27395977

Pulled By: ezyang

fbshipit-source-id: f86aa240feb034d42e4c45447e72218f6a773c24
2021-05-03 12:56:16 -07:00
kshitij12345
154eca0309 OpInfo: ravel, view, view_as (#56910)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56910

Reviewed By: ngimel

Differential Revision: D28141867

Pulled By: mruberry

fbshipit-source-id: bff49d40d7e3bb36bc83d1405bd77f5529eeffe9
2021-05-02 22:10:36 -07:00
Ivan Yashchuk
eaf00bf7d4 Skip linalg.qr saved mode check if compiled without LAPACK (#56284)
Summary:
This PR also removes qr and eig tests from test/test_torch.py. They were not skipped if compiled without LAPACK and they are now replaced with OpInfos.

Fixes https://github.com/pytorch/pytorch/issues/55929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56284

Reviewed By: ejguan

Differential Revision: D27827077

Pulled By: mruberry

fbshipit-source-id: 1dceb955810a9fa34bb6baaccbaf0c8229444d3a
2021-05-02 16:07:07 -07:00
kshitij12345
41099ef71c OpInfo: mvlgamma (#56907)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56907

Reviewed By: astaff

Differential Revision: D28118669

Pulled By: mruberry

fbshipit-source-id: f54ad6dc64ddb6bcfca5c5c7fd8f395cd9761128
2021-05-01 20:51:01 -07:00
Wenlei Xie
20085f6d23 Support auto generation of device check (#56872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872

ghstack-source-id: 127914018

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986429

fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f
2021-05-01 12:02:09 -07:00
Emilio Castillo
0a9c9cc674 Update DLPack to 0.4 (#55365)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55090

I included the header directly, but I am not sure if we should add this as a git submodule, what do you guys think?
Also regarding the implementation, in ATen lanes seems not to be supported, but from CuPy complex types are exported with 2 lanes, I am not sure wether this is correct or not. However, in PyTorch this seems to be working properly, so I forgive 2 lanes for complex datatypes.

TODO: add tests for complex and bfloat

Easy test script against cupy

```python
import cupy
import torch

from torch.utils.dlpack import to_dlpack
from torch.utils.dlpack import from_dlpack

# Create a PyTorch tensor.
tx1 = torch.tensor(
    [2 + 1j, 3 + 2j, 4 + 3j, 5 + 4j], dtype=torch.complex128
).cuda()

# Convert it into a DLPack tensor.
dx = to_dlpack(tx1)

# Convert it into a CuPy array.
cx = cupy.fromDlpack(dx)

# Convert it back to a PyTorch tensor.
tx2 = from_dlpack(cx.toDlpack())
torch.testing.assert_allclose(tx1, tx2)
```

Thanks to leofang who updated CuPy's dlpack version and his PR served me as the guide for this one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55365

Reviewed By: ngimel

Differential Revision: D27724923

Pulled By: mruberry

fbshipit-source-id: 481eadb882ff3dd31e7664e08e8908c60a960f66
2021-04-30 10:30:05 -07:00