Commit Graph

38059 Commits

Author SHA1 Message Date
Ilqar Ramazanli
f0e972a481 To add Nesterov Adam algorithm for multi-tensor optimizers API (#59165)
Summary:
Previously in the PR: https://github.com/pytorch/pytorch/issues/59009 we added NAdam to Optimizers.  Here in this PR we are proposing multi-tensor version of NAdam for PyTorch.

Nadam has been proposed in the paper   https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ and report  and report : http://cs229.stanford.edu/proj2015/054_report.pdf by Timothy Dozat.

It has been one of the most used algorithm in Deep Learning community.

It worth to noting that the implementation of NAdam is inspired by the implementation for Keras :
f9d3868495/tensorflow/python/keras/optimizer_v2/nadam.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59165

Reviewed By: vincentqb

Differential Revision: D29360577

Pulled By: iramazanli

fbshipit-source-id: 0fe14016303b2df2cb8cc31912a2674acf63d1e5
2021-06-27 17:00:41 -07:00
Mikhail Zolotukhin
3bfe15085d [TensorExpr] Add a mechanism to register custom TS->NNC lowerings in TensorExprKernel. (#60804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60804

The lowerings are stored as a map c10::Symbol -> std::function and the
signature of thoese functions match the signature of
`computeOperandValue`. Custom lowerings have higher priority over the
standard ones, i.e. we can redefine how a given op is lowered.

In general this feature is aimed at unblocking users whose models
contain ops that are not yet supported by NNC - it allows to quickly add
a custom lowering for a given op.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D29409580

Pulled By: ZolotukhinM

fbshipit-source-id: e8e8dc9d3cb9155cfbf5c08a4216ba1b5b791a60
2021-06-27 15:27:22 -07:00
Ilqar Ramazanli
5563f4bda0 To add Rectified Adam algorithm for multi-tensor optimizers API (#59161)
Summary:
Previously in the PR: https://github.com/pytorch/pytorch/issues/58968 we added RAdam to Optimizers. Here in this PR we are proposing multi-tensor version of RAdam for PyTorch.

Radam has been proposed in the paper https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al.

It has been one of the most used algorithm in Deep Learning community.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4 as it is the common practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59161

Reviewed By: vincentqb

Differential Revision: D29360576

Pulled By: iramazanli

fbshipit-source-id: 7ccdbf12b1ee7f12e66f7d7992123a70cc818b6b
2021-06-27 13:01:20 -07:00
Ansley Ussery
0fbc471d10 Support default values on NamedTuple fields (#54682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54682

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27327241

Pulled By: ansley

fbshipit-source-id: 76546f1770d50ebc3435bba3b74540e3c6be8a1c
2021-06-26 15:18:21 -07:00
Rong Rong (AI Infra)
6b53792f18 fix cuda mem leak check not properly run on master_builds (#60742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60742

improved CI_MASTER flag check logic, since it can be unset, true or false

Test Plan:
search for `PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK` in logs below:

- Before adding ci/master:
  - build workflow (`PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=1`): https://circleci.com/api/v1.1/project/github/pytorch/pytorch/14394913/output/107/0?file=true&allocation-id=60d5fd2fa55ae50282aec997-0-build%2F10295B30
- After adding ci/master label:
  - build workflow (`PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0`): https://circleci.com/api/v1.1/project/github/pytorch/pytorch/14398213/output/107/0?file=true&allocation-id=60d61cf8bb9d097afc7a11aa-0-build%2F400138F1
  - master build workflow (`PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0`): https://circleci.com/api/v1.1/project/github/pytorch/pytorch/14398198/output/107/0?file=true&allocation-id=60d61ca3467438480c963290-0-build%2F2999C909

Reviewed By: ngimel

Differential Revision: D29405732

Pulled By: walterddr

fbshipit-source-id: 09dd653cbb47ca61b1f8872851bda6db8db671b9
2021-06-26 07:05:32 -07:00
Hao Lu
e3abccec8a [Static Runtime] Remove output type constraints (#60669)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669

Test Plan: Added unit test to check for nested outputs.

Reviewed By: ajyu

Differential Revision: D29322025

fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
2021-06-26 02:36:27 -07:00
Takeshi Watanabe
dae25c2002 Fix missing spaces in error of constant_pad_nd (#60729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60729

Reviewed By: ZolotukhinM

Differential Revision: D29404422

Pulled By: ngimel

fbshipit-source-id: c40458c7a6ae33f61c680bff8de778a80658c250
2021-06-25 19:20:03 -07:00
Richard Barnes
9a08e87d8b Modernize for-loops in aten (#59598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59598

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28946826

fbshipit-source-id: 9f3f7e38833c2bc33d27243cef16ab0118c65f3a
2021-06-25 19:02:00 -07:00
Xiong Wei
7e3a694b23 supports non-leaf inputs for autograd.backward() function (#60521)
Summary:
Close https://github.com/pytorch/pytorch/issues/60268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60521

Reviewed By: ngimel

Differential Revision: D29393586

Pulled By: albanD

fbshipit-source-id: 2dd2de427ecfecca8d544237bacf690e0b7c918c
2021-06-25 18:57:26 -07:00
albanD
056a8e0d5c Remove un-used parameter in _trilinear backward (#60673)
Summary:
This argument is only important for speed and memory usage. So it is ok to ignore it during the backward.
As discussed, we might want to change this to speed up backward in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60673

Reviewed By: soulitzer

Differential Revision: D29370125

Pulled By: albanD

fbshipit-source-id: ad50b3ea530aeb194f5a51845523b517a50f2c71
2021-06-25 17:47:10 -07:00
Yi Wang
f262217101 [Model Averaging] Move step out of model averaging API (#60632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60632

Address the comment https://github.com/pytorch/pytorch/pull/60320#discussion_r654845062
ghstack-source-id: 132340278

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29355609

fbshipit-source-id: 50a6f13ed70b5a5b5b92ead2f3d7082c11277af5
2021-06-25 17:20:52 -07:00
Ivan Yashchuk
c5f0692b6e Sparse CSR: increase dtype test coverage (#60656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60656

This PR uses `torch.testing.get_all_dtypes()` for dtype parametrisation
of tests in `test_sparse_csr.py`. It adds previously excluded from tests
bool, half, bfloat16, complex dtypes. `torch.complex32` is omitted due
to lack of coverage and lack of specialized `AT_DISPATCH...`.
The process of adding more dtypes to tests releaved that `.to_dense()`
doesn't work for all dtypes.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29408058

Pulled By: cpuhrsch

fbshipit-source-id: 319b6f51b9786d6957d508f51657657a6d00267a
2021-06-25 17:11:21 -07:00
mingfeima
dd045ab540 add channels last for AdapativeMaxPool2d (#48920)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48920

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25399467

Pulled By: VitalyFedyunin

fbshipit-source-id: d9d2cc728cc7a18a26983e96d3c3e81a23659e89
2021-06-25 16:36:20 -07:00
Will Constable
367aff91d8 Fix missing #pragma once in jit/method.h
Summary: it seems to be accidentally missing

Test Plan: run CI

Reviewed By: suo

Differential Revision: D29335990

fbshipit-source-id: 2790bc10d141f9484a0807ff7800024a02fd9cfa
2021-06-25 16:32:54 -07:00
Victor Bittorf
8b6487c650 Add CUDA Vital (#58059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58059

Add CUDA.used vital sign which is true only if CUDA was "used" which technically means the context was created.

Also adds the following features:
- Force vitals to be written even if vitals are disabled, to enable testing when the env variable is not set from the start of execution
- Add a read_vitals call for python to read existing vital signs.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex basic_vitals

Reviewed By: xuzhao9

Differential Revision: D28357615

fbshipit-source-id: 681bf9ef63cb1458df9f1c241d301a3ddf1e5252
2021-06-25 16:31:11 -07:00
Brian Hirsh
9134b0e42f add a boxed CPU fallback kernel (#58065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58065

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.

Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.

### Design

To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:
* Confirm whether or not we can remove all C++ logging info directly in the yaml.

**Current Design**

All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1).

There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.

```
// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...
```
```
// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
  // Do custom logging here
  ...
  // Call the actual boxed CPU fallback.
  at::native::cpu_fallback(op, stack);
}

TORCH_LIBRARY_IMPL(_, XLA, m) {
  m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}
```

Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:
```
#include <ATen/native/CPUFallback.h>

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
  }
  ...
}
```

That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.

**Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?**
We could change the api to use `at::redispatch`, which would make it look something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
  }
  ...
}
```
Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!

Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
  }
  ...
}
```

Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.

**More alternatives**
The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:
* Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
* Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.

To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.

### Performance impact

While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.

I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.

I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.

`at::add`:
before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001
after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273
delta: ~15.5% increase

`at::add_out`:
before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz
after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227
delta: ~14.5% increase

High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.

For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is:
* An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
* An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
* An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
* unboxing->boxing->unboxing logic (this is the only strictly required piece)

There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](https://github.com/pytorch/pytorch/issues/55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.

Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28833085

Pulled By: bdhirsh

fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b
2021-06-25 16:26:50 -07:00
Hongbo Zhang
ad69e2fd11 [torch] Module fix on the support of LazyModule on bug #60132 (#60517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60517

This is to fix the module support on lazymodulefixin on the bug issue #60132
Check the link: https://github.com/pytorch/pytorch/issues/60132

We will have to update lazy_extension given the dependency on module.py and update the unit test as well.

Test Plan:
Unit test passes

torchrec test passes

Reviewed By: albanD

Differential Revision: D29274068

fbshipit-source-id: 1c20f7f0556e08dc1941457ed20c290868346980
2021-06-25 16:20:19 -07:00
Basil Hosmer
cab926b2c0 faster generate_square_subsequent_mask in nn.Transformer (#60631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631

Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small.

PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, albanD

Differential Revision: D29356673

Pulled By: bhosmer

fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e
2021-06-25 16:07:01 -07:00
Ansley Ussery
7585783b8d Remove Optional[None] annotations (#60704)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60704

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29380281

Pulled By: ansley

fbshipit-source-id: 055c17329a35375de4ebd058ee6d127475aad373
2021-06-25 15:53:58 -07:00
David Riazati
5ed7400b75 Fix doc preview source directory (#60792)
Summary:
`merge` is the directory with the actual changes, not `master`. Verified by downloading arficats from https://github.com/pytorch/pytorch/pull/60777/checks and searching through the result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60792

Reviewed By: walterddr

Differential Revision: D29405288

Pulled By: driazati

fbshipit-source-id: 419c943727c00429945c1f116645bfa22fb12456
2021-06-25 15:46:30 -07:00
Basil Hosmer
7b933cd9ea configurable pre/post LayerNorm in nn.Transformer (#60593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60593

Per #55270, this PR makes it configurable whether to run LayerNorm before or after other operations in Transformer layers.

However, it leaves for a separate PR the removal of the LayerNorm performed after the final encoder/decoder layer has run, which is redundant when LayerNorms has been run after other in-layer operations (problem described in #24930 #50086 #51447).

Note: this means that transformers built with `nn.Transformer()` are now configurable, but will still contain a redundant LayerNorm when configured as before. However, callers of the `TransformerEncoder` and `TransformerDecoder` classes have always been able to avoid this redundancy.

Reviewer notes:
1. Ran across this during other work, don't know if anybody's working on it already (most recent conversation in issues seems to be from early April). Happy to abandon if so.
2. Was looking for a quick way to add tests but it looks like the existing ones in test_nn just compare against snapshots. I could add something similar, but curious if there's any prepackaged way to add a test that LayerNorm-first (the new option) yields model that trains properly, etc.
3. New code in the `forward`s was written to minimize diff churn rather than maximize beauty :P happy to pretty it up if desired.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29356590

Pulled By: bhosmer

fbshipit-source-id: 308669326990b8923aab5fcd96e03b582fb21f24
2021-06-25 15:43:35 -07:00
angelayi
e13a9587b4 Revert "Revert D29135358: [quant] Input-Weight Equaliaztion - convert modifications" (#60646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60646

This reverts commit e60f9cfc58.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D29361191

Pulled By: angelayi

fbshipit-source-id: 275d8691d8e47da4ab80bb21b51d77ec25a0f714
2021-06-25 15:37:05 -07:00
Mikhail Zolotukhin
7188d84ccf [Tools] Update path in clang_format_utils after #60473 (#60782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60782

PR #60473 introduced a new folders nesting level, this change updates
clang_format_utils.py to accordingly adjust the way it sets up root
path.

Test Plan: Imported from OSS

Reviewed By: zhxchen17

Differential Revision: D29403622

Pulled By: ZolotukhinM

fbshipit-source-id: 6404271615c2d263834cf538ab0153c4d41cc5c3
2021-06-25 14:30:45 -07:00
Adam Simpkins
394f60b0fc [caffe2] update make_cifar_db to move the string into DB::Put() (#60692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60692

Update make_cifar_db.cc to work with the DB API changes in D29204425 (00896cb9ed).

Test Plan: buck build caffe2/binaries:make_cifar_db

Differential Revision: D29374754

fbshipit-source-id: 23d2acd24031d11071791e398433b537215ffd38
2021-06-25 14:02:24 -07:00
Ilqar Ramazanli
e1bd4963e2 To intorduce Functional API for multi-tensor (#60735)
Summary:
In this PR we change Multi-Tensor Optimizers to Functional API.

We can see that in the file : https://github.com/pytorch/pytorch/blob/master/torch/optim/_functional.py , there has been functional API defined for most of Optimizers. However we do not have similar file / functionality for multi tensors :
https://github.com/pytorch/pytorch/tree/master/torch/optim/_multi_tensor

Therefore we are adding it in this PR here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60735

Reviewed By: vincentqb

Differential Revision: D29392253

Pulled By: iramazanli

fbshipit-source-id: cebc8e7b07ab11156370f5297cfb419cd9f20b46
2021-06-25 13:09:26 -07:00
Richard Barnes
8f16a38067 Add missing kernel checks (#60635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60635

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29355747

fbshipit-source-id: 20bae292703a54b2895a33c11e6f1b8b9a9d8195
2021-06-25 12:54:40 -07:00
lezcano
dfc8247d33 Faster cumsum and cumprod backwards (#60642)
Summary:
Piggybacking on https://github.com/pytorch/pytorch/pull/58747, now we can implement the backwards of `cumsum` and `cumprod` without tricks. This minimises the number of kernels that are launched in GPU, so we see a reasonable speed-up on GPU. We should also get a better stability for ill-conditioned inputs, as we do not perform any numerical tricks to get the result.

Note that the benchmarks test forward + backward, so the true speed-up on the backward should be even faster. Even more so in `cumsum`, as it requires less operations than the backward of `cumprod`.

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(ndims, prod_dim, dim, num_threads, device):
    size = [500]*ndims
    size[dim] = prod_dim

    x = torch.rand(*size, device=device, requires_grad=True)
    # Make sure there are no zeros as the formula for the backward
    # that we are testing is for when the backward has no zeros
    with torch.no_grad():
        x.add_(1e-3)
    grad = torch.ones_like(x)

    timer = Timer(
        "torch.autograd.grad([x.cumprod(dim)], [x], grad_outputs=[grad])",
        globals={"x": x, "dim": dim, "grad": grad},
        label=f"Cumprod + Backwards {device}",
        description=f"dim: {dim}",
        sub_label=f"prod_dim: {prod_dim}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    ndims = 3
    dims = range(ndims)
    prod_dims = [10, 100, 500]
    for dim, prod_dim, device in product(dims, prod_dims, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        for num_threads in threads:
            yield ndims, prod_dim, dim, num_threads, device

compare = Compare([get_timer(*params) for params in get_params()])
compare.trim_significant_figures()
compare.print()
```

</details>

<details>
<summary>
Benchmark PR
</summary>

```
[------------ Cumprod + Backwards cpu -------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |     11   |     14   |     12
      prod_dim: 100  |    260   |    270   |    260
      prod_dim: 500  |   1400   |   1550   |   1360
2 threads: -----------------------------------------
      prod_dim: 10   |      6   |      6   |      6
      prod_dim: 100  |    170   |    166   |    167
      prod_dim: 500  |    902   |    950   |    858
4 threads: -----------------------------------------
      prod_dim: 10   |      4   |      3   |      3
      prod_dim: 100  |    110   |    108   |    106
      prod_dim: 500  |    576   |    590   |    547

Times are in milliseconds (ms).

[------------ Cumprod + Backwards cuda ------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |    562   |    566   |   1075
      prod_dim: 100  |   5388   |   5394   |   6697
      prod_dim: 500  |  28170   |  27580   |  30740

Times are in microseconds (us).
```

</details>

<details>
<summary>
Benchmark master
</summary>

```
[------------ Cumprod + Backwards cpu -------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |     11   |     13   |     12
      prod_dim: 100  |    270   |    270   |    256
      prod_dim: 500  |   1500   |   1590   |   1300
2 threads: -----------------------------------------
      prod_dim: 10   |      6   |      6   |      6
      prod_dim: 100  |    170   |    170   |    164
      prod_dim: 500  |    911   |    940   |    840
4 threads: -----------------------------------------
      prod_dim: 10   |      4   |      4   |      4
      prod_dim: 100  |    111   |    109   |    105
      prod_dim: 500  |    570   |    590   |    536

Times are in milliseconds (ms).

[------------ Cumprod + Backwards cuda ------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |    616   |    597   |   1109
      prod_dim: 100  |   5976   |   5723   |   7017
      prod_dim: 500  |  31110   |  29160   |  32320

Times are in microseconds (us).
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60642

Reviewed By: ngimel

Differential Revision: D29366368

Pulled By: albanD

fbshipit-source-id: b0d692ce030352965c2f152e0f92fbb61fc5ebde
2021-06-25 12:44:12 -07:00
David Riazati
d3bec9f4d2 Use S3 for documentation previews (#60711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60711

We already build the docs on each PR, this adds a step to push the relevant folder of the docs (we build the entire website for pytorch.github.io which clocks in at around 500 MB, but we really only need the "master" docs, not every version. The master docs by themselves are around 50 MB which is more reasonable). It uses the same S3 bucket as the artifacts but places the items at the `pytorch/pytorch/pr-previews/<pr number>` prefix. The bucket has a rule to expire resources in that prefix after 1 month.

On the AWS side the bucket has static hosting enabled with CloudFront directing to the docs preview prefix, so you can see the output at `https://d28slxzaq48q8t.cloudfront.net/<pr number>/`, e.g. https://d28slxzaq48q8t.cloudfront.net/60711/. For advertising we could link this on the HUD PR page as well as in the Dr. CI comment. We could add a CNAME on CloudFront to make this be `pr-preview.pytorch.org/<pr number>` or something but having random PRs be able to host content on the pytorch.org domain seems sketchy.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29398818

Pulled By: driazati

fbshipit-source-id: 24032854d83815853b3650d8e54f60b684707f76
2021-06-25 12:12:26 -07:00
Edward Yang
aacc722aec Dispatch to Python via __torch_dispatch__ (#59760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59760

See https://github.com/pytorch/pytorch/issues/59049

There are some moving parts to this PR, I'll structure this explanation so the straightforward parts go first, and then the less straightforward parts.

**The actual dispatch to Python.** The core logic of dispatch to Python lives in `concrete_dispatch_fn` in `torch/csrc/autograd/python_variable.cpp`. It takes the input IValue stack, scans all the arguments for Tensor arguments, and defers most of the heavy lifting to `handle_torch_function_no_python_arg_parser` which actually does all of the logic for calling out to torch dispatch (in particular, this function handles multiple dispatch situations for you). Because we have a different function name than regular `__torch_function__` handling, `handle_torch_function_no_python_arg_parser` is generalized to accept a magic method name to look for when testing if Tensors have custom handling or not. Unlike `__torch_function__`, by default there is no `__torch_dispatch__` on Tensor classes.

**Maintaining the Python dispatch key.** In order to get to the dispatch to Python logic, we must tag Tensors with the `__torch_dispatch__` magic method with the newly added Python dispatch key (separated from PythonFuncTorch to allow for a transitional period while they migrate to this mechanism). We expose a new private property `_is_python_dispatch` that assists in debugging if a Tensor is participating in Python dispatch or not. We apply the Python dispatch key the first time a PyObject for a Tensor is constructed (THPVariable_NewWithVar), testing if `__torch_dispatch__` exists with  then newly added `check_has_torch_dispatch`.

**Shallow copy and detach.** For the simple examples tested in this PR, most creations of Tensor route through the dispatcher. The exception to this is `shallow_copy_and_detach`, which bypasses the dispatcher and is used when saving tensors for backwards. When a Tensor is Python dispatch, we override the behavior of `shallow_copy_and_detach` to instead directly call into `__torch_dispatch__` to perform a `detach` operation (in the same way it would be invoked if you called `detach` directly). Because this Python call is triggered directly from c10::TensorImpl, it must be indirected through `PyInterpreter::detach`, which is the general mechanism for dynamic dispatching to the Python interpreter associated with a TensorImpl.

**torchdeploy compatibility.** The dispatch to Python logic cannot be directly registered to the dispatcher as it is compiled in the Python library, which will get loaded multiple times per torchdeploy interpreter. Thus, we must employ a two phase process. First, we register a fallback inside a non-Python library (aten/src/ATen/core/PythonFallbackKernel.cpp). Its job is to determine the appropriate PyInterpreter to handle the Python dispatch by going through all of the arguments and finding the first argument that has a PyObject/PyInterpreter. With this PyInterpreter, it makes another dynamic dispatch via "dispatch" which will go to the correct torchdeploy interpreter to handle dispatching to actual Python.

**Testing.** We provide a simple example of a LoggingTensor for testing, which can be used to generate TorchScript-like traces to observe what operations are being called when a Tensor is invoked. Although a LoggingTensor would be better implemented via an is-a relationship rather than a has-a relationship (as is done in the test), we've done it this way to show that arbitrarily complex compositions of tensors inside a tensor work properly.

**Known limitations.**

* We haven't adjusted any operator code, so some patterns may not work (as they lose the Python subclass in an unrecoverable way)
* `__torch_function__` must be explicitly disabled with `_disabled_torch_function_impl` otherwise things don't work quite correctly (in particular, what is being disabled is default subclass preservation behavior.)
* We don't ever populate kwargs, even when an argument is kwarg-only

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision:
D29017912
D29017912

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Pulled By: ezyang

fbshipit-source-id: a67714d9e541d09203a8cfc85345b8967db86238
2021-06-25 11:50:32 -07:00
Aswin John Mathews
a53d7f8f7c Remove test linalg test skips from MAGMA integration (#58232)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55552; majority of cases in https://github.com/pytorch/pytorch/issues/51303

Tests in torch/testing/_internal/common_methods_invocations.py  (tested through test_ops) cannot be fully removed, since the machines seem to be running out of gpu memory during the test, and needs further analysis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58232

Reviewed By: ngimel

Differential Revision: D29394021

Pulled By: malfet

fbshipit-source-id: f108a70af33beec908ac1c0b58467f8744e6fe87
2021-06-25 11:44:49 -07:00
Elton Leander Pinto
8216da1f23 Use python3.6 compatible APIs in clang_tidy.py (#60659)
Summary:
This PR make `tools/clang_tidy.py` use python 3.6 APIs for `asyncio` and `shlex`.

I ran into some issues when running this script with the `-j` flag inside of the clang-tidy docker image (which uses python 3.6). Specifically, the functions `asycnio.run` and `shlex.join` are available in python >= 3.8.

This change does not affect CI because we do not run the clang-tidy job in parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60659

Reviewed By: albanD

Differential Revision: D29377851

Pulled By: 1ntEgr8

fbshipit-source-id: 92ab7ee6782b78d40ffccd03f1718ede4204d948
2021-06-25 10:35:03 -07:00
Edgar Andrés Margffoy Tuay
6322f66878 Add python version and cuda-specific folder to store extensions (#60592)
Summary:
See https://github.com/pytorch/pytorch/issues/55267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60592

Reviewed By: albanD

Differential Revision: D29353368

Pulled By: ezyang

fbshipit-source-id: 1fbcd021f1030132c0f950f33ce4a3a2fef351e0
2021-06-25 10:27:04 -07:00
Masaki Kozuki
a404cc9a7b CUDA addcmul and addcdiv do math in float for 16 bits I/O (#60715)
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: 42c8439b6e/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu (L30) and cast inputs and results here:
42c8439b6e/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L133-L135)

Related to https://github.com/pytorch/pytorch/issues/58833 #60227 https://github.com/pytorch/pytorch/issues/60454
cc ptrblck mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
2021-06-25 10:21:35 -07:00
Rohan Varma
0be65cd52a [c10d] Fix test_collective_hang flakiness (#60662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60662

Fixes this flaky test. Basically, sometimes a rank can exit the test
early before rank 0 calls into allreduce. In this case Gloo will throw
connection reset error on all other ranks.
ghstack-source-id: 132363151

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29364806

fbshipit-source-id: ce0c292a2166edad57ea0dbb76df12cfd560a10d
2021-06-25 10:15:18 -07:00
Elton Leander Pinto
474bdaf54d Add --print-include-paths option to tools/linter/clang_tidy.py (#60744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60744

Fixes #60739

Test Plan:
Run this comand:
```
python3 tools/linter/clang_tidy.py --paths torch/csrc/fx --print-include-paths
```

Output (varies from machine to machine):
```
(clang-tidy output)
.
.
.

clang -cc1 version 11.0.0 based upon LLVM 11.0.0 default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "nccl/include"
ignoring nonexistent directory "/include"
ignoring duplicate directory ".."
ignoring duplicate directory "../aten/src"
ignoring duplicate directory "../third_party/onnx"
ignoring duplicate directory ".."
ignoring duplicate directory ".."
ignoring duplicate directory "../torch/lib"
ignoring duplicate directory "../torch/../third_party/gloo"
  as it is a non-system directory that duplicates a system directory
ignoring duplicate directory "../third_party/ideep/mkl-dnn/src/../include"
  as it is a non-system directory that duplicates a system directory
#include "..." search starts here:
#include <...> search starts here:
 aten/src
 ../aten/src
 .
 ..
 ../cmake/../third_party/benchmark/include
 caffe2/contrib/aten
 ../third_party/onnx
 third_party/onnx
 ../third_party/foxi
 third_party/foxi
 ../torch/../aten/src/TH
 caffe2/aten/src
 third_party
 ../torch/../third_party/valgrind-headers
 ../torch/csrc
 ../torch/csrc/api/include
 ../torch/lib
 ../torch/lib/libshm
 ../torch/csrc/api
 third_party/ideep/mkl-dnn/include
 ../third_party/fmt/include
 third_party/gloo
 ../torch/../third_party/gloo
 ../cmake/../third_party/googletest/googlemock/include
 ../cmake/../third_party/googletest/googletest/include
 ../third_party/protobuf/src
 /data/users/eltonpinto/miniconda3/envs/pytorch/include
 ../third_party/gemmlowp
 ../third_party/neon2sse
 ../third_party/XNNPACK/include
 ../third_party
 ../cmake/../third_party/eigen
 /home/eltonpinto/local/miniconda3/envs/pytorch/include/python3.8
 /home/eltonpinto/local/miniconda3/envs/pytorch/lib/python3.8/site-packages/numpy/core/include
 ../cmake/../third_party/pybind11/include
 /usr/local/cuda-11.3/include
 ../third_party/ideep/mkl-dnn/src/../include
 ../third_party/ideep/include
 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8
 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/x86_64-redhat-linux
 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/backward
 /usr/local/include
 /usr/lib64/clang/11.0.0/include
 /usr/include

.
.
.
(more clang-tidy output)
```

Imported from OSS

Reviewed By: ngimel

Differential Revision: D29395398

fbshipit-source-id: e92077a9c4e9dee7f9d7e05df180d552e3763540
2021-06-25 10:12:15 -07:00
Elton Leander Pinto
608f12b818 Fix --dry-run option in tools/linter/clang_tidy.py (#60744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60744

Fixes #60741

Test Plan:
Run this command:
```
python3 tools/linter/clang_tidy.py --paths torch/csrc/fx --dry-run
```
Output:
```
clang-tidy -p build -config '{"InheritParentConfig": true, "Checks": " bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, -bugprone-reserved-identifier, cppcoreguidelines-*, -cppcoreguidelines-avoid-magic-numbers, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-macro-usage, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, -facebook-hte-RelativeInclude, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-concat-nested-namespaces, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, -performance-unnecessary-value-param, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null}' torch/csrc/fx/fx_init.cpp
```

Reviewed By: ngimel

Differential Revision: D29394538

Pulled By: 1ntEgr8

fbshipit-source-id: b824bc2aa63631f074e9ad17092e4e063d347395
2021-06-25 09:53:29 -07:00
lezcano
3a838e4ce3 Parametrizations depending on several inputs (#60530)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/58488

There was a line that had been changed in `test_nn.py` as caught in https://github.com/pytorch/pytorch/pull/58488#discussion_r651267668

I reverted that line, which should never have been changed. I reckon that should solve the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60530

Reviewed By: ngimel

Differential Revision: D29329865

Pulled By: albanD

fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3
2021-06-25 09:16:57 -07:00
Kevin Tse
8cba365378 Fix incorrect doc about the dtype for torch.randint described in issue #56347 (#60507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60507

Fix incorrect documentation about the dtype for `torch.randint` described in issue #56347

Test Plan: Review documentation to make sure formatting is right

Reviewed By: bdhirsh

Differential Revision: D29321181

fbshipit-source-id: caae69a9bbb30052da518a3f5d22a7ed3504cdd2
2021-06-25 07:51:36 -07:00
Martin Yuan
d8c3d555e4 [Delegate] Support composite of lowered sub modules of the same backend (#59921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59921

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D29091143

Pulled By: iseeyuan

fbshipit-source-id: 9ffcd18681917ece8ec73a34866c53701bdee1bc
2021-06-25 07:18:32 -07:00
Ilqar Ramazanli
7c2938bf67 To refactor Sparse Adam algorithm for functional form (#59171)
Summary:
Adds Functional Interface for Sparse Adam Optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59171

Reviewed By: vincentqb

Differential Revision: D29360582

Pulled By: iramazanli

fbshipit-source-id: 5ceffd7f4b7abd1e0b758a5b8445abdf5555eba0
2021-06-25 06:35:39 -07:00
Xiaomeng Yang
963c983366 Improve numerical stability of LayerNorm (#59987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59987

Similar as GroupNorm, improve numerical stability of LayerNorm by Welford algorithm and pairwise sum.

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"

Reviewed By: ngimel

Differential Revision: D29115235

fbshipit-source-id: 5183346c3c535f809ec7d98b8bdf6d8914bfe790
2021-06-25 02:22:42 -07:00
Protonu Basu
5b1f5c8f17 When creating a single parition skip the output nodes, but process possible nodes after it. (#60370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60370

When creating a single parition skip the output nodes, but process possible nodes after it.

Test Plan: Run all CI tests.

Reviewed By: jfix71

Differential Revision: D29265278

fbshipit-source-id: 2242009973a54498d8027cce5a294558a1206fdf
2021-06-24 23:50:30 -07:00
Hao Lu
2b51a8a935 [BackwardCompatibility] Remove aten::to from allow_list (#60147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60147

Remove aten::to from allow_list now that the aten::to schema change has landed (D29121620 (eda2ddb5b0)).

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D29187314

fbshipit-source-id: abdb5a560287a861f3858732f7b3da342ee4aa55
2021-06-24 22:57:57 -07:00
kshitij12345
3ca28656fa [special] erfcx cuda support (#60519)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60519

Reviewed By: ngimel

Differential Revision: D29353105

Pulled By: mruberry

fbshipit-source-id: 2f525a347a22f96411739a16e354c7291e863f95
2021-06-24 21:50:37 -07:00
Garrett Cramer
46d27a53fe cuda rpc backward sparse tensor fix (#59609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59609

quick fix for https://github.com/pytorch/pytorch/issues/58755

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D29335722

Pulled By: gcramer23

fbshipit-source-id: 0de7e0399b30f0934320f1e9abb1b92a45bcf929
2021-06-24 21:40:43 -07:00
Mike Ruberry
561132f902 Revert D29330585: [pytorch][PR] add BFloat16 support for arange on CPU
Test Plan: revert-hammer

Differential Revision:
D29330585 (375d201086)

Original commit changeset: b8a04cee0c3f

fbshipit-source-id: dc138f9613becd083848e82d15c138d3883493c8
2021-06-24 20:57:43 -07:00
David Reiss
d63c236fb3 Introduce quantized convolution serialization format 3 (#60241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60241

We're going to make a forward-incompatible change to this serialization
format soon, so I'm taking the opportunity to do a little cleanup.

- Use int for version.  This was apparently not possible when V2
  was introduced, but it works fine now as long as we use int64_t.
  (Note that the 64-bits are only used in memory.  The serializer will
  use 1 byte for small non-negative ints.)
- Remove the "packed params" tensor and replace it with a list of ints.
- Replace the "transpose" field with "flags" to allow more binary flags
  to be packed in.
- Unify required and optional tensors.  I just made them all optional
  and added an explicit assertion for the one we require.

A bit of a hack: I added an always-absent tensor to the front of the
tensor list.  Without this, when passing unpacked params from Python to
the ONNX JIT pass, they type would be inferred to `List[Tensor]` if all
tensors were present, making it impossible to cast to
`std::vector<c10::optional<at:Tensor>>` without jumping through hoops.

The plan is to ship this, along with another diff that adds a flag to
indicate numerical requirements, wait a few weeks for an FC grace
period, then flip the serialization version.

Test Plan: CI.  BC tests.

Reviewed By: vkuzo, dhruvbird

Differential Revision: D29349782

Pulled By: dreiss

fbshipit-source-id: cfef5d006e940ac1b8e09dc5b4c5ecf906de8716
2021-06-24 20:52:43 -07:00
Peter Bell
42c8439b6e TH: Clean up dead code (#60655)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60655

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371717

Pulled By: ngimel

fbshipit-source-id: faa71b1d4a15450c78e12aa917daec853057bce9
2021-06-24 19:42:16 -07:00
Peter Bell
4a7d281119 Migrate THAllocator to ATen (#60325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60325

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371715

Pulled By: ngimel

fbshipit-source-id: 78ec8368a48e1a4690d0664a0b02d2a235af98ff
2021-06-24 19:42:14 -07:00
Peter Bell
d586248544 Migrate THStorage_resizeBytes to ATen (CPU) (#60324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60324

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371716

Pulled By: ngimel

fbshipit-source-id: 056aee0ec87722090c133777b6948c28b03b37e4
2021-06-24 19:41:02 -07:00