Commit Graph

6316 Commits

Author SHA1 Message Date
Ivan Yashchuk
528158af47 Updated derivatives for complex mm, mv, ger, bmm, triangular_solve (#45737)
Summary:
This PR updates derivatives for a few functions so that `gradgradcheck` for `torch.cholesky` is passed ([ref](https://github.com/pytorch/pytorch/pull/45267#discussion_r494439967)).

Some tests (that call to `bmm_cuda`) fail with with `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexDouble`
until PR https://github.com/pytorch/pytorch/issues/42553 is merged.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45737

Reviewed By: bdhirsh

Differential Revision: D24279917

Pulled By: anjali411

fbshipit-source-id: 7b696d2cfc2ef714332c2e3e5d207e257be67744
2020-10-15 11:27:30 -07:00
Elias Ellison
908c23579d [JIT] Revert Freezing shared type PR (#46285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45902 by reverting https://github.com/pytorch/pytorch/pull/42457

The test case introduced by https://github.com/pytorch/pytorch/pull/42457 was fixed by https://github.com/pytorch/pytorch/pull/46250, which I'm assuming is the real source of the bug.

In the future it would be good to provide repro's for freezing issues without including a quantization dependency; there was another another issue in freezing (see: https://github.com/pytorch/pytorch/pull/46054) who's root cause was the same quantization issue https://github.com/pytorch/pytorch/pull/46250.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46285

Reviewed By: bdhirsh

Differential Revision: D24288739

Pulled By: eellison

fbshipit-source-id: b69ee8c713f749cd93d5eba370c3eafed86568bb
2020-10-15 10:57:30 -07:00
Yanan Cao
86abc8cd48 [JIT] Make InsertInstruction overflow check a warning instead of fatal (#46369)
Summary:
This diff restores previous behavior of silently allow overflowing when inserting instructions. The behavior was changed recently in https://github.com/pytorch/pytorch/issues/45382. But it started to break some existing use cases that haver overflow problems.

Restoring original behavior but throw a warning to to unblock existing use cases where overflowing happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46369

Reviewed By: kwanmacher, wanchaol, fbhuba

Differential Revision: D24324345

Pulled By: gmagogsfm

fbshipit-source-id: 1c0fac421d4de38f070e21059bbdc1b788575bdf
2020-10-14 23:09:53 -07:00
Alexander Golynski
e7e919fc34 Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24294437

Pulled By: gmagogsfm

fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6
2020-10-14 16:27:37 -07:00
BowenBao
b28b5d3c68 [ONNX] Update squeeze test for opset 9 (#45369)
Summary:
Only under static axes does opset 9 supports no-op squeeze when dim is not 1.
Updating the test case where it was setting dynamic axes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45369

Reviewed By: anjali411

Differential Revision: D24280180

Pulled By: bzinodev

fbshipit-source-id: d7cda88ab338a1c41a68052831dcebe739a3843c
2020-10-14 12:53:13 -07:00
shubhambhokare1
9d389b1dcc [ONNX] Preprocess index_put with bool inputs to masked_scatter/masked_fill (#45584)
Summary:
When the input to an indexing operation is a boolean, for example array[True] = value,
the subsequent index_put node formed needs to be converted to masked_scatter/masked_fill node based on the type of val the indexing node is equated. If that value is just a single scalar, then we use the masked_fill functionality and if value is a tensor of appropriate size, we use the masked_scatter functionality.

Fixes https://github.com/pytorch/pytorch/issues/34054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45584

Reviewed By: VitalyFedyunin

Differential Revision: D24116921

Pulled By: bzinodev

fbshipit-source-id: ebd66e06d62e15f0d49c8191d9997f55edfa520e
2020-10-14 10:58:55 -07:00
Mikhail Zolotukhin
d790ec6de0 [JIT] Update comment in jit_log.h. (#46301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46301

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24295281

Pulled By: ZolotukhinM

fbshipit-source-id: a4f84c773029845065895a81f9d753a9c82a99e0
2020-10-13 23:42:28 -07:00
Rohan Varma
f7398759b4 Only populate grad accumulator to var mapping for find_unused_parameters=True in DDP (#45942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45942

We only need to keep track of this for traversing the autograd graph
when find_unused_parameters=True. Without that, we populate and keep this
mapping in memory, which occupies sizeof(pointer) * number of grad accumulators
of extra memory.
ghstack-source-id: 114219289

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D24154407

fbshipit-source-id: 220d723e262f36590a03a3fd2dab47cbfdb87d40
2020-10-13 21:12:59 -07:00
olegfaust
ac3f23deb0 Fixed usage of std::move function (#46199)
Summary:
Removed std::move in situations when move wasn't really possible (therefore std::move didn't move anything but created copy instead).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46199

Reviewed By: bdhirsh

Differential Revision: D24287408

Pulled By: glaringlee

fbshipit-source-id: f88b9500e7bbaa709bff62b845966e2adc7fa588
2020-10-13 19:13:30 -07:00
Martin Yuan
173363f31a Use tensor's quantized properties directly in pickler (#46267)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46267

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24283008

Pulled By: iseeyuan

fbshipit-source-id: 76c8410d428a5fc487381e65a9f3a789a9f04eb0
2020-10-13 19:05:52 -07:00
Pritam Damania
f89498f3f8 Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221

The RPC framework only allowed sending RPCs based on provided
WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier
to refer to everything in terms of ranks since DDP doesn't support names yet.

As a result, support a `to` parameter in the RPC APIs which allow for
specifying a rank as well would be helpful.
ghstack-source-id: 114207172

Test Plan:
1) waitforbuildbot
2) Unit Tests

Reviewed By: mrshenli

Differential Revision: D24264989

fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70
2020-10-13 17:52:54 -07:00
Michael Ranieri
b1d24dded1 make a way to disable callgrind (#46116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116

Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource.

Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND`

Reviewed By: malfet

Differential Revision: D24227360

fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f
2020-10-13 16:18:04 -07:00
Supriya Rao
95ccf34fb9 [quant][graph][fix] Set type for GetAttr nodes in remapTypes (#46250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46250

Previously the type of GetAttr nodes was getting set incorrectly and wasn't matching the module type

Test Plan:
Existing quantization tests

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24279872

fbshipit-source-id: 2b2e3027f6e9ad8ba9e9b7937bd5cc5daaf6e17c
2020-10-13 12:59:28 -07:00
chengjun
5741de883a Define the record_stream method in native_functions.yaml (#44301)
Summary:
The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device.

Fixes https://github.com/pytorch/pytorch/issues/36556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301

Reviewed By: glaringlee

Differential Revision: D23763954

Pulled By: ezyang

fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293
2020-10-13 09:15:22 -07:00
Brian Hirsh
a3caa719af fix #45552 - adding add_done_callback(fn) to torch.futures.Future (#45675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45675

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24055353

Pulled By: bdhirsh

fbshipit-source-id: 9233c8e17acc878f0fecbe740a4397fb55cf722f
2020-10-13 07:47:36 -07:00
Tao Xu
a277c097ac [iOS][GPU] Add Metal/MPSCNN support on iOS (#46112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112

### Summary

This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.

allow-large-files

- Users API

```
  auto module = torch::jit::load(model);
  module.eval();
  at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
  auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
    - Person Segmentation v106 (FB Internal)
    - Mobilenetv2

- Supported Operators
    - aten::conv2d
    - aten::addmm
    - aten::add.Tensor
    - aten::sub.Tensor
    - aten::mul.Tensor
    - aten::relu
    - aten::hardtanh
    - aten::hardtanh_
    - aten::sigmoid
    - aten::max_pool2d
    - aten::adaptive_avg_pool2d
    - aten::reshape
    - aten::t
    - aten::view
    - aten::log_softmax.int
    - aten::upsample_nearest2d.vec

- Supported Devices
    - Apple A9 and above
    - iOS 10.2 and above

- CMake scripts
    - `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`

### Test Plan

- Circle CI

ghstack-source-id: 114155638

Test Plan:
1. Sandcastle CI
2. Circle CI

Reviewed By: dreiss

Differential Revision: D23236555

fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
2020-10-13 01:46:56 -07:00
Nikita Shulga
ba1e0a88bb Use const-references in nodes_to_rewrite range loop
Test Plan: CI

Reviewed By: supriyar

Differential Revision: D24267389

fbshipit-source-id: c56d6bf1924b4c4c993fdf1328cfd5ab0d890869
2020-10-12 20:08:34 -07:00
Nick Gibson
f3db68776c [NNC] Fix two more bugs in Cuda Half support (#46129)
Summary:
Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats:

1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules.

2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in.

Fixes https://github.com/pytorch/pytorch/issues/45953.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129

Reviewed By: glaringlee

Differential Revision: D24253639

Pulled By: nickgg

fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac
2020-10-12 13:31:07 -07:00
Brian Hirsh
c02efdefa8 adding complex support for distributed functions and . fix #45760 (#45879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24127949

Pulled By: bdhirsh

fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223
2020-10-12 12:44:47 -07:00
partypyro
8d5256e6dd Made exception message for torch.LongTensor() legacy constructor more readable (#46147)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46085

Made exception message for torch.LongTensor() legacy constructor more
readable

![exception_screenshot](https://user-images.githubusercontent.com/13827698/95664789-e3387b80-0aff-11eb-8e8e-bd2ee449cd7e.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46147

Reviewed By: glaringlee

Differential Revision: D24252617

Pulled By: mrshenli

fbshipit-source-id: 6c03b66fef50cf18f9d37c7047d3b98c847ae287
2020-10-12 11:26:38 -07:00
Gregory Chanan
2070834b9e Improve error checking of Storage._writeFile. (#46036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46036

Previously, this function didn't do error-bounds checking on the GetItem (GET_ITEM) calls, which led to issues like https://github.com/pytorch/pytorch/issues/46020.

A better solution would be to use pybind, but given writing the file is going to dominate bounds checking, this is strictly better.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24228370

Pulled By: gchanan

fbshipit-source-id: f5d0a3d21ff12b4380beefe1e9954fa81ea2f567
2020-10-12 11:10:04 -07:00
Nick Gibson
2fa91fa305 [NNC] Fix crash when simplifying certain subtractions (#46108)
Summary:
Fixes a crash bug in the IRSimplifier when the LHS is a Term (e.g. 2x) and the RHS is a Polynomial (e.g. 2x+1).

This case crashes 100% of the time so I guess it's not very common in models we've been benchmarking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46108

Reviewed By: agolynski

Differential Revision: D24226593

Pulled By: nickgg

fbshipit-source-id: ef454c855ff472febaeba16ec34891df932723c0
2020-10-09 15:15:55 -07:00
Ailing Zhang
0ddcc0ce35 Add alias dispatch key DefaultBackend. (#45718)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45718

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165892

Pulled By: ailzhang

fbshipit-source-id: ed28bf62b7c6320d966fd10b7a44b14efffe2f62
2020-10-09 12:02:44 -07:00
Rohan Varma
62554a3bd2 Prioritize raising error message about unused parameters when rebuild_buckets fails (#45933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933

Occasionally users run DDP with models with unused params, in this
case we would like to surface an error message telling them to run with
find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made
it so that we raise a size mismatch error when this happens, but the
information about unused parameters is likely to be more useful and likely to
be the most common case of failure. Prefer raising this error over the
subsequent size mismatch errors.
ghstack-source-id: 113914759

Test Plan: Added unittest

Reviewed By: mrshenli

Differential Revision: D24151256

fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675
2020-10-09 09:16:45 -07:00
generatedunixname89002005325676
9fb8e33a5b [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D24215555

fbshipit-source-id: 21d10bd60ab302c7cf7e245979b2d2ef0a142a1c
2020-10-09 08:37:54 -07:00
Raghavan Raman
a5c0dbc519 Add support for Softmax. (#45286)
Summary:
This PR adds support for Softmax in NNC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45286

Reviewed By: mrshenli

Differential Revision: D24042901

Pulled By: navahgar

fbshipit-source-id: 120bafe17586d3ecf0918f9aee852a7c3a8f4990
2020-10-08 23:57:02 -07:00
Mingzhe Li
8cd3857bc7 [NCCL] Add torch::cuda::nccl::send/recv (#45926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926

torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python
Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517

ghstack-source-id: 113910530

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24147802

fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3
2020-10-08 19:20:40 -07:00
Shen Li
96d48178c8 Make pipeWrite and pipeRead noexcept (#45783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783

After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.

This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.

Differential Revision: D24094803

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: mrshenli

fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
2020-10-08 18:53:51 -07:00
Supriya Rao
31888b2e77 [quant][pyper] Rename the sparse argument for embedding_bag ops (#46003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003

sparse is confusing because itt is used in training for sparse gradients

Test Plan: Imported from OSS

Reviewed By: radkris-git, qizzzh

Differential Revision: D24178248

fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
2020-10-08 16:15:28 -07:00
Supriya Rao
8c80ee8ba5 [quant] Set sparse to False for embedding_bag ops in graph mode (#45997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45997

The current sparse field using in the float module is for sparse gradients, which is not applicable
to inference. The sparse field in the quantizd ops denotes pruned weights.

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24176543

fbshipit-source-id: a05b4ff949e0375462ae411947f68076e1b460d2
2020-10-08 16:13:12 -07:00
huaidong.xiong
e3112e3ed6 aten::set_grad_enabled should not push as it does not return a value (#45559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45558

This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is:

```cpp
Operator(
    "aten::set_grad_enabled(bool val) -> ()",
    [](Stack* stack) {
      torch::GradMode::set_enabled(pop(stack).toBool());
      push(stack, IValue());
    },
    aliasAnalysisConservative()),
```

which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value.

If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure.

The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45559

Reviewed By: albanD

Differential Revision: D24142153

Pulled By: SplitInfinity

fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d
2020-10-08 14:42:11 -07:00
Nick Gibson
402abdfdf4 [NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869)
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:

Before:
```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int j_1 = 0; j_1 < 10; j_1++) {
    B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
  }
```

After `cacheAccesses(A->buf(), "A_local", j_loop);`

```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int i_2 = 0; i_2 < 2; i_2++) {
    for (int j_1 = 0; j_1 < 11; j_1++) {
      A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
    }
  }
  for (int j_2 = 0; j_2 < 10; j_2++) {
    B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
  }
}
```

Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
  sum[l1] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
}
```

After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:

```
for (int l1 = 0; l1 < 4; l1++) {
  Allocate(d_local, float, {1});
  sum[l1] = 0.f;
  d_local[0] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
  sum[l1] = (sum[l1]) + (d_local[0]);
  Free(d_local);
}
```

I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.

This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.

While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869

Reviewed By: mruberry

Differential Revision: D24195276

Pulled By: nickgg

fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
2020-10-08 14:13:28 -07:00
Elias Ellison
1197a38a63 [JIT] Bind log1p and lgamma (#45791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45791

Most of the lowering for log1p and lgamma already existed, add JIT integration.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24169536

Pulled By: eellison

fbshipit-source-id: a009c77a3471f3b5d378bad5de6d8e0880e9da3c
2020-10-08 12:06:34 -07:00
Elias Ellison
338283057b [JIT] [3/3] Make sure fusion occurs in test_tensorexpr (#45790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45790

Making sure that more tests invoke a run with a Fusion Group.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24169534

Pulled By: eellison

fbshipit-source-id: a2666df53fbb12c64571e960f59dbe94df2437e4
2020-10-08 12:06:25 -07:00
Elias Ellison
564296f051 [2/3] [JIT] Make sure fusion occurs in test_tensorexpr (#45789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789

Making sure that more tests invoke a run with a Fusion Group.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D24169535

Pulled By: eellison

fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c
2020-10-08 12:06:16 -07:00
Elias Ellison
1b97ffa07a [1/3] [JIT] Make sure fusion occurs in test_tensorexpr file (#45788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788

We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph  were fused.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24169537

Pulled By: eellison

fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
2020-10-08 12:02:57 -07:00
Heitor Schueroff de Souza
636eb18029 Fixed median nan propagation and implemented nanmedian (#45847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847

Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24136629

Pulled By: heitorschueroff

fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
2020-10-08 11:20:21 -07:00
Thomas Viehmann
d3d8da7a8e Enable CUDA Fuser for ROCm (#45965)
Summary:
This enables the cuda fuser on ROCm and enables tests for them.

Part of this patch is based on work of Rohith Nallamaddi, thank you.
Errors are my own, of course.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965

Reviewed By: seemethere

Differential Revision: D24170457

Pulled By: walterddr

fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6
2020-10-08 10:41:56 -07:00
Hao Lu
ea4fbb2e5e [StaticRuntime] Replace hashtable based workspace with vector<IValue> (#45892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892

Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`.

Reviewed By: dzhulgakov

Differential Revision: D24098763

fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6
2020-10-08 09:50:30 -07:00
Mikhail Zolotukhin
6e4de44501 [TensorExpr] LoopNest: add a constructor that takes Stmt instead of list of Tensors. (#45949)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45949

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156001

Pulled By: ZolotukhinM

fbshipit-source-id: 6f4f050b04e802e274c42ed64be74c21ba79c29f
2020-10-08 00:58:13 -07:00
Mikhail Zolotukhin
1036b77416 [TensorExpr] LoopNest: replace output_tensors_ with output_bufs_. (#45948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45948

No functionality changes expected, it's just a preparation for further
changes in the LoopNest interface.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156000

Pulled By: ZolotukhinM

fbshipit-source-id: f95ab07aac0aba128bc4ed5376a3251ac9c31c06
2020-10-08 00:58:10 -07:00
Mikhail Zolotukhin
29da553dd9 [TensorExpr] Loopnest: unify intermediate_tensors_ and temp_bufs_. (#45947)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45947

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24155999

Pulled By: ZolotukhinM

fbshipit-source-id: d82acf6aba570f6a675eea683c306088e2a41f91
2020-10-08 00:58:08 -07:00
Mikhail Zolotukhin
598caddd93 [TensorExpr] Add shorthand versions for splitWith{Mask,Tail} functions. (#45946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45946

Also, make these functions static - they are not using anything from
`LoopNest` and can be applied to any `Stmt`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156002

Pulled By: ZolotukhinM

fbshipit-source-id: 1c7d205f85a2a1684e07eb836af662f10d0a50fc
2020-10-08 00:58:06 -07:00
Mikhail Zolotukhin
b65ffa365c [TensorExpr] Nuke Function class and directly use Tensor instead. (#45936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45936

`Tensor` has been a view into a `Function` that was supposed to be used
for a more general case when we have multiple computations over the same
domain (aka multiple output functions). We have never got to a point
where we need this and now have other ideas in mind on how to support
this case if need be. For now, let's just nuke `Function` to reduce the
overall system complexity.

The change should not affect any existing behavior.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24153214

Pulled By: ZolotukhinM

fbshipit-source-id: 26d5f11db5d661ff5e1135f4a49eff1c6d4c1bd5
2020-10-08 00:55:31 -07:00
Nikita Shulga
c19b9cd18d Add torch::cuda::ncll::all2all (#45900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes https://github.com/pytorch/pytorch/issues/42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
2020-10-07 23:56:31 -07:00
Nick Gibson
19da1d22fe [NNC] Registerizer V2, supporting partial and conditional replacement (#45574)
Summary:
This is a rewrite of the Registerizer, supporting scalar replacement in *vastly* more situations. As a refresher, the registerizer does this:

Before:
``` A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}
```
After:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;
```

Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups).

I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet.

It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses.

There are a few cases that make this more challenging:

 - For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements.

E.g. Before:
```
A[0] = 2;
for (int x1 = 0; x1 < 10; x1++) {
  A[0] = (A[0]) + x1;
}
for (int x2 = 1; x2 < 10; x2++) {
  A[x2] = A[x2 - 1];
}
for (int x3 = 0; x3 < 10; x3++) {
  A[0] = (A[0]) + x3;
}
```
After:
```
int A_1 = 2;
for (int x1 = 0; x1 < 10; x1++) {
  A_1 = A_1 + x1;
}
A[0] = A_1;
for (int x2 = 1; x2 < 10; x2++) {
  A[x2] = A[x2 - 1];
}
int A_2 = A[0];
for (int x3 = 0; x3 < 10; x3++) {
  A_2 = A_2 + x3;
}
A[0] = A_2;
```
- Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it.

E.g. this example:
```
if (x<5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
A[x] = (A[x]) + 1;
if (x>5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
```
The A[x] access can be registerized due to the unconditional access between the two conditions:
```
int A_1 = A[x];
if (x<5 ? 1 : 0) {
  A_1 = A_1 + 1;
}
A_1 = A_1 + 1;
if (x>5 ? 1 : 0) {
  A_1 = A_1 + 1;
}
A[x] = A_1;
```
But this example has no accesses that can be registerized:
```
if (x<5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
if (x>5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
```

- IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within.

E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it.
```
A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]);
```

But the equivalent kernel using Cond can be registerized:
```
if (x<3 ? 1 : 0) {
  float B_1 = B[x];
  A[x] = B_1 + B_1;
} else {
  A[x] = B[x];
}
```
- Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var.

E.g. no accesses in this example can be registerized:
```
for (int x = 0; x < 10; x++) {
  int y = 30;
  A[y] = x + (A[y]);
}
```

But they can in this example:
```
int y = 30;
for (int x = 0; x < 10; x++) {
  A[y] = x + (A[y]);
}
```

**Testing**

The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests.

**Performance**

Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount.

Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574

Reviewed By: albanD

Differential Revision: D24151517

Pulled By: nickgg

fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae
2020-10-07 18:17:27 -07:00
Elias Ellison
c86655a815 [JIT] Fix Dict bug in constant hashing (#45929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929

We were checking `and` when we should have been checking `or`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24148804

Pulled By: eellison

fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d
2020-10-07 17:40:17 -07:00
Elias Ellison
72e4f51bc0 [JIT] fix dict update (#45857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45857

Fix for https://github.com/pytorch/pytorch/issues/45627

Op was calling `insert` instead of `insert_or_assign`, so it wouldn't overwrite an existing key.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24148805

Pulled By: eellison

fbshipit-source-id: bf39c71d5d928890b82cff1a9a0985dc47c1ffac
2020-10-07 17:36:02 -07:00
Michael Carilli
5640b79bf8 Allow consumer ops to sync on GraphRoot's gradient (#45787)
Summary:
Currently, a GraphRoot instance doesn't have an associated stream.  Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream.  If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.

The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
    # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
    # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
    loss.backward()

    # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
    # and the side_stream context is irrelevant.  GraphRoot's interaction with its first consumer(s) is the spot where
    # the side_stream context causes a problem.
```

This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)

The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.

With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
    loss.backward()

# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
    kickoff_grad = torch.ones_like(loss)
    loss.backward(gradient=kickoff_grad)

# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)

# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787

Reviewed By: nairbv

Differential Revision: D24138376

Pulled By: albanD

fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
2020-10-07 08:53:53 -07:00
James Reed
be45c3401a [JIT] Make objects throw Python AttributeError on nonexistant attr access (#45911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45911

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D24140971

Pulled By: jamesr66a

fbshipit-source-id: 046a2cffff898efad5bcc36a41bf992f36f555f9
2020-10-07 01:57:29 -07:00