Commit Graph

351 Commits

Author SHA1 Message Date
Taylor Robie
a4ca394f8a Revert "Revert D26907093: Add repeats to Timer.collect_callgrind(...)" (#54484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54484

Re-land of https://github.com/pytorch/pytorch/pull/53295. (With fixed unit tests.)

This reverts commit 0dc5abfaa9.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27255201

Pulled By: robieta

fbshipit-source-id: 4e9fed7522631d66c5cd7e27ace9b5ffc3a0bbfc
2021-03-23 21:58:17 -07:00
Edward Yang
e0aebe241d Refactor tensor_new.cpp to use TensorOptions instead of DispatchKey (#54034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54034

Fixes #53544

I had to touch a bunch of lines but the refactoring was fairly
mechanical.  Here's how it works.

The basic concept behind this PR is that tensor_new.cpp was previously
abusing DispatchKey when it actually meant TensorOptions.  The provided
DispatchKey argument to most of the constructor functions typically
comes from torch::tensors::get_default_dispatch_key();  it doesn't
really make sense for people to set the default dispatch key, but
this got grandfathered in due to the old API set_default_tensor_type
(where the "Type" concept got refactored into "DispatchKey" concept
over time).  See also #53124.  But the upshot is that, semantically,
what we refer to as the default dispatch key really is more like
torch.set_default_tensor_type(torch.Tensor) versus
torch.set_default_tensor_type(torch.cuda.Tensor): clearly the user
wants to do something about *construction* of the tensor, and
TensorOptions captures that exactly.

So, how exactly to translate from one to the other?
- Sources (things that used to PRODUCE DispatchKey)
  - Most top level functions take a DispatchKey as their argument.  I
    use the new function dispatchKeyToTensorOptions to convert it into
    a TensorOptions
  - typeIdWithDefault now produces a TensorOptions (probably could do
    with a rename, though I didn't)
- Sinks (things that used to CONSUME DispatchKey)
  - Previously, the function options() was typically used to convert the
    DispatchKey into a TensorOptions.  Now its replacement build_options
    just takes a TensorOptions and sets some extra fields on it.
    Irritatingly, I can't just replace
    `build_options(options, scalar_type, device)` with
    `options.dtype(scalar_type).device(device)` because the semantics
    are slightly different: if device is nullopt, we should preserve
    the usage of the device specified in options (what options.device()
    does is overwrite the device unconditionally; e.g., if device is
    nullopt, unset device from options)
  - The other major sink for DispatchKey was `internal_new_from_data`,
    but it turns out it only really extracts the device type from
    the dispatch key.  Now it just pulls out the device from
    TensorOptions.
- To actually do the translation of DispatchKey to TensorOptions, I
  introduce new functions dispatchKeyToLayout (replicating
  layout_from_backend--there are still a few uses of this function
  so I couldn't delete it) and dispatchKeyToDeviceType (replacing
  computeDeviceType)
- In all internal functions, whenever DispatchKey is taken as an argument,
  I instead take TensorOptions as an argument, and pass it along.
- Anywhere `legacyExtractDispatchKey(other.key_set())` equality was
  previously used, I now do `other.options().type_equal()`, which
  is the intended BC for doing "backend to backend" comparisons
- There are a few places in the sparse constructors where we allocated
  a tensor for values, and then read out the dispatch key from the
  result to allocate the keys.  As best as I can tell, this is totally
  equivalent to just passing in the options to both values and indices
  (the only difference is dtype, which is captured via a separate
  argument)

This refactor doesn't really go far enough: for example, there are now
functions that take both TensorOptions and ScalarType, when really
the TensorOptions can capture this all.  I kept it solely just
s/DispatchKey/TensorOptions/ to reduce the number of possible bugs;
also, a lot of this will be mooted by a proper fix to #53124.

Even with this limited refactor, the payoff is sweet.  I can delete:

- backendToCPU
- backendToXPU
- backendToCUDA
- backendToHIP
- backendToBackendOfDeviceType

The reason I can do this is because I can simply overwrite layout in TensorOptions
to do the conversion, rather than having to type out each backend case
explicitly.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27109509

Pulled By: ezyang

fbshipit-source-id: 91d16cfbc390127770362ac04fb43f7e070077e9
2021-03-19 09:08:32 -07:00
Pavel Belevich
0dc5abfaa9 Revert D26907093: Add repeats to Timer.collect_callgrind(...)
Test Plan: revert-hammer

Differential Revision:
D26907093 (74993dcf7b)

Original commit changeset: 72e5b4889691

fbshipit-source-id: 80779ec895920a4e9b33daa56f32b587f8912ed6
2021-03-17 20:14:21 -07:00
Taylor Robie
74993dcf7b Add repeats to Timer.collect_callgrind(...) (#53295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53295

A lot of the time spent in `collect_callgrind` is spinning up Valgrind and executing the initial `import torch`. In most cases the actual run loop is a much smaller fraction. As a result, we can reuse the same process to do multiple replicates and do a much better job amortizing that startup cost. This also tends to result in more stable measurements: the kth run is more repeatable than the first because everything has been given a chance to settle into a steady state. The instruction microbenchmarks lean heavily on this behavior. I found that in practice doing several `n=100` replicates to be more reliable than one monolithic 10,000+ iteration run. (Since rare cases like memory consolidation will just contaminate that one replicate, as opposed to getting mixed into the entire long run.)

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26907093

Pulled By: robieta

fbshipit-source-id: 72e5b48896911f5dbde96c8387845d7f9882fdb2
2021-03-17 18:05:13 -07:00
mattip
54a2498919 Modify tests to use assertWarnsOnceRegex instead of maybeWarnsRegex (#52387)
Summary:
Related to https://github.com/pytorch/pytorch/issues/50006

Follow on for https://github.com/pytorch/pytorch/issues/48560 to ensure TORCH_WARN_ONCE warnings are caught. Most of this is straight-forward find-and-replace, but I did find one place where the TORCH_WARN_ONCE warning was not wrapped into a python warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52387

Reviewed By: albanD

Differential Revision: D26773387

Pulled By: mruberry

fbshipit-source-id: 5be7efbc8ab4a32ec8437c9c45f3b6c3c328f5dd
2021-03-08 03:32:14 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
kshitij12345
c4c77e2001 [special] add torch.special namespace (#52296)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

 * Add `torch.special` namespace
* Add `torch.special.gammaln` (alias to `torch.lgamma`)

TODO:
* Add proper entries for docs.
   * [x] Add .rst file entry
   * [x] Add documentation
   * [x] Update `lgamma` OpInfo entry for alias to `special.gammaln`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52296

Reviewed By: ngimel

Differential Revision: D26754890

Pulled By: mruberry

fbshipit-source-id: 73479f68989d6443ad07b7b02763fa98973c15f6
2021-03-04 00:04:36 -08:00
Nikita Shulga
a0a1bb074b Make NumPy dependency dynamic (#52794)
Summary:
Move NumPy initialization from `initModule()` to singleton inside
`torch::utils::is_numpy_available()` function.
This singleton will print a warning, that NumPy integration is not
available, rather than fails to import torch altogether.
The warning be printed only once, and will look something like the
following:
```
UserWarning: Failed to initialize NumPy: No module named 'numpy.core' (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:66.)
```

This is helpful if PyTorch was compiled with wrong NumPy version, of
NumPy is not commonly available on the platform (which is often the case
on AARCH64 or Apple M1)

Test that PyTorch is usable after numpy is uninstalled at the end of
`_test1` CI config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52794

Reviewed By: seemethere

Differential Revision: D26650509

Pulled By: malfet

fbshipit-source-id: a2d98769ef873862c3704be4afda075d76d3ad06
2021-02-25 19:45:00 -08:00
Bel H
db33afbf9f Change cmake to allow building with MLC kick-off build (#51326)
Summary:
- Allows build process to build with MLC enabled if subrepo folder mlc is in path and we can link against ML Compute on macOS BigSur
- To build with MLC enabled you will need to clone the mlc repo inside the pytorch repository.
- We need both this change and https://github.com/pytorch/pytorch/pull/50634 on pytorch/pytorch to enable the `mlc` device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51326

Reviewed By: glaringlee

Differential Revision: D26533138

Pulled By: malfet

fbshipit-source-id: 0baa06b4eb2d62dbfc0f6fc922096cb0db1cc7d1
2021-02-19 13:04:25 -08:00
Kimish Patel
a6e94d274f [Pytorch] Add python binding to use mobile cpu allocator. (#52323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52323

Using default cpu allocator for ops executed on qnnpack backend will result in
asan failures with heap overflow since qnnpack (and xnnpack) can access input
beyond their and/beginning.

Here we are enabling this feature specifically to enable dynamic sparse linear op test
using qnnpack engine. In dynamic linear op, the fp32 bias is not packed and
hence can result in out-of-bound access.

Test Plan: test_set_default_mobile_cpu_allocator.py

Reviewed By: z-a-f

Differential Revision: D26263481

fbshipit-source-id: a49227cac7e6781b0db4a156ca734d7671972d9f
2021-02-17 08:42:23 -08:00
mattip
b97a040f71 ENH: toggle TORCH_WARN_ONCE to TORCH_WARN for tests (#48560)
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624

~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~

Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560

Reviewed By: ngimel

Differential Revision: D26171175

Pulled By: mruberry

fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
2021-02-08 08:21:19 -08:00
Will Constable
f2e41257e4 Back out "Revert D26077905: Back out "Revert D25850783: Add torch::deploy, an embedded torch-python interpreter"" (#51267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51267

Original commit changeset: b70185916502

Test Plan: test locally, oss ci-all, fbcode incl deferred

Reviewed By: suo

Differential Revision: D26121251

fbshipit-source-id: 4315b7fd5476914c8e5d6f547e1cfbcf0c227781
2021-01-28 19:30:45 -08:00
Richard Zou
1379842f4a Add private mechanism to toggle vmap fallback warnings (#51218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51218

Fixes #51144.

Context
=======

Users have complained about warning spam from batched gradient
computation. This warning spam happens because warnings in C++ don't
correctly get turned into Python warnings when those warnings arise from
the autograd engine.

To work around that, this PR adds a mechanism to toggle vmap warnings.
By default, the vmap fallback will not warn when it is invoked. However,
by using `torch._C._debug_only_display_vmap_fallback_warnings(enabled)`,
one can toggle the existence of vmap fallback warnings.

This API is meant to be a private, debug-only API. The goal is to be
able to non-intrusively collect feedback from users to improve
performance on their workloads.

What this PR does
=================

This PR adds an option to toggle vmap warnings. The mechanism is
toggling a bool in ATen's global context.

There are some other minor changes:
- This PR adds a more detailed explanation of performance cliffs to the
autograd.functional.{jacobian, hessian} documentation
- A lot of the vmap tests in `test_vmap.py` rely on the fallback warning
to test the presence of the fallback. In test_vmap, I added a context
manager to toggle on the fallback warning while testing.

Alternatives
============

I listed a number of alternatives in #51144. My favorite one is having a new
"performance warnings mode" (this is currently a WIP by some folks on
the team). This PR is to mitigate the problem of warning spam before
a "performance warnings mode" gets shipped into PyTorch

Concerns
========

I am concerned that we are advertising a private API
(`torch._C._debug_only_display_vmap_fallback_warnings(enabled)`) in the
PyTorch documentation. However, I hope the naming makes it clear to
users that they should not rely on this API (and I don't think they have
any reason to rely on the API).

Test Plan
=========

Added tests in `test_vmap.py` to check:
- by default, the fallback does not warn
- we can toggle whether the fallback warns or not

Test Plan: Imported from OSS

Reviewed By: pbelevich, anjali411

Differential Revision: D26126419

Pulled By: zou3519

fbshipit-source-id: 95a97f9b40dc7334f6335a112fcdc85dc03dcc73
2021-01-28 13:05:00 -08:00
Mike Ruberry
12a434abbc Revert D26077905: Back out "Revert D25850783: Add torch::deploy, an embedded torch-python interpreter"
Test Plan: revert-hammer

Differential Revision:
D26077905 (dc2a44c4fc)

Original commit changeset: fae83bf9822d

fbshipit-source-id: b70185916502ba9ebe16d781cf0659b9f7865c9a
2021-01-27 19:53:29 -08:00
Will Constable
dc2a44c4fc Back out "Revert D25850783: Add torch::deploy, an embedded torch-python interpreter" (#51124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51124

Original commit changeset: 1c7133627da2

Test Plan: Test locally with interpreter_test and on CI

Reviewed By: suo

Differential Revision: D26077905

fbshipit-source-id: fae83bf9822d79e9a9b5641bc5191a7f3fdea78d
2021-01-27 16:49:42 -08:00
Mike Ruberry
e843974a6e Revert D25850783: Add torch::deploy, an embedded torch-python interpreter
Test Plan: revert-hammer

Differential Revision:
D25850783 (3192f9e4fe)

Original commit changeset: a4656377caff

fbshipit-source-id: 1c7133627da28fb12848da7a9a46de6d3b2b67c6
2021-01-26 02:07:44 -08:00
Will Constable
3192f9e4fe Add torch::deploy, an embedded torch-python interpreter (#50458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50458

libinterpreter.so contains a frozen python distribution including
torch-python bindings.

Freezing refers to serializing bytecode of python standard library modules as
well as the torch python library and embedding them in the library code.  This
library can then be dlopened multiple times in one process context, each
interpreter having its own python state and GIL.  In addition, each python
environment is sealed off from the filesystem and can only import the frozen
modules included in the distribution.

This change relies on newly added frozenpython, a cpython 3.8.6 fork built for this purpose.  Frozenpython provides libpython3.8-frozen.a which
contains frozen bytecode and object code for the python standard library.

Building on top of frozen python, the frozen torch-python bindings are added in
this diff, providing each embedded interpreter with a copy of the torch
bindings.  Each interpreter is intended to share one instance of libtorch and
the underlying tensor libraries.

Known issues

- Autograd is not expected to work with the embedded interpreter currently, as it manages
its own python interactions and needs to coordinate with the duplicated python
states in each of the interpreters.
- Distributed and cuda stuff is disabled in libinterpreter.so build, needs to be revisited
- __file__ is not supported in the context of embedded python since there are no
files for the underlying library modules.
using __file__
- __version__ is not properly supported in the embedded torch-python, just a
workaround for now

Test Plan: tested locally and on CI with cmake and buck builds running torch::deploy interpreter_test

Reviewed By: ailzhang

Differential Revision: D25850783

fbshipit-source-id: a4656377caff25b73913daae7ae2f88bcab8fd88
2021-01-25 15:14:28 -08:00
Kurt Mohler
8ab1a1495d Rename set_deterministic to use_deterministic_algorithms (#49904)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49904

Reviewed By: ezyang, mrshenli

Differential Revision: D25956761

Pulled By: mruberry

fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6
2021-01-22 11:27:07 -08:00
Taylor Robie
d31a760be4 move has_torch_function to C++, and make a special case object_has_torch_function (#48965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48965

This PR pulls `__torch_function__` checking entirely into C++, and adds a special `object_has_torch_function` method for ops which only have one arg as this lets us skip tuple construction and unpacking. We can now also do away with the Python side fast bailout for `Tensor` (e.g. `if any(type(t) is not Tensor for t in tensors) and has_torch_function(tensors)`) because they're actually slower than checking with the Python C API.

Test Plan: Existing unit tests. Benchmarks are in #48966

Reviewed By: ezyang

Differential Revision: D25590732

Pulled By: robieta

fbshipit-source-id: 6bd74788f06cdd673f3a2db898143d18c577eb42
2021-01-10 19:23:35 -08:00
Qifan Lu
cfc3db0ca9 Remove THPWrapper (#49871)
Summary:
Remove `THPWrapper` from PyTorch C code since it is not used anymore and because we have dropped Python 2 compatibility, its usage can be replaced by capsule objects (`PyCapsule_New`, `PyCapsule_CheckExact`, `PyCapsule_GetPointer` and `PyCapsule_GetDestructor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49871

Reviewed By: mruberry

Differential Revision: D25715038

Pulled By: albanD

fbshipit-source-id: cc3b6f967bbe0dc42c692adf76dff4e4b667fdd5
2020-12-30 03:01:52 -08:00
peterjc123
815d38395a PyLong_{As/From}{Long/UnsignedLong} lint checks (#49280)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49280

Reviewed By: mruberry

Differential Revision: D25592330

Pulled By: ezyang

fbshipit-source-id: 5c16d6aed88ad1feaa7f129b4cd44c0561be2de2
2020-12-17 09:32:08 -08:00
Michael Carilli
c068180a17 [CUDA graphs] Cuda RNG-safe graph capture and replay bindings (#48875)
Summary:
Part 2 of https://github.com/pytorch/pytorch/pull/46148 refactor.  (part 1 was https://github.com/pytorch/pytorch/pull/48694.)
Contains
- a few more CUDAGeneratorImpl diffs to clean up graph capture interaction
- Capture and replay bindings that interact correctly with CUDAGeneratorImpl
- Tests.

Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds.

See [Note [CUDA Graph-safe RNG states]](02d89f9f1d/aten/src/ATen/CUDAGeneratorImpl.h (L13-L85)) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48875

Reviewed By: zou3519

Differential Revision: D25482654

Pulled By: ngimel

fbshipit-source-id: 634dbc4c6c9d7d0d9a62dc81a52d430561f905fe
2020-12-14 10:51:58 -08:00
Taylor Robie
27905dfe9c Expose CXX_FLAGS through __config__ (#47861)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47861

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199263

Pulled By: robieta

fbshipit-source-id: 3cfdb0485d686a03a68dd0907d1733634857963f
2020-12-01 19:58:29 -08:00
Nikita Shulga
2b6a720eb1 Update pybind to 2.6.0 (#46415)
Summary:
Preserve PYBIND11 (63ce3fbde8) configuration options in `torch._C._PYBIND11 (63ce3fbde8)_COMPILER_TYPE` and use them when building extensions

Also, use f-strings in `torch.utils.cpp_extension`

"Fixes" https://github.com/pytorch/pytorch/issues/46367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46415

Reviewed By: VitalyFedyunin

Differential Revision: D24605949

Pulled By: malfet

fbshipit-source-id: 87340f2ed5308266a46ef8f0317316227dab9d4d
2020-10-29 10:53:47 -07:00
Nikita Shulga
42a51148c1 Use f-strings in torch.utils.cpp_extension (#47025)
Summary:
Plus two minor fixes to `torch/csrc/Module.cpp`:
 - Use iterator of type `Py_ssize_t` for array indexing in `THPModule_initNames`
 - Fix clang-tidy warning of unneeded defaultGenerator copy by capturing it as `const auto&`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47025

Reviewed By: samestep

Differential Revision: D24605907

Pulled By: malfet

fbshipit-source-id: c276567d320758fa8b6f4bd64ff46d2ea5d40eff
2020-10-28 21:32:33 -07:00
albanD
143d1fd9f5 Namespace cleanup for 1.7 Part 2 (#46673)
Summary:
make valgrind_toggle and valgrind_supported_platform private functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46673

Reviewed By: gchanan

Differential Revision: D24458133

Pulled By: albanD

fbshipit-source-id: 6f3fad9931d73223085edbd3cd3b7830c569570c
2020-10-22 07:57:51 -07:00
Pritam Damania
2b221a9599 Remove PyCFunction casts as much as possible. (#46227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227

Follow up from https://github.com/pytorch/pytorch/issues/45419, in
this PR I've removed as many PyCFunction casts as I could from the codebase.

The only ones I didn't remove were the ones with `METH_VARARGS | METH_KEYWORDS`
which have 3 parameters instead of 2 and had to be casted. Example: `
{"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS |
METH_KEYWORDS, nullptr},`
ghstack-source-id: 114632704

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D24269435

fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf
2020-10-20 15:01:51 -07:00
Michael Ranieri
b1d24dded1 make a way to disable callgrind (#46116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116

Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource.

Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND`

Reviewed By: malfet

Differential Revision: D24227360

fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f
2020-10-13 16:18:04 -07:00
chengjun
5741de883a Define the record_stream method in native_functions.yaml (#44301)
Summary:
The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device.

Fixes https://github.com/pytorch/pytorch/issues/36556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301

Reviewed By: glaringlee

Differential Revision: D23763954

Pulled By: ezyang

fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293
2020-10-13 09:15:22 -07:00
Pritam Damania
6e43f0db8b Use correct signatures for METH_NOARGS. (#45528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45528

As described in https://github.com/pytorch/pytorch/issues/45419,
resolving a bunch of cpython signature issues.

#Closes: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 113385726

Test Plan: sentinel

Reviewed By: albanD

Differential Revision: D24000626

fbshipit-source-id: d334596f1f0256063691aa044c8fb2face260817
2020-10-02 10:43:58 -07:00
Supriya Rao
04526a49d3 [quant] creating quint4x2 dtype for quantized tensors (#44678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678

This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2
The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it.

This change uses most of the existing scaffolding for qtensor storage. We allocate storage
based on the dtype before creating a new qtensor.

It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info
while quantizing and packing the qtensor (when we add 2-bit qtensor)

Kernels that use this dtype should be aware of the packing format.

Test Plan:
Locally tested
```
x = torch.ones((100, 100), dtype=torch.float)
qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8)
qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2)

torch.save(x, "temp.p")
print('Size float (B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx_8bit, "temp.p")
print('Size quantized 8bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx, "temp.p")
print('Size quantized 4bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
```

Size float (B): 40760
Size quantized 8bit(B): 10808
Size quantized 4bit(B): 5816

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23993134

fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d
2020-10-01 23:53:34 -07:00
Taylor Robie
2b13d9413e Re-land: Add callgrind collection to Timer #44717 (#45586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586

Test Plan: The unit test has been softened to be less platform sensitive.

Reviewed By: mruberry

Differential Revision: D24025415

Pulled By: robieta

fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
2020-09-30 17:43:06 -07:00
Mike Ruberry
51d0ae9207 Revert D24010742: [pytorch][PR] Add callgrind collection to Timer
Test Plan: revert-hammer

Differential Revision:
D24010742 (9b27e0926b)

Original commit changeset: df6bc765f8ef

fbshipit-source-id: 4c1edd57ea932896f7052716427059c924222501
2020-09-30 10:15:46 -07:00
Taylor Robie
9b27e0926b Add callgrind collection to Timer (#44717)
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:

A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.

Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
    "x.backward()",
    setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()

for c, fn in counts[:20]:
    print(f"{c:>12}  {fn}")
```

```
      812800  ???:_dl_update_slotinfo
      355600  ???:update_get_addr
      308300  work/Python/ceval.c:_PyEval_EvalFrameDefault'2
      304800  ???:__tls_get_addr
      196059  ???:_int_free
      152400  ???:__tls_get_addr_slow
      138400  build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
      126526  work/Objects/dictobject.c:_PyDict_LoadGlobal
      114268  ???:malloc
      101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
       85900  work/Python/ceval.c:_PyEval_EvalFrameDefault
       79946  work/Objects/typeobject.c:_PyType_Lookup
       72000  build/../c10/core/Device.h:c10::Device::validate()
       70000  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
       66400  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
       63000  ???:pthread_mutex_lock
       61200  work/Objects/dictobject.c:PyDict_GetItem
       59800  ???:free
       58400  work/Objects/tupleobject.c:tupledealloc
       56707  work/Objects/dictobject.c:lookdict_unicode_nodummy
```

Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions:  {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
    _ = count_dict.setdefault(fn, 0)
    count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
    print(f"{c:>8}  {fn}")
```

```
Head instructions: 7609547
1.6 instructions:  6059648
  169600  ???:_dl_update_slotinfo
  101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
   74200  ???:update_get_addr
   63600  ???:__tls_get_addr
   46800  work/Python/ceval.c:_PyEval_EvalFrameDefault
   33512  work/Objects/dictobject.c:_PyDict_LoadGlobal
   31800  ???:__tls_get_addr_slow
   31700  build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
   28300  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
   27800  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
   27401  work/Objects/dictobject.c:lookdict_unicode_nodummy
   24115  work/Objects/typeobject.c:_PyType_Lookup
   24080  ???:_int_free
   21700  work/Objects/dictobject.c:PyDict_GetItemWithError
   20700  work/Objects/dictobject.c:PyDict_GetItem
          ...
   -3200  build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
   -3400  build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
   -3500  /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
   -3700  build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
   -4207  work/Objects/obmalloc.c:PyMem_Calloc
   -4500  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
   -4800  build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
   -5000  build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
   -5300  work/Objects/listobject.c:PyList_New
   -5400  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
   -5600  /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
   -6231  work/Objects/obmalloc.c:PyMem_Free
   -6300  work/Objects/listobject.c:list_repeat
  -11200  work/Objects/listobject.c:list_dealloc
  -28900  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```

Remaining TODOs:
  * Include a timer in the generated script for cuda sync.
  * Add valgrind to CircleCI machines and add a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717

Reviewed By: soumith

Differential Revision: D24010742

Pulled By: robieta

fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
2020-09-30 05:52:54 -07:00
gunandrose4u
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
Mike Ruberry
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
gunandrose4u
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Richard Zou
e8f4b04d9a vmap: temporarily disable support for random functions (#42617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42617

While we figure out the random plan, I want to initially disable
support for random operations. This is because there is an ambiguity in
what randomness means. For example,

```
tensor = torch.zeros(B0, 1)
vmap(lambda t: t.normal_())(tensor)
```

in the above example, should tensor[0] and tensor[1] be equal (i.e.,
use the same random seed), or should they be different?

The mechanism for disabling random support is as follows:
- We add a new dispatch key called VmapMode
- Whenever we're inside vmap, we enable VmapMode for all tensors.
This is done via at::VmapMode::increment_nesting and
at::VmapMode::decrement_nesting.
- DispatchKey::VmapMode's fallback kernel is the fallthrough kernel.
- We register kernels that raise errors for all random functions on
DispatchKey::VmapMode. This way, whenever someone calls a random
function on any tensor (not just BatchedTensors) inside of a vmap block,
an error gets thrown.

Test Plan: - pytest test/test_vmap.py -v -k "Operators"

Reviewed By: ezyang

Differential Revision: D22954840

Pulled By: zou3519

fbshipit-source-id: cb8d71062d4087e10cbf408f74b1a9dff81a226d
2020-08-11 07:19:51 -07:00
Mike Ruberry
9c8021c0b1 Adds torch.linalg namespace (#42664)
Summary:
This PR adds the `torch.linalg` namespace as part of our continued effort to be more compatible with NumPy. The namespace is tested by adding a single function, `torch.linalg.outer`, and testing it in a new test suite, test_linalg.py. It follows the same pattern that https://github.com/pytorch/pytorch/pull/41911, which added the `torch.fft` namespace, did.

Future PRs will likely:

- add more functions to torch.linalg
- expand the testing done in test_linalg.py, including legacy functions, like torch.ger
- deprecate existing linalg functions outside of `torch.linalg` in preference to the new namespace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42664

Reviewed By: ngimel

Differential Revision: D22991019

Pulled By: mruberry

fbshipit-source-id: 39258d9b116a916817b3588f160b141f956e5d0b
2020-08-07 10:18:30 -07:00
Mike Ruberry
ccfce9d4a9 Adds fft namespace (#41911)
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.

Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:

```
import torch.fft

t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```

See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911

Reviewed By: glaringlee

Differential Revision: D22941894

Pulled By: mruberry

fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
2020-08-06 00:20:50 -07:00
Hameer Abbasi
3d46e02ea1 Add __torch_function__ for methods (#37091)
Summary:
According to pytorch/rfcs#3

From the goals in the RFC:

1. Support subclassing `torch.Tensor` in Python (done here)
2. Preserve `torch.Tensor` subclasses when calling `torch` functions on them (done here)
3. Use the PyTorch API with `torch.Tensor`-like objects that are _not_ `torch.Tensor`
   subclasses (done in https://github.com/pytorch/pytorch/issues/30730)
4. Preserve `torch.Tensor` subclasses when calling `torch.Tensor` methods. (done here)
5. Propagating subclass instances correctly also with operators, using
   views/slices/indexing/etc. (done here)
6. Preserve subclass attributes when using methods or views/slices/indexing. (done here)
7. A way to insert code that operates on both functions and methods uniformly
   (so we can write a single function that overrides all operators). (done here)
8. The ability to give external libraries a way to also define
   functions/methods that follow the `__torch_function__` protocol. (will be addressed in a separate PR)

This PR makes the following changes:

1. Adds the `self` argument to the arg parser.
2. Dispatches on `self` as well if `self` is not `nullptr`.
3. Adds a `torch._C.DisableTorchFunction` context manager to disable `__torch_function__`.
4. Adds a `torch::torch_function_enabled()` and `torch._C._torch_function_enabled()` to check the state of `__torch_function__`.
5. Dispatches all `torch._C.TensorBase` and `torch.Tensor` methods via `__torch_function__`.

TODO:

- [x] Sequence Methods
- [x] Docs
- [x] Tests

Closes https://github.com/pytorch/pytorch/issues/28361

Benchmarks in https://github.com/pytorch/pytorch/pull/37091#issuecomment-633657778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37091

Reviewed By: ngimel

Differential Revision: D22765678

Pulled By: ezyang

fbshipit-source-id: 53f8aa17ddb8b1108c0997f6a7aa13cb5be73de0
2020-08-05 20:44:13 -07:00
Xiang Gao
23174ca71b [reland] Enable TF32 support for cuBLAS (#41498)
Summary:
fix rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498

Reviewed By: mruberry

Differential Revision: D22560572

Pulled By: ngimel

fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041
2020-07-15 21:00:55 -07:00
Shen Li
3a63a939d4 Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS
Test Plan: revert-hammer

Differential Revision:
D22517785 (288ece89e1)

Original commit changeset: 87334c893561

fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458
2020-07-15 08:15:48 -07:00
Xiang Gao
288ece89e1 Enable TF32 support for cuBLAS (#40800)
Summary:
Benchmark on a fully connected network and torchvision models (time in seconds) on GA100:

| model              | batch size | forward(TF32) | forward(FP32) | backward(TF32) | backward(FP32) |
|--------------------|------------|---------------|---------------|----------------|----------------|
| FC 512-128-32-8    | 512        | 0.000211      | 0.000321      | 0.000499       | 0.000532       |
| alexnet            | 512        | 0.0184        | 0.0255        | 0.0486         | 0.0709         |
| densenet161        | 128        | 0.0665        | 0.204         | 0.108          | 0.437          |
| googlenet          | 256        | 0.0925        | 0.110         | 0.269          | 0.326          |
| inception_v3       | 256        | 0.155         | 0.214         | 0.391          | 0.510          |
| mnasnet1_0         | 512        | 0.108         | 0.137         | 0.298          | 0.312          |
| mobilenet_v2       | 512        | 0.114         | 0.294         | 0.133          | 0.303          |
| resnet18           | 512        | 0.0722        | 0.100         | 0.182          | 0.228          |
| resnext50_32x4d    | 256        | 0.170         | 0.237         | 0.373          | 0.479          |
| shufflenet_v2_x1_0 | 512        | 0.0463        | 0.0473        | 0.125          | 0.123          |
| squeezenet1_0      | 512        | 0.0870        | 0.0948        | 0.205          | 0.214          |
| vgg16              | 256        | 0.167         | 0.234         | 0.401          | 0.502          |
| wide_resnet50_2    | 512        | 0.186         | 0.310         | 0.415          | 0.638          |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800

Reviewed By: mruberry

Differential Revision: D22517785

Pulled By: ngimel

fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e
2020-07-14 13:21:10 -07:00
Kurt Mohler
124cdf2290 Add experimental deterministic flag (#38683)
Summary:
Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch.
Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available.
Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic.
Offers both Python and ATen interfaces

Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683

Differential Revision: D21998093

Pulled By: ezyang

fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737
2020-06-12 08:44:06 -07:00
Wanchao Liang
d493918436 [dist_autograd] expose distributed backward C++ API (#38656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38656

Test Plan: Imported from OSS

Differential Revision: D21940441

Pulled By: wanchaol

fbshipit-source-id: e9d35201825912f5e7d7e1d0a71586abe5a6f71c
2020-06-08 19:42:21 -07:00
Edward Yang
4d880c0693 Device and torch._C function cleanup (#38173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38173

- Introduce torch.types.Device representing all "device-like" types
- Stubbed torch.device.__reduce__
- Stubbed all torch._C functions comprehensively
- Deleted _safe_call which is unused throughout the codebase

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21497399

Pulled By: ezyang

fbshipit-source-id: 1f534442b0ec9a70d556545d072f2c06a08b9d15
2020-06-03 19:17:22 -07:00
Nikita Shulga
a864dbb360 Make _C extension a thin C wrapper (#39375)
Summary:
It just depends on a single `torch_python` library.
C library does not depend on standard C++ library and as result it closes https://github.com/pytorch/pytorch/issues/36941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39375

Reviewed By: orionr

Differential Revision: D21840645

Pulled By: malfet

fbshipit-source-id: 777c189feee9d6fc686816d92cb9f109b8aac7ca
2020-06-02 13:11:59 -07:00
David Reiss
6d642a6f6c Remove (most) Python 2 support from C++ code (#35614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35614

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well.

Test Plan: CI

Differential Revision: D20842876

Pulled By: dreiss

fbshipit-source-id: 18abf0d324ed2185ec6d27c864e935d856dcc6ad
2020-05-14 15:01:49 -07:00
Peter Bell
5137827ad0 Lazily initialise thread local num_threads value (#37461)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37259, fixes https://github.com/pytorch/pytorch/issues/20156

This lazily calls `at::init_num_threads` once for each thread by adding a call to `lazy_init_num_threads` in `at::parallel_for` and `at::parallel_reduce`.

If this solution is okay, then we should add the same to guard other places that might use MKL or OpenMP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37461

Reviewed By: ezyang

Differential Revision: D21472763

Pulled By: ilia-cher

fbshipit-source-id: 889d6664f5bd4080037ade02ee324b1233992915
2020-05-11 13:24:45 -07:00
anjali411
1f09f7ea44 Python API for Complex Storage and storage copy logic (#35771)
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771

Differential Revision: D21319650

Pulled By: anjali411

fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
2020-05-01 11:47:22 -07:00
Gregory Chanan
287f3b746e Remove Backend -> THPLayout mapping. (#37527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37527

This is yet another place that needs to be updated for adding a new "Backend" and is unnecessary.  Instead, just use layout_from_backend and have a map from Layout -> THPLayout.

Other changes:
- rename torch::getDtype and torch::getLayout to torch::getTHPDtype and torch::getTHPLayout since e.g. for layout you are both passing in and returning a "layout" type.
- add NumOptions to Layout to match the dtype/ScalarType formulation.

Test Plan: Imported from OSS

Differential Revision: D21309836

Pulled By: gchanan

fbshipit-source-id: ede0e4f3bf7ff2cd04a9b17df020f0d4fd654ba3
2020-04-30 11:11:09 -07:00
Edward Yang
9e3605de98 [RELAND] New operator registration API (#35061) (#35629)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/35061 ; removed
the get qualified type name magic from debug strings to work around
MSVC 2017 bug.

Main points of the new API:

- You can register implementations (impl) without having to specify a schema.
- Registrations are commutative, so no matter what order your static
  initializers run, you end up with the same end result.

op_registration_test.cpp contains a reasonably comprehensive accounting
for the available API surface

How does this implementation proceed?  The basic concept is to relax the
internal invariants of Dispatcher data structures to allow the
possibility that a FunctionSchema is not specified in an Operator.

- DispatchKeyExtractor has an uninitialized state where it doesn't look
  for dispatch keys in any arguments of the stack.  It can have a
  schema (de)registered to itself post facto with
  registerSchema/unregisterSchema.
- DispatchTable has a new constructor taking only an OperatorName for
  the uninitialized state.  It can have a schema (de)registered to itself
  post facto with registerSchema/unregisterSchema
- OperatorDef maintains counts of both defs and well as defs_and_impls.
  defs_and_impls keeps track of the outstanding impl registrations; you
  may have impl registrations but no defs.  If there are no defs (no
  schema), the operator is not returned by findSchema.  A new
  findOperatorByName fucntion unconditionally returns the OperatorHandle
  even if there's no schema.  OperatorHandle::hasSchema can be used
  to check if the operator has schema.
- Replaced 'registerKernel' with 'registerImpl', which is the new
  interface for directly registering kernels without implementations.
- Because 'registerImpl' no longer requires an OperatorHandle, change
  'registerDef' to only return a RegistrationHandleRAII.  This is marginally
  less efficient (since we're doing two hash table lookups on a registration
  now), but this won't matter in the long term, and probably doesn't
  matter now either.
- Rename registerBackendFallbackKernel to registerFallback (this exposed
  a bunch of places where we're improperly directly interfacing with Dispatcher;
  we need to add this capability to the true public API)
- All code generated internal registrations are switched to use the new
  API.  This includes VariableType registrations (which previously
  weren't converted) and the mobile autograd stuff
- Switch the new-style def()/impl() APIs to interact directly with Dispatcher,
  rather than indirecting through the old API
- We deleted alias analysis kind merging entirely.  As a nod to BC, it's
  possible to define a full schema with alias analysis kind, and then
  later do another full schema def with missing alias analysis kind, but
  the opposite direction is not allowed.  We can remove this entirely
  following the plan at https://github.com/pytorch/pytorch/issues/35040
- Schema matching is moved inside the dispatcher, because we might not
  be able to immediately schema match at the point of an impl() (because
  we don't have the schema yet).  To do this, we store the inferred
  function schema inside a KernelEntry, so we can check it when we get
  the real schema.
- Registered kernel functions now store a debug string which
  can be used to more easily identify them.  Tests use this to
  distinguish between multiple distinct registrations; regular
  invocations get only very basic information.

Because we need our static initializers to work no matter what order
they're run, the testing strategy on this PR is quite involved.

The general concept:
- Bind a (very gimped) version of the dispatcher API from Python,
  so that we can easily write a more complex testing harness
  using expect tests.
- For series of registrations we want to test, exhaustively
  test every possible permutation of registrations (and
  deregistrations), and show that the intermediate states
  agree no matter what path is taken.
- Intermediate states are rendered using a new dumpState()
  debugging method that prints the internal state of the
  dispatcher.  This method may be generally useful for people
  who want to see what's in the dispatcher.
- Simultaneously, add a new invariant testing function which
  checks that the internal invariants of the dispatcher are
  upheld (so we don't have to print internal implementation
  details of the dispatcher)

The testing framework found a few bugs in development.  For example,
here is a case where we registered schema too early, before checking
if it was valid:

```
Traceback (most recent call last):
  File "test/test_dispatch.py", line 164, in test_def_impl_schema_mismatch
    ], raises=True)
  File "test/test_dispatch.py", line 135, in commute
    results=results, raises=raises)
  File "test/test_dispatch.py", line 83, in run_permutation
    .format(ctor_order[:i], op_ix))
  File "test/test_dispatch.py", line 59, in check_invariants
    .format(expected_provenance, actual_provenance)
AssertionError: 'name[16 chars]ema: (none)\ncatchall: boxed unboxed :: (Tenso[18 chars]0)\n' != 'name[16 chars]ema: test::foo(Tensor x, Tensor y) -> (Tensor)[53 chars]0)\n'
  name: test::foo
- schema: (none)
+ schema: test::foo(Tensor x, Tensor y) -> (Tensor)
  catchall: boxed unboxed :: (Tensor _0) -> (Tensor _0)
 : expected from running ctors (1,); actual from running ctors (1,) and then failing to run ctor 0 (did this failure leave the dispatcher in a wedged state? it shouldn't!)
```

There are also C++ smoketests for the API.  These tests comprehensively
cover the C++ API surface of the new operator registration API, but
don't check very hard if the API does the right thing (that's what
test_dispatch.py is for)

Some miscellaneous changes which could have been split into other
PRs, but I was too lazy to do so:

- Add torch::jit::parseName (mirroring parseSchema/parseSchemaOrName)
- Add cloneWithName functionality to FunctionSchema
- Unconditionally generate schema registration, even when type_method_dispatch
  is a dict.  The one exception is for manual registrations....
- Add fallback, CppFunction::makeFallthrough and
  CppFunction::makeFromBoxedFunction to public API of op_registration, so we can
  stop calling internal registerImpl directly
- Add new syntax sugar dispatch_autograd for registering autograd kernels
- Minor OperatorName cleanup, storing OperatorName in DispatchTable
  and defining operator<< on OperatorName
- Refactored the op registration API to take FunctionSchema directly.
  We now do namespacing by post facto fixing up the OperatorName
  embedded in FunctionSchema.  This also means that you can
  now do torch::import("ns1").def("ns2::blah") and have the ns2
  override ns1 (although maybe this is not the correct behavior.)
- New torch::schema public API, for attaching alias analysis kind
  annotation kinds.  This meant we had to template up some function
  signatures which previously took const char*.  There's now a nice
  comment explaining this strategy.
- torch::import now takes std::string which means we can use
  the namespacing from Python

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35629

Differential Revision: D20724551

Pulled By: ezyang

fbshipit-source-id: befa46a1affb4ec4ae1fb39e3564a63695a6ca41
2020-03-29 19:48:29 -07:00
Edward Yang
227beb9095 Revert D20680520: New operator registration API
Test Plan: revert-hammer

Differential Revision:
D20680520

Original commit changeset: 5d39a28e4ec7

fbshipit-source-id: 5b2497ffc24db9a05b01d526f161bc0164f9f707
2020-03-28 14:49:56 -07:00
Edward Yang
28ab8c6ff8 New operator registration API (#35061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35061

Main points of the new API:

- You can register implementations (impl) without having to specify a schema.
- Registrations are commutative, so no matter what order your static
  initializers run, you end up with the same end result.

op_registration_test.cpp contains a reasonably comprehensive accounting
for the available API surface

How does this implementation proceed?  The basic concept is to relax the
internal invariants of Dispatcher data structures to allow the
possibility that a FunctionSchema is not specified in an Operator.

- DispatchKeyExtractor has an uninitialized state where it doesn't look
  for dispatch keys in any arguments of the stack.  It can have a
  schema (de)registered to itself post facto with
  registerSchema/unregisterSchema.
- DispatchTable has a new constructor taking only an OperatorName for
  the uninitialized state.  It can have a schema (de)registered to itself
  post facto with registerSchema/unregisterSchema
- OperatorDef maintains counts of both defs and well as defs_and_impls.
  defs_and_impls keeps track of the outstanding impl registrations; you
  may have impl registrations but no defs.  If there are no defs (no
  schema), the operator is not returned by findSchema.  A new
  findOperatorByName fucntion unconditionally returns the OperatorHandle
  even if there's no schema.  OperatorHandle::hasSchema can be used
  to check if the operator has schema.
- Replaced 'registerKernel' with 'registerImpl', which is the new
  interface for directly registering kernels without implementations.
- Because 'registerImpl' no longer requires an OperatorHandle, change
  'registerDef' to only return a RegistrationHandleRAII.  This is marginally
  less efficient (since we're doing two hash table lookups on a registration
  now), but this won't matter in the long term, and probably doesn't
  matter now either.
- Rename registerBackendFallbackKernel to registerFallback (this exposed
  a bunch of places where we're improperly directly interfacing with Dispatcher;
  we need to add this capability to the true public API)
- All code generated internal registrations are switched to use the new
  API.  This includes VariableType registrations (which previously
  weren't converted) and the mobile autograd stuff
- Switch the new-style def()/impl() APIs to interact directly with Dispatcher,
  rather than indirecting through the old API
- We deleted alias analysis kind merging entirely.  As a nod to BC, it's
  possible to define a full schema with alias analysis kind, and then
  later do another full schema def with missing alias analysis kind, but
  the opposite direction is not allowed.  We can remove this entirely
  following the plan at https://github.com/pytorch/pytorch/issues/35040
- Schema matching is moved inside the dispatcher, because we might not
  be able to immediately schema match at the point of an impl() (because
  we don't have the schema yet).  To do this, we store the inferred
  function schema inside a KernelEntry, so we can check it when we get
  the real schema.
- Registered kernel functions now store a debug string which
  can be used to more easily identify them.  There's some best
  effort stuff based on __FUNCSIG__ but this is only really
  capable of reporting types and not function symbols.  Tests
  use this to distinguish between multiple distinct registrations.

Because we need our static initializers to work no matter what order
they're run, the testing strategy on this PR is quite involved.

The general concept:
- Bind a (very gimped) version of the dispatcher API from Python,
  so that we can easily write a more complex testing harness
  using expect tests.
- For series of registrations we want to test, exhaustively
  test every possible permutation of registrations (and
  deregistrations), and show that the intermediate states
  agree no matter what path is taken.
- Intermediate states are rendered using a new dumpState()
  debugging method that prints the internal state of the
  dispatcher.  This method may be generally useful for people
  who want to see what's in the dispatcher.
- Simultaneously, add a new invariant testing function which
  checks that the internal invariants of the dispatcher are
  upheld (so we don't have to print internal implementation
  details of the dispatcher)

The testing framework found a few bugs in development.  For example,
here is a case where we registered schema too early, before checking
if it was valid:

```
Traceback (most recent call last):
  File "test/test_dispatch.py", line 164, in test_def_impl_schema_mismatch
    ], raises=True)
  File "test/test_dispatch.py", line 135, in commute
    results=results, raises=raises)
  File "test/test_dispatch.py", line 83, in run_permutation
    .format(ctor_order[:i], op_ix))
  File "test/test_dispatch.py", line 59, in check_invariants
    .format(expected_provenance, actual_provenance)
AssertionError: 'name[16 chars]ema: (none)\ncatchall: boxed unboxed :: (Tenso[18 chars]0)\n' != 'name[16 chars]ema: test::foo(Tensor x, Tensor y) -> (Tensor)[53 chars]0)\n'
  name: test::foo
- schema: (none)
+ schema: test::foo(Tensor x, Tensor y) -> (Tensor)
  catchall: boxed unboxed :: (Tensor _0) -> (Tensor _0)
 : expected from running ctors (1,); actual from running ctors (1,) and then failing to run ctor 0 (did this failure leave the dispatcher in a wedged state? it shouldn't!)
```

There are also C++ smoketests for the API.  These tests comprehensively
cover the C++ API surface of the new operator registration API, but
don't check very hard if the API does the right thing (that's what
test_dispatch.py is for)

Some miscellaneous changes which could have been split into other
PRs, but I was too lazy to do so:

- Add torch::jit::parseName (mirroring parseSchema/parseSchemaOrName)
- Add cloneWithName functionality to FunctionSchema
- Unconditionally generate schema registration, even when type_method_dispatch
  is a dict.  The one exception is for manual registrations....
- Add fallback, CppFunction::makeFallthrough and
  CppFunction::makeFromBoxedFunction to public API of op_registration, so we can
  stop calling internal registerImpl directly
- Add new syntax sugar dispatch_autograd for registering autograd kernels
- Minor OperatorName cleanup, storing OperatorName in DispatchTable
  and defining operator<< on OperatorName
- Refactored the op registration API to take FunctionSchema directly.
  We now do namespacing by post facto fixing up the OperatorName
  embedded in FunctionSchema.  This also means that you can
  now do torch::import("ns1").def("ns2::blah") and have the ns2
  override ns1 (although maybe this is not the correct behavior.)
- New torch::schema public API, for attaching alias analysis kind
  annotation kinds.  This meant we had to template up some function
  signatures which previously took const char*.  There's now a nice
  comment explaining this strategy.
- torch::import now takes std::string which means we can use
  the namespacing from Python

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20680520

Pulled By: ezyang

fbshipit-source-id: 5d39a28e4ec7c73fe4b1fb2222e865ab65e188f5
2020-03-28 10:52:49 -07:00
Omkar Salpekar
4025729e88 [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636

Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072

Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.

In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an  RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).

This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.

Differential Revision: D20019236

fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
2020-03-20 20:07:47 -07:00
Kimish Patel
4c30fc7238 Integrate XNNPACK with custom class for packing weights. (#34047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34047

This PR integrates the added xnnpack conv2d and linear op via
custom class registration for packed weights. The packed struct
is serializable.

Test Plan:
python test test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20185657

fbshipit-source-id: fc7e692d8f913e493b293b02d92f4e78536d7698
2020-03-14 12:51:56 -07:00
Peter Bell
5fc5cf6571 Stop using ctypes to interface with CUDA libraries. (#33678)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678

Differential Revision: D20249187

Pulled By: ezyang

fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed
2020-03-11 07:22:46 -07:00
Michael Suo
dbe850af5b [jit] do the code reorg (#33851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33851

Rationale and context described in #33828.

Script to reproduce the move:
https://gist.github.com/suo/16cbefaaeb67ca5a7c6caffd49b7f6e9
ghstack-source-id: 99079645

Test Plan: Make sure CI passes

Reviewed By: jamesr66a

Differential Revision: D20133869

fbshipit-source-id: 390e9241a9c85366d9005c492ac31f10aa96488e
2020-02-27 13:02:51 -08:00
Jithun Nair
718c538ff9 Add ability to enable/disable MIOpen at runtime (#33118)
Summary:
1. Set `torch._C.has_cudnn` to `True` for ROCm
2. Make MIOpen invocations respect value of `cudnn_enabled` or `at::globalContext().userEnabledCuDNN()`
3. `torch/backends/cudnn/__init__.py`: Add hip-specific changes (use "hide whitespace changes" option to view simpler diff)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33118

Differential Revision: D19977719

Pulled By: bddppq

fbshipit-source-id: 64d4dd1d78afcf96201360d85b8be5950f96dfad
2020-02-20 10:47:57 -08:00
Peter Bell
44af8ee6cd Add pybind11 exception translator (#30588)
Summary:
Closes https://github.com/pytorch/pytorch/issues/30027

The idea here is that you can bind a function with `pybind11` in a single line and without modifying the function:
```cpp
m.def("foo", foo, py::call_guard<torch::PyWarningHandler>());
```
Where warnings are handled by the [`call_guard`](https://pybind11.readthedocs.io/en/stable/advanced/functions.html#call-guard) and exceptions are handled by the `pybind11` exception translator. To do this, I have added support for handling C++ exceptions in `torch::PyWarningHandler`'s destructor without setting the python error state before hand.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30588

Differential Revision: D19905626

Pulled By: albanD

fbshipit-source-id: 90c0a5e298b123cc0c8ab9c52c91be4e96ea47c6
2020-02-18 11:33:29 -08:00
Basil Hosmer
fb159b5236 Some work on eager op binding codegen (gen_python_functions.py) (#29986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29986

Previously in addition to generating a python binding for each op,
we would generate an almost-trivial helper for each overload.
This PR eliminates the helpers, simplifying codegen logic a bit and
reducing the source-level indirection by a step.
Perf should be unchanged.

codegen diff: 1f2f07fb60

Note: in the interests of keeping the diff contained, there's only
some light cleanup here beyond what's necessary for the codegen changes.
Plan is to do some more substantial refactoring in followup PRs that
leave generated code unchanged.

Test Plan: Imported from OSS

Differential Revision: D18567980

Pulled By: bhosmer

fbshipit-source-id: eb9a81babb4489abd470842757af45580d4c9906
2020-01-30 00:29:53 -08:00
Pavel Belevich
62b06b9fae Rename TensorTypeId to DispatchKey (#32154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32154

TensorTypeId -> DispatchKey
	c10/core/TensorTypeId.h -> c10/core/DispatchKey.h
	c10/core/TensorTypeId.cpp -> c10/core/DispatchKey.cpp
	TensorTypeId::* -> DispatchKey::*
	TensorTypeId type_id -> DispatchKey dispatch_key
		type_id -> dispatch_key
	TensorTypeId::NumTensorIds -> DispatchKey::NumDispatchKeys
	RealTensorTypeId -> RealDispatchKey
TensorTypeSet -> DispatchKeySet
	TensorTypeIds -> DispatchKeys
	c10/core/TensorTypeSet.h -> c10/core/DispatchKeySet.h
	c10/core/TensorTypeSet.cpp -> c10/core/DispatchKeySet.cpp
	type_set() -> key_set()
	type_set_ -> key_set_
	typeSet -> keySet
ExcludeTensorTypeIdGuard -> ExcludeDispatchKeyGuard
IncludeTensorTypeIdGuard -> IncludeDispatchKeyGuard
LocalTensorTypeSet -> LocalDispatchKeySet
	c10/core/impl/LocalTensorTypeSet.h -> c10/core/impl/LocalDispatchKeySet.h
	c10/core/impl/LocalTensorTypeSet.cpp -> c10/core/impl/LocalDispatchKeySet.cpp
	tls_local_tensor_type_set -> tls_local_dispatch_key_set
	tls_is_tensor_type_id_excluded -> tls_is_dispatch_key_excluded
	tls_set_tensor_type_id_excluded -> tls_set_dispatch_key_excluded
	tls_is_tensor_type_id_included -> tls_is_dispatch_key_included
	tls_set_tensor_type_id_included -> tls_set_dispatch_key_included
MultiDispatchTensorTypeSet -> MultiDispatchKeySet
	multi_dispatch_tensor_type_set -> multi_dispatch_key_set
tensorTypeIdToBackend -> dispatchKeyToBackend
backendToTensorTypeId -> backendToDispatchKey
initForTensorTypeSet -> initForDispatchKeySet
inferred_type_set -> inferred_key_set
computeTensorTypeId -> computeDispatchKey
PODLocalTensorTypeSet raw_local_tensor_type_set -> PODLocalDispatchKeySet raw_local_dispatch_key_set
get_default_tensor_type_id -> get_default_dispatch_key
inferred_type_id -> inferred_dispatch_key
actual_type_id -> actual_dispatch_key
typeSetToDispatchKey_ -> dispatchKeySetToDispatchKey_
get_type_id() -> get_dispatch_key()
legacyExtractTypeId -> legacyExtractDispatchKey
extractTypeId -> extractDispatchKey

Test Plan: Imported from OSS

Differential Revision: D19398900

Pulled By: pbelevich

fbshipit-source-id: 234ad19f93d33e00201b61e153b740a339035776
2020-01-15 11:16:08 -08:00
Richard Zou
bcb0bb7e0e Remove unnecessary ATen/core/EnableNamedTensor.h (#31117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31117

After this diff, we will have completely removed the named tensor
feature flagging. This means that named tensors are always on and that
there is no mechanism to turn them off. There should be no more follow-up
diffs.

I performed the deletion of the header with
```
find . -type f -print0 | xargs -0 sed -i '/#include
<ATen\/core\/EnableNamedTensor.h>/d'
```

Test Plan: - wait for CI

Differential Revision: D18934952

Pulled By: zou3519

fbshipit-source-id: 253d059074b910fef15bdf885ebf71e0edf5bea5
2019-12-12 09:53:07 -08:00
Richard Zou
9047d4df45 Remove all remaining usages of BUILD_NAMEDTENSOR (#31116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31116

Changelist:
- remove BUILD_NAMEDTENSOR macro
- remove torch._C._BUILD_NAMEDTENSOR
- remove all python behavior that relies on torch._C._BUILD_NAMEDTENSOR

Future:
- In the next diff, I will remove all usages of
ATen/core/EnableNamedTensor.h since that header doesn't do anything
anymore
- After that, we'll be done with the BUILD_NAMEDTENSOR removal.

Test Plan: - run CI

Differential Revision: D18934951

Pulled By: zou3519

fbshipit-source-id: 0a0df0f1f0470d0a01c495579333a2835aac9f5d
2019-12-12 09:53:03 -08:00
Richard Zou
e05ee4c421 Remove BUILD_NAMEDTENSOR macros (#30894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30894

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There
will be followups.

Reasons for removing the macros:
- BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0.
- Since we don't test building without it, it is useless to keep around.
- Code becomes nicer to read without the macros

Reasons for not removing the macros:
- potential for feature flagging

Now, I argue against needing to feature flag. The main reason why we
might want to feature flag is if we need to disable the feature.
We'd need a fast switch to disable the feature if someone discovers
in the future that named tensors caused some regression in some existing workflows.

In https://github.com/pytorch/pytorch/pull/25798, I did a variety of
macro- and micro- benchmarks to determine the performance impact of named
tensors on regular tensors.

[The
microbenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-529014810)
were not very stable, and running the
microbenchmarks for more iterations doesn't actually help because the
noise is not distributed in a nice way. Instead of microbenchmarks I ran
a [profiler
(perf)](https://github.com/pytorch/pytorch/pull/25798#issuecomment-555707645)
to estimate how much overhead named tensors add to unnamed code. I
estimated the overhead to be less than 100ns for `add` and even smaller
for `mm`; there are ways to optimize even futher if we find this to be a
problem.

[Initial
macrobenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-530539104)
were also not very stable. I ran imagenet for some number of epochs. To
make them more stable, I got rid of the data loading (which seemed to
vary between runs). [In some benchmarkers without data
loading](https://github.com/pytorch/pytorch/pull/25798#issuecomment-562214053),
we can see that the results are less noisy now. These results support
no noticeable regressions in speed.

Test Plan: - wait for CI

Differential Revision: D18858543

Pulled By: zou3519

fbshipit-source-id: 08bf3853a9f506c6b084808dc9ddd1e835f48c13
2019-12-10 07:54:05 -08:00
Edward Yang
0c91ebb694 Delete all trivial uses of make_variable. (#29213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29213

A trivial use of make_variable is one where requires_grad=False.  This
transformation is not technically semantics preserving, as make_variable
will create a shallow copy of the tensor in question; however, I
am guessing that we have the invariant that we don't actually make
use of this shallow copy in a nontrivial way.

There were some cases where the surrounding code expected a Variable proper
to be returned; I retained those sites.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18353503

Pulled By: ezyang

fbshipit-source-id: 57fe34d82e009c0cc852266fb0b79d6d9c62bb03
2019-11-13 07:43:41 -08:00
Alban Desmaison
9b875e1256 Buffer python warning to avoid deadlocks
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26613

Test Plan: Imported from OSS

Differential Revision: D18249633

Pulled By: albanD

fbshipit-source-id: 863f52400e1b97943a67a9e1abb09ae8d045e7f0
2019-11-07 08:35:06 -08:00
albanD
cb3232fdb9 Fix clang-tidy errors in csrc/Module.cpp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28283

Test Plan: Imported from OSS

Differential Revision: D18249632

Pulled By: albanD

fbshipit-source-id: 0c7c71b3b7c74d338a90850e06c841b399f5709f
2019-11-07 08:34:58 -08:00
Supriya Rao
45391ccecb Update qengine flag in python to string (#26620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620

This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now).
set_qengine and get_qengine return an int which represents the at::QEngine enum

Test Plan:
python test/test_torch.py

Imported from OSS

Differential Revision: D17533582

fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f
2019-09-23 17:56:50 -07:00
Jerry Zhang
2667493f4c Expose supportedQEngines to python (#26474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26474

att

Test Plan:
python test/test_torch.py

Imported from OSS

Differential Revision: D17517373

fbshipit-source-id: af931761d6ee31a88808d05f686002a83b6b25af
2019-09-21 10:36:13 -07:00
Jerry Zhang
8f50ea0f5c Add NoQEngine to QEngine and refactor the name of set/get qengine (#26471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26471

att

Test Plan:
.

Imported from OSS

Differential Revision: D17491215

fbshipit-source-id: 5790aa0113bfdbeeb838f3d1530397606ccaa1e9
2019-09-19 17:42:09 -07:00
Ailing Zhang
b1ecf4bc82 Revert D17464904: Add NoQEngine to QEngine and refactor the name of set/get qengine
Test Plan: revert-hammer

Differential Revision:
D17464904

Original commit changeset: d8f2cebb978f

fbshipit-source-id: 8feb86f7347f455eb51538ce7893d4a096ba0ba4
2019-09-18 20:04:58 -07:00
Jerry Zhang
4f7292f7ee Add NoQEngine to QEngine and refactor the name of set/get qengine (#26330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26330

att

Test Plan:
.

Imported from OSS

Differential Revision: D17464904

fbshipit-source-id: d8f2cebb978fcbc478bc7e111ba24bc71a6f8915
2019-09-18 19:38:59 -07:00
Supriya Rao
bb1efb3bee Adding quantized::linear function for pytorch mobile in c10 (#26135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26135

This change adds the support to call QNNPACK using the refactored API for Linear operators (Fully Connected)
It also has certain cmake changes to enable builing and using pytorch_qnnpack inside aten
I have disabled USE_QNNPACK in CMakeLists.txt. Enabling it results in picking kernels from third_party/QNNPACK during runtime since the function names are the same.

Test Plan:
python test/test_quantized.py TestQNNPackOps.test_qlinear_qnnpack

Imported from OSS

Differential Revision: D17434885

fbshipit-source-id: 084698026938f4529f61d12e86dfe82534ec73dd
2019-09-17 16:16:39 -07:00
Richard Zou
caed485873 Turn on BUILD_NAMEDTENSOR permanently (#26060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26060

This PR enables BUILD_NAMEDTENSOR by default. This is done via including
a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`.
In the future, the plan is to get rid of the flag entirely: we can
incrementally delete usages after this PR goes in.

This PR also maintains the namedtensor ci vs regular ci distinction.
`test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified.
TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this
distinction later and send out an announcement about it; devs will be
responsible for named tensor failures after that.

The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we
could quickly prototype named tensor features without worrying about
adding overhead to the framework. The overheads can be categorized as
memory overhead and performance overhead.

Memory overhead: named tensors adds 1 additional word per Tensor. This
is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>`
field. This is not a lot of overhead.

Performance overhead: At all entry points to name inference, we check
if inputs to an op are named. If inputs are not named, we short-circuit
and don't do name inference. These calls should therefore be as
efficient as error-checking code and not take up a lot of time.

My plan is to benchmark a few functions and then post the results in a
comment to this PR.

Test Plan: - [namedtensor ci]

Differential Revision: D17331635

Pulled By: zou3519

fbshipit-source-id: deed901347448ae2c26066c1fa432e3dc0cadb92
2019-09-17 08:25:00 -07:00
Ralf Gommers
1b4951d3a5 Fix remaining invalid function cast warnings that show up with GCC 8/9 (#26104)
Summary:
Follow-up to gh-25483, more of the same fixes for warnings like:

```
../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* (*)(THPVariable*)’ {aka ‘_object* (*)(THPVariable*)’} to ‘getter’ {aka ‘_object* (*)(_object*, void*)’} [-Wcast-function-type]
  503 |   {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr},
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines.

`clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104

Differential Revision: D17396831

Pulled By: ezyang

fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7
2019-09-17 07:43:37 -07:00
Supriya Rao
24d5b5f5f9 Add Runtime flag for quantized backend. (#25680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25680

Add a runtime flag to choose between FBGEMM and QNNPACK when compiled with both.

The flag can be set by using torch.backends.quantized.engine = torch.fbgemm/torch.qnnpack or ctx::setPreferredQuantizedEngine(at::QEngine)
ghstack-source-id: 89935643

Test Plan: Verified torch.backends.quantized.engine works

Differential Revision: D17198233

fbshipit-source-id: e5449d06f4136385e0e6d18bd4237f8654a61672
2019-09-11 21:37:36 -07:00
jiayisun
b9bf91feb8 Add torch.backends.mkldnn.enabled flag (#25459)
Summary:
This PR is about add torch.backends.mkldnn.enabled flag said in https://github.com/pytorch/pytorch/issues/25186 which can be used disable mkldnn at runtime step as torch.backends.cudnn.enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25459

Differential Revision: D17258926

Pulled By: ezyang

fbshipit-source-id: e179ad364cc608fdaa7d0f37e2e762ceb5eda598
2019-09-11 12:09:40 -07:00
Gregory Chanan
716815e3de Stop initializing THNN backend. (#25352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25352

It doesn't appear to be necessary anymore; assuming this works I'll kill the codegen in a follow-up PR.

Test Plan: Imported from OSS

Differential Revision: D17101573

Pulled By: gchanan

fbshipit-source-id: bd3d1724ee5c659185a161b1e291e30af52f0a8a
2019-08-30 07:42:17 -07:00
Pritam Damania
7818e7e5d4 Basic framework for Distributed Autograd context. (#24875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24875

As per https://github.com/pytorch/pytorch/issues/23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```
ghstack-source-id: 89119248

Test Plan: unit tests.

Differential Revision: D16356694

fbshipit-source-id: d1a8678da0c2af611758dbb5d624d554212330ce
2019-08-28 18:51:56 -07:00
Shen Li
02d3c302d8 Fix build failure on OSX (#23998)
Summary:
https://github.com/pytorch/pytorch/pull/23228 caused build failure on OSX, because rpc.h is included as long as USE_DISTRIBUTED=1, but rpc/init.cpp (and others) is only included when NOT APPLE. So, it cannot find python_functions defined in init.cpp on MacOS. This PR attempt to fix it by wrapping rpc.h with USE_C10D, which is only set when NOT APPLE.

I tried this fix locally and it works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23998

Differential Revision: D16706087

Pulled By: mrshenli

fbshipit-source-id: d04fe6717a181a3198289cdef51439708c2e291d
2019-08-07 22:05:41 -07:00
Shen Li
8b349073ce sync and async torch.distributed.rpc for builtin operators (#23228)
Summary:
Features:

* sync and async RPC for builtin operators
* RpcAgent API
* ProcessGroupAgent implementation

Goal:

* have a minimum working and testable RPC implementation
* make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation
  * For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object.
  * For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...).
* support blocking and non-blocking RequestCallback
  * blocking means the callback won't return before sending out the response
  * non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list.

We are not exporting this diff until we finalize distributed autograd design and publish the API review publicly.

https://fb.quip.com/FabTAZKVgQpf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23228
ghstack-source-id: 87816717

Reviewed By: zhaojuanmao

Differential Revision: D15194693

fbshipit-source-id: 7adb600796613cde6073db6c227451b89940ecaf
2019-08-06 16:03:01 -07:00
Richard Zou
8e466b7e21 Add torch._C._BUILD_NAMEDTENSOR() (#23623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23623

This is a quick, not-user-facing check for if pytorch was built with BUILD_NAMEDTENSOR=1.

Test Plan:
- run tests [namedtensor ci]

gh-metadata: pytorch pytorch 23623 gh/zou3519/85/head

Differential Revision: D16621829

Pulled By: zou3519

fbshipit-source-id: d7e1161dc176bab2c1f953265722daeba1e63102
2019-08-02 11:37:25 -07:00
Iurii Zdebskyi
3a8d7463bd Enabled BFloat16 storage (#21523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21523
ghimport-source-id: 698b3cbd6b21c09b9ff8bf8011980df8e35c33b0

Test Plan: Imported from OSS

Differential Revision: D15819368

Pulled By: izdeby

fbshipit-source-id: f6b3bba7b3ca8ee677bd80a231dbb3920c07d61c
2019-07-09 21:51:06 -07:00
Roy Li
9c8f9f0ecb Remove many usages of Type (#21941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21941
ghimport-source-id: f20cca6229daba9eb8652adb3d959266ae081ef1

Test Plan: Imported from OSS

Differential Revision: D15893331

Pulled By: li-roy

fbshipit-source-id: c988b16008ff0e2725a88c6025afd4aabdaca45a
2019-06-30 04:11:28 -07:00
Alexander Sidorov
f51de8b61a Back out "Revert D15435461: [pytorch][PR] PyTorch ThroughputBenchmark" (#22185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22185

Original commit changeset: 72a0eac1658b

Differential Revision: D15981928

fbshipit-source-id: d2455d79e81c26ee90d41414cde8ac0f9b703bc3
2019-06-26 16:05:51 -07:00
Ailing Zhang
e8bc992b03 print device when it's not on default device (#22094)
Summary:
we used to not print device when it's on xla. It's sometimes confusing as it looks the same as cpu tensor...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22094

Differential Revision: D15975405

Pulled By: ailzhang

fbshipit-source-id: f19ceb9e26f5f2f6e7d659de12716f0dfe065f42
2019-06-25 20:28:50 -07:00
Pieter Noordhuis
6ff0c6ca3f Remove THD (#22065)
Summary:
It's been ~9 months since moving THD to the `torch.distributed.deprecated` namespace (see https://github.com/pytorch/pytorch/issues/11405) and we haven't seen issues related to it, so it's time to remove it.

Closes https://github.com/pytorch/pytorch/issues/18967.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22065

Reviewed By: mrshenli

Differential Revision: D15983669

Pulled By: pietern

fbshipit-source-id: 2a2f5866f9a63040bc7cef3956d5fd215aba7165
2019-06-25 12:19:13 -07:00
Soumith Chintala
08060e898b Revert D15435461: [pytorch][PR] PyTorch ThroughputBenchmark
Differential Revision:
D15435461

Original commit changeset: db08829dc3f4

fbshipit-source-id: 72a0eac1658b2d3f885bc9a21c49fcc23030ae3e
2019-06-23 22:55:05 -07:00
Alexander Sidorov
9b45237618 PyTorch ThroughputBenchmark (#20766)
Summary:
This is useful for measuring inference performance of your
models. This is a very basic benchmark for now. We don't support
batching on the benchmark side, no inter and intra op parallelizm is
supported yet, just caller based parallelizm.

Main phylosophy here is that user should be able to provide inputs
from python and just stack them within the benchmark. API should be
exactly the same as passing inputs to module.forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20766

Test Plan: Added a new unit test

Differential Revision: D15435461

Pulled By: salexspb

fbshipit-source-id: db08829dc3f4398bb1d8aa16cc4a58b6c72f16c6
2019-06-23 13:03:18 -07:00
Jerry Zhang
94f903654c Add qscheme() method (#20608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20608

Exposing QScheme in python as Python objects like `torch.qscheme.per_tensor_affine` etc.

Reviewed By: zafartahirov

Differential Revision: D15364354

fbshipit-source-id: 4d6a96d67e9ead051cf4a8f934553a8c7232fdb7
2019-06-14 16:29:29 -07:00
Syed Tousif Ahmed
ae342fd076 Refactor Random Number Generators in ATen (#21364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21364
ghimport-source-id: ca7d37e10190ba46dc8512f437404ca9216d3369

Differential Revision: D15696497

Pulled By: ezyang

fbshipit-source-id: 2e713b8566ae915e175b5a79ac1dd9b86cc2a23d
2019-06-12 13:01:30 -07:00
davidriazati
f172fadd80 Make warnings be UserWarnings with source file info (#21231)
Summary:
Redo of #15201, this makes `warnings.warn` calls match their Python
behavior
](https://our.intern.facebook.com/intern/diff/15605266/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21231

Pulled By: driazati

Differential Revision: D15605266

fbshipit-source-id: 5931fd720b0c40d52dd492fbd1f5a76abefaab5c
2019-06-05 11:09:11 -07:00
Jerry Zhang
277bf69fa0 Add torch.load/torch.save for QTensor (#20830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20830

att

Reviewed By: dzhulgakov

Differential Revision: D15340701

fbshipit-source-id: 677038c8101f66dec4856c2eccf9f9e394012226
2019-05-30 20:52:19 -07:00
Dmytro Dzhulgakov
c25e33789e Lightweight at-most-once logging for API usage (#20745)
Summary:
Resubmit #20698 which got messed up.

Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.

Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745

Differential Revision: D15429196

Pulled By: dzhulgakov

fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
2019-05-23 23:17:59 -07:00
Will Feng
8cde4c4d22 Remove Variable::Impl and DifferentiableViewImpl (#17072)
Summary:
As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR:
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()`
3. Remove `Variable.data()` API
3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history.

After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't.

**Note that this PR is BC-breaking in the following use cases:**

**Use Case 1:**
Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type.

**Use Case 2:**
If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example:
```python
params = torch.tensor([1.5, 1.5]).requires_grad_()
with torch.no_grad():
    # Change gradient to a sparse tensor
    params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.]))

grad_saved = params.grad
params.backward(torch.tensor([1.5, 1.5]))
assert id(grad_saved) == id(params.grad)  # This will fail after this PR
```
The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072

Differential Revision: D14075257

Pulled By: yf225

fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957
2019-05-23 21:09:04 -07:00
Edward Z. Yang
9b1dbffba5
Re-sync with internal repository (#20702) 2019-05-20 09:22:57 -04:00
Dmytro Dzhulgakov
d3059b9c49 Lightweight logging for once-only API usage 2019-05-19 23:04:40 -07:00