Commit Graph

102 Commits

Author SHA1 Message Date
Richard Zou
4bd03b0242 Add python mode (#63496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63496

This PR adds a (private) enable_python_mode context manager.
(see torch/utils/_python_dispatch.py).
enable_python_mode accepts the type of a __torch_dispatch__ object
as its argument. Whenever an operator gets called inside of the
context manager, it dispatches to the __torch_dispatch__ of
the passed-in type.

Example usage:
```
with enable_python_mode(LoggingTensor):
    z = torch.empty([])
    assert isinstance(z, LoggingTensor)
```

There are quite a few changes that were made to support this.

First, we added TorchDispatchTypeObject, a C++ struct that represents the
type of a `__torch_dispatch__` object (e.g. LoggingTensor).
It holds both the PyObject* representing the class and a PyInterpreter*
so we know which Python interpreter it came from.

Next, we updated the concrete_dispatch_fn in python_variable.cpp to accept
a `const std::shared_ptr<TorchDispatchTypeObject>&` argument. When this
is null, dispatching happens as usual. When it is non-null, we prepend
the TorchDispatchTypeObject's PyObject* to the overloaded args list so that
it is considered first for dispatch.

To get that to work, we changed how `handle_torch_dispatch_no_python_arg_parser`
works. The "overloaded args list" previously only consisted of Tensor PyObjects,
but now it can have types in addition to Tensors!
- We renamed `append_overloaded_arg` to `append_overloaded_arg`
- We added a new `append_overloaded_type` that appends a type to
overloaded_args
- We added special handling in `handle_torch_dispatch_no_python_arg_parser`
and `append_overloaded_arg` to handle types in addition to Tensors.

Then, there is PythonMode and PythonModeTLS.
- We reuse the DispatchKey::Python dispatch key as a mode key
- We use PythonMode::enter and PythonMode::exit to enable/disable
DispatchKey::Python and set the PythonModeTLS.
- PythonModeTLS stores a TorchDispatchTypeObject as metadata.
- PythonMode is in libtorch_python, and PythonModeTLS is in ATen.
This split is due to the libtorch_python library boundary (because we need
to save TLS in ATen/ThreadLocalState)
- We modify the PythonFallbackKernel to look up
the relevant TorchDispatchTypeObject (if Python Mode is active) and
dispatch using it.

There are two more miscellaneous changes:
- internal_new_from_data (torch/csrc/utils/tensor_new.cpp) gets an
exclude guard. enable_python_mode currently does not handle
torch.tensor and the exclude guard is to prevent a bug.

Future:
- This PR does not allow for the nesting of Python modes. In the future we
should be able to enable this with a more sane no_dispatch API and by changing
the TLS to a stack. For now I did not need this for CompositeImplicitAutograd testing.

Test Plan: - new tests

Reviewed By: malfet, albanD

Differential Revision: D30543236

Pulled By: zou3519

fbshipit-source-id: ef5444d96a5a957d1657b7e37dce80f9a497d452
2021-08-30 18:44:35 -07:00
Aaron Bockover
c78ab28441 Add support for the ONNX Runtime Eager Mode backend (#58248)
Summary:
This PR implements the necessary hooks/stubs/enums/etc for complete ONNX Runtime (ORT) Eager Mode integration. The actual extension will live out of tree at https://github.com/pytorch/ort.

We have been [working on this at Microsoft](https://github.com/microsoft/onnxruntime-pytorch/tree/eager-ort/torch_onnxruntime) for the last few months, and are finally ready to contribute the PyTorch core changes upstream (nothing major or exciting, just the usual boilerplate for adding new backends).

The ORT backend will allow us to ferry [almost] all torch ops into granular ONNX kernels that ORT will eagerly execute against any devices it supports (therefore, we only need a single ORT backend from a PyTorch perspective).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58248

Reviewed By: astaff

Differential Revision: D30344992

Pulled By: albanD

fbshipit-source-id: 69082b32121246340d686e16653626114b7714b2
2021-08-20 11:17:13 -07:00
Kimish Patel
1b04d99f55 [Pytorch Profiler] Introduce scopes to enableProfiler (#62417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62417

This diff adds an option to make enableProfiler enable callbacks only
for certain RecordScopes.
Why?
Profiling has some overhead when we repeatedly execute callbacks for
alls copes. On mobile side when we often have small quantized models
this overhead can be large. We observed that by only profiling top level
op and skipping profiling of other atend ops called within we can limit
this overhead. For example, instead of profling at::conv2d -> at::convolution ->
at::convolution_ and further more if ops like transpose etc. are called,
skipping profiling of those. Of course this limits the visibility, but
at the least this way we get a choice.

Test Plan: Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D29993659

fbshipit-source-id: 852d3ae7822f0d94dc6e507bd4019b60d488ef69
2021-08-13 21:40:15 -07:00
Kimish Patel
54f2eb6e7e [Pytorch Profiler] Add support for adding module hierarchy to (#61792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61792

KinetoEvent

This PR adds module hierarchy information to events.
What is module hierarchy information attached to events?
During profiling a TorchScript module, when events are added, we ask JIT
what is the module hierarchy associated with the node being
executed. At the time of execution of that node, there might be multiple
frames in the stack of interpreter. For each frame, we find
corresponding node and the corresponding module hierarchy is queried.
Module hierarchy corresponding to the node is associated with node's
InlinedCallStack. InlinedCallStack of node tracks the path via which the
node is inlined. Thus during the inlining process we annotate
module information corresponding to the CallMethod nodes being inlined.

With this PR, chrome trace will contain additional metadata:
"Module Hierarchy". This can look like this:
TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward
It contains module instance, type name and the method name in the
callstack.

Test Plan:
test_profiler

Imported from OSS

Reviewed By: raziel, ilia-cher

Differential Revision: D29745442

fbshipit-source-id: dc8dfaf7c5b8ab256ff0b2ef1e5ec265ca366528
2021-08-13 21:39:10 -07:00
Victor Quach
ed7ece389d Forbid inplace modification of a saved tensor's pack_hook input (#62717)
Summary:
When using saved tensors hooks (especially default hooks),
if the user defines a `pack_hook` that modifies its input,
it can cause some surprising behavior.

The goal of this PR is to prevent future user headache by catching
inplace modifications of the input of `pack_hook` and raising an error if
applicable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62717

Reviewed By: albanD

Differential Revision: D30255243

Pulled By: Varal7

fbshipit-source-id: 8d73f1e1b50b697a59a2849b5e21cf0aa7493b76
2021-08-12 12:40:10 -07:00
Rishi Puri
324673a537 rebase for autocast updates to include device_type and dtype flags (#61002)
Summary:
Fixes #{55374}
https://github.com/pytorch/pytorch/issues/55374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61002

Reviewed By: malfet, mruberry

Differential Revision: D30016812

Pulled By: ngimel

fbshipit-source-id: 6e09a29f539d28e9aea5cd9489b1e633cc588033
2021-08-10 20:03:12 -07:00
Ilia Cherniavskii
773a8eede4 [profiler][refactor] Refactor the usage of legacy profiler implementation (#61931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61931

This PR consolidates the profiling code around a new C++ implementation
(profiler_kineto.h/cpp) and uses it unconditionally from
torch.autograd.profiler/torch.profiler:
1. Always use profiler_kineto.h/cpp as the C++ implementation
2. Simplify profiler.py to remove unneeded parts depending on legacy
impl
3. Move some of the legacy logic into profiler_legacy.py (to be fully
deleted later)

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py -v
USE_KINETO=0 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py -v

Imported from OSS

Reviewed By: gdankel

Differential Revision: D29801599

fbshipit-source-id: 9794d29f2af38dddbcd90dbce4481fc8575fa29e
2021-08-03 18:51:29 -07:00
Victor Quach
b161ac541d [reland] Add default Saved Variable hooks (#62563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62563

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Relanding previous PR: https://github.com/pytorch/pytorch/pull/61834

Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc

Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98

The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30045405

Pulled By: Varal7

fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332
2021-08-02 11:30:26 -07:00
Yu Guo
5c47038d12 Back out D29792193 "Add default Saved Variable hooks" (#62415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62415

test error

Differential Revision: D29990361

fbshipit-source-id: 99c87dec6c5be6496c9db5c9205c3cb72a953dd9
2021-07-29 16:31:00 -07:00
Alex Suhan
b176feec1e Add device and key for lazy tensors (#61621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61621

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D29912934

Pulled By: asuhan

fbshipit-source-id: 493c32063a3e756d93cbf1d876563a35eaafb537
2021-07-26 23:00:22 -07:00
Victor Quach
be17d6eadf Add default Saved Variable hooks (#61834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61834

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29792193

Pulled By: Varal7

fbshipit-source-id: 33e931230ef59faa3ec8b5d11ef7c05539bce77c
2021-07-26 08:14:32 -07:00
Ilia Cherniavskii
41f7a9dac0 [profiler][refactor] Avoid using legacy event in profiler (#61721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61721

Remove dependency on LegacyEvent from the profiler

Test Plan:
python test/test_profiler.py -v

Imported from OSS

Reviewed By: kimishpatel, gdankel

Differential Revision: D29716769

fbshipit-source-id: 2c2b48f2ee096adcbde09821e0cc7c0fcb94d19f
2021-07-23 18:28:08 -07:00
cyy
59d6e07ada fix forward_idx check (#59911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59911

Reviewed By: dzhulgakov

Differential Revision: D29829020

Pulled By: albanD

fbshipit-source-id: f685063061dab499368a272d6b94a44e89f9a143
2021-07-22 08:37:33 -07:00
Victor Quach
ee5a97de11 Register Saved Tensors hooks (#60663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60663

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29466223

fbshipit-source-id: 65dc3a935c18a0e6b93a37e24543c696e6ae0321
2021-07-15 08:09:55 -07:00
Victor Quach
a5e2ea4345 Add noop register hook (#60685)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60685

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29466224

fbshipit-source-id: 68c8aa022ccffeefd45062f1443d15c9a6824f3d
2021-06-30 07:46:34 -07:00
Victor Quach
f54290fd72 Expose raw saved tensors for custom functions (#60551)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60551

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29466228

fbshipit-source-id: 7565f6cc3f2488c7e444cf81c7eb37a60c75b0e8
2021-06-29 17:21:52 -07:00
Edward Yang
aacc722aec Dispatch to Python via __torch_dispatch__ (#59760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59760

See https://github.com/pytorch/pytorch/issues/59049

There are some moving parts to this PR, I'll structure this explanation so the straightforward parts go first, and then the less straightforward parts.

**The actual dispatch to Python.** The core logic of dispatch to Python lives in `concrete_dispatch_fn` in `torch/csrc/autograd/python_variable.cpp`. It takes the input IValue stack, scans all the arguments for Tensor arguments, and defers most of the heavy lifting to `handle_torch_function_no_python_arg_parser` which actually does all of the logic for calling out to torch dispatch (in particular, this function handles multiple dispatch situations for you). Because we have a different function name than regular `__torch_function__` handling, `handle_torch_function_no_python_arg_parser` is generalized to accept a magic method name to look for when testing if Tensors have custom handling or not. Unlike `__torch_function__`, by default there is no `__torch_dispatch__` on Tensor classes.

**Maintaining the Python dispatch key.** In order to get to the dispatch to Python logic, we must tag Tensors with the `__torch_dispatch__` magic method with the newly added Python dispatch key (separated from PythonFuncTorch to allow for a transitional period while they migrate to this mechanism). We expose a new private property `_is_python_dispatch` that assists in debugging if a Tensor is participating in Python dispatch or not. We apply the Python dispatch key the first time a PyObject for a Tensor is constructed (THPVariable_NewWithVar), testing if `__torch_dispatch__` exists with  then newly added `check_has_torch_dispatch`.

**Shallow copy and detach.** For the simple examples tested in this PR, most creations of Tensor route through the dispatcher. The exception to this is `shallow_copy_and_detach`, which bypasses the dispatcher and is used when saving tensors for backwards. When a Tensor is Python dispatch, we override the behavior of `shallow_copy_and_detach` to instead directly call into `__torch_dispatch__` to perform a `detach` operation (in the same way it would be invoked if you called `detach` directly). Because this Python call is triggered directly from c10::TensorImpl, it must be indirected through `PyInterpreter::detach`, which is the general mechanism for dynamic dispatching to the Python interpreter associated with a TensorImpl.

**torchdeploy compatibility.** The dispatch to Python logic cannot be directly registered to the dispatcher as it is compiled in the Python library, which will get loaded multiple times per torchdeploy interpreter. Thus, we must employ a two phase process. First, we register a fallback inside a non-Python library (aten/src/ATen/core/PythonFallbackKernel.cpp). Its job is to determine the appropriate PyInterpreter to handle the Python dispatch by going through all of the arguments and finding the first argument that has a PyObject/PyInterpreter. With this PyInterpreter, it makes another dynamic dispatch via "dispatch" which will go to the correct torchdeploy interpreter to handle dispatching to actual Python.

**Testing.** We provide a simple example of a LoggingTensor for testing, which can be used to generate TorchScript-like traces to observe what operations are being called when a Tensor is invoked. Although a LoggingTensor would be better implemented via an is-a relationship rather than a has-a relationship (as is done in the test), we've done it this way to show that arbitrarily complex compositions of tensors inside a tensor work properly.

**Known limitations.**

* We haven't adjusted any operator code, so some patterns may not work (as they lose the Python subclass in an unrecoverable way)
* `__torch_function__` must be explicitly disabled with `_disabled_torch_function_impl` otherwise things don't work quite correctly (in particular, what is being disabled is default subclass preservation behavior.)
* We don't ever populate kwargs, even when an argument is kwarg-only

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision:
D29017912
D29017912

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Pulled By: ezyang

fbshipit-source-id: a67714d9e541d09203a8cfc85345b8967db86238
2021-06-25 11:50:32 -07:00
Ilia Cherniavskii
11aa5e4f66 Add underscores to some internal names (#59027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59027

Add underscores to some of the internal names

Test Plan:
python test/test_profiler.py -v

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28724294

fbshipit-source-id: 1f6252e4befdf1928ac103d0042cbbf40616f74a
2021-05-27 09:39:28 -07:00
leslie-fang-intel
0ede83db7a enable torch.cpu.amp.autocast (#57386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57386

Here is the PR for what's discussed in the RFC https://github.com/pytorch/pytorch/issues/55374 to enable the autocast for CPU device. Currently, this PR only enable BF16 as the lower precision datatype.

Changes:
1.  Enable new API `torch.cpu.amp.autocast` for autocast on CPU device: include the python API, C++ API, new Dispatchkey etc.
2.  Consolidate the implementation for each cast policy sharing between CPU and GPU devices.
3.  Add the operation lists to corresponding cast policy for cpu autocast.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572219

Pulled By: ezyang

fbshipit-source-id: db3db509973b16a5728ee510b5e1ee716b03a152
2021-05-20 17:48:36 -07:00
Jeffrey Wan
e71b526e7e Add inference mode python bindings and tests (#58045)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56608

 - Adds binding to the `c10::InferenceMode` RAII class in `torch._C._autograd.InferenceMode` through pybind. Also binds the `torch.is_inference_mode` function.
 - Adds context manager `torch.inference_mode` to manage an instance of `c10::InferenceMode` (global).  Implemented in `torch.autograd.grad_mode.py` to reuse the `_DecoratorContextManager` class.
 - Adds some tests based on those linked in the issue + several more for just the context manager

Issues/todos (not necessarily for this PR):
- Improve short inference mode description
- Small example
- Improved testing since there is no direct way of checking TLS/dispatch keys
-

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58045

Reviewed By: agolynski

Differential Revision: D28390595

Pulled By: soulitzer

fbshipit-source-id: ae98fa036c6a2cf7f56e0fd4c352ff804904752c
2021-05-13 08:55:35 -07:00
Ilia Cherniavskii
6997e7bd39 Update Kineto submodule (#58179)
Summary:
Update Kineto submodule, minor api changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58179

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D28391369

Pulled By: ilia-cher

fbshipit-source-id: 61fbf63d9ec2db66fac203944679e4b99cb0d568
2021-05-13 04:03:04 -07:00
Ilia Cherniavskii
2b99bce1d7 [profiler] CUDA event fallback (#58133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58133

Adding CUDA event fallback for cases when CUPTI tracing is not
available, this corresponds to the legacy profiler GPU profiling

Test Plan: python test/test_profiler.py -v

Reviewed By: gdankel

Differential Revision: D28379596

Pulled By: ilia-cher

fbshipit-source-id: 2db3b2cd8c1c3e6e596784ab00a226c69db2ef27
2021-05-13 03:41:03 -07:00
Sujoy Saraswati
3c973de543 HABANA Device registration key and Autograd key addition (#57094)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57094

Reviewed By: mruberry

Differential Revision: D28355895

Pulled By: wconstab

fbshipit-source-id: 5d8b5762a69f444f4fe7f476891150fa5483d893
2021-05-12 13:07:33 -07:00
Ilia Cherniavskii
f1defeaea4 [profiler][resend] Add cuda memory and distributed metadata (#58010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58010

Resending https://github.com/pytorch/pytorch/pull/57252

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D28345161

Pulled By: ilia-cher

fbshipit-source-id: 18be07b275403205f5b5487ae3589bd39a8eac96
2021-05-12 02:04:48 -07:00
Ilia Cherniavskii
c714596027 [kineto] Update Kineto submodule, cupti library paths (#57789)
Summary:
Update kineto submodule, improve cupti detection

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57789

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28297175

Pulled By: ilia-cher

fbshipit-source-id: 5895270fae160097ae8872a592984d0e4a1b187b
2021-05-10 19:15:59 -07:00
Alban Desmaison
036167111d Revert D28294662: [pytorch][PR] add cuda memory and distributed metadata
Test Plan: revert-hammer

Differential Revision:
D28294662 (98fcdb8005)

Original commit changeset: 3c71ffa333e3

fbshipit-source-id: 7c96e13b227fe0dff60ccb1c57cfd6790f8591b7
2021-05-10 15:28:53 -07:00
Mike Guo
98fcdb8005 add cuda memory and distributed metadata (#57252)
Summary:
Implementation for https://github.com/pytorch/kineto/issues/155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57252

Reviewed By: gdankel

Differential Revision: D28294662

Pulled By: ilia-cher

fbshipit-source-id: 3c71ffa333e341ff8113e891681a4905f54802dc
2021-05-10 13:29:18 -07:00
Rohan Varma
7175d49122 [Dist profiling] Add is_async field (#57253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (https://github.com/pytorch/pytorch/pull/56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Test Plan: CI

Reviewed By: walterddr, ilia-cher

Differential Revision: D28086719

fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
2021-05-04 17:44:28 -07:00
albanD
95dc2b6e9b Remove unused forward AD flag (#57058)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57058

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28071504

Pulled By: albanD

fbshipit-source-id: df694ac6b9fbb4aed269d61cd9522f8602fdae0c
2021-04-30 07:32:56 -07:00
Ilia Cherniavskii
3115728cba [profiler] Support for trace metadata (#56575)
Summary:
Adding support for user defined trace metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56575

Test Plan: python test/test_profiler.py TestProfiler.test_profiler_metadata

Reviewed By: gdankel

Differential Revision: D27957876

Pulled By: ilia-cher

fbshipit-source-id: 8b6c254cca97eca23fc418e37e5772b207b0525a
2021-04-28 05:12:34 -07:00
Mike Guo
28f52649d8 add dtype information for input (#55358)
Summary:
add dtype for all input besides input dimenstion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55358

Reviewed By: heitorschueroff

Differential Revision: D27862346

Pulled By: ilia-cher

fbshipit-source-id: 656c5d6c9f23d723b27b44f0afc1a249ce1f3e44
2021-04-21 15:25:08 -07:00
Ilia Cherniavskii
728d18f976 Enable USE_KINETO (#51273)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51273

Reviewed By: malfet

Differential Revision: D26119144

fbshipit-source-id: eab0d17789c1eab89a7369f0574d3b4c2767c98a
2021-03-30 09:39:11 -07:00
Philip Meier
b0afe945a7 Fix pylint error torch.tensor is not callable (#53424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53424

Fixes https://github.com/pytorch/pytorch/issues/24807 and supersedes the stale https://github.com/pytorch/pytorch/issues/25093 (Cc Microsheep). If you now run the reproduction

```python
import torch

if __name__ == "__main__":
    t = torch.tensor([1, 2, 3], dtype=torch.float64)
```

with `pylint==2.6.0`, you get the following output

```
test_pylint.py:1:0: C0114: Missing module docstring (missing-module-docstring)
test_pylint.py:4:8: E1101: Module 'torch' has no 'tensor' member; maybe 'Tensor'? (no-
member)
test_pylint.py:4:38: E1101: Module 'torch' has no 'float64' member (no-member)
```

Now `pylint` doesn't recognize `torch.tensor` at all, but it is promoted in the stub. Given that it also doesn't recognize `torch.float64`, I think fixing this is out of scope of this PR.

 ---

## TL;DR

This BC-breaking only for users that rely on unintended behavior. Since `torch/__init__.py` loaded `torch/tensor.py` it was populated in `sys.modules`. `torch/__init__.py` then overwrote `torch.tensor` with the actual function. With this `import torch.tensor as tensor` does not fail, but returns the function rather than the module. Users that rely on this import need to change it to `from torch import tensor`.

Reviewed By: zou3519

Differential Revision: D26223815

Pulled By: bdhirsh

fbshipit-source-id: 125b9ff3d276e84a645cd7521e8d6160b1ca1c21
2021-03-09 11:32:53 -08:00
Edward Yang
0f81a69a96 Make meta a device (getting rid of empty_meta) (#53143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143

Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors.  This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)

Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
  tensors are device agnostic.  It's hard to say exactly what the
  correct level of abstraction here is, but in this particular case
  implementation considerations trump semantic considerations: it
  is way easier to have just a meta device, than to have a meta device
  AND a cpu device AND a cuda device.  This may limit the applicability
  of meta tensors for tracing models that do explicit cpu()/cuda()
  conversions (unless, perhaps, we make those operations no-ops on meta
  tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
  Are they really supposed to be all caps?  That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
  dispatch keys.
- I intentionally did NOT add a Backend for Meta.  For now, I'm going to
  hope meta tensors never exercise any of the Backend conversion code;
  even if it does, better to fix the code to just stop converting to and
  from Backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26763552

Pulled By: ezyang

fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
2021-03-03 11:24:13 -08:00
Bel H
30cb6ac53c Introduce mlc device (ML Compute device) to PyTorch's device list (#50634)
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:

TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
 m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
 ...
}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634

Reviewed By: malfet

Differential Revision: D26614213

Pulled By: smessmer

fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
2021-02-24 22:39:11 -08:00
Xu Zhao
cae4379826 Enable FLOPS Computation for Experimental Kineto Profiler (#51503)
Summary:
Add the FLOPS metric computation to the experimental Kineto profiler.
This includes saving necessary extra arguments and compute flops in the C++ code,
and extract the FLOPS value from the Python frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51503

Test Plan:
Build PyTorch with USE_KINETO option, then run the unit test:

```python
python test/test_profiler.py -k test_flops
```

Reviewed By: ilia-cher

Differential Revision: D26202711

Pulled By: xuzhao9

fbshipit-source-id: 7dab7c513f454355a220b72859edb3ccbddcb3ff
2021-02-03 12:15:23 -08:00
Taylor Robie
839c2f235f treat Parameter the same way as Tensor (#48963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48963

This PR makes the binding code treat `Parameter` the same way as `Tensor`, unlike all other `Tensor` subclasses. This does change the semantics of `THPVariable_CheckExact`, but it isn't used much and it seemed to make sense for the half dozen or so places that it is used.

Test Plan: Existing unit tests. Benchmarks are in #48966

Reviewed By: ezyang

Differential Revision: D25590733

Pulled By: robieta

fbshipit-source-id: 060ecaded27b26e4b756898eabb9a94966fc9840
2021-01-10 19:18:31 -08:00
Ilia Cherniavskii
749f8b7850 Remove flops warnings from the default profiler use case (#49896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49896

Add missing check for with_flops option set

Test Plan:
python test/test_profiler.py
CI

Reviewed By: xuzhao9, ngimel

Differential Revision: D25716930

Pulled By: ilia-cher

fbshipit-source-id: 0da0bbb6c1a52328f665237e503406f877b41449
2020-12-30 23:49:29 -08:00
albanD
c23808d8e8 Reland: Add base forward grad logic (#49734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49734

RFC: https://github.com/pytorch/rfcs/pull/11

This PR add the basic logic to handle forward grad as dual Tensors.
It contains the following:
- Mechanism to save dual state on a Tensor and clear it up when the dual level ends
- C++ and python user facing API
- Updated view system that is able to track both forward and backward views

The current PR has the following limitations:
- Extensive tests are in the next PR in the stack as formulas are needed to write full tests.
- Only the manual formulas have been audited and no other formula is actually implemented here (they are in the next PR in the stack)
- Only level 0 is allowed for now. This was discussed and agreed that it is not needed for the first version of this PR.
- We can save one ViewInfo creation when both the forward and backward views have the same base. This can be done by adding a boolean flag to the DifferentiableViewMeta and extra logic in the `as_view` method. This is left out to keep this PR concise.
- We can skip tracking forward views if the base has a forward grad. This can be done by adding extra logic in the `as_view` method. This is left out to keep this PR concise.

Reading guide:
- Updated view handling in [gen_variable_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-f6553cec68caeaea36f6c8b14ff76a6d39dfd774e0ea9ef2f76e8d81fd9af5df), [VariableTypeUtils.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-ec71cfa45954dece1236c661d170e6341879c5be637f4abf52e826d61b40695a), [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285) (skip code below "[Forward Grad View]" for now), [variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-1604bcd0e4350ed99ec45e437cee7ac9ebe337392c9ea16a236247aeeb35b02bR266-R542) and [custom_function.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-dd85f452082b5bb6612bbc12adb496f8827defa228509f7b493de1d517522d5d). This introduces the new ViewInfo to hold view informations shared for forward and backward. It also updates the differentiable view meta to use this. And it updates the as_view function to handle both forward and backward view.
- New forward grad class that handle storing gradients and tracking at each level [forward_grad.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c6c5b9ab2d7e5dde4102495faa1b6bbbfc23aa3e47deb7359c0bfe1eb004c0cb), [forward_grad.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-de2ab54ade7312701850d71a119a4f4ee4b9fc5a9c42a467cdd4e73c033531dd) and [build_variables.bzl](https://github.com/pytorch/pytorch/pull/49097/files#diff-dfdfa2efb17beddfd9094524f95351fd197db6c8857e96b436fb599870359325). EDIT: These files also contain the new flag to globally disable forward AD that allows us to reduce performance issues while this is in development.
- Lowest level API and binding between Tensor and AutogradMeta in [TensorBody.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-7554853205392fa743357bf845ecc350a974ec049383248c12daaf2f4de04911), [TensorImpl.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-052bd9150ef8e09289ddf644b5a6830ede49207201cd41728f6d7cc6d9cead94), [TensorImpl.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-a15aae4cf23da44970db7cece62ff981265575c798c62f7b52d87c8809dfe2e1) and the rest of [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285R557-R677)
- API to access the forward primal that needs to be a differentiable function (and so in native_functions.yaml) [native_functions.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991) [NamedRegistrations.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-69bd3bea510c9b64e1633fa18c3ea63d4b8348dbad3a78ad9de844ab3e43dc1d), [VariableMethodsStub.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-23f5fcb737a2b289811fe0f4b65aef775e7c824b2e629ecd343df51405cd434f), [derivatives.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_python_functions.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_trace_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-54e0b976027bf8debefb959ff360b89ae93466970c843365b1b3a03806d868ce), [TraceTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-f34636741ad4a23d018e0c289bc750c3bad887b45660e1d6eaf440d234a78fbf) and [part of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R198-R243)
- c++ API [autograd.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-349028fbe8291a965a7a263c323b208fe071c35c66179ee997ef84fa81aa4b1e), [autograd.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-a3fe908d67dfec16a1fcde300de68b0701bf68b88db7451f29f2bee255cf30c9)
- python binding [init.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-c58a67c85191c22c9b3bb439117d8053edfd9dea839fa010cf967d404c3c630d)
- python API [forward_ad.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a4efad4ba18fffdfb264c21e5475997a24a743089a899f8ec1a5ff962c6738d9), [autograd/__init__.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-743abcafd32ad0e69f39ac5a91df4197b7e1921c135cacee7ef6dc829a8a7af8)
- c++ and python printing [Formatting.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-881dba501e71662e2e4818b4b016f739b344c8aed2f5edc6b871eda47a2aced0), [_tensor_str.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a7911f8d5e73adbff914d99fd7818ace2a7030b6a3748abe06ec6fc6e3df9cc3)
- Utility for formulas and updated manual functions to respect new view system as well as forward grad [FunctionsManual.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-6378bb6dc81a64dab676d61731341fa5d1088418f32a1473a33a0ccfc2357dc1), [FunctionsManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-4adbd88239afcd60e8198aab65d4f5e43b62314e34b80551e997a1ea503adea5) [rest of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R264-R433)
- Ensure SavedVariable save forward grad properly [saved_variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c1b8039d776241abe177d5aa99b79dd9489a9b3e529da8ab24c2e386c1238ae2), [saved_variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-cc9fba479b5beae06b2eea2e390d17796e0341c5b037a20b5bcaccbb0c341030)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D25678797

Pulled By: albanD

fbshipit-source-id: 3d58550c11b5f58b9b73fd30596d042b857fb9dd
2020-12-22 12:11:27 -08:00
Walter Shen
f5178bf151 Revert D25607503: Add base forward grad logic
Test Plan: revert-hammer

Differential Revision:
D25607503 (fdf02eff3d)

Original commit changeset: f1396290de1d

fbshipit-source-id: 057206e28ff48ee288856adfe3ca577d4880789f
2020-12-21 19:56:28 -08:00
albanD
fdf02eff3d Add base forward grad logic (#49097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49097

RFC: https://github.com/pytorch/rfcs/pull/11

This PR add the basic logic to handle forward grad as dual Tensors.
It contains the following:
- Mechanism to save dual state on a Tensor and clear it up when the dual level ends
- C++ and python user facing API
- Updated view system that is able to track both forward and backward views

The current PR has the following limitations:
- Extensive tests are in the next PR in the stack as formulas are needed to write full tests.
- Only the manual formulas have been audited and no other formula is actually implemented here (they are in the next PR in the stack)
- Only level 0 is allowed for now. This was discussed and agreed that it is not needed for the first version of this PR.
- We can save one ViewInfo creation when both the forward and backward views have the same base. This can be done by adding a boolean flag to the DifferentiableViewMeta and extra logic in the `as_view` method. This is left out to keep this PR concise.
- We can skip tracking forward views if the base has a forward grad. This can be done by adding extra logic in the `as_view` method. This is left out to keep this PR concise.

Reading guide:
- Updated view handling in [gen_variable_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-f6553cec68caeaea36f6c8b14ff76a6d39dfd774e0ea9ef2f76e8d81fd9af5df), [VariableTypeUtils.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-ec71cfa45954dece1236c661d170e6341879c5be637f4abf52e826d61b40695a), [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285) (skip code below "[Forward Grad View]" for now), [variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-1604bcd0e4350ed99ec45e437cee7ac9ebe337392c9ea16a236247aeeb35b02bR266-R542) and [custom_function.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-dd85f452082b5bb6612bbc12adb496f8827defa228509f7b493de1d517522d5d). This introduces the new ViewInfo to hold view informations shared for forward and backward. It also updates the differentiable view meta to use this. And it updates the as_view function to handle both forward and backward view.
- New forward grad class that handle storing gradients and tracking at each level [forward_grad.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c6c5b9ab2d7e5dde4102495faa1b6bbbfc23aa3e47deb7359c0bfe1eb004c0cb), [forward_grad.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-de2ab54ade7312701850d71a119a4f4ee4b9fc5a9c42a467cdd4e73c033531dd) and [build_variables.bzl](https://github.com/pytorch/pytorch/pull/49097/files#diff-dfdfa2efb17beddfd9094524f95351fd197db6c8857e96b436fb599870359325). EDIT: These files also contain the new flag to globally disable forward AD that allows us to reduce performance issues while this is in development.
- Lowest level API and binding between Tensor and AutogradMeta in [TensorBody.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-7554853205392fa743357bf845ecc350a974ec049383248c12daaf2f4de04911), [TensorImpl.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-052bd9150ef8e09289ddf644b5a6830ede49207201cd41728f6d7cc6d9cead94), [TensorImpl.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-a15aae4cf23da44970db7cece62ff981265575c798c62f7b52d87c8809dfe2e1) and the rest of [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285R557-R677)
- API to access the forward primal that needs to be a differentiable function (and so in native_functions.yaml) [native_functions.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991) [NamedRegistrations.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-69bd3bea510c9b64e1633fa18c3ea63d4b8348dbad3a78ad9de844ab3e43dc1d), [VariableMethodsStub.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-23f5fcb737a2b289811fe0f4b65aef775e7c824b2e629ecd343df51405cd434f), [derivatives.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_python_functions.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_trace_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-54e0b976027bf8debefb959ff360b89ae93466970c843365b1b3a03806d868ce), [TraceTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-f34636741ad4a23d018e0c289bc750c3bad887b45660e1d6eaf440d234a78fbf) and [part of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R198-R243)
- c++ API [autograd.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-349028fbe8291a965a7a263c323b208fe071c35c66179ee997ef84fa81aa4b1e), [autograd.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-a3fe908d67dfec16a1fcde300de68b0701bf68b88db7451f29f2bee255cf30c9)
- python binding [init.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-c58a67c85191c22c9b3bb439117d8053edfd9dea839fa010cf967d404c3c630d)
- python API [forward_ad.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a4efad4ba18fffdfb264c21e5475997a24a743089a899f8ec1a5ff962c6738d9), [autograd/__init__.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-743abcafd32ad0e69f39ac5a91df4197b7e1921c135cacee7ef6dc829a8a7af8)
- c++ and python printing [Formatting.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-881dba501e71662e2e4818b4b016f739b344c8aed2f5edc6b871eda47a2aced0), [_tensor_str.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a7911f8d5e73adbff914d99fd7818ace2a7030b6a3748abe06ec6fc6e3df9cc3)
- Utility for formulas and updated manual functions to respect new view system as well as forward grad [FunctionsManual.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-6378bb6dc81a64dab676d61731341fa5d1088418f32a1473a33a0ccfc2357dc1), [FunctionsManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-4adbd88239afcd60e8198aab65d4f5e43b62314e34b80551e997a1ea503adea5) [rest of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R264-R433)
- Ensure SavedVariable save forward grad properly [saved_variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c1b8039d776241abe177d5aa99b79dd9489a9b3e529da8ab24c2e386c1238ae2), [saved_variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-cc9fba479b5beae06b2eea2e390d17796e0341c5b037a20b5bcaccbb0c341030)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25607503

Pulled By: albanD

fbshipit-source-id: f1396290de1d75760f3d380c43cdd56e86fa6099
2020-12-21 14:39:43 -08:00
Xu Zhao
573f4aa352 FLOPS Roofline Analysis Feature for PyTorch Profiler. (#46506)
Summary:
FLOPs Roofline Analysis Feature for PyTorch Profiler.

Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv.
FLOPs are helpful to estimate the computation complexity of the operators.
For now, we use input shapes to estimate the number of floating pointer operations.
In the future, we may compute this information by tracking hardware counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46506

Test Plan:
Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following:
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                   Input Shapes        MFLOPS
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                aten::matmul         0.06%      57.653us        82.97%      79.310ms      79.310ms             1                 [[40, 33, 1, 243], [243, 243]]            --
                    aten::mm        82.84%      79.186ms        82.86%      79.204ms      79.204ms             1                      [[1320, 243], [243, 243]]       984.323
                aten::conv2d         0.04%      36.345us        16.06%      15.347ms      15.347ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [  44065010.318
           aten::convolution         0.02%      16.016us        16.02%      15.310ms      15.310ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
          aten::_convolution         0.07%      63.855us        16.00%      15.294ms      15.294ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
    aten::mkldnn_convolution        15.89%      15.188ms        15.93%      15.225ms      15.225ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
                  aten::relu         0.10%      98.223us         0.64%     612.157us     306.079us             2                             [[40, 33, 1, 243]]            --
             aten::threshold         0.49%     465.416us         0.54%     513.934us     256.967us             2                     [[40, 33, 1, 243], [], []]            --
                  aten::add_         0.29%     279.301us         0.29%     279.301us     279.301us             1                  [[40, 33, 1, 243], [243], []]            --
                 aten::empty         0.10%      99.113us         0.10%      99.113us      24.778us             4                       [[], [], [], [], [], []]            --
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
Self CPU time total: 95.584ms

.
----------------------------------------------------------------------
Ran 1 test in 0.176s

For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators.

Reviewed By: ezyang

Differential Revision: D25214452

Pulled By: xuzhao9

fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3
2020-12-17 21:19:25 -08:00
Scott Wolchok
22c6dafd33 [PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808

Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629

Reviewed By: malfet

Differential Revision: D25563207

fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
2020-12-15 19:16:01 -08:00
Mike Ruberry
25bc906281 Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback
Test Plan: revert-hammer

Differential Revision:
D25135415 (7e23ee1598)

Original commit changeset: 5e92dc79da64

fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988
2020-12-14 21:04:27 -08:00
Scott Wolchok
7e23ee1598 [PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25135415

fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
2020-12-14 20:08:16 -08:00
Scott Wolchok
900aa4ee97 [PyTorch] remove convenience RecordFunctionCallback interface (#48620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620

In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25132183

fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
2020-12-14 20:03:15 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Pritam Damania
2b221a9599 Remove PyCFunction casts as much as possible. (#46227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227

Follow up from https://github.com/pytorch/pytorch/issues/45419, in
this PR I've removed as many PyCFunction casts as I could from the codebase.

The only ones I didn't remove were the ones with `METH_VARARGS | METH_KEYWORDS`
which have 3 parameters instead of 2 and had to be casted. Example: `
{"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS |
METH_KEYWORDS, nullptr},`
ghstack-source-id: 114632704

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D24269435

fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf
2020-10-20 15:01:51 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Rohan Varma
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00