Commit Graph

97 Commits

Author SHA1 Message Date
Yuxin Wu
c8ed84ad06 Fix a static initialization order fiasco in c10d (#90149)
The `TORCH_LIBRARY_IMPL` registrations in `OpsImpl.cpp` needs to happen after `ProcessGroup` is registered as a torch class -- which happens in `Ops.cpp`. However, the order of the registrations is undefined between the two files.

If the registration in `OpsImpl.cpp` runs before `Ops.cpp`, we get a crash at program launch similar to #83255 . This happens in our internal build.

This PR moves `OpsImpl.cpp` to the end of `Oops.cpp`. Because according to the omniscient lord of chatGPT:
<img width="600" alt="2022-12-04_19-25" src="https://user-images.githubusercontent.com/1381301/205542847-3535b319-3c2a-4e8e-bc11-27913f6afb39.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90149
Approved by: https://github.com/kwen2501, https://github.com/H-Huang, https://github.com/soumith
2022-12-12 08:21:54 +00:00
Han Qi (qihqi)
25eb7c3ae3 Clean up dependancy for flatbuffer_loader (#86041)
Test Plan: waitforsandcastle

Differential Revision: D38445936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86041
Approved by: https://github.com/cccclai
2022-12-08 03:48:04 +00:00
Richard Zou
4b1053497c [vmap] Prepend "legacy" to files for old vmap implementation (#90324)
We have an older torch.vmap implementation. It is no longer supported.
It still needs to exist somewhere for the sake of BC with
torch.autograd.functional.

This PR makes it clear what files are meant for implementing the old
vmap implementation. I've seen a couple of PRs recently adding support
for the old vmap implementation, so this will lessen the confusion.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324
Approved by: https://github.com/samdow
2022-12-07 18:46:15 +00:00
Emilio Castillo
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
Chen Lai
bca75fd2d3 Move xnnpack taget to fb code base (#88909)
1. Move the source file list to the `build_variables.bzl`, as it's the source of truth for both internal buck build and oss build
2. Move target definitions to `fb` internal folder
3. Some changes are triggered from auto format.

Differential Revision: [D40906961](https://our.internmc.facebook.com/intern/diff/D40906961/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40906961/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88909
Approved by: https://github.com/mcr229
2022-11-13 12:04:35 +00:00
Antoni Viros i Martin
c77368d416 Implement a constructor for nested_tensor that is similar to torch.tensor() (#88213)
Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213
Approved by: https://github.com/cpuhrsch
2022-11-08 00:03:18 +00:00
jjsjann123
7b419e8513 [NVFuser] Upstream push 1026 (#87779)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

* codegen improvement:
    i. allow non-root trivial reductions, allow empty/no-op fusion
    ii. fixes vectorization checks and size calculation
    iii. bank conflict handle improvement
    iv. enables transpose scheduler

* misc:
    i. CI tests failure fixes
    ii. cpp tests file clean up
    iii. trivial forwarding supports added in codegen runtime
    iv. added factory methods support in codegen

Commits that's in this PR from the devel branch:

```
7117a7e37ebec372d9e802fdfb8abb7786960f4a patching nvfuser conv cudnn test numerics mismatch (#2048)
65af1a4e7013f070df1ba33701f2d524de79d096 Inserting sync for redundant parallel types is already done at the (#2023)
6ac74d181689c8f135f60bfc1ec139d88941c98c Fix sync map (#2047)
f5bca333355e2c0033523f3402de5b8aac602c00 Bank conflict checker improvements (#2032)
d2ca7e3fd203537946be3f7b435303c60fa7f51e Minor update on cp.async code generation. (#1901)
d36cf61f5570c9c992a748126287c4e7432228e0 Test file cleanup (#2040)
0b8e83f49c2ea9f04a4aad5061c1e7f4268474c6 Allow non-root trivial reductions (#2037)
a2dfe40b27cd3f5c04207596f0a1818fbd5e5439 Fix vectorize size calculation (#2035)
e040676a317fe34ea5875276270c7be88f6eaa56 Use withPredicate to replace setPredicate to maintain Exprs immutable (#2025)
197221b847ad5eb347d7ec1cf2706733aacbf97c removing ci workflow (#2034)
40e2703d00795526e7855860aa00b9ab7160755f Reduction rand like patch (#2031)
bc772661cbdb3b711d8e9854ae9b8b7052e3e4a3 Add utility for checking bank conflict of shared memory (#2029)
ddd1cf7695f3fb172a0e4bcb8e4004573617a037 Add back FusionReductionWithTrivialReduction_CUDA (#2030)
fbd97e5ef15fa0f7573800e6fbb5743463fd9e57 Revert "Cleanup trivial reduction workarounds (#2006)" (#2024)
bca20c1dfb8aa8d881fc7973e7579ce82bc6a894 Cleanup trivial reduction workarounds (#2006)
e4b65850eee1d70084105bb6e1f290651adde23e Trivial forwarding (#1995)
1a0e355b5027ed0df501989194ee8f2be3fdd37a Fix contiguity analysis of predicates to match updated contiguity. (#1991)
a4effa6a5f7066647519dc56e854f4c8a2efd2a7 Enable output allocation cache (#2010)
35440b7953ed8da164a5fb28f87d7fd760ac5e00 Patching bn inference (#2016)
0f9f0b4060dc8ca18dc65779cfd7e0776b6b38e8 Add matmul benchmark (#2007)
45045cd05ea268f510587321dbcc8d7c2977cdab Enable tests previously disabled due to an aliasing bug (#2005)
967aa77d2c8e360c7c01587522eec1c1d377c87e Contiguous indexing for View operations (#1990)
a43cb20f48943595894e345865bc1eabf58a5b48 Make inlining even more modular (#2004)
dc458358c0ac91dfaf4e6655a9b3fc206fc0c897 Test util cleanup (#2003)
3ca21ebe4d213f0070ffdfa4ae5d7f6cb0b8e870 More strict validation (#2000)
a7a7d573310c4707a9f381831d3114210461af01 Fix build problem (#1999)
fc235b064e27921fa9d6dbb9dc7055e5bae1c222 Just fixes comments (#1998)
482386c0509fee6edb2964c5ae72074791f3e43a cleanup (#1997)
4cbe0db6558a82c3097d281eec9c85ad2ea0893a Improve divisible split detection (#1970)
42ccc52bdc18bab0330f4b93ed1399164e2980c9 Minor build fix. (#1996)
fcf8c091f72d46f3055975a35afd06263324ede6 Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989)
15f2f6dba8cbf408ec93c344767c1862c30f7ecc Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988)
8f1c7f52679a3ad6acfd419d28a2f4be4a7d89e2 Minor cleanup lower_unroll.cpp (#1994)
1d9858c80319ca7f0037db7de5f04e47f540d76c Minor cleanup (#1992)
f262d9cab59f41c669f53799c6d4a6b9fc4267eb Add support for uniform RNG (#1986)
eb1dad10c73f855eb1ecb20a8b1f7b6edb0c9ea3 Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987)
634820c5e3586c0fe44132c51179b3155be18072 Add support for some empty fusion (#1981)
eabe8d844ad765ee4973faa4821d451ef71b83c3 Segment self mapping fusions (#1954)
e96aacfd9cf9b3c6d08f120282762489bdf540c8 Enable Transpose operation (#1882)
425dce2777420248e9f08893765b5402644f4161 Add a null scheduler that helps segmenting away no-op schedules (#1835)
306d4a68f127dd1b854b749855e48ba23444ba60 Fix canScheduleCompileTime check of transpose scheduler (#1969)
b1bd32cc1b2ae7bbd44701477bddbcfa6642a9be Minor fix (#1967)
bd93578143c1763c1e00ba613a017f8130a6b989 Enable transpose scheduler (#1927)
b7a206e93b4ac823c791c87f12859cf7af264a4c Move scheduler vectorize utilities into their own file (#1959)
d9420e4ca090489bf210e68e9912bb059b895baf View scheduling (#1928)
c668e13aea0cf21d40f95b48e0163b812712cdf2 Upstream push ci fixes (#1965)
c40202bb40ce955955bb97b12762ef3b6b612997 Fix dump effective bandwidth (#1962)
93505bcbb90a7849bd67090fe5708d867e8909e4 WAR on index mapping when exact and permissive maps differ (#1960)
45e95fd1d3c773ee9b2a21d79624c279d269da9f Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930)
a3ecb339442131f87842eb56955e4f17c544e99f Improve the comments at the beginning of index_compute.h (#1946)
f7bc3417cc2923a635042cc6cc361b2f344248d6 Remove unused variables (#1955)
df3393adbb5cb0309d091f358cfa98706bd4d313 Some cleanup (#1957)
7d1d7c8724ab5a226fad0f5a80feeac04975a496 TVDomainGuard factory (#1953)
357ba224c0fb41ed3e4e8594d95599c973f4a0ca Fill allocation with nan on tests (#1956)
8eafc54685d406f5ac527bcbacc475fda4492d7a Fix detection of unmappable root domains (#1952)
90a51f282601ba8ebd4c84b9334efd7762a234bc Some indexing cleanups, Add eye support (#1940)
ddc01e4e16428aec92f9c84d698f959b6436a971 Exclude unsupported data types (#1951)
992e17c0688fe690c51b50e81a75803621b7e6aa test the groups the same order as they are merged (#1949)
208262b75d1fed0597a0329d61d57bc8bcd7ff14 Move detection of self mapping IDs to IterDomainGraph from (#1941)
ac4de38c6ee53b366e85fdfe408c3642d32b57df Merge pull request #1945 from csarofeen/master_merge_0828
631094891a96f715d8c9925fb73d41013ca7f2e3 Add full, full_like, zeros, zeros_like, ones, ones_like (#1943)
aab10bce4541204c46b91ff0f0ed9878aec1bfc4 Merge remote-tracking branch 'upstream/viable/strict' into HEAD
4c254c063bb55887b45677e3812357556a7aa80d Fix arange when step is negative (#1942)
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D40869846](https://our.internmc.facebook.com/intern/diff/D40869846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87779
Approved by: https://github.com/davidberard98
2022-11-04 20:04:34 +00:00
Digant Desai
0fc7de3986 [profiler] Add Linux Perf support (#87866)
* Add support to use Linux kernel perf subsystem via the profiler.
* For now the perf configurability is quite limited to just event names. Threading etc. to come later.
* Given we want to support variety of different cpu types, number of events list (in addition to the standard set of events) is also limited.
* Rather than failing with unsupported feature for non-Linux platforms, it returns zeros for all the event counts.
* For now, max event counts is capped at 4, time multiplexing is not allowed.
* Threadpool recreate hack is restricted to mobile only - need to add better support for threading in general

Differential Revision: [D40238033](https://our.internmc.facebook.com/intern/diff/D40238033/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238033/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87866
Approved by: https://github.com/SS-JIA
2022-11-02 13:42:24 +00:00
Christian Puhrsch
6fe41e76a9 Create separate files for NT Unary, Binary and Matmul ops (#88091)
Improves code organization and code share.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88091
Approved by: https://github.com/drisspg
2022-10-31 20:10:07 +00:00
Edward Z. Yang
d3c01c722d Fix pybind11 problems with c10::SymInt unregistered (#88011)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88011
Approved by: https://github.com/weiwangmeta, https://github.com/albanD
2022-10-29 07:55:45 +00:00
Taylor Robie
fb64f7b804 [Profiler][Trivial] Move ID assignment code to data_flow.cpp (#87670)
ID assignment has become a very complex facet of the profiler. The existing code has grown organically as I've discovered various refinements and has become very difficult to understand or reason about. (With more complexity coming in https://github.com/pytorch/pytorch/pull/87133)

I want to take a step back and add some structure and additional comments to the ID assignment algorithm. Before I do, however, it's time to move it out of `collection.cpp` to a dedicated data flow file.

Differential Revision: [D40666360](https://our.internmc.facebook.com/intern/diff/D40666360/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40666360/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87670
Approved by: https://github.com/slgong-fb
2022-10-28 18:40:18 +00:00
Edward Z. Yang
1ff52225f1 Unify SymIntNode and SymFloatNode into SymNode (#87817)
This refactor was prompted by challenges handling mixed int/float
operations in C++.  A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
approach.

The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode.  This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods.  This has a number of
knock on effects.

- We no longer have C++ classes to bind to Python.  Instead, we take an
  entirely new approach to our Python API, where we have a SymInt/SymFloat
  class defined entirely in Python, which hold a SymNode (which corresponds
  to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
  it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
  when it goes into C++.  This implies a userland rename.

  In principle, it is also possible for the canonical implementation of SymNode
  to be written in C++, and then bound to Python with pybind11 (we have
  this code, although it is commented out.)  However, I did not implement
  this as we currently have no C++ implementations of SymNode.

  Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
  code needs to know how to find these classes.  Currently, this is done
  just by manually importing torch and getting the attributes.

- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
  takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
  __torch_dispatch__ works.

Some miscellaneous improvements:

- SymInt now has a constructor that takes SymNode.  Note that this
  constructor is ambiguous if you pass in a subclass of SymNode,
  so an explicit downcast is necessary.  This means toSymFloat/toSymInt
  are no more.  This is a mild optimization as it means rvalue reference
  works automatically.

- We uniformly use the caster for c10::SymInt/SymFloat, rather than
  going the long way via the SymIntNode/SymFloatNode.

- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
  functions, pretty sure this doesn't do anything.

- guard_int is now a free function, since to guard on an int you cannot
  assume the method exists.  A function can handle both int and SymInt
  inputs.

- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
  ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
  plain methods; this is to help avoid confusion between the two types.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-10-27 20:56:02 +00:00
Jiewen Tan
2205f56f46 [LTC] Remove lazy::View (#87822)
Summary:
This is the first part to remove the whole view and aliasing infrastructure in LTC, which is deprecated in favor of functionalization. It mainly removes things that use lazy::View.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87822
Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/wconstab
2022-10-27 20:39:30 +00:00
Antoni Viros i Martin
d94e33f041 Add support for .to() for NestedTensor backends (#87146)
Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140.

Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job.

Differential Revision: D40437585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146
Approved by: https://github.com/drisspg
2022-10-20 03:46:50 +00:00
PyTorch MergeBot
8eb579e362 Revert "[Profiler] Move legacy profiler out of torch/csrc/autograd (#85512)"
This reverts commit 157a3d2a7c.

Reverted https://github.com/pytorch/pytorch/pull/85512 on behalf of https://github.com/DanilBaibak due to Due to files were deleted, the internal build failed. Please re-submit via codev.
2022-10-14 14:56:59 +00:00
Taylor Robie
157a3d2a7c [Profiler] Move legacy profiler out of torch/csrc/autograd (#85512)
The legacy profiler is an eyesore in the autograd folder. At this point the implementation is almost completely decoupled from the rest of profiler, and it is in maintaince mode pending deprecation.

As a result, I'm moving it to `torch/csrc/profiler/standalone`. Unfortuantely BC requires that the symbols remain in `torch::autograd::profiler`, so I've put some basic forwarding logic in `torch/csrc/autograd/profiler.h`.

One strange bit is that `profiler_legacy.h` forward declares `torch::autograd::Node`, but doesn't seem to do anything with it. I think we can delete it, but I want to test to make sure.

(Note: this should not land until https://github.com/pytorch/torchrec/pull/595 is landed.)

Differential Revision: [D39108648](https://our.internmc.facebook.com/intern/diff/D39108648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85512
Approved by: https://github.com/aaronenyeshi
2022-10-14 05:38:48 +00:00
Taylor Robie
35fb007749 [Profiler][Minor] Separate standalone profilers from the main PyTorch profiler. (#85511)
There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT)

They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder.

Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511
Approved by: https://github.com/albanD
2022-10-14 05:38:48 +00:00
Taylor Robie
b8f14b7877 [Profiler][Minor] Group and consolidate stub APIs (#85510)
There is a concept in profiler of a stub that wraps a profiling API. It was introduced for CUDA profiling before Kineto, and ITT has adopted it to call into VTune APIs. However for the most part we don't really interact with them when developing the PyTorch profiler.

Thus it makes sense to unify the fallback registration mechanism and create a subfolder to free up real estate in the top level `torch/csrc/profiler` directory.

Differential Revision: [D39108647](https://our.internmc.facebook.com/intern/diff/D39108647/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85510
Approved by: https://github.com/aaronenyeshi
2022-10-14 05:38:46 +00:00
Jason Ansel
f1fdb6efbd Manual changes for moving dynamo to core (#86621)
This is the subset of the changes in #86461 not auto-generated by `copy_to_core.sh`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86621
Approved by: https://github.com/albanD
2022-10-11 23:01:21 +00:00
Zachary DeVito
736adc0808 Memory snapshots from C++ (#86190)
Sometimes the driving process want to save memory snapshots but isn't Python.
Add a simple API to turn it on without python stack traces. It still
saves to the same format for the vizualization and summary scripts, using
the C++ Pickler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190
Approved by: https://github.com/ezyang
2022-10-05 07:36:39 +00:00
Horace He
0e256c2550 removed compile cache and static argnums (#85783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85783
Approved by: https://github.com/wconstab
2022-09-28 08:33:59 +00:00
Howard Huang
ccac8d13d5 [3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations (#83735)
### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735
Approved by: https://github.com/kwen2501
2022-09-28 03:24:06 +00:00
cpuhrsch
6a04df3ac8 Get flash_attn to compile for CUDA 11.6 linux nightly build (#84941)
This PR only attempts to get this code to compile for all archs so that we can dispatch to it in https://github.com/pytorch/pytorch/pull/84653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84941
Approved by: https://github.com/drisspg, https://github.com/malfet
2022-09-26 20:49:19 +00:00
nikitaved
12ae3bea43 Faster mul(sparse, sparse) with broadcasting in dense dims. (#85336)
This is a combo PR of https://github.com/pytorch/pytorch/pull/84929 and ~https://github.com/pytorch/pytorch/pull/83428~.

Preliminary benchmarks (square matrices of shape (n, n)).

<details>

<summary>Script</summary>

```python
import torch
import math
from IPython import get_ipython
from itertools import product, repeat
import pickle
from torch.utils.benchmark import Timer, Compare

torch.manual_seed(13)

problem_dims = (
    # n > nnz
    (10000, 100),
    (100000, 1000),
    (1000000, 10000),
    # n < nnz
    (10, 100),
    (10, 1000),
    (10, 10000),
    (100, 1000),
    (100, 10000),
    (1000, 10000),
    (1000, 100000),
    (1000, 1000000),
    #(1000000, 1000000000),
)

name = "PR"
device = "cuda"
results = []

for n, nnz in problem_dims:
    def gen_tensor(coalesce=False):
        shape = (n, n)
        nrows, ncols = shape
        rowidx = torch.randint(low=0, high=nrows, size=(nnz,), device=device)
        colidx = torch.randint(low=0, high=ncols, size=(nnz,), device=device)
        itemidx = torch.vstack((rowidx, colidx))
        xvalues = torch.randn(nnz, device=device)
        itemidx = torch.hstack((itemidx, itemidx))
        xvalues = torch.hstack((xvalues, xvalues))
        res = torch.sparse_coo_tensor(itemidx, xvalues, size=shape)
        if coalesce:
            return res.coalesce()
        else:
            return res

    for x_coalesce, y_coalesce in product(*repeat((True, False), 2)):
        x = gen_tensor(x_coalesce)
        y = gen_tensor(y_coalesce)
        smtp = "x * y"
        timer = Timer(smtp,
                      globals=globals(),
                      label="coo.mul",
                      description=f"{name}: mul, device: {device}",
                      sub_label=f"n={n}, nnz={nnz}, coalesce=({x_coalesce, y_coalesce})",
                      num_threads=torch.get_num_threads())
        results.append(timer.blocked_autorange())

compare = Compare(results)
compare.trim_significant_figures()
compare.print()

with open(f"{name}_{device}_mul.pickle", 'wb') as f:
    pickle.dump(results, f)

```

</details>

<details>

<summary>Gather results</summary>

```python
import pickle
from torch.utils.benchmark import Timer, Compare

files = [
        "PR",
        "master"
        ]

device = 'cuda'

timers = []
for name in files:
    with open("{}_{}_mul.pickle".format(name, device), 'rb') as f:
        timers += pickle.load(f)

compare = Compare(timers)
compare.trim_significant_figures()
compare.print()

```

</details>

<details>

<summary>CUDA</summary>

```
[------------------------------------------------- coo.mul -------------------------------------------------]
                                                       |  PR: mul, device: cuda  |  master: mul, device: cuda
24 threads: -------------------------------------------------------------------------------------------------
      n=10000, nnz=100, coalesce=((True, True))        |             95          |                91
      n=10000, nnz=100, coalesce=((True, False))       |             87          |               242
      n=10000, nnz=100, coalesce=((False, True))       |             87          |               226
      n=10000, nnz=100, coalesce=((False, False))      |            130          |               371
      n=100000, nnz=1000, coalesce=((True, True))      |            100          |               521
      n=100000, nnz=1000, coalesce=((True, False))     |             90          |               649
      n=100000, nnz=1000, coalesce=((False, True))     |            100          |               659
      n=100000, nnz=1000, coalesce=((False, False))    |            200          |               781
      n=1000000, nnz=10000, coalesce=((True, True))    |            100          |              4861
      n=1000000, nnz=10000, coalesce=((True, False))   |            100          |              5012
      n=1000000, nnz=10000, coalesce=((False, True))   |             98          |              5010
      n=1000000, nnz=10000, coalesce=((False, False))  |            384          |              5174
      n=10, nnz=100, coalesce=((True, True))           |            100          |                79
      n=10, nnz=100, coalesce=((True, False))          |            100          |               221
      n=10, nnz=100, coalesce=((False, True))          |            100          |               221
      n=10, nnz=100, coalesce=((False, False))         |            100          |               350
      n=10, nnz=1000, coalesce=((True, True))          |            100          |               100
      n=10, nnz=1000, coalesce=((True, False))         |            100          |               240
      n=10, nnz=1000, coalesce=((False, True))         |            100          |               254
      n=10, nnz=1000, coalesce=((False, False))        |            100          |               392
      n=10, nnz=10000, coalesce=((True, True))         |            100          |               110
      n=10, nnz=10000, coalesce=((True, False))        |            110          |               286
      n=10, nnz=10000, coalesce=((False, True))        |            110          |               286
      n=10, nnz=10000, coalesce=((False, False))       |            271          |               455
      n=100, nnz=1000, coalesce=((True, True))         |            110          |               851
      n=100, nnz=1000, coalesce=((True, False))        |            110          |              1000
      n=100, nnz=1000, coalesce=((False, True))        |            110          |               990
      n=100, nnz=1000, coalesce=((False, False))       |            140          |              1124
      n=100, nnz=10000, coalesce=((True, True))        |            110          |              5137
      n=100, nnz=10000, coalesce=((True, False))       |            110          |              5391
      n=100, nnz=10000, coalesce=((False, True))       |            100          |              5405
      n=100, nnz=10000, coalesce=((False, False))      |            249          |              5539
      n=1000, nnz=10000, coalesce=((True, True))       |            100          |              8598
      n=1000, nnz=10000, coalesce=((True, False))      |            100          |              8800
      n=1000, nnz=10000, coalesce=((False, True))      |            100          |              8782
      n=1000, nnz=10000, coalesce=((False, False))     |            255          |              8956
      n=1000, nnz=100000, coalesce=((True, True))      |            120          |             84500
      n=1000, nnz=100000, coalesce=((True, False))     |            200          |             88560
      n=1000, nnz=100000, coalesce=((False, True))     |            160          |             89000
      n=1000, nnz=100000, coalesce=((False, False))    |            373          |             89000
      n=1000, nnz=1000000, coalesce=((True, True))     |            312          |            606400
      n=1000, nnz=1000000, coalesce=((True, False))    |           1340          |            609200
      n=1000, nnz=1000000, coalesce=((False, True))    |           1340          |            609100
      n=1000, nnz=1000000, coalesce=((False, False))   |           4408          |            611400

Times are in microseconds (us).
```

</details>

<details>

<summary>CPU</summary>

```
[------------------------------------------------ coo.mul ------------------------------------------------]
                                                       |  PR: mul, device: cpu  |  master: mul, device: cpu
24 threads: -----------------------------------------------------------------------------------------------
      n=10000, nnz=100, coalesce=((True, True))        |              8         |                8
      n=10000, nnz=100, coalesce=((True, False))       |             32         |               34
      n=10000, nnz=100, coalesce=((False, True))       |             32         |               34
      n=10000, nnz=100, coalesce=((False, False))      |             41         |               56
      n=100000, nnz=1000, coalesce=((True, True))      |             24         |               24
      n=100000, nnz=1000, coalesce=((True, False))     |             90         |              100
      n=100000, nnz=1000, coalesce=((False, True))     |             87         |              100
      n=100000, nnz=1000, coalesce=((False, False))    |            231         |              255
      n=1000000, nnz=10000, coalesce=((True, True))    |            190         |              200
      n=1000000, nnz=10000, coalesce=((True, False))   |            908         |             2023
      n=1000000, nnz=10000, coalesce=((False, True))   |            800         |             2036
      n=1000000, nnz=10000, coalesce=((False, False))  |           3684         |             3989
      n=10, nnz=100, coalesce=((True, True))           |              8         |                7
      n=10, nnz=100, coalesce=((True, False))          |             34         |               30
      n=10, nnz=100, coalesce=((False, True))          |             33         |               30
      n=10, nnz=100, coalesce=((False, False))         |             44         |               50
      n=10, nnz=1000, coalesce=((True, True))          |              8         |                7
      n=10, nnz=1000, coalesce=((True, False))         |            100         |              100
      n=10, nnz=1000, coalesce=((False, True))         |            130         |              100
      n=10, nnz=1000, coalesce=((False, False))        |            746         |              210
      n=10, nnz=10000, coalesce=((True, True))         |              8         |                7
      n=10, nnz=10000, coalesce=((True, False))        |           1000         |             1500
      n=10, nnz=10000, coalesce=((False, True))        |           1000         |             1510
      n=10, nnz=10000, coalesce=((False, False))       |           3063         |             2457
      n=100, nnz=1000, coalesce=((True, True))         |             25         |               25
      n=100, nnz=1000, coalesce=((True, False))        |            180         |              130
      n=100, nnz=1000, coalesce=((False, True))        |            200         |              130
      n=100, nnz=1000, coalesce=((False, False))       |            271         |              255
      n=100, nnz=10000, coalesce=((True, True))        |            100         |              100
      n=100, nnz=10000, coalesce=((True, False))       |           2444         |             2290
      n=100, nnz=10000, coalesce=((False, True))       |           2455         |             2357
      n=100, nnz=10000, coalesce=((False, False))      |           5316         |             3783
      n=1000, nnz=10000, coalesce=((True, True))       |            204         |              211
      n=1000, nnz=10000, coalesce=((True, False))      |           2457         |             2480
      n=1000, nnz=10000, coalesce=((False, True))      |           2448         |             2539
      n=1000, nnz=10000, coalesce=((False, False))     |           3665         |             4801
      n=1000, nnz=100000, coalesce=((True, True))      |           2293         |             2374
      n=1000, nnz=100000, coalesce=((True, False))     |           9000         |            24620
      n=1000, nnz=100000, coalesce=((False, True))     |           8000         |            25080
      n=1000, nnz=100000, coalesce=((False, False))    |          26500         |            47650
      n=1000, nnz=1000000, coalesce=((True, True))     |          10000         |            13000
      n=1000, nnz=1000000, coalesce=((True, False))    |          80000         |           362200
      n=1000, nnz=1000000, coalesce=((False, True))    |          78050         |           392600
      n=1000, nnz=1000000, coalesce=((False, False))   |         312100         |           766900

Times are in microseconds (us).
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85336
Approved by: https://github.com/cpuhrsch
2022-09-23 23:31:19 +00:00
jjsjann123
0e582fbfcc [NVFuser] Upstream push 0907 (#84626)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84626
Approved by: https://github.com/malfet
2022-09-23 20:29:48 +00:00
Richard Zou
5e5c319549 Move functorch python bindings to torch/csrc (#85426)
This moves functorch's python bindings to torch/csrc/functorch/init.cpp.
Coming next is the torchdim move. I didn't do torchdim yet because
moving functorch's python bindings unblocks some other things that I
want to do first.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85426
Approved by: https://github.com/ezyang
2022-09-22 18:47:12 +00:00
Mikayla Gawarecki
77f1f98479 Re-introduce torch.Tensor.to_padded_tensor (#85293)
Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293
Approved by: https://github.com/cpuhrsch
2022-09-21 18:45:56 +00:00
Richard Zou
d9024ea284 Setup torch/csrc/functorch/*; move CompileCache.{h, cpp} there (#85263)
The plan for functorch C++ is:
- all C++-only code goes into aten/functorch.
- any C++ code with a python dependency goes into torch/csrc/functorch. This will include the functorch Python bindings as well as all of torchdim.

Alternative:
- we could split it so that code goes into torch/csrc/functorch/nopython and torch/csrc/functorch/python instead of putting anything into ATen. This just feels like a matter of cosmetics.

This PR also does two more things:
- fix a windows lint error regarding PyLong_asLong
- clang-format the code (because the linter got triggered)

Test Plan:
- run tests
- check internal build

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85263
Approved by: https://github.com/ezyang
2022-09-19 21:49:18 +00:00
Michael Voznesensky
cd7408e950 Add aten _assert_tensor_metadata op (#84617)
Example:
```
graph():
    %arg0 : [#users=3] = placeholder[target=arg0]
    %arg_guard_equality_check : [#users=1] = call_function[target=torch._tensor_equal](args = (%arg0, (1, 1, 2), (2, 2, 1), torch.float32), kwargs = {})
    %_assert_true : [#users=0] = call_function[target=torch._assert_true](args = (%arg_guard_equality_check, Guard evaluation failed equality check for arg0), kwargs = {})
    %add : [#users=1] = call_function[target=operator.add](args = (%arg0, 1), kwargs = {})
    return ([arg0, arg0], (add, add))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84617
Approved by: https://github.com/jansel
2022-09-19 20:48:09 +00:00
Kevin Stephano
b8418e02eb Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#85045)
This PR does the following:

- Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`.  The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`.  It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`.  The `FusionInterface` is an object that represents a Fusion in python.  It can also query the cache based on id.
- The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
- Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
- Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
- Adds a README file to explain how to use the Python Frontend

While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.

An identical PR to #83267 to avoid tooling issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85045
Approved by: https://github.com/davidberard98
2022-09-17 10:52:54 +00:00
PyTorch MergeBot
81620c3360 Revert "Faster mul(sparse, sparse) with broadcasting in dense dims. (#83428)"
This reverts commit d49943bda8.

Reverted https://github.com/pytorch/pytorch/pull/83428 on behalf of https://github.com/osalpekar due to Reverted because __restrict symbol not supported by certain MSVC compilers, leading to undefined symbol error at compilation time
2022-09-17 06:53:11 +00:00
nikitaved
d49943bda8 Faster mul(sparse, sparse) with broadcasting in dense dims. (#83428)
Preliminary benchmarks (square matrices of shape (n, n)).

<details>

<summary>Script</summary>

```python
import torch
import math
from IPython import get_ipython
from itertools import product, repeat
import pickle
from torch.utils.benchmark import Timer, Compare

torch.manual_seed(13)

# specifies (n, nnz)
problem_dims = (
    # n > nnz
    (10000, 100),
    (100000, 1000),
    (1000000, 10000),
    # n < nnz
    (10, 100),
    (10, 1000),
    (10, 10000),
    (100, 1000),
    (100, 10000),
    (1000, 10000),
    (1000, 100000),
    (1000, 1000000),
    #(1000000, 1000000000),
)

name = "PR"
device = "cuda"
results = []

for n, nnz in problem_dims:
    def gen_tensor(coalesce=False):
        shape = (n, n)
        nrows, ncols = shape
        rowidx = torch.randint(low=0, high=nrows, size=(nnz,), device=device)
        colidx = torch.randint(low=0, high=ncols, size=(nnz,), device=device)
        itemidx = torch.vstack((rowidx, colidx))
        xvalues = torch.randn(nnz, device=device)
        itemidx = torch.hstack((itemidx, itemidx))
        xvalues = torch.hstack((xvalues, xvalues))
        res = torch.sparse_coo_tensor(itemidx, xvalues, size=shape)
        if coalesce:
            return res.coalesce()
        else:
            return res

    for x_coalesce, y_coalesce in product(*repeat((True, False), 2)):
        x = gen_tensor(x_coalesce)
        y = gen_tensor(y_coalesce)
        smtp = "x * y"
        timer = Timer(smtp,
                      globals=globals(),
                      label="coo.mul",
                      description=f"{name}: mul, device: {device}",
                      sub_label=f"n={n}, nnz={nnz}, coalesce=({x_coalesce, y_coalesce})",
                      num_threads=torch.get_num_threads())
        results.append(timer.blocked_autorange())

compare = Compare(results)
compare.trim_significant_figures()
compare.print()

with open(f"{name}_{device}_mul.pickle", 'wb') as f:
    pickle.dump(results, f)

```

</details>

<details>

<summary>Gather results</summary>

```python
import pickle
from torch.utils.benchmark import Timer, Compare

files = [
        "PR",
        "master"
        ]

device = 'cuda'

timers = []
for name in files:
    with open("{}_{}_mul.pickle".format(name, device), 'rb') as f:
        timers += pickle.load(f)

compare = Compare(timers)
compare.trim_significant_figures()
compare.print()

```

</details>

<details>

<summary>CUDA</summary>

```
[------------------------------------------------- coo.mul -------------------------------------------------]
                                                       |  PR: mul, device: cuda  |  master: mul, device: cuda
24 threads: -------------------------------------------------------------------------------------------------
      n=10000, nnz=100, coalesce=((True, True))        |             95          |                91
      n=10000, nnz=100, coalesce=((True, False))       |             87          |               242
      n=10000, nnz=100, coalesce=((False, True))       |             87          |               226
      n=10000, nnz=100, coalesce=((False, False))      |            130          |               371
      n=100000, nnz=1000, coalesce=((True, True))      |            100          |               521
      n=100000, nnz=1000, coalesce=((True, False))     |             90          |               649
      n=100000, nnz=1000, coalesce=((False, True))     |            100          |               659
      n=100000, nnz=1000, coalesce=((False, False))    |            200          |               781
      n=1000000, nnz=10000, coalesce=((True, True))    |            100          |              4861
      n=1000000, nnz=10000, coalesce=((True, False))   |            100          |              5012
      n=1000000, nnz=10000, coalesce=((False, True))   |             98          |              5010
      n=1000000, nnz=10000, coalesce=((False, False))  |            384          |              5174
      n=10, nnz=100, coalesce=((True, True))           |            100          |                79
      n=10, nnz=100, coalesce=((True, False))          |            100          |               221
      n=10, nnz=100, coalesce=((False, True))          |            100          |               221
      n=10, nnz=100, coalesce=((False, False))         |            100          |               350
      n=10, nnz=1000, coalesce=((True, True))          |            100          |               100
      n=10, nnz=1000, coalesce=((True, False))         |            100          |               240
      n=10, nnz=1000, coalesce=((False, True))         |            100          |               254
      n=10, nnz=1000, coalesce=((False, False))        |            100          |               392
      n=10, nnz=10000, coalesce=((True, True))         |            100          |               110
      n=10, nnz=10000, coalesce=((True, False))        |            110          |               286
      n=10, nnz=10000, coalesce=((False, True))        |            110          |               286
      n=10, nnz=10000, coalesce=((False, False))       |            271          |               455
      n=100, nnz=1000, coalesce=((True, True))         |            110          |               851
      n=100, nnz=1000, coalesce=((True, False))        |            110          |              1000
      n=100, nnz=1000, coalesce=((False, True))        |            110          |               990
      n=100, nnz=1000, coalesce=((False, False))       |            140          |              1124
      n=100, nnz=10000, coalesce=((True, True))        |            110          |              5137
      n=100, nnz=10000, coalesce=((True, False))       |            110          |              5391
      n=100, nnz=10000, coalesce=((False, True))       |            100          |              5405
      n=100, nnz=10000, coalesce=((False, False))      |            249          |              5539
      n=1000, nnz=10000, coalesce=((True, True))       |            100          |              8598
      n=1000, nnz=10000, coalesce=((True, False))      |            100          |              8800
      n=1000, nnz=10000, coalesce=((False, True))      |            100          |              8782
      n=1000, nnz=10000, coalesce=((False, False))     |            255          |              8956
      n=1000, nnz=100000, coalesce=((True, True))      |            120          |             84500
      n=1000, nnz=100000, coalesce=((True, False))     |            200          |             88560
      n=1000, nnz=100000, coalesce=((False, True))     |            160          |             89000
      n=1000, nnz=100000, coalesce=((False, False))    |            373          |             89000
      n=1000, nnz=1000000, coalesce=((True, True))     |            312          |            606400
      n=1000, nnz=1000000, coalesce=((True, False))    |           1340          |            609200
      n=1000, nnz=1000000, coalesce=((False, True))    |           1340          |            609100
      n=1000, nnz=1000000, coalesce=((False, False))   |           4408          |            611400

Times are in microseconds (us).
```

</details>

<details>

<summary>CPU</summary>

```
[------------------------------------------------ coo.mul ------------------------------------------------]
                                                       |  PR: mul, device: cpu  |  master: mul, device: cpu
24 threads: -----------------------------------------------------------------------------------------------
      n=10000, nnz=100, coalesce=((True, True))        |              8         |                8
      n=10000, nnz=100, coalesce=((True, False))       |             32         |               34
      n=10000, nnz=100, coalesce=((False, True))       |             32         |               34
      n=10000, nnz=100, coalesce=((False, False))      |             41         |               56
      n=100000, nnz=1000, coalesce=((True, True))      |             24         |               24
      n=100000, nnz=1000, coalesce=((True, False))     |             90         |              100
      n=100000, nnz=1000, coalesce=((False, True))     |             87         |              100
      n=100000, nnz=1000, coalesce=((False, False))    |            231         |              255
      n=1000000, nnz=10000, coalesce=((True, True))    |            190         |              200
      n=1000000, nnz=10000, coalesce=((True, False))   |            908         |             2023
      n=1000000, nnz=10000, coalesce=((False, True))   |            800         |             2036
      n=1000000, nnz=10000, coalesce=((False, False))  |           3684         |             3989
      n=10, nnz=100, coalesce=((True, True))           |              8         |                7
      n=10, nnz=100, coalesce=((True, False))          |             34         |               30
      n=10, nnz=100, coalesce=((False, True))          |             33         |               30
      n=10, nnz=100, coalesce=((False, False))         |             44         |               50
      n=10, nnz=1000, coalesce=((True, True))          |              8         |                7
      n=10, nnz=1000, coalesce=((True, False))         |            100         |              100
      n=10, nnz=1000, coalesce=((False, True))         |            130         |              100
      n=10, nnz=1000, coalesce=((False, False))        |            746         |              210
      n=10, nnz=10000, coalesce=((True, True))         |              8         |                7
      n=10, nnz=10000, coalesce=((True, False))        |           1000         |             1500
      n=10, nnz=10000, coalesce=((False, True))        |           1000         |             1510
      n=10, nnz=10000, coalesce=((False, False))       |           3063         |             2457
      n=100, nnz=1000, coalesce=((True, True))         |             25         |               25
      n=100, nnz=1000, coalesce=((True, False))        |            180         |              130
      n=100, nnz=1000, coalesce=((False, True))        |            200         |              130
      n=100, nnz=1000, coalesce=((False, False))       |            271         |              255
      n=100, nnz=10000, coalesce=((True, True))        |            100         |              100
      n=100, nnz=10000, coalesce=((True, False))       |           2444         |             2290
      n=100, nnz=10000, coalesce=((False, True))       |           2455         |             2357
      n=100, nnz=10000, coalesce=((False, False))      |           5316         |             3783
      n=1000, nnz=10000, coalesce=((True, True))       |            204         |              211
      n=1000, nnz=10000, coalesce=((True, False))      |           2457         |             2480
      n=1000, nnz=10000, coalesce=((False, True))      |           2448         |             2539
      n=1000, nnz=10000, coalesce=((False, False))     |           3665         |             4801
      n=1000, nnz=100000, coalesce=((True, True))      |           2293         |             2374
      n=1000, nnz=100000, coalesce=((True, False))     |           9000         |            24620
      n=1000, nnz=100000, coalesce=((False, True))     |           8000         |            25080
      n=1000, nnz=100000, coalesce=((False, False))    |          26500         |            47650
      n=1000, nnz=1000000, coalesce=((True, True))     |          10000         |            13000
      n=1000, nnz=1000000, coalesce=((True, False))    |          80000         |           362200
      n=1000, nnz=1000000, coalesce=((False, True))    |          78050         |           392600
      n=1000, nnz=1000000, coalesce=((False, False))   |         312100         |           766900

Times are in microseconds (us).
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83428
Approved by: https://github.com/cpuhrsch
2022-09-16 00:28:40 +00:00
soulitzer
7f88934a8f [reland 2] Call jit decomp in VariableType to improve forward AD coverage (#84976)
Reland of https://github.com/pytorch/pytorch/pull/84675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84976
Approved by: https://github.com/zou3519
2022-09-15 22:46:19 +00:00
PyTorch MergeBot
94b67f4cd8 Revert "Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267)"
This reverts commit ec916bf6af.

Reverted https://github.com/pytorch/pytorch/pull/83267 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-09-14 17:40:22 +00:00
Howard Huang
74ead61944 [2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680)
### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.

Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
2022-09-14 13:05:58 +00:00
Kevin Stephano
ec916bf6af Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267)
This PR does the following:

- Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`.  The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`.  It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`.  The `FusionInterface` is an object that represents a Fusion in python.  It can also query the cache based on id.
- The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
- Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
- Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
- Adds a README file to explain how to use the Python Frontend

While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83267
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-09-13 23:28:39 +00:00
Mikayla Gawarecki
e217b30b0f Add torch.nested namespace (#84102)
First step towards #83775
- only `to_padded_tensor` is moved to the nested namespace for now
- following the schema used for `special`, `fft`, `linalg` and other namespaces, nested functions are registered in native_functions.yaml as `nested_{function_name}` and are bound to the desired Python name in
`torch/nested/__init__.py`, and the desired C++ name in `torch/csrc/api/include/torch/nested.h`.

~~**Question**: should we keep the documentation for `Tensor.to_padded_tensor` or can this deleted since it is shared by `torch.nested.to_padded_tensor`?~~

[generated nested docs](https://docs-preview.pytorch.org/84102/nested.html?highlight=nested#module-torch.nested)

Differential Revision: [D39361148](https://our.internmc.facebook.com/intern/diff/D39361148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84102
Approved by: https://github.com/drisspg
2022-09-12 16:31:05 +00:00
Taylor Robie
328538700a [Profiler][Trivial] Move PythonTracerBase to torch/csrc/profiler/orchestration (#83895)
The ownership model between `RecordQueue` and `PythonTracer` is brittle; if a profiler is popped without proper shutdown it can dangle a reference in `PythonTracer` which will segfault when dereferenced. The next PR will address this; to start we simply move the code into `torch/csrc/profiler/orchestration` to limit the sloc delta when making actual changes.

Differential Revision: [D38933962](https://our.internmc.facebook.com/intern/diff/D38933962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38933962/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83895
Approved by: https://github.com/slgong-fb
2022-09-09 19:04:08 +00:00
Dhruv Matani
18a31cc044 [Mobile] Fix The Build For Model Tracer (#84755)
Summary: Currently, the model tracer build is broken because of 2 reasons:
1. A few source files are missing, resulting in missing link time symbols
2. The `TRACING_BASED` flag isn't passed correctly from the command line (specified as an evnironment variable) as a CMake flag

Both these issues were fixed.

Test Plan: Ran this command: `USE_CUDA=0 TRACING_BASED=1 python setup.py develop --cmake`

and saw that the tracer binary was built at `build/bin/model_tracer` - also ran it to ensure that it can generate a YAML file.

Differential Revision: [D39391270](https://our.internmc.facebook.com/intern/diff/D39391270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84755
Approved by: https://github.com/cccclai
2022-09-09 18:22:24 +00:00
Taylor Robie
bea0184033 Reland: [Profiler][Trivial] Create orchestration folder and move observer management there. (#83893)" (#84667)
Reland of https://github.com/pytorch/pytorch/pull/83893

Differential Revision: [D39282536](https://our.internmc.facebook.com/intern/diff/D39282536/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39282536/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84667
Approved by: https://github.com/slgong-fb
2022-09-08 17:09:19 +00:00
Driss Guessous
f803fa9fc9 [Nested Tensor] Add a NestedTensorUtils header and cpp file for organization (#84385)
# Summary
Trying to do some clean up into code structure for nested tensors. This introduces a utility header and cpp file that implements helper functions.

This is the initial PR in more clean up. The next would be separating out the all native functions that create nested tensors into their own file since they do not infact do math on nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84385
Approved by: https://github.com/mikaylagawarecki
2022-09-02 16:31:55 +00:00
PyTorch MergeBot
8b578849b4 Revert "[Profiler][Trivial] Create orchestration folder and move observer management there. (#83893)"
This reverts commit 48a596ad3f.

Reverted https://github.com/pytorch/pytorch/pull/83893 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-09-01 18:34:58 +00:00
Howard Huang
693ed8b147 [1/N] [Dispatchable Collectives] Create Backend class (#83679)
### Changes:
- Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.

#### Motivation
In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83679
Approved by: https://github.com/kwen2501
2022-09-01 01:51:20 +00:00
Jeff Daily
d09486ab23 [ROCm] enable nvfuser (#82498)
### Description
The nvfuser is enabled for ROCm.

### Testing
CI label ciflow/trunk covers the newly enabled ROCm functionality as well as any CUDA regressions caused by these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82498
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-08-30 21:50:39 +00:00
Taylor Robie
48a596ad3f [Profiler][Trivial] Create orchestration folder and move observer management there. (#83893)
Just a basic move. Later I'll add other subsystems. (Python, Kineto)

Differential Revision: [D38925895](https://our.internmc.facebook.com/intern/diff/D38925895/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38925895/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83893
Approved by: https://github.com/slgong-fb
2022-08-30 21:40:59 +00:00
Kimish Patel
cfd18e105f [Pytorch][Ondevice quantization] Add device side API to convert model (#83807)
Summary:
This diff adds device side API which will convert the model to its
quantized equivalent. THe input model must have been prepared AOT for
quantization.

API is implemented by:
- Running reset obervers
- Running observe method
- Running quantize method
- And replacing method, e.g. forward, with its quantized equivalent.

Test Plan:
test/quantization/jit/test_ondevice_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D38889818](https://our.internmc.facebook.com/intern/diff/D38889818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83807
Approved by: https://github.com/iseeyuan
2022-08-29 17:57:38 +00:00
Kimish Patel
5c7e801c50 [pytorch][on device quant] Finalize method for ondevice quant (#83571)
Summary:
After inserting quant dequant nodes in the graph, we need
1. Insert packed param creation and quantized op
2. Create packed_params attribute in the top module. For this we need
graph that inlined except for calculate_qparams method calls. But they
can be inlined too. So perhaps we need to make sure no other callmethods
exist.
3. Insert SetAttr for the packed param
4. Insert GetAttr for the packed param
5. Use GetAttr output for quantized op where applicable, e.g.
linear_dynamic

The above is added to quantize_<method-name> method created inprevious
step. Once the above steps are done clone the method into
quantized_<method-name>

Modify quantize_<method-name>:
1. Remove all outputs from the method.
2. Run dce
3. Remove all inputs from the method except self.

Modify quantized_<method-name>:
1. Remove all packed_param setAttr nodes.
2. Run dce.

This should result in removal of all nodes that generate packed param.

Test Plan: To be written

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D38771416](https://our.internmc.facebook.com/intern/diff/D38771416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83571
Approved by: https://github.com/jerryzh168
2022-08-29 17:53:11 +00:00
jjsjann123
b21a6ff639 [NVFuser] Upstream push 0811 (#83239)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. double support in expression evaluator
- bug fixes:
  1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784)
  2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard
- scheduler:
  1. manual transpose schedule example
  2. WIP transpose scheduler

Commits that's in this PR from the devel branch:

```
b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854)
8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889)
83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898)
69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883)
15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888)
aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239
Approved by: https://github.com/davidberard98
2022-08-25 02:23:22 +00:00
Hansong Zhang
6edcf8e18c Move nnapi code from ATen common code to specific library (#83748)
Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

Test Plan: Sandcastle.

Differential Revision: D38761095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA
2022-08-24 02:17:52 +00:00
chenlai
25dd2a0422 Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)
`_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
Approved by: https://github.com/qihqi
2022-08-23 21:25:18 +00:00