Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18833
ghimport-source-id: 6f2be25fcc5e6be3ffe20582e604bd2c1fbab66b
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18833 [STACK] Cache device on TensorImpl; clean up TensorImpl constructors.**
* #18832 [STACK] Disallow changing the device of a tensor via set_.
* #18831 [STACK] Stop swapping in Storages of the wrong device for Tensors.
1) We cache device on TensorImpl. This means we can access the device without a virtual function and allows us to more easily extend TensorImpls (because they don't need to figure out how to store the Device for themselves).
2) Clean up TensorImpl APIs. We had a constructor that took a TensorTypeId and an allocator and would allocate a Storage based on the recognized types of TensorTypeIds. Instead, we just have two different constructors: one for types with a storage, one without.
Reviewed By: dzhulgakov
Differential Revision: D14766230
fbshipit-source-id: 745b8db84dcd6cb58f1a8675ad3ff8d033bc50df
Summary:
This defines a generic counters API that users can utilize to provide monitoring functionality in e.g. a production service. We expose both counters for runtime internals as well as a TorchScript API to create user-defined counters. Synopsis of the API:
- `torch/csrc/jit/script/logging.h` specifies the externally-facing API in C++
- `torch/jit/_logging.py` specifies the Python API
We use an interface, `LoggerBase`, to define the interactions between users and a logging backend. Implementing a subclass of `LoggerBase` allows the user to handle these events in a custom way, such as logging into a DB or calling into an infra-specific counters API.
From the frontend perspective, we can create log events in two ways:
1. We provide an `add_stat_value(name, val)` function. This calls into the Logger backend with a key/value pair. For example, we might call `add_stat_value('foo', 1)` to bump an event counter.
2. We provide a `time_point()` function to record a timestamp in nanoseconds. This can be used in conjunction with `add_stat_value` to record runtime wall clock durations.
Examples of frontend usage can be found in `test_jit.py TestLogging`.
We provide a trivial `LockingLogger` implementation as an example and for testing purposes. It is likely not ready for production usage. It demonstrates that a backend implementing the API can do things like specify aggregation types and report these aggregate stats via the `get_counters()` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18235
Differential Revision: D14545060
Pulled By: jamesr66a
fbshipit-source-id: 04099543a1898cfdd411511e46e03d5dce9b4881
Summary:
1. Move ATen threadpool & open registration mechanism to C10
2. Move the `global_work_queue` to use this open registration mechanism, to allow users to substitute in their own
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17788
Reviewed By: zdevito
Differential Revision: D14379707
Pulled By: jamesr66a
fbshipit-source-id: 949662d0024875abf09907d97db927f160c54d45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16751
This was made more complicated by the fact that ivalue::IntList
is a thing. So I had to fix all of the sites where we referring
to IValue post facto.
The following codemods were run, in this order:
```
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntList IntArrayRef
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntArrayRef::create IntList::create
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in ivalue::IntArrayRef ivalue::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in Tag::IntArrayRef Tag::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in isIntArrayRef isIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in toIntArrayRef toIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'Shared<IntArrayRef>' 'Shared<IntList>'
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'intrusive_ptr<IntArrayRef>' 'intrusive_ptr<IntList>'
```
Some manual fixups were done afterwards; they can be reviewed separately
at https://github.com/pytorch/pytorch/pull/16752
Reviewed By: dzhulgakov
Differential Revision: D13954363
fbshipit-source-id: b5c40aacba042402155a2f5a229fa6db7992ac64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15855
This is preparation work for moving IValue to c10.
Reviewed By: ezyang
Differential Revision: D13605259
fbshipit-source-id: cc545f582ab8607bb02aaf71273cb2710200b295
Summary:
respect grad guard for torch.jit._fork and torch.jit._wait.
Verified that the test failed without the fix, and pass with the fix.
Ideally I would like to enable and disable grad inside the forked function.
It doesn't seems like it's supported at this moment. This code handles that
as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16101
Differential Revision: D13708374
Pulled By: gqchen
fbshipit-source-id: 0533f080c4d0253fb4c61d2a0d3cc22de5721a09
Summary:
The PR clang-formats everything in `torch/csrc/jit/` and adds it to the pre-commit hook.
Here is a list of non-mechanical changes:
- I went over each file and fixed up whenever I could tell that clang-format was clobbering comment formatting.
- Made the macros in register_prim_ops a little more clang-format friendly by omitting trailing commas
- Refactored autodiff.cpp to use a helper class with explicit state rather than a bunch of capturing lambdas
- Small improvements to the precommit hook clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15524
Differential Revision: D13547989
Pulled By: suo
fbshipit-source-id: 3ff1541bb06433ccfe6de6e33f29227a2b5bb493
Summary:
Save error info in the future for parent thread to pick up. Throw the error
when the thread is the root thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14523
Differential Revision: D13251756
Pulled By: highker
fbshipit-source-id: b40f9a45665e1a934743f131ec5e8bad5622ce67
Summary:
Removing the deprecated functions in `torch/csrc/variable_tensor_functions.h` (like `torch::CPU`) and corresponding implementations from `torch/csrc/torch.cpp` from master after the release.
ezyang gchanan soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15003
Differential Revision: D13418086
Pulled By: goldsborough
fbshipit-source-id: a0accdf6f7b0efa1ec07ac7b74b86ff2da37543f
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
(1) Move Caffe2 thread pool to aten
(2) Use the same thread pool definition for PyTorch interpreter
(3) Make ivalue::Future thread-safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14114
Reviewed By: ilia-cher
Differential Revision: D13110451
Pulled By: highker
fbshipit-source-id: a83acb6a4bafb7f674e3fe3d58f7a74c68064fac
Summary:
InterpresterStateImpl con continue its lifecycle by increment the ref
count itself. This patch also removes InterpresterState::clone()
interface that conflicts with intrusive_ptr_target that disallows copy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13784
Differential Revision: D13015451
Pulled By: highker
fbshipit-source-id: a05f1ea6549d52ec693ccffefaa4d520b2474b8c
Summary:
Upon calling wait(), save the forked thread and the current thread to a
task queue. A idling thread (which currently is single threaded) should
pick a ready task and run till there is nothing in the task queue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13212
Differential Revision: D12884522
Pulled By: highker
fbshipit-source-id: b3942a0ee63c148e05f5f41bdc73007fa3c3368e
Summary:
Enables almost all `modernize-*` checks in clang-tidy. This warns against things such as:
- Use of `const std::string&` instead of new-style `std::string` + move,
- Using old-style loops instead of range-for loops,
- Use of raw `new`
- Use of `push_back` instead of `emplace_back`
- Use of `virtual` together with `override` (`override` is sufficient)
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13196
Differential Revision: D12891837
Pulled By: goldsborough
fbshipit-source-id: 4d0f782a09eb391ee718d3d66f74c095ee121c09
Summary:
This PR principally redesigns the fuser's logical flow to be hierarchical, with device-independent logic directing (relatively little) device-specific logic. This design is based on reviews of XLA, TVM, internal design review at NVIDIA and discussions with fuser owners at Facebook. To further vet the design I have begun developing the next significant PR (extended fusion logic) on top of this architecture and it has made the work significantly easier. This PR also improves fuser modularity, which should make it easier for others to contribute to. Unfortunately, this PR is large and its nature has made breaking it into smaller pieces challenging. Future PRs should be smaller.
The fusion flow is now:
- Fusions are "registered" and "upfront compilation" occurs. The fusion specifications, which includes the graph, go into a thread-safe device-independent cache. Upfront compilation generates some information used later during shape inference.
- Fusions are run, which passes them to an executor that performs shape inference, requests an instantiated fusion from the specification's thread-safe store, and launches them. Launch logic eventually defers to device-specific logic.
- Fusions not previously instantiated are compiled. Compilation is device-specific and arg-specific. Compilation logic eventually defers to device-specific logic.
- If the fusion could not be run because fusion on the requested device is disabled or shape inference fails a fallback is invoked.
This flow can be thought of as PyTorch IR -> Device-Independent Fusion Logic -> Device-Specific Fusion Logic. The current upstream logic is, by contrast, PyTorch IR -> Device-Specific Logic -> Device-Independent Logic, which results in needless code duplication and lack of conceptual clarity. That was my mistake when splitting the fuser off from the rest of the jit and our reviews since then have been incredibly helpful in understanding why the approach in this PR is better.
This PR does not only move code around. It also fixes few couple bugs and makes some logical/code changes.
Bug fixes:
- thread-safety is improved with caches preventing concurrent access
- the nvrtc version is now reviewed to determine the appropriate compute architecture to compile for, fixing a bug that would cause runtime errors if a user's nvrtc didn't support the compute architecture their gpu reported
- an issue with DeviceGuard not setting the device properly and failing silently is worked-around (ezyang mentioned he was reviewing the dynamic registration DeviceGuard uses, which may resolve the issue)
Code/Logical changes:
- "const" now appears many more places (note: I cast const away in operator.h because of some obscure build issues -- I think we should be able to fix this and will take a look while this goes through testing)
- The new flow allowed some redundant code to be removed (AnnotatedGraph is gone, for example, and the more straightforward flow eliminated duplication of effort elsewhere)
- Fallback logic is now also invoked if a fusion is requested on a device that cannot handle fusions
- Use of macros to determine which files are compiled is reduced (though they may come back if the Windows build is unhappy)
- There is no more "common" code or folder, the device-independent logic being at the forefront of the fuser replaces and improves upon the goal of sharing code
apaszke who I promised naming rights to
zdevito who correctly pointed out that the device-independent logic should be the bulk of what the fuser is doing
ngimel who contributed to the design of this architecture
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13108
Reviewed By: gchanan, fmassa
Differential Revision: D12850608
Pulled By: soumith
fbshipit-source-id: 24e2df6dfa97591ee36aeca8944519678c301fa3
Summary:
This is a first step towards adding exceptions. We need minimal support in order to begin converting the torch library to weak script mode (which is the main goal here).
Some limitations (that are documented in the tests & compiler):
1. Cannot assign exceptions to variables
2. Any name after raise is being treated as a valid Exception
3. No control flow analysis yet. Below a will be undefined:
if True:
a = 1
else:
raise Exception("Hi")
return a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12789
Differential Revision: D12848936
Pulled By: eellison
fbshipit-source-id: 1f60ceef2381040486123ec797e97d65b074862d
Summary:
There are still a few work to be done:
- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h
This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:
(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.
Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354
Reviewed By: orionr
Differential Revision: D10238910
Pulled By: Yangqing
fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
Summary:
This PR adds a bool type to `IValue` and puts it into place.
* changes conds for `prim::If` and `prim::Loop` to use `bool` type
* changes operators that take `bool`s to match their native ops
* fixes ambiguous `aten` ops `aten::std` and `aten::var`
* fixes tests in `test_jit.py TestJitGenerated`
```
'test_std_dim',
'test_std_dim_1d',
'test_std_dim_1d_neg0',
'test_std_dim_neg0',
'test_var_dim',
'test_var_dim_1d',
'test_var_dim_1d_neg0',
'test_var_dim_neg0'
```
* adds `prim::BoolToTensor` and `prim::TensorToBool`
apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11834
Differential Revision: D9928570
Pulled By: driazati
fbshipit-source-id: 373c53df2f1a8ffa9e33d9a517002fbeef25f3eb
Summary:
This PR implements the design that we discussed. Changes:
- Added a World token IValue and type. The IValue is basically a dummy struct for now, in the future we may extend it (say, add thread-local state).
- Effectful ops explicitly declare they are mutable by having World tokens as inputs and outputs in their schema.
- Purely functional ops that use mutable values will get "fenced" and the world token will be threaded through the fences
- AnnotateEffects pass which wires up all the world tokens together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10700
Reviewed By: eellison
Differential Revision: D9547881
Pulled By: michaelsuo
fbshipit-source-id: ebbd786c31f15bf45e2ddb0c188438ff2f5f3c88
Summary:
This PR splits the CPU and CUDA fusion compilers, putting them into a new jit/fusers/ directory with jit/fusers/common for common components. In particular:
- A fusion interface is created that allows "fusion handles" to be requested
- The CPU and CUDA fusers implement this interface, with dispatch determined by device
- The fusion compilers, fusion function specializations and resource strings are split
- CPU-specific classes like TempFile and DynamicLibrary are in the CPU fuser
- Common classes likes TensorDesc and the base fusion function class are in jit/fusers/common
- There is still some specialization in jit/fusers/common, but these specializations are small(-ish)
- Updates the build system to remove the dummy interface on Windows and minimize the use of macros
This structure should allow in-flight PRs to easily rebase while providing a clear interface to the fusers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10981
Reviewed By: soumith
Differential Revision: D9701999
Pulled By: apaszke
fbshipit-source-id: 3b6bec7b97e0444b2a93caa38d9b897f2e68c1b3
Summary:
**Review last commit only.** Stacked on top of #10949.
This commit fixes a number of issues connected to caching
differentiability status of graphs inside graph executors,
and changes the rules for optimization of differentiable subgraphs.
Previously every one of those was instantiated as a separate graph
executor, but now they are simply heavier-optimized graph regions,
and graph executors are only instantiated for their backward.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10977
Differential Revision: D9600626
Pulled By: apaszke
fbshipit-source-id: dad09a0f586e396afbd5406319c1cd54fbb8a3d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10824
API additions:
- Tensor(c10::intrusive_ptr<TensorImpl,UndefinedTensor>&&)
- Tensor(const c10::intrusive_ptr<TensorImpl,UndefinedTensor>&)
- Tensor::operator=(Tensor&&) && (for completeness sake)
- TensorBase::unsafeGetTensorImpl()
- TensorBase::unsafeReleaseTensorImpl()
- TensorBase::getIntrusivePtr()
- TensorImpl::type_id()
- Tensor::set_data()
- Tensor::is_same(Tensor)
- Tensor::use_count()
- Tensor::type_id()
- Tensor::scalar_type()
- WeakTensor::is_same(WeakTensor)
- intrusive_ptr::weak_use_count()
- weak_intrusive_ptr::weak_use_count()
- c10::raw::intrusive_ptr::{incref,decref,make_weak}
- c10::raw::weak_intrusive_ptr::{incref,decref,lock}
API changes:
- Tensor::pImpl is no longer public (and now named tensor_impl_)
- Most methods accessed this way are now accessible on Tensor
maybe_zero_dim() and set_wrapped_number() being prominent exceptions
(they are now accessed through unsafeGetTensorImpl())
- Type is no longer friend of Tensor
- TensorBase::reset(TensorImpl*) is deleted
- TensorBase::reset(TensorImpl*, bool should_retain) is deleted
- TensorBase::swap(TensorBaseImpl&) is deleted; use std::swap instead
- TensorBase::get() is deleted; use unsafeGetTensorImpl() instead
- TensorBase::detach() is deleted; use unsafeReleaseTensorImpl() instead
- TensorBase::retain() is deleted; use _raw_incref() instead
- TensorBase::release() is deleted; use _raw_decref() instead
- WeakTensor lost most of its methods (it no longer inherits from
TensorBase)
- TensorImpl::storage() is now a const method
- Tensor(TensorBase) constructor removed, instead
we go through getIntrusivePtr(). I'm not sure about
this change; I happened to have accidentally removed the
TensorBase constructor and decided to fix call sites,
but I could go the other way.
- detail::set_data() is deleted; use Tensor::set_data() instead
- c10::raw_intrusive_ptr_target removed; use the functions in c10::raw instead.
(The reason for this change, is that it is invalid to cast an intrusive_ptr_target*
to a raw_intrusive_ptr_target* to take advantage of the methods. But there is
no reason the incref/decref methods shouldn't also work on intrusive_ptr_target;
it is primarily an API consideration. We can be more standards compliant by
keeping them as functions, which are universally applicable.)
- intrusive_ptr::reclaim() and weak_intrusive_ptr::reclaim() now work on
pointers of the NullType. (This counts as a bug fix, because the documentation
specified that pointers produced by release() are valid to reclaim(), and
a release() on a null intrusive_ptr produces the NullType::singleton())
Bug fixes:
- Dispatch code for mutable references incorrectly returned
a reference to a value argument (which would immediately
go out of scope). They now correctly return a tensor by
value.
- intrusive_ptr copy/move assignment did not work correctly when
an object was assigned to itself. We now check for this case and
no-op if so. (This bug manifested itself as a Tensor mysteriously
becoming an UndefinedTensor after lines of code like
'x = x.mul_(y)')
Other changes:
- The checked cast functions in Utils.h have now been
renamed and detemplatized into checked unwrap functions.
- Added type_id() and scalar_type() methods to Tensor
- pImpl is no longer public
- Documented what the && overloads are doing
- All occurrences of 'new TensorImpl' (and similar spellings, like 'new THTensor')
have been expunged. This is NO LONGER a valid way to create a new
tensor, and if you do this, upon your first incref, you will catch an ASSERT
failure saying that only tensors created by intrusive_ptr::release() are valid
to reclaim(). Use c10::make_intrusive instead in this situation.
- IValue is adjusted to use intrusive_ptr instead of Retainable, and all
other sub-classes of Retainable were modified to use intrusive_ptr.
When doing this, I had to make the constructors of sub-classes like
ConstantList public, so that c10::make_intrusive could invoke them. Fortunately,
if you incorrectly stack allocate a ConstantList, and then try to get an
intrusive_ptr to it, it will fail, as stack allocated ConstantLists have refcount 0.
- IValue very narrowly sidesteps the problem of handling NullType, as it
considers intrusive_ptr<TensorImpl> identical to intrusive_ptr<TensorImpl, UndefinedTensor>
which is not always true. This was always the case, but there's now a comment
explaining what's going on.
Some MSVC bugs were uncovered during the preparation of this patch.
They are documented as comments in the code.
Reviewed By: gchanan
Differential Revision: D9481140
fbshipit-source-id: 14a8ea0c231ed88b5715fb86d92730926f9f92fc
Summary:
Please review the expects carefully to make sure there are no regressions. I tried to go over them one by one when they changed, but it's sometimes easy to miss finer details.
Summary of changes:
- Renamed `TensorType` to `CompleteTensorType`. Added a new `TensorType` which records only the scalar type, number of dimensions, and device of a value. The argument behind the rename is to encourage people to use `CompleteTensorType` less, as most passes will only have limited information available. To make transition easier `complete_type->cast<TensorType>()` works, and makes our passes work with both kinds of specialization if they don't need extra the extra detail.
- Renamed `ArgumentSpec` to `CompleteArgumentSpec`. Added a new `ArgumentSpec`, which matches argument only at the level of the new `TensorType`.
- Shape analysis can process graphs with both `CompleteTensorType` and `TensorType`.
- Fuser was a part that heavily relied on full shape information being available. Now, we simply try to fuse the largest possible graphs, and have to do run-time checks to make sure they match the code we generate. If they don't, we fall back to regular interpretation. The shape checks are implementing using an optimized method exploiting algebraic properties of shapes with broadcasting, and the relations of broadcasting with pointwise ops. A full written proof of correctness of the shape checking algorithm is included in a comment in `graph_fuser.cpp`.
zdevito ezyang mruberry ngimel csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10844
Differential Revision: D9498705
Pulled By: apaszke
fbshipit-source-id: 0c53c2fcebd871cc2a29c260f8d012276479cc61
Summary:
```
Use intrusive_ptr in Storage; replace unique_ptr<Storage> with Storage
This patch does two major changes:
- It replaces the use of Retainable in Storage with a new implementation
based on intrusive_ptr. This will be necessary because Caffe2 will
be using this class to implement intrusive_ptrs, and we need to
line these up for the merge. One good thing about the new implementation is
that the default copy/move constructors/assignment operators and destructor
work automatically, instead of needing to be hardcoded into Storage/Tensor.
- It replaces all places where we returned std::unique_ptr<Storage> with
Storage, collapsing an unnecessary double indirection that is no longer
necessary now that we have correctly working copy/move constructors.
I didn't initially want to do step (2), but it was very important to
eliminate all bare uses of new Storage and new StorageImpl, and this making
the API change was the most straightforward way to do this.
HOW TO FIX YOUR CODE IN THE NEW API
- You no longer need to dereference the result of tensor.storage() to pass
it to set. So, instead of:
x.set_(*y.storage());
just write:
x.set_(y.storage());
- If you were accessing methods on StorageImpl via the pImpl() method, you
must use the dot operator to run pImpl(). Even better; just drop pImpl,
we now have method forwarding. So, instead of:
storage->pImpl()->data();
just do:
storage->data();
// storage.pImpl()->data() works too but is not as recommended
- storage->getDevice() is no more; instead use storage->device().index()
MISC CODE UPDATES
- retain, release, weak_retain, weak_release and weak_lock are now
reimplemented using the "blessed API", and renamed to make it
clearer that their use is discouraged.
- nvcc OS X and general OS X portability improvements to intrusive_ptr
- A new comment in intrusive_ptr describing how stack allocated
intrusive_ptr_targets work differently than heap allocated ones
from c10::make_intrusive
CAVEAT EMPTOR
- THStorage_weakRetain used to work on strong pointers, but it NO LONGER
works with intrusive_ptr. You must reclaim the strong pointer into a
real strong pointer, construct a weak pointer from it, and then release
the strong and weak pointers. See StorageSharing.cpp for an example.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10488
Reviewed By: gchanan
Differential Revision: D9306134
Pulled By: ezyang
fbshipit-source-id: 02d58ef62dab8e4da6131e1a24834a65c21048e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10278
Translation to Backend happens immediately before we go into the
Type universe; otherwise we use TensorTypeId.
I allocated TensorTypeId corresponding exactly to existing ATen
Backend. Only CPUTensorId and CUDATensorId are relevant in the
Caffe2 universe.
Reviewed By: gchanan
Differential Revision: D9184060
fbshipit-source-id: 9d3989c26f70b90f1bbf98b2a96c57e2b0a46597
Summary:
The basic game plan is to stop accessing the type_ field directly,
and instead using the stored backend_, scalar_type_ and
is_variable_ to look up the appropriate Type from Context.
Storage of backend_ and scalar_type_ are new.
At some future point in time, I'd like to look at this code
carefully to see if I can get everything in this codepath inlining.
I didn't do it in this patch because there are circular include
problems making things difficult.
Some other details:
- Added Device::backend() which does what it says on the tin
- SparseTensorImpl is temporarily hard-coded to root in at::Context
for the appropriate context. If/when we put this in shared code,
we'll have to break this dep too, but for now it should be OK.
- There's a stupid problem with globalContext() deadlocking if
you didn't actually initialize it before loading libtorch.so
(which is bringing along the variable hooks). I fixed this by
reordering the static initializers. Fixes#9784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10210
Differential Revision: D9150697
Pulled By: ezyang
fbshipit-source-id: 89e2006c88688bcfab0dcee82dc369127c198c35
Summary:
* Changes `insertConstant(g, val)` to `g.insertConstant(val)`.
* Moves SourceRange to its own file to enable it.
* Cleans up dead attribute code in schema matching and graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10177
Differential Revision: D9137789
Pulled By: zdevito
fbshipit-source-id: 8a73cfb01a576f02e7e4dce019be9c0a0002989d
Summary:
…e_/is_variable_
The basic game plan is to stop accessing the type_ field directly,
and instead using the stored backend_, scalar_type_ and
is_variable_ to look up the appropriate Type from Context.
Storage of backend_ and scalar_type_ are new.
At some future point in time, I'd like to look at this code
carefully to see if I can get everything in this codepath inlining.
I didn't do it in this patch because there are circular include
problems making things difficult.
Some other details:
- Added Device::backend() which does what it says on the tin
- SparseTensorImpl is temporarily hard-coded to root in at::Context
for the appropriate context. If/when we put this in shared code,
we'll have to break this dep too, but for now it should be OK.
- There's a stupid problem with globalContext() deadlocking if
you didn't actually initialize it before loading libtorch.so
(which is bringing along the variable hooks). I didn't fix
it in this PR; it's tracked in #9784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9787
Reviewed By: cpuhrsch
Differential Revision: D8980971
Pulled By: ezyang
fbshipit-source-id: 2b4d867abfdc3999a836a220c638c109053145a8
Summary:
More clang tidy cleanups in `torch/csrc`. This time:
1. `hicpp-use-equals-default` recommends `= default` instead of `{}` for constructors/destructors. This is better practice because it expresses the intent better (https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration)
2. `readability-inconsistent-declaration-parameter-name` enforces that parameter names in the declaration match parameter names in the definition. This is just generally useful and can prevent confusion and bugs.
Also updated my script a little bit.
apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9737
Differential Revision: D9069069
Pulled By: goldsborough
fbshipit-source-id: f7b3f3a4eb4c9fadc30425a153566d3b613a41ae
Summary:
Based on top of #9763 (first 3 commits belong to that PR). The first commits from this PR are "Stop using attributes ..."
I tried to separate the changes into fairly meaningful commits. I can't split them up into smaller PRs, because everything starts working and all tests pass only after the whole sequence, but hopefully this will make reviewing somewhat easier.
Known issues/regressions/future tasks:
- `aten::lerp` and `aten::clamp` are no longer fusable
- `CreateAutodiffSubgraphs` needs a rewrite
- It is much more strict now, and will miss a lot of opportunities, especially when viewing ops are involved. Our previous approach was "ignore the assumption on shape availability in gradient formulas to determine differentiability, and hope that shape prop will be robust enough to actually deliver them before we differentiate", which obviously doesn't scale well to more complex cases. We should either work on reducing the size dependency of grad formulas (feasible e.g. for `view`/`reshape`, unfeasible for `squeeze`/`unsqueeze`), or make `CreateAutodiffSubgraphs` integrate some kind of "I could integrate this node into an AD subgraph, but will I be able to infer the shape of its input" reasoning (kind of like a limited shape prop, that doesn't infer anything, and only tells if it *could* infer something).
- It sometimes creates constant-only (or constants + one node) graphs, which is useless
- Broken `aten::add` in auto-batching, because it gained a non-tensor input. I changed the test for pointwise operations to use `aten::mul` instead, but I needed to disable the LSTM cell test. I'm not sure how scalar constants should be implemented in this case, because I don't fully understand our format. cc: ChunliF
- Graph import does some hacks to recover type of constants. This code should be removed once we'll gain the ability to export the IR along with value types.
- There's still a fair amount of dead code that can be removed. I didn't want to make this diff any bigger, and removing it is an easy task.
- Graph fuser could be improved to use signature matching (possibly using `OperatorSet`) instead of basing on node kinds.
- Manual constant propagation for the `ListConstruct` node in `torch/onnx/utils.py` should be replaced with a proper constant propagation pass (or we should ensure that the one we have handles at least this case before we remove this code).
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9807
Reviewed By: ezyang
Differential Revision: D9004285
Pulled By: apaszke
fbshipit-source-id: fe88026a765f6b687354add034c86402362508b7
Summary:
This is blocking the IR operator unification, because I need to be able to pass scalars to backward functions.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9763
Reviewed By: zou3519
Differential Revision: D8978457
Pulled By: apaszke
fbshipit-source-id: 570b4c3409322459cb0f2592069730a7d586ab20
Summary:
I split it into two parts, _local_scalar and _local_scalar_dense (unchecked)
so I could reuse the sparse logic in both paths.
_local_scalar became a method on Tensor to work around a circular
include problem.
This is resurrected copy of #9652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9762
Differential Revision: D8972348
Pulled By: ezyang
fbshipit-source-id: 2232dbfc8e1286b8a4a1c67d285c13a7771aad4c
Summary:
I got some tensor->variable conversion exceptions from `torch/csrc/autograd/variable.h`, which used the `TORCH_ASSERTM` macros instead of `AT_CHECK`, so they didn't have backtraces. This was such a substantial loss for debugability that I decided to update the whole codebase to use the backtrace-enabled ATen macros instead of `TORCH_ASSERT` and `JIT_ASSERT`, the latter having been an alias of the former.
ezyang apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9575
Differential Revision: D8924566
Pulled By: goldsborough
fbshipit-source-id: 7a4013b13eec9dbf024cef94cf49fca72f61d441
Summary:
This can hardly be called an improvement (we now print
CPUFloatType instead of CPUFloatTensor) but it was the
simplest way I could think of devirtualizing this function in
the short term. Probably need some sort of native function
that gives string information about a tensor.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Approved in #9710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9758
Differential Revision: D8966935
Pulled By: ezyang
fbshipit-source-id: a4641affe0a6153f90cdd9f4f2a1100e46d1a2db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9718
This patch switches the interpreter to use IValue's primitive numbers rather than tensors for computing on integers and floats. In addition to preparing the interpreter for first-class support of other types, this cleans up the handling of primitive numbers, making it possible to just use the normal operator overloading dispatch to find the right implementation for numbers. As a result of this change, a lot of other functionality needed to be updated since it was the first time we use non-tensors in a lot of places in the code base.
Notes:
* Fixes code_template.py so that multi-line strings are indented correctly when used on a standalone line
* Cast operators (`int(x)`) now are functional. Some tests have addition conversions to integers because
we no longer allow implicit tensor -> integer conversions following the same convention as in python
* prim::ListConstruct/createList has been added to the interpreter for creating lists and this has
replaced aten::stack for integers lists
* gen_jit_dispatch.py has been refactored so that non-tensor types use operators on IValues to extract
the primitives
* IValue gains a .to<T> method that is the equivalent of tensor_as but for IValue instead of at::Tensor
* `constant_as<T>` is switched over to using IValues's `.to<T>` method, to make conversion from constant->IValue->C++ type
more consistent. This functionality combined with `toIValue(Value*)` replaces the `tensor_as` and `as_tensor` family of functions.
* conditional expressions (if, loop) and operators related to them are now computed on integers rather than tensors
* IValue gains constructors for constructing from at::Scalar and converting to it. However, IValue itself will always store
the scalars as a double or int64.
* To align with python 3 syntax, TK_INT, TK_FLOAT, and TK_BOOL have been removed from the parser, and int/float/bool are just treated as special identifiers in the compiler,
along with print. These are represented as special sugared values with a `call` method implemented. For int/float/bool this implements casting behavior.
* Dropped shared_from_this from Type/Module. They were not needed and they making debugging harder because they internally throw/catch exceptions.
* Shape propagation has been updated to support running nodes that include floating point primitive types, this required some refactoring of internal functions.
* TensorToNum and NumToTensor have actual implementations as operators now
* regster_prim_ops now contains implementations of math operators for float/int primitive types, and for mixed (prim <+> tensor) versions. This removes the need for special handling in compiler.cpp
* Primitive math is now entirely handled by letting the compiler choose the right overloads. This removes tons of special casing in the compiler.
* incorporates eellison's change to allow casting from return values. Due to the addition of primitive support, the code need slight modifications, so I just pre-merged it here.
* stack.h gains generic vararg versions of push/pop that know how to convert to/from C++ types:
```
at::Tensor a;
at::Scalar b;
pop(stack, a, b);
at::Tensor c = a + b;
push(stack, c);
```
apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9584
Reviewed By: apaszke
Differential Revision: D8910546
Pulled By: zdevito
fbshipit-source-id: 0f3e60d4d22217f196a8f606549430e43b7e7e30
Summary:
IValue is short for interpreter value. It is used frequently so a short name is important.
This will allow us to implement more non-tensor types in an efficient way and remove
many hacks from the compiler.
This PR is limited. It only introduces IValue and changes interpreter to use it.
Follow up PRs will:
* Change the way aten_ops consume non-tensor types so that integer lists,
are no longer represented as Tensors.
* Introduce TensorList as a fundamental type and remove all vararg handling in gen_jit_dispatch
* Change the compiler to implement math on primitive numbers rather than converting to tensors.
jamesr66a apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9368
Reviewed By: ezyang
Differential Revision: D8817598
Pulled By: zdevito
fbshipit-source-id: 29dce80611ce5f6384234de9d12a67861d2b112f
Summary:
This is a series of two commits that should probably be read separately. They are stacked on top of #9018 since the second commit requires it for correctness.
Commit 1
=======
This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process.
* Switches schema over to parsed declarations, in the future this will allow something like:
```
registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) {
...
})
```
This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines).
The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that.
Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types.
This removes the other way we encoded schema, and makes it easier to see what schema are registered.
Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6
* Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself.
* Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change.
* Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier.
Commit 2
=======
This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed.
* Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info.
* Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes.
* Remove addInterpreterOpHandler in favor of global operator registry.
* Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered.
* Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op.
* Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors. In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first.
* remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication
* refactor stack manipulation functions into a separate header file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8885
Reviewed By: jamesr66a
Differential Revision: D8751048
Pulled By: zdevito
fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557