Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54896
This should help performance. (For example, it improves total
time spent in a C++ benchmark that just adds 2 tensors in place by
about 10%.)
ghstack-source-id: 125659451
Reviewed By: bhosmer
Differential Revision: D27404164
fbshipit-source-id: e1dce8c02100ee4ce22510298c7e0d0f192be201
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53388
Most of this method did not depend on the template parameter. No need to include it in the .h file or duplicate it in the generated code.
ghstack-source-id: 123211590
Test Plan: Existing CI should cover this
Reviewed By: smessmer
Differential Revision: D26851985
fbshipit-source-id: 115e00fa3fde547c4c0009f2679d4b1e9bdda5df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52344
This line is a bug-prone use of std::move combined with a reference to the moved-from parameter in the same series of function call arguments. This is normally a problem because the order of evaluation is undefined -- if the move happens before the call to `storage.device()`, we may have problems. It is not a problem here because we are merely forwarding from one `Storage&&` parameter to another.
ghstack-source-id: 121837267
Test Plan: See no clang-tidy/HowToEven warning on the diff, I hope
Reviewed By: bhosmer
Differential Revision: D26436550
fbshipit-source-id: da85d79be854ff42c5a0cab9649ba82295816eca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51049
This diff makes it OK to query has_storage() on all TensorImpls. I added debug assertions that storage_ is indeed never set on them, which is required for this to be correct.
ghstack-source-id: 120714380
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D26008498
fbshipit-source-id: b3f55f0b57b04636d13b09aa55bb720c6529542c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50290
This was reverted because it landed after D24772023 (b73c018598), which
changed the implementation of `dim()`, without rebasing on top of it,
and thus broke the build.
ghstack-source-id: 119608505
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D25852810
fbshipit-source-id: 9735a095d539a3a6dc530b7b3bb758d4872d05a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50176
UndefinedTensorImpl was the only type that overrode this, and IIUC we don't need to do it.
ghstack-source-id: 119609531
Test Plan: CI, internal benchmarks
Reviewed By: ezyang
Differential Revision: D25817370
fbshipit-source-id: 985a99dcea2e0daee3ca3fc315445b978f3bf680
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47507
This introduces a new SizesAndStrides class as a helper for
TensorImpl, in preparation for changing its representation.
ghstack-source-id: 119313559
Test Plan:
Added new automated tests as well.
Run framework overhead benchmarks. Results seem to be neutral-ish.
Reviewed By: ezyang
Differential Revision: D24762557
fbshipit-source-id: 6cc0ede52d0a126549fb51eecef92af41c3e1a98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49770
Seems like the performance cost of making this commonly-called method virtual isn't worth having use of undefined tensors crash a bit earlier (they'll still fail to dispatch).
ghstack-source-id: 119528065
Test Plan: framework overhead benchmarks
Reviewed By: ezyang
Differential Revision: D25687465
fbshipit-source-id: 89aabce165a594be401979c04236114a6f527b59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48877
Setting `Storage` in the TensorImpl ctor only to set it again in
`copy_tensor_metadata` wastes one refcount bump.
ghstack-source-id: 117937872
Test Plan:
internal benchmark. compared results with perf, saw 0.15%
reduction in percent of total time spent in
`TensorImpl::shallow_copy_and_detach`.
Reviewed By: bhosmer
Differential Revision: D25353529
fbshipit-source-id: e85d3a139ccd44cbd059c14edb19b22b962881a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48681
This should reduce reference counting traffic when creating views.
The code duplication here is unfortunate and I'm open to suggestions on how to reduce it. It's especially regrettable that we create a footgun for subclasses of TensorImpl: they can accidentally override only one of the two overloads and get confusing behavior.
ghstack-source-id: 117896685
Test Plan: internal benchmarks
Reviewed By: ezyang
Differential Revision: D25259741
fbshipit-source-id: 55f99b16b50f9791fdab85cbc81d7cd14e31c4cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48680
It seems a bit long to put into the header (and is virtual anyway).
ghstack-source-id: 117894350
Test Plan: CI
Reviewed By: bhosmer
Differential Revision: D25259848
fbshipit-source-id: e3eed1f2483fc3c1ff51459159bf3bfed9d6f363
Summary:
This PR moves `DispatchKey::Autograd` to an alias dispatch key mapping to `AutogradCPU, AutogradCUDA, AutogradXLA, AutogradOther, AutogradPrivate*` keys.
A few things are handled in this PR:
- Update alias dispatch key mapping and precompute dispatchTable logic
- Move `Autograd` key from `always_included` set to TensorImpl constructor.
- Update `dummyTensor` constructor to take `requires_grad` as optional argument so that it's closer to the real application in op_registration_test.
- Use `BackendSelect` key for both backend select before and after autograd layer. (1 liner in backend_select codegen)
A few planned followups ordered by priority:
- [cleanup] Update `test_dispatch.py` to include testing `Autograd`.
- [cleanup] Add Math alias key and move catchAll to Math. (to remove 2.2 in `computeDispatchTableEntryWithDebug`)
- [new feature] Add support for Math in native_functions.yaml
- [cleanup] Add iterator like functionality to DispatchKeySet
- [cleanup/large] Only add Autograd backend keys when tensor requires grad. (cc: ljk53 ?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43070
Reviewed By: ezyang
Differential Revision: D23281535
Pulled By: ailzhang
fbshipit-source-id: 9ad00b17142e9b83304f63cf599f785500f28f71
Summary:
Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like `t.grad().defined()` is not thread safe.
- This introduces `t.mutable_grad()` that should be used when getting a mutable version of the saved gradient. This function is **not** thread safe.
- The `Tensor& grad()` API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40887
Reviewed By: ezyang
Differential Revision: D22343932
Pulled By: albanD
fbshipit-source-id: d5eb909bb743bc20caaf2098196e18ca4110c5d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32734
VariableTensorId is the only key with this treatment today,
but BackendSelect and CompoundOp are coming soon.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D19628091
Pulled By: ezyang
fbshipit-source-id: 250753f90528fa282af7a18d8d2f7736382754bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30874
These have all been disabled at this point, so there is no difference in the generated code.
Test Plan: Imported from OSS
Differential Revision: D18855990
Pulled By: gchanan
fbshipit-source-id: 03796b2978e23ef9060063f33241a1cbb39f1cf3
Summary:
This improved multi-d microbenchmark by ~100 ns, empty_tensor_restride used to be 13% of iteration time, now about 5%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30452
Test Plan: Covered by existing tests
Differential Revision: D18704233
Pulled By: ngimel
fbshipit-source-id: be527f09183bc31e9d1f63fd49bfbe0998fe167f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28620
All Tensors are Variables now, they just happen to have requires_grad=False. Tensors ALWAYS have `VariableTensorId` in their type set.
When constructing this patch, I had to make decisions about what I would fix in this patch, and what I would leave for follow up PRs. Here is the cleanup that happens in this patch:
- The `is_variable` property is removed from TensorOptions. I removed this immediately because unlike Tensor::is_variable, TensorOptions::is_variable doesn't respect our VariableTensorId thread-local state. This means that there were a bunch of places where TensorOptions::is_variable was false, which is obviously bogus in the world when tensor and variable are merged. Instead of keeping the method as a function that always returns true, I just opted to remove it entirely (it's not public API.) All places we set `is_variable` are deleted.
- Knock on effect: there is no longer a separate DeprecatedTypeProperties for the variable and non-variable versions of type.
- Knock on effect: instead of asserting on TensorOptions::is_variable, instead we just test `at::impl::variable_is_excluded()`
- There is now only one copy of the cuDNN RNN dropout cache, not two (I'm not sure why we had two to begin with)
Some cleanup that doesn't happen in this patch:
- Eliminating unnecessary uses of `make_variable`
- Eliminating `Tensor::is_variable`
The most subtle part of this patch is retaining tracing behavior: the fact that everything is a Variable means that more code gets routed to VariableType than before; this can change traces. I identified two places where we didn't appropriately turn off VariableType, mostly factory functions:
- `torch.tensor` must turn off VariableType before invoking `at::empty` to construct the tensor, as it subsequently does direct data access
- `tensor_slow` (invoked when you pass a Python scalar to a tensor argument) must turn off VariableType before calling `scalar_to_tensor` so the scalar gets traced as constant, rather than as a call to `scalar_to_tensor`.
Honestly, these are all giant hacks, and should be replaced with a more specialized guard that just toggles tracing.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: dreiss
Differential Revision: D18171156
Pulled By: ezyang
fbshipit-source-id: 5b6a045beba37492647e350190f495114e86504d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28610
The basic idea is, in some cases where we stored a pointer to a full AutogradMeta object, instead store a nullptr. We let a nullptr represent a default-constructed AutogradMeta object, and simply populate it with a real AutogradMeta if there is ever a situation where we need to modify it.
The primary technical contrivance in this diff is I have to use AutogradMetaFactory to lazily initialize the AutogradMeta, as it is not available in the dynamic library that TensorImpl is in. (I spent a while trying to put them in the same compilation unit, but gave up in the end as it pushed us over the Windows linking binary size limit. Eep.)
Some other notes:
- `set_autograd_meta` now unconditionally turns a tensor into a variable. I audited all call sites and observed there are no occurrences where nullptr is passed (after this patch, there are now!)
- `copy_tensor_metadata` is updated to unconditionally preserve the VariableTensorId-ness of the destination tensor. I think this is the more correct semantics; we can't do the old semantics anymore.
- There's a bunch of places in the API where we return const references to objects. This is pretty weird to me, but I didn't feel like cleaning it up. But sometimes I don't conveniently have something that's the right lifetime, so I introduced a number of singletons to handle this correctly.
You might wonder why I'm doing the optimization before the variable-tensor dynamic merge. The reason is simple: this change is semantics preserving, while variable-tensor dynamic merge is not. So it is easier to get right, and prevents us from regressing performance if we do it the other way.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18171162
Pulled By: ezyang
fbshipit-source-id: 580df729e4d04881b2b9caa0f0c00785b3afbb92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28609
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18171159
Pulled By: ezyang
fbshipit-source-id: 509061ca56186c7762da9634abecbafad0277d94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28593
When I turn on Variable everywhere, I will need to be able to construct
AutogradMetas from TensorImpl. But I cannot call the constructor directly
as it lives in another dynamic library. So I need another virtual factory interface
to be able to do this.
I also adjust the AutogradMeta constructor so that the TensorImpl argument is
optional. This argument is only needed if `requires_grad == True`, as we use it
to test if the variable is valid (only floating point tensors can have requires grad true).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18171161
Pulled By: ezyang
fbshipit-source-id: 3f2e86720899b3bda36ddd90244c2624645cc519
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28592
These aren't perf critical, and putting them in a cpp file makes it easier to
work on them.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18171158
Pulled By: ezyang
fbshipit-source-id: 4aad434ad4aecba7ed46761f676df6bbec37733e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26719
This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first.
General structure of the PR:
* Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary.
* Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_local_tensor_type_set()` which is considered as part of dispatch.
* The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT.
* The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs.
The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch:
* `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface.
* `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly.
* `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly.
One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D17549624
Test Plan: Imported from OSS
Pulled By: ezyang
fbshipit-source-id: 57dbdd8d6812a66082aa6db2934c8edcda340ea6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27106
Adds memory_format option to the `clone` operator.
Introduce new `clone` behavior if used with `input_t.clone(memory_format=torch.preserve_format)`:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.
---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.
Test Plan: Imported from OSS
Differential Revision: D17699357
Pulled By: VitalyFedyunin
fbshipit-source-id: 5ae1537c2aca1abf0bf1eec4416846129c156f66