Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294
While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```
Reviewed By: ngimel
Differential Revision: D23964734
Pulled By: bertmaher
fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071
Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.
This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793
Test Plan:
waitforsandcastle
vs master: https://www.internalfb.com/intern/fblearner/details/216129483/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/
Reviewed By: ezyang
Differential Revision: D23486638
fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062
Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.
Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789
Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/
Reviewed By: ezyang
Differential Revision: D23484192
fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653
This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.
This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.
Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620
Reviewed By: mrshenli
Differential Revision: D23638499
fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702
Original commit changeset: c6bd6d277aca
This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.
Test Plan: Tested and previously landed.
Reviewed By: mruberry
Differential Revision: D23703215
fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252
Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.
Test Plan:
Tested with dpp perf test and able to collect trace.
{F307824044}
Reviewed By: ilia-cher
Differential Revision: D23323486
fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44066
Add STL Input iterator to DispatchKeySet:
* Iterator is able to iterate from first not undefined DispatchKey
to NumDispatchKeys.
* Iterator is invalidated once underlying DispatchKeySet is invalidated
Note see http://www.cplusplus.com/reference/iterator/ for comparisons of
different iterators.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23611405
Pulled By: linux-jedi
fbshipit-source-id: 131b287d60226a1d67a6ee0f88571f8c4d29f9c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440
`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.
This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.
Reviewed By: dzhulgakov
Differential Revision: D23593741
fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
Summary:
`is_complex_t` is a bad name. For example in std, there are `std::is_same` but not `std::is_same_t`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39906
Reviewed By: mrshenli
Differential Revision: D22665013
Pulled By: anjali411
fbshipit-source-id: 4b71745f5e2ea2d8cf5845d95ada4556c87e040d
Summary:
This PR moves `DispatchKey::Autograd` to an alias dispatch key mapping to `AutogradCPU, AutogradCUDA, AutogradXLA, AutogradOther, AutogradPrivate*` keys.
A few things are handled in this PR:
- Update alias dispatch key mapping and precompute dispatchTable logic
- Move `Autograd` key from `always_included` set to TensorImpl constructor.
- Update `dummyTensor` constructor to take `requires_grad` as optional argument so that it's closer to the real application in op_registration_test.
- Use `BackendSelect` key for both backend select before and after autograd layer. (1 liner in backend_select codegen)
A few planned followups ordered by priority:
- [cleanup] Update `test_dispatch.py` to include testing `Autograd`.
- [cleanup] Add Math alias key and move catchAll to Math. (to remove 2.2 in `computeDispatchTableEntryWithDebug`)
- [new feature] Add support for Math in native_functions.yaml
- [cleanup] Add iterator like functionality to DispatchKeySet
- [cleanup/large] Only add Autograd backend keys when tensor requires grad. (cc: ljk53 ?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43070
Reviewed By: ezyang
Differential Revision: D23281535
Pulled By: ailzhang
fbshipit-source-id: 9ad00b17142e9b83304f63cf599f785500f28f71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43719
Accidentally this slipped through: with guard did not update the current
context
Test Plan: cpu_caching_allocator_test
Reviewed By: linbinyu
Differential Revision: D23374453
fbshipit-source-id: 1d3ef21cc390d0a8bde98fb1b5c2175b40ab571b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.
Test Plan:
android test: cpu_caching_allocator_test
Imported from OSS
Reviewed By: dreiss
Differential Revision: D22726976
fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
Summary:
They were likely copied from some macro definition, but they do not
belong to macro definitions here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43318
Reviewed By: pbelevich
Differential Revision: D23241526
Pulled By: mrshenli
fbshipit-source-id: e0b5eddfde2c882bb67f56d84ee79281cc5fc941
Summary:
Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see:
018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)
Also add ability to convert python complex object to `c10::complex<at::Half>`
Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes
Fixes https://github.com/pytorch/pytorch/issues/43143
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279
Reviewed By: mrshenli
Differential Revision: D23230296
Pulled By: malfet
fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42612
Add a new Quantizer that supports an input zero point (bias) that can be float.
The quantization equation in this case is
Xq = (Xf - bias) * inv_scale, where bias is float zero_point value
We start with per-row implementation and can extend to per-tensor in the future, if necessary
Test Plan:
python test/test_quantization.py TestQuantizedTensor
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22960142
fbshipit-source-id: ca9ab6c5b45115d3dcb1c4358897093594313706
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42617
While we figure out the random plan, I want to initially disable
support for random operations. This is because there is an ambiguity in
what randomness means. For example,
```
tensor = torch.zeros(B0, 1)
vmap(lambda t: t.normal_())(tensor)
```
in the above example, should tensor[0] and tensor[1] be equal (i.e.,
use the same random seed), or should they be different?
The mechanism for disabling random support is as follows:
- We add a new dispatch key called VmapMode
- Whenever we're inside vmap, we enable VmapMode for all tensors.
This is done via at::VmapMode::increment_nesting and
at::VmapMode::decrement_nesting.
- DispatchKey::VmapMode's fallback kernel is the fallthrough kernel.
- We register kernels that raise errors for all random functions on
DispatchKey::VmapMode. This way, whenever someone calls a random
function on any tensor (not just BatchedTensors) inside of a vmap block,
an error gets thrown.
Test Plan: - pytest test/test_vmap.py -v -k "Operators"
Reviewed By: ezyang
Differential Revision: D22954840
Pulled By: zou3519
fbshipit-source-id: cb8d71062d4087e10cbf408f74b1a9dff81a226d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42619
Added missing entries to `DispatchKey::toString()` and reordered to match declaration order in `DispatchKey.h`
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22963407
Pulled By: bhosmer
fbshipit-source-id: 34a012135599f497c308ba90ea6e8117e85c74ac
Summary:
This function was always expecting to return a `size_t` value
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42454
Reviewed By: ezyang
Differential Revision: D22993168
Pulled By: ailzhang
fbshipit-source-id: 044df8ce17983f04681bda8c30cd742920ef7b1e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42694
The old implementation allowed calling SmallVector constructor and operator= for any type without restrictions,
but then failed with a compiler error when the type wasn't a collection.
Instead, we should only use it if Container follows a container concept and just not match the constructor otherwise.
This fixes an issue kimishpatel was running into.
ghstack-source-id: 109370513
Test Plan: unit tests
Reviewed By: kimishpatel, ezyang
Differential Revision: D22983020
fbshipit-source-id: c31264f5c393762d822f3d64dd2a8e3279d8da44
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570
ProfiledType doesn't do anything and is not used atm, removing
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22938664
Pulled By: ilia-cher
fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41610
Previously, operators that have a `Tensor?` (i.e. optional tensor) in their schema implemented it using `Tensor` in C++ and filled in an undefined tensor for the None case.
The c10 operator library, however, expects `Tensor?` to be represented as `optional<Tensor>`, so those operators couldn't be c10-full yet and still had to use codegenerated unboxing instead of templated unboxing.
This PR changes that. It extends the `hacky_wrapper_for_legacy_signatures` to not only take case of TensorOptions, but now also map between signatures taking `Tensor` and `optional<Tensor>`.
For this, it requires an additional template parameter, the expected signature, and it uses that to go argument-by-argument and unwrap any optionals it finds.
ghstack-source-id: 108873701
Test Plan: waitforsandcastle
Reviewed By: bhosmer
Differential Revision: D22607879
fbshipit-source-id: 57b2fb01a294b804f82cd55cd70f0ef4a478e14f
Summary:
* Make c10::cuda functions regular non-inlined functions
* Add driver_version() and device_synchronize() functions
With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj
FYI malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251
Reviewed By: malfet
Differential Revision: D22826505
Pulled By: ziab
fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38999
Adds boxing for inplace and outplace kernels, itemizes
remaining unsupported cases, and fails compilation when
new unsupported types are introduced in op signatures.
Test Plan: Imported from OSS
Differential Revision: D21718547
Pulled By: bhosmer
fbshipit-source-id: 03295128b21d1843e86789fb474f38411b26a8b6
Summary:
Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like `t.grad().defined()` is not thread safe.
- This introduces `t.mutable_grad()` that should be used when getting a mutable version of the saved gradient. This function is **not** thread safe.
- The `Tensor& grad()` API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40887
Reviewed By: ezyang
Differential Revision: D22343932
Pulled By: albanD
fbshipit-source-id: d5eb909bb743bc20caaf2098196e18ca4110c5d2
Summary:
Declaring GLOG_ constants in google namespace causes a conflict in C++ project that uses GLOG and links with LibPyTorch compiled without GLOG.
For example, see https://github.com/facebookresearch/ReAgent/issues/288
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41504
Reviewed By: kaiwenw
Differential Revision: D22564308
Pulled By: malfet
fbshipit-source-id: 2167bd2c6124bd14a67cc0a1360521d3c375e3c2