Summary: This adds a new MTIA DeviceType which is associated with the MTIA DispatchKey and will be used for the Meta in-house training and inference accelerators.
Test Plan: All CI should pass.
Differential Revision: D42526044
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92232
Approved by: https://github.com/ezyang
This API adds some improvements to external backends who are building C++ backends out of tree using the `PrivateUse1` dispatch key.
The docs and linked examples go over the API in more detail, but you should be able to use it like:
```
# This should probably be in the __init__.py file of a external backend's python package
> torch.register_privateuse1_backend("foo")`
# And it will allow the user to do this:
> a = torch.ones(2, device="foo")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86992
Approved by: https://github.com/albanD
Rather than hardcoding list of device names, generate it from list of known types.
Performance is not important at the error codepath, as it will not be evaluated during normal codepath.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85982
Approved by: https://github.com/kit1980
Summary: The new PrivateUse1 DeviceType is associated with the PrivateUse1 DispatchKey, which can be used for non-public devices without introducing a new device type. Note that the stringified name of the PrivateUse1 device is "privateuseone".
Test Plan: All CI should pass.
Differential Revision: D35859437
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77208
Approved by: https://github.com/bdhirsh
Summary:
This PR implements the necessary hooks/stubs/enums/etc for complete ONNX Runtime (ORT) Eager Mode integration. The actual extension will live out of tree at https://github.com/pytorch/ort.
We have been [working on this at Microsoft](https://github.com/microsoft/onnxruntime-pytorch/tree/eager-ort/torch_onnxruntime) for the last few months, and are finally ready to contribute the PyTorch core changes upstream (nothing major or exciting, just the usual boilerplate for adding new backends).
The ORT backend will allow us to ferry [almost] all torch ops into granular ONNX kernels that ORT will eagerly execute against any devices it supports (therefore, we only need a single ORT backend from a PyTorch perspective).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58248
Reviewed By: astaff
Differential Revision: D30344992
Pulled By: albanD
fbshipit-source-id: 69082b32121246340d686e16653626114b7714b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63464
This was previously committed as D30281388 (4d6f98ecad), but was reverted due to t98478641. jnkwok1 confirmed that this change was not the root cause, so trying to land it again.
Currently, `std::regex` is used for parsing device strings. This is undesirable for a few reasons.
1. Increases binary size
2. Slows down model loading
3. Potentially uses more memory at runtime
4. Takes marginally longer time to build code that uses std::regex v/s not using std::regex
This change avoids the use of `std::regex` for parsing the device string since we don't need to.
ghstack-source-id: 136006963
ghstack-source-id: 136081898
Test Plan:
### AI Bench Runs
**Before this change:**
1. Model Load time: [252ms](https://www.internalfb.com/intern/aibench/details/332471502816548)
2. Model unload time: 3.5ms
**After this change:**
1. Model Load time: [240ms](https://www.internalfb.com/intern/aibench/details/652195589031318), which is an approx 5% reduction for the current model. I suspect percentage wise, it will be larger for smaller models since this is a fixed cost reduction.
2. Model unload time: 3.3ms (probably too small to be meaningfully impactful to an end user).
### BSB Results
```
D30281388 (4d6f98ecad)-V1 (https://www.internalfb.com/intern/diff/D30281388 (4d6f98ecad)/?dest_number=135713848)
messenger-pika-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -7.1 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -17.6 KiB
Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:551399955987465@base/bsb:551399955987465@diff/
```
Reviewed By: raziel, pavithranrao
Differential Revision: D30388269
fbshipit-source-id: 10942e7aa56f9ea47aa479a8f50187f2ce2899bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63204
Currently, `std::regex` is used for parsing device strings. This is undesirable for a few reasons.
1. Increases binary size
2. Slows down model loading
3. Potentially uses more memory at runtime
4. Takes marginally longer time to build code that uses std::regex v/s not using std::regex
This change avoids the use of `std::regex` for parsing the device string since we don't need to.
ghstack-source-id: 136006963
Test Plan:
### AI Bench Runs
**Before this change:**
1. Model Load time: [252ms](https://www.internalfb.com/intern/aibench/details/332471502816548)
2. Model unload time: 3.5ms
**After this change:**
1. Model Load time: [240ms](https://www.internalfb.com/intern/aibench/details/652195589031318), which is an approx 5% reduction for the current model. I suspect percentage wise, it will be larger for smaller models since this is a fixed cost reduction.
2. Model unload time: 3.3ms (probably too small to be meaningfully impactful to an end user).
### BSB Results
```
D30281388-V1 (https://www.internalfb.com/intern/diff/D30281388/?dest_number=135713848)
messenger-pika-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -7.1 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -17.6 KiB
Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:551399955987465@base/bsb:551399955987465@diff/
```
Reviewed By: raziel
Differential Revision: D30281388
fbshipit-source-id: 4d998e9f313e6366d9d89a6a73cd090ddfb059fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830
Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.
Test Plan: CI
Reviewed By: zertosh
Differential Revision: D27979080
fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143
Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors. This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)
Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
tensors are device agnostic. It's hard to say exactly what the
correct level of abstraction here is, but in this particular case
implementation considerations trump semantic considerations: it
is way easier to have just a meta device, than to have a meta device
AND a cpu device AND a cuda device. This may limit the applicability
of meta tensors for tracing models that do explicit cpu()/cuda()
conversions (unless, perhaps, we make those operations no-ops on meta
tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
Are they really supposed to be all caps? That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
dispatch keys.
- I intentionally did NOT add a Backend for Meta. For now, I'm going to
hope meta tensors never exercise any of the Backend conversion code;
even if it does, better to fix the code to just stop converting to and
from Backend.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: samestep
Differential Revision: D26763552
Pulled By: ezyang
fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:
TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
...
}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634
Reviewed By: malfet
Differential Revision: D26614213
Pulled By: smessmer
fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
Summary:
Add a new device type 'XPU' ('xpu' for lower case) to PyTorch. Changes are needed for code related to device model and kernel dispatch, e.g. DeviceType, Backend and DispatchKey etc.
https://github.com/pytorch/pytorch/issues/48246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49786
Reviewed By: mrshenli
Differential Revision: D25893962
Pulled By: ezyang
fbshipit-source-id: 7ff0a316ee34cf0ed6fc7ead08ecdeb7df4b0052
Summary:
ezyang,
I have added the changes to DispatchKey, DeviceType, Backend to support the out-of-tree FPGA.
cc. tataetae
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38938
Differential Revision: D21748955
Pulled By: ezyang
fbshipit-source-id: fe76d9730818205961430d2a0e00727b5c547b32
Summary:
This PR made the expected torch device string error message to include `xla` as the acceptable torch device prefix string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36446
Test Plan:
No Logic changed, and made sure `xla` is acceptable in `torch.device`.
```
import torch
device = torch.device("xla")
```
```
device = torch.device("unrecognized")
RuntimeError: Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu, xla device type at start of device string: unrecognized
```
Differential Revision: D20993449
Pulled By: dahsh
fbshipit-source-id: 83afe4f913a650a655bfda9c2a64bf9e5aa27e16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29351
When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.
This removes the last couple of cases. Benchmark shows ~5% improvement:
TorchSaveSmallTensor Pre: 13.12us
TorchSaveSmallTensor Post: 12.48us
ghstack-source-id: 93517928
Test Plan:
buck build mode/opt experimental/jeremyl/c2:
buck-out/opt/gen/experimental/jeremyl/c2/SerializationBench --bm_regex=TorchSaveSmallTensor
Differential Revision: D18365066
fbshipit-source-id: a3284bec004751cedae1cdadf27f969422faff8e
Summary:
This PR also moves Device::validate into the header file, which makes
statements like `Device d = kCPU` effectively free.
Device includes the device's index, so TensorIterator::compute_types
now implicitly checks that all CUDA inputs are on the same GPU.
Previously, this was done ad-hoc in places like TensorIterator::binary_op.
Note that zero-dim Tensor (scalars) are NOT required to be on the
same device as other inputs because they behave almost like Python numbers.
TensorIterator handles copying zero-dim Tensors to the common device.
Prior to this PR, TensorIterator would copy zero-dim Tensors between CPU
and GPU, but not between different GPUs (because Backend didn't encode
the GPU index). This removes that restriction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20690
Differential Revision: D15414826
Pulled By: colesbury
fbshipit-source-id: 1d0ad1f7d663252af36dd4590bcda418c2f7a09f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316
This starts cleaning up the files in c10 according to the module structure we decided on.
Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h
Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp
i-am-not-moving-c2-to-c10
Reviewed By: dzhulgakov
Differential Revision: D13498493
fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63