Fixes#156052 and #156444.
This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.
Changes done in this PR:
1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.
This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
Fixes#156052 and #156444.
This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.
Changes done in this PR:
1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.
This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
Reland of #160532
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187
Approved by: https://github.com/angelayi
Reland of #160532
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Test Plan:
CI
Rollback Plan:
Differential Revision: D80181887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532
Approved by: https://github.com/henryoier, https://github.com/ezyang
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543
# Solution
We should raise python exception instead of aborting...
# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
what(): device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
# Motivation
According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design.
I personally prefer the **Simple Version** APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does **NOT** break the previous design philosophies.
I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle:
1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter.
2. Device-specific APIs should be placed under device-specific submodules.
3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter.
Also, I list the pros and cons of **Simple Version** here:
Pros:
- `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience;
- more concise, facilitate the developer to write a device-agnostic code.
Cons:
- no obvious drawbacks.
# Additional Context
I list the new APIs here:
```python
torch.accelerator.is_available() -> bool:
torch.accelerator.current_accelerator() -> torch.device:
torch.accelerator.device_count() -> int:
torch.accelerator.current_device_idx() -> int:
torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None:
torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream:
torch.accelerator.set_stream(stream: torch.Stream) -> None:
torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None:
```
According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204
Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD, https://github.com/jeffdaily
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def query(self) -> _bool: ...
def synchronize(self) -> None: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
Fixes the string_view errors and reland the work. The previous changes in torch/csrc/utils/invalid_arguments.cpp were too aggressive and not tested thoroughly. They are discarded.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110518
Approved by: https://github.com/ezyang
Applies so more fixes to headers that may have been missed before for performance optimization.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang @ezyang since this more in the series of the clang-tidy fixup
This is PR fixes 3 main issues:
1. Use emplacement more in headers
1. Avoid unnecessary copies and use const ref when possible
1. Default any special functions when possible to make them potentially trivial and more readable.
1. There is also one change in this PR that tries to prevent unnecessary math promotion, the rest of these changes are in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91445
Approved by: https://github.com/ezyang
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59560
`at::cuda::CUDAStream` has the `query` and `synchronize` methods, but `c10::Stream` does not, and I couldn't find any generic way to accomplish this. Hence I added helpers to do this to the DeviceGuardImpl interface, and then defined these methods on `c10::Stream`. (I had to do it out-of-line to circumvent a circular dependency).
ghstack-source-id: 130932249
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D28931377
fbshipit-source-id: cd0c19cf021e305d0c0cf9af364afb445d010248
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57345
Already back in https://github.com/pytorch/pytorch/pull/57046 we realized that calling this method `getStreamFromPool` could cause issues because that name gets HIPified and thus in some callsites we'd end up calling a method that doesn't exist. In the end we got away with it because the places where we were calling that method weren't HIPified. However in the next PR we'll use this method inside RPC, and that will start causing problems, hence here I rename it to something that should not cause conflicts. This is a private API (since it's inside `impl`) thus there's no backwards compatibility concerns.
ghstack-source-id: 127916484
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28114923
fbshipit-source-id: e027ad08a8e02090c08c6407c2db5a7fde104812
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830
Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.
Test Plan: CI
Reviewed By: zertosh
Differential Revision: D27979080
fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57047
We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to record a DataPtr onto a stream with the caching allocator.
ghstack-source-id: 127713135
(Note: this ignores all push blocking failures!)
Test Plan: Used later in this stack
Reviewed By: ezyang
Differential Revision: D28029161
fbshipit-source-id: ff337ab8ccc98437b5594b2f263476baa1ae93e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57046
We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to get a stream from the global ATen pool.
ghstack-source-id: 127713137
(Note: this ignores all push blocking failures!)
Test Plan: Used later in this stack
Reviewed By: ezyang
Differential Revision: D28029159
fbshipit-source-id: 5055d84c1f3c2a4d86442f3149455c5ebd976dea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53142
It turns out to make Meta a device I need to substantively reuse
the CPUGuardImpl implementation. It's pretty parametrizable so
just move this over to DeviceGuardImplInterface templated over
the DeviceType.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: anjali411, samestep
Differential Revision: D26763553
Pulled By: ezyang
fbshipit-source-id: 464fb3e3a72ba7c55a12adffe01c18171ce3e857
Summary:
All pretty minor. I avoided renaming `class DestructableMock` to `class DestructibleMock` and similar such symbol renames (in this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49815
Reviewed By: VitalyFedyunin
Differential Revision: D25734507
Pulled By: mruberry
fbshipit-source-id: bbe8874a99d047e9d9814bf92ea8c036a5c6a3fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45485
Essentially this is the problem reported by ezyang: https://fb.workplace.com/groups/llvm.gcc/permalink/4053565044692080. There are two proposed fixes:
* https://github.com/pytorch/pytorch/pull/44883: this doesn't work because it fails some static assert at runtime
```
caffe2/c10/core/TensorOptions.h:553:1: error: static_assert failed due to requirement 'sizeof(c10::TensorOptions) <= sizeof(long) * 2' "TensorOptions must fit in 128-bits"
static_assert( sizeof(TensorOptions) <= sizeof(int64_t) * 2,
^
```
* https://github.com/pytorch/pytorch/pull/44885: to be tested
This diff is a temp hack to work around the problem. W/o this patch:
```
volatile size_t device_type = static_cast<size_t>(type);
auto p = device_guard_impl_registry[device_type].load();
C10_LOG_FIRST_N(WARNING, 10) << "XDW-fail: " << cntr << ", Device type: " << type << ", type cast: " << device_type << ", guard: " << p;
// output
XDW-fail: 1129, Device type: cuda, type cast: 65537, guard: 0
```
Another workaround is D23788441, which changes -O3 to -O2. So this seems to be a miscompilation for nvcc or the host compiler.
Reviewed By: ezyang
Differential Revision: D23972356
fbshipit-source-id: ab91fbbfccb6389052de216f95cf9a8265445aea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29409Fixes#27875
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18396828
Pulled By: ezyang
fbshipit-source-id: 3f53cbbe620cd3445852273be90ff5744aa7a8cb
Summary:
This PR addresses issue https://github.com/pytorch/pytorch/issues/7601.
Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user.
For example, a model with forward declared like (original example courtesy of ngimel):
```
def forward(self,x):
x0 = x.clone()
torch._C._cuda_setStream(self.stream1._cdata)
y0 = self.fc1(x0)
self.event1.record(stream = torch.cuda.current_stream())
torch._C._cuda_setStream(self.stream2._cdata)
y1 = self.fc2(x)
self.event2.record(stream = torch.cuda.current_stream())
self.stream2.wait_event(self.event1)
return y0 + y1
```
currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously.
The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer.
In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354
Test Plan: Two tests were added specifically for this behavior.
Differential Revision: D17275980
Pulled By: mruberry
fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445
ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18445 Unify cudaGetDeviceCount implementations.**
I went about doing this by searching for calls to cudaGetDeviceCount,
and then methodically replacing them with references to c10::cuda::device_count()
or at::cuda::device_count().
There is a point to doing this: the various implementations wildly differed
in their handling of what to do when cudaGetDeviceCount returns an error.
The final standardized behavior is that **all errors are swallowed** and
we return device count of zero. This indirectly fixes running CUDA builds
on CPU, which was broken in #17847.
I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl.
This is a BC-breaking change for anyone inheriting from DeviceGuardImpl
but all you need to do is put 'noexcept' on your method and it is backwards
compatible with older libtorch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14612189
fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316
This starts cleaning up the files in c10 according to the module structure we decided on.
Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h
Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp
i-am-not-moving-c2-to-c10
Reviewed By: dzhulgakov
Differential Revision: D13498493
fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63