pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yuanhao Ji	86fbbe44cc	Improve error message for CUDAGuardImpl, MPSGuardImpl, XPUGuardImpl (#149838 ) Fixes #149822 Will get: ``` RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/home/jyh/workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. CUDAGuardImpl initialized with non-CUDA DeviceType: cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149838 Approved by: https://github.com/Skylion007, https://github.com/guangyey	2025-03-25 07:29:53 +00:00
Yu, Guangye	07fa6e2c8b	Fix torch.accelerator api abort when passing invaild device (#143550 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD	2024-12-23 03:44:22 +00:00
Yu, Guangye	40c098f731	Introduce a device-agnostic runtime API design (#132204 ) # Motivation According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design. I personally prefer the Simple Version APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does NOT break the previous design philosophies. I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle: 1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter. 2. Device-specific APIs should be placed under device-specific submodules. 3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter. Also, I list the pros and cons of Simple Version here: Pros: - `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience; - more concise, facilitate the developer to write a device-agnostic code. Cons: - no obvious drawbacks. # Additional Context I list the new APIs here: ```python torch.accelerator.is_available() -> bool: torch.accelerator.current_accelerator() -> torch.device: torch.accelerator.device_count() -> int: torch.accelerator.current_device_idx() -> int: torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None: torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream: torch.accelerator.set_stream(stream: torch.Stream) -> None: torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None: ``` According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204 Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD	2024-10-27 10:37:09 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit `bd72e28314`. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
Yu, Guangye	31372fa842	Support generic stream/event on CUDA/HIP backend (#125757 ) # Motivation According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize Pull Request resolved: https://github.com/pytorch/pytorch/pull/125757 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/EikanWang	2024-05-10 13:34:09 +00:00
egienvalue	408aa0182c	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-04-24 20:51:17 +00:00
PyTorch MergeBot	0feab7d6c3	Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 )" This reverts commit `cb17721899`. Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
Yu, Guangye	eb7adc3ae0	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-30 13:04:38 +00:00
PyTorch MergeBot	968c4c4154	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit `74deacbf31`. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:17 +00:00
Yu, Guangye	74deacbf31	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-21 01:52:58 +00:00
PyTorch MergeBot	f9ed1c432d	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit `0ff1109e26`. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))	2024-03-19 15:40:26 +00:00
Yu, Guangye	0ff1109e26	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-19 06:02:28 +00:00
chentianyi16	0e68eb1505	Add privateuseone flags for c10::EventFlag (#121118 ) Fixes #117341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121118 Approved by: https://github.com/albanD	2024-03-14 20:07:12 +00:00
cyy	560c92c324	[DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142 Approved by: https://github.com/ezyang	2024-02-08 23:00:56 +00:00
cyy	4a019047ad	Enable nested namespace check in clang-tidy (#118506 ) It is time to enable nested namespaces in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506 Approved by: https://github.com/albanD	2024-01-31 00:32:35 +00:00
cyy	b72ddbab60	[Clang-tidy header][15/N] Enable clang-tidy on headers in c10/cuda and c10/mobile (#116602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116602 Approved by: https://github.com/ezyang	2024-01-18 08:15:50 +00:00
Aidyn-A	69eef5a4be	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-10 17:31:12 +00:00
PyTorch MergeBot	45a2f6b70f	Revert "Reduce includes of CUDACachingAllocator.h (#97072 )" This reverts commit `1bcb880894`. Reverted https://github.com/pytorch/pytorch/pull/97072 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2023-04-07 06:15:11 +00:00
Zachary DeVito	1bcb880894	Reduce includes of CUDACachingAllocator.h (#97072 ) On my machine this goes from > 200 to ~80, making rebuilds faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97072 Approved by: https://github.com/wanchaol	2023-04-06 17:22:35 +00:00
PyTorch MergeBot	279ca5f9db	Revert "[CUDA12] set_device change (#94864 )" This reverts commit `c18be2b2ec`. Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11	2023-04-05 14:53:00 +00:00
Aidyn-A	c18be2b2ec	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-05 14:34:00 +00:00
cyy	f172feae0d	More tidy fixes (#93069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069 Approved by: https://github.com/Skylion007	2023-01-27 06:40:50 +00:00
Edward Z. Yang	f6ce2a442e	Refactor PyInterpreter to use normal vtables (#84388 ) I realized that we can deal with the dead vtable problem by... introducing another indirection! The resulting code is worse (you have to do one more dereference to get to the vtable), but the reduction in boilerplate is, IMO, worth it. I did this refactor because I'm about to add a lot more methods to PyInterpreter to handle expunging SymInt from TensorImpl. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/84388 Approved by: https://github.com/albanD	2022-09-02 00:06:43 +00:00
Mateusz Sypniewski	916def84d4	CUDA trace Python hooks (#82824 ) ### Description This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82824 Approved by: https://github.com/lw, https://github.com/ezyang, https://github.com/malfet	2022-08-11 10:21:40 +00:00
Richard Barnes	2793cf85ec	Check all CUDA API calls for errors in caffe2/c10/ (#74918 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74918 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D35194795 fbshipit-source-id: 8490e5497c37bab0055925ed520c2fd0c37a554c (cherry picked from commit 52697ab670e2f53c580cfd4ca82c5468ed3bb06c)	2022-03-30 17:13:02 +00:00
Jeff Daily	b7391f44df	cast return of cudaGetLastError() to void when discarding (#62518 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/62511. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62518 Reviewed By: walterddr, janeyx99 Differential Revision: D30029858 Pulled By: malfet fbshipit-source-id: d47ce4e507ac800b4e5a5e0a8d9a6fabdfd28e6d	2021-08-03 11:17:22 -07:00
Jeff Daily	15210f3b82	ignore and clear not ready errors (#61554 ) Summary: Follow-up to https://github.com/pytorch/pytorch/issues/18584. This PR covers the remaining places where event or stream query might result in not ready errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61554 Reviewed By: mrshenli Differential Revision: D29763973 Pulled By: ezyang fbshipit-source-id: 41d988d1826b2309cc6b01a81144094b353abdf9	2021-07-19 16:03:04 -07:00
Luca Wehrstedt	e7cccc23b9	Add query and synchronize to c10::Stream (#59560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59560 `at::cuda::CUDAStream` has the `query` and `synchronize` methods, but `c10::Stream` does not, and I couldn't find any generic way to accomplish this. Hence I added helpers to do this to the DeviceGuardImpl interface, and then defined these methods on `c10::Stream`. (I had to do it out-of-line to circumvent a circular dependency). ghstack-source-id: 130932249 Test Plan: CI Reviewed By: ezyang Differential Revision: D28931377 fbshipit-source-id: cd0c19cf021e305d0c0cf9af364afb445d010248	2021-06-10 01:42:40 -07:00
Luca Wehrstedt	0c3e79b5b9	Rename DeviceGuardImplInteface's getStreamFromPool method (#57345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57345 Already back in https://github.com/pytorch/pytorch/pull/57046 we realized that calling this method `getStreamFromPool` could cause issues because that name gets HIPified and thus in some callsites we'd end up calling a method that doesn't exist. In the end we got away with it because the places where we were calling that method weren't HIPified. However in the next PR we'll use this method inside RPC, and that will start causing problems, hence here I rename it to something that should not cause conflicts. This is a private API (since it's inside `impl`) thus there's no backwards compatibility concerns. ghstack-source-id: 127916484 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28114923 fbshipit-source-id: e027ad08a8e02090c08c6407c2db5a7fde104812	2021-05-01 16:12:53 -07:00
Scott Wolchok	44cc873fba	[PyTorch] Autoformat c10 (#56830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830 Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase. Test Plan: CI Reviewed By: zertosh Differential Revision: D27979080 fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151	2021-04-30 21:23:28 -07:00
Luca Wehrstedt	ea64c90ecc	Add recordDataPtrOnStream to DeviceGuardImplInterface (#57047 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57047 We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to record a DataPtr onto a stream with the caching allocator. ghstack-source-id: 127713135 (Note: this ignores all push blocking failures!) Test Plan: Used later in this stack Reviewed By: ezyang Differential Revision: D28029161 fbshipit-source-id: ff337ab8ccc98437b5594b2f263476baa1ae93e7	2021-04-29 09:31:43 -07:00
Luca Wehrstedt	6fdf092cad	Add getStreamFromPool to DeviceGuardImplInterface (#57046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57046 We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to get a stream from the global ATen pool. ghstack-source-id: 127713137 (Note: this ignores all push blocking failures!) Test Plan: Used later in this stack Reviewed By: ezyang Differential Revision: D28029159 fbshipit-source-id: 5055d84c1f3c2a4d86442f3149455c5ebd976dea	2021-04-29 09:30:41 -07:00
cyy	d8730194e7	use device methods (#52899 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52899 Reviewed By: zou3519 Differential Revision: D26752203 Pulled By: albanD fbshipit-source-id: eaef89377999b20655fe85d5a38ca7a2c5882de7	2021-03-02 20:14:23 -08:00
Xiang Gao	b1f08e7426	Call uncheckedSetDevice in ~InlineDeviceGuard only when device index are different (#35438 ) Summary: Setting device could be expensive, especially when a debugger is present. We should check the device are different before we set. cc: ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/35438 Differential Revision: D20664084 Pulled By: ngimel fbshipit-source-id: 2440b4c9d96c41b4a19d5b1e8e1756fa40f090f0	2020-03-30 13:13:17 -07:00
Mike Ruberry	87a2c92615	Updates autograd engine to respect streams set in forward (#8354 ) Summary: This PR addresses issue https://github.com/pytorch/pytorch/issues/7601. Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user. For example, a model with forward declared like (original example courtesy of ngimel): ``` def forward(self,x): x0 = x.clone() torch._C._cuda_setStream(self.stream1._cdata) y0 = self.fc1(x0) self.event1.record(stream = torch.cuda.current_stream()) torch._C._cuda_setStream(self.stream2._cdata) y1 = self.fc2(x) self.event2.record(stream = torch.cuda.current_stream()) self.stream2.wait_event(self.event1) return y0 + y1 ``` currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously. The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer. In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354 Test Plan: Two tests were added specifically for this behavior. Differential Revision: D17275980 Pulled By: mruberry fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d	2019-09-10 23:46:51 -07:00
Mike Ruberry	a024e1e091	Creates Torch-friendly Event class and adds Stream tracking to autograd (#25130 ) Summary: Resubmission of https://github.com/pytorch/pytorch/issues/23424 because previous PR was borked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25130 Test Plan: Two tests were added to cuda_stream_test for this functionality. Differential Revision: D17145538 Pulled By: mruberry fbshipit-source-id: 2546c5907c038412e03aa0d3328a972b0164c455	2019-09-01 12:37:52 -07:00
Edward Yang	529bb859b2	Revert D17052534: [pytorch][PR] Creates Torch-friendly Event class and adds Stream tracking to autograd Test Plan: revert-hammer Differential Revision: D17052534 Original commit changeset: d91b308ad0f7 fbshipit-source-id: dacc7e70a835a8fa6ae71246999b4eff3383f3f3	2019-08-28 08:24:43 -07:00
Mike Ruberry	433fe47d95	Creates Torch-friendly Event class and adds Stream tracking to autograd (#25130 ) Summary: Resubmission of https://github.com/pytorch/pytorch/issues/23424 because previous PR was borked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25130 Differential Revision: D17052534 Pulled By: mruberry fbshipit-source-id: d91b308ad0f730646bb7b3492a601cd9b05c72d8	2019-08-26 15:19:06 -07:00
Edward Yang	515238e0a5	Unify cudaGetDeviceCount implementations. (#18445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445 ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde Stack from [ghstack](https://github.com/ezyang/ghstack): * #18242 Test running a CUDA build on CPU machine. * #18445 Unify cudaGetDeviceCount implementations. I went about doing this by searching for calls to cudaGetDeviceCount, and then methodically replacing them with references to c10::cuda::device_count() or at::cuda::device_count(). There is a point to doing this: the various implementations wildly differed in their handling of what to do when cudaGetDeviceCount returns an error. The final standardized behavior is that all errors are swallowed and we return device count of zero. This indirectly fixes running CUDA builds on CPU, which was broken in #17847. I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl. This is a BC-breaking change for anyone inheriting from DeviceGuardImpl but all you need to do is put 'noexcept' on your method and it is backwards compatible with older libtorch. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14612189 fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb	2019-03-26 09:50:14 -07:00
Xiaodong Wang	af0c79eed4	Catch cudaError_t return val (nodiscard in rocm) (#16399 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16399 Catching cudaError_t return values in a few places, because it's nodiscard in rocm. Unless we add -Wno-unused-result, it'll end up with a compilation error. Also in c10/cuda/test, check whether a host has GPU or not. We were silently throwing out the error before (so not really testing the cuda api). Reviewed By: bddppq Differential Revision: D13828281 fbshipit-source-id: 587d1cc31c20b836ce9594e3c18f067d322b2934	2019-02-11 13:18:36 -08:00
Sebastian Messmer	d408324350	Move files to/from c10/core and c10/util (#15316 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316 This starts cleaning up the files in c10 according to the module structure we decided on. Move to c10/util: - Half.h, Half-inl.h, Half.cpp, bitcasts.h Move to c10/core: - Device.h, Device.cpp - DeviceType.h, DeviceType.cpp i-am-not-moving-c2-to-c10 Reviewed By: dzhulgakov Differential Revision: D13498493 fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63	2019-01-10 16:22:22 -08:00
Igor Fedan	62151aa259	Added deviceCount() virtual method to DeviceGuardImplInterface (#15574 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15574 Added deviceCount() virtual method to DeviceGuardImplInterface, also added correspondent implementation for CPUGuardImpl, CUDAGuardImpl, FakeGuardImpl, VirtualGuardImpl, HIPGuardImplMasqueradingAsCUDA Reviewed By: soumith Differential Revision: D13554609 fbshipit-source-id: 913bf2aad44a0a356efe54505ee4abaf6c4622db	2018-12-27 15:36:32 -08:00
Edward Yang	2d485ffb17	Move CUDAGuard, CUDAStream and CUDAGuardImpl to c10/cuda (#14248 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248 This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current behavior of pretending CUDA is HIP when you build with ROCm enabled. Reviewed By: bddppq Differential Revision: D13145293 fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02	2018-12-12 11:24:26 -08:00

47 Commits