pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Mike Ruberry	1003ccfa15	Creates CUDAContext (#9435 ) Summary: ezyang noticed that the CUDAStream files lived under ATen/ despite being CUDA-specific, and suggested porting them to ATen/cuda and exposing them with a new CUDAContext. This PR does that. It also: - Moves ATen's CUDA-specific exceptions for ATen/cudnn to ATen/cuda for consistency - Moves getDeviceProperties() and getCurrentCUDASparseHandle() to CUDAContext from CUDAHooks The separation between CUDAContext and CUDAHooks is straightforward. Files that are in CUDA-only builds should rely on CUDAContext, while CUDAHooks is for runtime dispatch in files that can be included in CPU-only builds. A comment in CUDAContext.h explains this pattern. Acquiring device properties and CUDA-specific handles is something only done in builds with CUDA, for example, so I moved them from CUDAHooks to CUDAContext. This PR will conflict with #9277 and I will merge with master after #9277 goes in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9435 Reviewed By: soumith Differential Revision: D8917236 Pulled By: ezyang fbshipit-source-id: 219718864234fdd21a2baff1dd3932ff289b5751	2018-07-20 12:56:15 -07:00
Edward Yang	cffca2926b	Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (#9358 ) Summary: See Note [Supervisor deleter] for how SupervisedPtr works. This design is not the obvious one, but there were a lot of constraints feeding into it: - It must support the reallocation usage-pattern, where, given an existing Storage, we allocate a new region of memory, copy the existing data to it, and then deallocate the old region of memory. - Creation of a deleter for memory MUST avoid dynamic allocations in the common case. We've done some benchmarking in Caffe2 where dynamic allocation for deleters is ruinously expensive, and it's really hard to avoid these performance tarpits in very general function wrappers like std::function or folly::Function (while benchmarking this, we discovered that folly::Function's move constructor was way more expensive than it should be). - We need to be able to deallocate data that comes from external sources, e.g., dlpack and numpy tensors. Most notably, you often cannot deallocate these with merely the void* data pointer; you need some extra, out-of-band information (e.g., the managing struct) to deallocate it. Sometimes, you may even want to resize data living in an external source! - The "core" allocators need to support being wrapped in a Thrust allocator, so you need to be implement the following two functions: char* allocate(size_t); void deallocate(char, size_t); - We need to support tensors which contain non-POD, non-trivially copyable data; specifically tensors of std::string. This is an upcoming requirement from Caffe2. It's dirty AF, but it's really useful. - It should use C++ standard library types like std::unique_ptr (which is hugely problematic because std::unique_ptr doesn't call the deleter when the pointer is null.) Here is the billing of changes: - Built-in support for realloc() has been DROPPED ENTIRELY. Instead, you're expected to allocate and then copy from the old memory to the new memory if you want to do a reallocation. This is what you'd generally have expected to occur; and axing realloc() from the design lets us avoid some tricky correctness issues with std::realloc(), namely the fact that we must refuse the realloc if the type of the elements are not trivially copyeable. If it really matters, we can add this back, but there really needs to be a good explanation WHY you need fast resizing reallocations (by in large, people don't resize their storages, and it should be acceptable to have a performance degradation when they do). - TH_STORAGE_FREEMEM is no more; instead, if you want a storage which doesn't free its result, you just give it an empty deleter. - What we used to call an "allocator" (really, a combined object for allocating/deleting) has been split into two concepts, an allocator, and a smart pointer (SupervisedPtr) which knows how to delete data. - Unlike previously, where THAllocator/THCDeviceAllocator could have a per-tensor context storing extra information (e.g., a pointer to the metadata you need to actually free the tensor), there is no context in the allocator or the deleter of the smart pointer; instead, the smart pointer directly holds an owning reference to the metadata necessary to free the data. This metadata is freshly manufactured* upon every allocation, which permits us to resize tensors even in the absence of built-in support for realloc(). - By default, allocators don't support "raw" allocations and deallocations with raw pointers. This is because some allocations may return a different context every time, in which case you need to reconstruct the context at delete time (because all you got was a void, not a unique_ptr that carries the deleter). - The diff between at::Allocator and THCDeviceAllocator is a bit larger: - It used to return a cudaError_t. Now, allocators are expected to check the error status immediately and throw an exception if there was an error. It turns out that this is what was immediately done after all occurrences of allocate/release, so it wasn't a big deal (although some subsidiary interfaces had to themselves be converted to not return cudaError_t). There is one notable exception to this, and it is how we handle CUDA OOM: if this occurs, we attempt to return unused memory to the system and try again. This is now handled by a catch-all try-catch block. The cost of catching the exception is probably the least of your worries if you're about to OOM. - It used to take the CUDA stream to perform the allocation on as an argument. However, it turned out that all call sites, this stream was the stream for the current device. So we can push this into the allocator (and the choice, in the future, could be made explicitly by twiddling thread local state.) - It held two extra methods, emptyCache and cacheInfo, specifically for interacting with some state in THCCachingAllocator. But this "generality" was a lie, since THCCachingAllocator was the only allocator that actually implemented these methods, and there is actually a bunch of code in THC which assumes that it is the caching allocator that is the underlying allocator for CUDA allocations. So I folded these two methods into this interface as THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo. - It held its context directly inside the THCDeviceAllocator struct. This context has been moved out into whatever is holding the at::Allocator. - The APIs for getting at allocators/deleters is now a little different. - Previously there were a bunch of static variables you could get the address of (e.g., &THDefaultAllocator); now there is a function getTHDefaultAllocator(). - Some "allocators" didn't actually know how to allocate (e.g., the IPC "allocator"). These have been deleted; instead, you can wrap the produced pointers into SupervisedPtr using an appropriate makeSupervisedPtr() static method. - Storage sharing was a lot of work to wrangle, but I think I've tamed the beast. - THMapAllocator and its "subclasses" have been refactored to be proper, honest to goodness C++ classes. I used the enum argument trick to get "named" constructors. We use inheritance to add refcounting and management (in libshm). What we previously called the "Context" class (Context has been dropped from the name) is now the supervisor for the data. - Sometimes, we need to pull out the file descriptor from a tensor. Previously, it was pulled out of the allocator context. Now, we pull it out of the supervisor of the SupervisorPtr, using the static method fromSupervisedPtr(), which uses the deleter as the typeid, and refines the type if it matches. - I renamed the std::function deleter into InefficientStdFunctionSupervisor, to emphasize the fact that it does a dynamic allocation to save the std::function deleter. TODO: - Windows libshm is in shambles and needs to be fixed. Perhaps for the future: - newFromFd is now unconditionally calling cudaPointerGetAttributes even though this is unnecessary, because we know what the device is from higher up in the callstack. We can fix this by making newWithDataAndAllocator also take an explicit device argument. - Consider statically distinguishing between allocators that support raw_allocate/raw_deallocate, and those which don't. The Thrust constraint applies only to the CUDA device allocator; you never need to allocate CPU memory this way - Really want to get rid of storage views. Ugh. Nontrivial bugs I noticed when preparing this patch: - I forgot to placement-new unique pointers and attempted to assign them directly on uninitialized memory; very bad! Sam Gross has encouraged me to replace this with a proper constructor but I keep putting it off, because once everything goes in StorageImpl there really will be a proper constructor. - I rewrote a number of APIs to use newWithDataAndAllocator instead of newWithAllocator, calling the allocator at the call site (because they required "allocation context" which we no longer give to "allocators"). When I did this, I forgot to insert the multiplication with sizeof(real) to scale from numels to number of bytes. - The implementation of swap on storages was missing it for scalarType and backend. It was benign (because the only case we call swap is when these are the same), but I fixed it anyway. - I accidentally returned a nullptr unique_ptr with no deleter, even though there was a legitimate one. This matters, because some code still shoves its hands in the deleter context to get extra metadata about the function. - I used std::move() on a unique_ptr, and then did a boolean test on the pointer aftewards (always false!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358 Reviewed By: SsnL Differential Revision: D8811822 Pulled By: ezyang fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75	2018-07-15 15:11:18 -07:00
Soumith Chintala	dc186cc9fe	Remove NO_* and WITH_* across codebase, except in setup.py (#8555 ) * remove legacy options from CMakeLists * codemod WITH_ to USE_ for WITH_CUDA, WITH_CUDNN, WITH_DISTRIBUTED, WITH_DISTRIBUTED_MW, WITH_GLOO_IBVERBS, WITH_NCCL, WITH_ROCM, WITH_NUMPY * cover SYSTEM_NCCL, MKLDNN, NNPACK, C10D, NINJA * removed NO_* variables and hotpatch them only in setup.py * fix lint	2018-06-15 12:29:48 -04:00
Zachary DeVito	d985cf46f1	Add workaround to fix include warnings in Python 2 builds. (#6716 )	2018-04-24 12:30:19 -07:00
Tongzhou Wang	4563e190c4	Use THC cached CUDA device property when get_device_name and get_device_capability (#6027 ) Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python. Also fixes ATen compile error when it can't find CUDA. Fixes #4908. Using the script from that issue, we get roughly 18x speed-up. [ssnl@ ~] python dev.py # master 0.2826697587966919 0.00034999847412109375 0.0003493785858154297 0.000356292724609375 0.00036025047302246094 0.0003629922866821289 0.00036084651947021484 0.00035686492919921874 0.00036056041717529296 0.0003606319427490234 [ssnl@ ~] python dev.py # this PR 0.27275662422180175 2.1147727966308594e-05 1.9598007202148438e-05 1.94549560546875e-05 1.9359588623046876e-05 1.938343048095703e-05 2.0074844360351563e-05 1.952648162841797e-05 1.9311904907226562e-05 1.938343048095703e-05	2018-03-30 16:39:22 -04:00
Sam Gross	48a3349c29	Delete dead Tensor code paths (#5417 ) This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp. This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.	2018-02-27 17:58:09 -05:00
Carl Lemaire	6b95ca4eda	DataParallel: GPU imbalance warning (#5376 )	2018-02-27 21:30:41 +01:00
Sam Gross	406c9f9c28	Remove two uses of the old Tensor class (#5413 )	2018-02-26 15:00:51 -05:00
Sam Gross	30ec06c140	Merge Variable and Tensor classes (#5225 ) This replaces the torch.Tensor constructors with factories that produce Variables. Similarly, functions on the torch module (e.g. torch.randn) now return Variables. To keep the PR to a reasonable size, I've left most of the unused tensor code. Subsequent PRs will remove the dead code, clean-up calls to torch.autograd.Variable, and rename Variable to Tensor everywhere. There are some breaking changes because Variable and Tensors had slightly different semantics. There's a list of those changes here: https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge	2018-02-23 18:03:31 -05:00
Adam Paszke	1061d7970d	Move broadcast and broadcast_coalesced to C++	2018-01-18 11:16:45 +01:00
Adam Paszke	de5f7b725e	Base for pure C++ NCCL interface	2018-01-18 11:16:45 +01:00
Tongzhou Wang	5918243b0c	Methods for checking CUDA memory usage (#4511 ) * gpu mem allocated * add test * addressed some of @apaszke 's comments * cache stats * add more comments about test	2018-01-09 11:47:48 -05:00
peterjc123	77ea2f26d8	Add build support for Python 2.7 using MSVC (#4226 )	2017-12-20 15:07:25 +01:00
Sam Gross	bcfe259f83	Add streams and comms as optional arguments (#3968 ) Adds streams and comms as optional arguments to the NCCL calls in torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for multi-process mode. Moves Py_RETURN_NONE statements after the GIL is re-acquired.	2017-12-04 13:51:22 -05:00
Sam Gross	4bce69be22	Implement Variable.storage() (#3765 ) This still uses THPStorage, but avoids touching THPTensor	2017-11-20 14:18:07 -05:00
Soumith Chintala	50009144c0	add warnings if device capability is less than ideal (#3601 )	2017-11-09 11:48:59 -05:00
peterjc123	aa911939a3	Improve Windows Compatibility (for csrc/scripts) (#2941 )	2017-11-08 19:51:35 +01:00
SsnL	bb1b826cdc	Exposing emptyCache from allocator (#3518 ) * Add empty_cache binding * cuda.empty_cache document * update docs	2017-11-07 17:00:38 -05:00
Adam Paszke	cc3058bdac	Fix macOS build (with CUDA) (#3071 )	2017-10-11 19:04:15 +02:00
Soumith Chintala	e9dccb3156	implement all_reduce, broadcast, all_gather, reduce_scatter	2017-10-09 22:24:18 -04:00
Soumith Chintala	4d62933529	add initial NCCL C bindings	2017-10-09 22:24:18 -04:00
Soumith Chintala	b3bc5fe302	refactor THCP method defs into cuda/Module.cpp	2017-09-30 13:14:35 -07:00
Justin Johnson	94b5990201	Add torch.cuda.get_device_name function (#2540 )	2017-08-26 15:06:37 -04:00
Adam Paszke	8ab3d214d5	Fixes for DistributedDataParallel (#2168 )	2017-07-21 16:00:46 -04:00
Edward Z. Yang	72e9e7abf7	Warning squash. Signed-off-by: Edward Z. Yang <ezyang@fb.com>	2017-07-21 14:13:11 -04:00
Zach DeVito	9d8cff9bc1	initialize aten and pytorch to share the same THCState	2017-07-11 10:35:03 -04:00
Adam Paszke	12813b88f6	Add DistributedDataParallel	2017-06-12 22:00:22 -04:00
Trevor Killeen	05bc877a05	make THPPointer have explicit constructors (#1636 )	2017-05-25 15:35:54 -04:00
Sam Gross	4c1cdb6148	Refactor Python string utility function	2017-04-28 21:25:26 +02:00
Sam Gross	aab30d4ea2	Fix errors when no CUDA devices are available (#1334 ) Fixes #1267 This fixes a number of issues when PyTorch was compiled with CUDA support but run on a machine without any GPUs. Now, we treat all errors from cudaGetDeviceCount() as if the machine has no devices.	2017-04-23 14:45:27 +02:00
Sergey Zagoruyko	8dc5d2a22e	export current_blas_handle	2017-03-23 23:32:45 +01:00
Sam Gross	b9379cfab7	Use cuDNN and NCCL symbols from _C library (#1017 ) This ensures that we use the same library at the C++ level and with Python ctypes. It moves the searching for the correct library from run-time to compile-time.	2017-03-16 16:10:17 -04:00
soumith	7ad948ffa9	fix tests to not sys.exit(), also fix fatal error on THC initialization	2017-03-01 17:37:04 -05:00
Sam Gross	fc6fcf23f7	Lock the cudaFree mutex. (#880 ) Prevents NCCL calls from overlapping with cudaFree() which can lead to deadlocks.	2017-03-01 11:29:25 -05:00
Adam Paszke	19a65d2bea	Expose stateless methods for torch.cuda.HalfTensor	2017-02-26 20:02:42 +01:00
Sam Gross	bd5303010d	Refactor autograd package to separate Python dependencies. (#662 ) The core autograd Variable, Function, and Engine no longer depend on the Python API. This let's us implement functions in C++. In the future, we can also multithread engine and release the GIL for most of the non-Python backwards.	2017-02-13 16:00:16 -08:00
Zeming Lin	59d66e6963	Sparse Library (#333 )	2017-01-05 00:43:41 +01:00
Sam Gross	20fffc8bb7	Fix torch.is_tensor for half tensors (#322 ) Fixes #311	2016-12-19 15:27:47 +01:00
Sam Gross	1af9a9637f	Refactor copy and release GIL during copy (#286 )	2016-12-11 21:54:58 +01:00
Sam Gross	0d7d29fa57	Enable caching allocator for CUDA pinned memory (#275 ) Also add binding for CUDA "sleep" kernel	2016-12-02 01:33:56 -05:00
Adam Paszke	ebc70f7919	Look for libcudart in default CUDA installation paths (#195 )	2016-11-02 19:36:10 -04:00
Sam Gross	ad5fdef6ac	Make every user-visible Tensor have a Storage (#179 )	2016-10-31 12:12:22 -04:00
Sam Gross	79ead42ade	Add CUDA Stream and Event API (#133 )	2016-10-18 12:15:57 -04:00
Sam Gross	8d39fb4094	Use new THC API for device allocator	2016-10-17 09:35:41 -07:00
Sam Gross	ee14cf9438	Add support for pinned memory: (#127 ) torch.Storage/Tensor.pin_memory() torch.Storage/Tensor.is_pinned()	2016-10-15 18:38:26 -04:00
Sam Gross	c20828478e	Update Module.cpp for THC changes	2016-09-30 11:13:14 -07:00
Adam Paszke	3f7ab95890	Finish implementation of prng related functions	2016-09-29 11:33:25 -07:00
Sam Gross	4e9f0a8255	Use CUDA caching allocator	2016-09-26 13:12:39 -07:00
Adam Paszke	06ab3f962f	Refactor _C extension to export some utilities	2016-09-21 08:36:54 -07:00
Adam Paszke	3ea1da3b2c	Minor fix in CUDA module	2016-09-14 11:09:03 -04:00
soumith	1f2695e875	adding cuda driver check functions for runtime checking	2016-09-13 10:34:13 -07:00
Adam Paszke	8d933cbfc4	Fixes for OS X	2016-08-22 22:45:35 -04:00
Adam Paszke	12bed8dc0d	Add CUDA device selection	2016-08-12 07:46:46 -07:00
Adam Paszke	92e983a489	Fixes for Linux and new cutorch	2016-08-02 09:20:18 -07:00
Adam Paszke	c574295012	Various fixes	2016-07-19 10:45:59 -04:00
Adam Paszke	3a44259b32	Add support for CUDA	2016-07-19 10:45:59 -04:00

... 2 3 4 5 6

256 Commits