pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jithun Nair	dc1f9eee53	Avoid printing erroneous warning about "MIOpen not found" for ROCm builds (#33837 ) Summary: Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen. Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests: ``` /root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH }.get(sys.platform, 'LD_LIBRARY_PATH'))) ``` Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837 Differential Revision: D20369285 Pulled By: xw285cornell fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98	2020-04-17 20:31:01 -07:00
Nikita Shulga	2458f6c63e	Move all nccl from torch_python to torch_cuda (#36193 ) Summary: Because `torch_python` is supposed to be thin wrapper around `torch` In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp (which is part of torch_python) to nccl.cpp (which is part of torch_cuda) Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193 Test Plan: CI Differential Revision: D20930047 Pulled By: malfet fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c	2020-04-08 18:01:47 -07:00
Pavel Belevich	3328a2f903	Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026 Differential Revision: D20856458 Pulled By: pbelevich fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32	2020-04-07 08:05:23 -07:00
Peter Bell	5fc5cf6571	Stop using ctypes to interface with CUDA libraries. (#33678 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678 Differential Revision: D20249187 Pulled By: ezyang fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed	2020-03-11 07:22:46 -07:00
Emilio Castillo	31cc311143	Expose `CUDACachingAllocator` `raw_alloc` and `raw_delete` to python (#33860 ) Summary: This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls). Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126 We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860 Differential Revision: D20212788 Pulled By: ngimel fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c	2020-03-03 17:50:11 -08:00
Edward Yang	1111a6b810	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#30274 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/29095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30274 Differential Revision: D18762293 Pulled By: ezyang fbshipit-source-id: d3d50c2dd12bcb678ab25fa708eb6587cc4b66f9	2019-12-02 12:19:58 -08:00
Mike Ruberry	eff4c4d7c1	Revert D18301806: Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL Test Plan: revert-hammer Differential Revision: D18301806 Original commit changeset: 03da6a26c41e fbshipit-source-id: c1324ee8d154e7e16f5dd4f1cf3625aaa566cd39	2019-11-21 14:50:07 -08:00
Alan Du	f4b9690f2d	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#29095 ) Summary: Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions. Fixes https://github.com/pytorch/pytorch/issues/29065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095 Differential Revision: D18301806 Pulled By: ezyang fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a	2019-11-21 13:44:40 -08:00
Peter Bell	bb119d957e	Move torch.cuda's atfork handler into C++ (#29101 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/23401 We cannot rely on `multiprocessing.util.register_after_fork` since it is only called for processes created by the `multiprocessing` module and not `os.fork()`. Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101 Differential Revision: D18355451 Pulled By: ezyang fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469	2019-11-11 07:34:27 -08:00
Xiang Gao	02921e7985	Use cuDNN's handle pool mechanism to manage cublas handles (#29233 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/6962 The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872. ~~I didn't add any unit test here yet because as mcarilli mentioned:~~ > ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~ ~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~ cc: colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233 Differential Revision: D18372007 Pulled By: ezyang fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d	2019-11-07 12:50:18 -08:00
Gao, Xiang	2d2fe14a60	Install CUDA for clang-tidy (#27967 ) Summary: fixes: https://github.com/pytorch/pytorch/issues/28009 clang-tidy is reporting `'cuda_runtime_api.h' file not found` when a PR modifying some file including this header. Installation script take from official site: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork Pull Request resolved: https://github.com/pytorch/pytorch/pull/27967 Differential Revision: D17952383 Pulled By: ezyang fbshipit-source-id: 85807d93bd46eb902a84b2126784349ce3a01cfa	2019-10-16 10:02:19 -07:00
Jerry Ma	1610ea8ef8	Comprehensive-ish instrumentation for CUDA memory allocator (#27361 ) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6	2019-10-08 15:42:48 -07:00
Ralf Gommers	1b4951d3a5	Fix remaining invalid function cast warnings that show up with GCC 8/9 (#26104 ) Summary: Follow-up to gh-25483, more of the same fixes for warnings like: ``` ../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* ()(THPVariable)’ {aka ‘_object* ()(THPVariable)’} to ‘getter’ {aka ‘_object* ()(_object, void*)’} [-Wcast-function-type] 503 \| {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr}, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines. `clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104 Differential Revision: D17396831 Pulled By: ezyang fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7	2019-09-17 07:43:37 -07:00
Guanheng Zhang	b22adeb007	Fix error message for a wrong fork CUDA (#23322 ) Summary: Re-land https://github.com/pytorch/pytorch/pull/23030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23322 Differential Revision: D16469442 Pulled By: zhangguanheng66 fbshipit-source-id: 70b63ab6265efa3f289111ef0ce46bb3c0d353bc	2019-07-25 12:58:14 -07:00
Edward Yang	1f608d09cf	Revert D16440000: [pytorch][PR] Re-land "Fix error message for a wrong fork CUDA" Differential Revision: D16440000 Original commit changeset: e05683275522 fbshipit-source-id: b688f24c1e6d3d8f63c2d415262a3f0ab1b85914	2019-07-24 12:05:36 -07:00
Guanheng Zhang	aa660b8eb7	Re-land "Fix error message for a wrong fork CUDA" (#23209 ) Summary: Re-land https://github.com/pytorch/pytorch/pull/23030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23209 Differential Revision: D16440000 Pulled By: zhangguanheng66 fbshipit-source-id: e05683275522835a33d5a7e6d76b7e94774e4d98	2019-07-24 07:01:04 -07:00
Jesse Hellemn	06d11f0434	Revert D16368004: [pytorch][PR] Fix error message for a wrong fork CUDA Differential Revision: D16368004 Original commit changeset: 44b6977790ce fbshipit-source-id: c81a232bd52219e56a19c64650c4b6dedeb167cb	2019-07-22 18:46:48 -07:00
Guanheng Zhang	a6e45a69a8	Fix error message for a wrong fork CUDA (#23030 ) Summary: Fix https://github.com/pytorch/pytorch/issues/17357 Unblock 1.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23030 Differential Revision: D16368004 Pulled By: zhangguanheng66 fbshipit-source-id: 44b6977790ce768efa4777bae41d4b26dae5f288	2019-07-22 15:04:32 -07:00
SsnL	8482efb203	pin_memory malloc now uses existing context if available. (#22229 ) Summary: This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27: ``` --------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746 ``` Commits: 1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists). Link `c10/cuda` against `libcuda` for device API calls. 2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`. Fix `OptionalDeviceGuard` doc. 3. Refactor `test_cuda_primary_ctx.py` to support multiple tests. Add test for this in that file. Fixes https://github.com/pytorch/pytorch/issues/21081. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229 Differential Revision: D16170194 Pulled By: zou3519 fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d	2019-07-16 10:18:30 -07:00
Iurii Zdebskyi	3a8d7463bd	Enabled BFloat16 storage (#21523 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21523 ghimport-source-id: 698b3cbd6b21c09b9ff8bf8011980df8e35c33b0 Test Plan: Imported from OSS Differential Revision: D15819368 Pulled By: izdeby fbshipit-source-id: f6b3bba7b3ca8ee677bd80a231dbb3920c07d61c	2019-07-09 21:51:06 -07:00
Syed Tousif Ahmed	effcc398c4	Refactor Random Number Generators in ATen (#21555 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21555 ghimport-source-id: dd900a8c3e1ef9ef1e011b8bb5476626d18cc462 Test Plan: Imported from OSS Differential Revision: D15875780 Pulled By: ezyang fbshipit-source-id: 6e04e90af62ab9c9593d74f344a3a084aaaf6f43	2019-06-19 13:54:09 -07:00
Will Feng	8cde4c4d22	Remove Variable::Impl and DifferentiableViewImpl (#17072 ) Summary: As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR: 1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class 2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()` 3. Remove `Variable.data()` API 3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history. After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't. Note that this PR is BC-breaking in the following use cases: Use Case 1: Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type. Use Case 2: If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example: ```python params = torch.tensor([1.5, 1.5]).requires_grad_() with torch.no_grad(): # Change gradient to a sparse tensor params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.])) grad_saved = params.grad params.backward(torch.tensor([1.5, 1.5])) assert id(grad_saved) == id(params.grad) # This will fail after this PR ``` The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072 Differential Revision: D14075257 Pulled By: yf225 fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957	2019-05-23 21:09:04 -07:00
Roy Li	53bb739b67	Remove uses of TypeID (#19452 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19452 ghimport-source-id: 816ae7fe1a18d76f064d5796dec44dca6a138a21 Differential Revision: D15009920 Pulled By: li-roy fbshipit-source-id: 722f05a927528148555561da62839f84dba645c6	2019-04-19 12:07:35 -07:00
Pieter Noordhuis	563de88aa5	Revert D14909203: Remove usages of TypeID Differential Revision: D14909203 Original commit changeset: d716179c484a fbshipit-source-id: 992ff1fcd6d35d3f2ae768c7e164b7a0ba871914	2019-04-18 17:47:39 -07:00
Roy Li	01d7d3de46	Remove usages of TypeID (#19183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19183 ghimport-source-id: 9af190b072523459fa61e5e79419b88ac8586a4d Differential Revision: D14909203 Pulled By: li-roy fbshipit-source-id: d716179c484aebfe3ec30087c5ecd4a11848ffc3	2019-04-17 23:55:47 -07:00
Soumith Chintala	b5d8844bbe	push magma init into lazyInitCUDA (#18527 ) Summary: Tries to fix C++ API's usage of MAGMA-based functions. Attempts to Fix https://github.com/pytorch/pytorch/issues/18074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18527 Differential Revision: D14691694 Pulled By: soumith fbshipit-source-id: dd04e74418e486d73ea4a92193ddf79352ed71ba	2019-04-03 12:47:34 -07:00
Edward Yang	515238e0a5	Unify cudaGetDeviceCount implementations. (#18445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445 ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde Stack from [ghstack](https://github.com/ezyang/ghstack): * #18242 Test running a CUDA build on CPU machine. * #18445 Unify cudaGetDeviceCount implementations. I went about doing this by searching for calls to cudaGetDeviceCount, and then methodically replacing them with references to c10::cuda::device_count() or at::cuda::device_count(). There is a point to doing this: the various implementations wildly differed in their handling of what to do when cudaGetDeviceCount returns an error. The final standardized behavior is that all errors are swallowed and we return device count of zero. This indirectly fixes running CUDA builds on CPU, which was broken in #17847. I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl. This is a BC-breaking change for anyone inheriting from DeviceGuardImpl but all you need to do is put 'noexcept' on your method and it is backwards compatible with older libtorch. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14612189 fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb	2019-03-26 09:50:14 -07:00
Vitaly Fedyunin	5653a914f7	Implement reference counting for shared IPC CUDA tensors (#16854 ) Summary: This is to fix #16141 and similar issues. The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage. ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854 Differential Revision: D13994490 Pulled By: VitalyFedyunin fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1	2019-03-25 10:24:38 -07:00
Roy Li	7aae51cded	Replace tensor.type().scalarType() calls with tensor.scalar_type() Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17515 Reviewed By: ezyang Differential Revision: D14233250 fbshipit-source-id: 6c7af8d2291c0c2b148001b30cf03834f34366c0	2019-03-08 14:08:18 -08:00
Will Feng	393c97fda7	Fix variable checking in THCPModule_setRNGState (#17474 ) Summary: See https://github.com/pytorch/pytorch/pull/16325/files#r259576901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17474 Differential Revision: D14209549 Pulled By: yf225 fbshipit-source-id: 2ae091955ae17f5d1540f7d465739c4809c327f8	2019-02-25 11:05:51 -08:00
Iurii Zdebskyi	444039c47b	Bool tensor. Part 0: Boolean storage implementation (#16810 ) Summary: This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this: 0. Storage Implementation (this change) 1. Tensor Creation. 2. Tensor Conversions. 3. Tensor Indexing. 4. Tensor Operations. 5. Back compatibility related changes. This feature was requested by the community: https://github.com/pytorch/pytorch/issues/4764 https://github.com/pytorch/pytorch/issues/4219 https://github.com/pytorch/pytorch/issues/4288 Change: Added boolean type to the Storage class for CPU and CUDA backends. Tested via: 1. unit tests 2. running this: -> import torch -> torch.BoolStorage <class 'torch.BoolStorage'> -> torch.cuda.BoolStorage <class 'torch.cuda.BoolStorage'> Pull Request resolved: https://github.com/pytorch/pytorch/pull/16810 Reviewed By: gchanan Differential Revision: D14087246 Pulled By: izdeby fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e	2019-02-19 08:22:13 -08:00
Will Feng	202eaa4ef4	Use non-Variable type for callsites that check type equality (#16325 ) Summary: When Variable and Tensor are merged, the dynamic type of the tensors passed to certain functions will become variables, and expecting `type()` on those variables to still return non-Variable types will cause type mismatch error. One way to fix this problem is to use the thread-local guard `at::AutoNonVariableTypeMode` to force `type()` to return non-Variable type, but ideally we want to limit the use of `at::AutoNonVariableTypeMode` to be only in VariableType.cpp. Another way to fix the problem is to use `at::globalContext().getNonVariableType()` instead to get the non-Variable type of the tensor, which is what this PR is trying to achieve. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16325 Differential Revision: D14012022 Pulled By: yf225 fbshipit-source-id: 77ef1d2a02f78bff0063bdd72596e34046f1e00d	2019-02-10 09:47:50 -08:00
Edward Yang	e936a69085	Move THCCachingAllocator to c10_cuda. (#16119 ) Summary: Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119 Reviewed By: smessmer Differential Revision: D13718768 fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656	2019-01-24 12:06:56 -08:00
Edward Yang	24b50f1411	Remove unnecessary includes and headers from THCCachingAllocator, move to at::cuda:: namespace (#16117 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117 This means I can move it to c10_cuda with minimal fuss. Reviewed By: smessmer Differential Revision: D13717836 fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00	2019-01-24 12:06:54 -08:00
Shen Li	2235fb256e	Add default_stream() and enhance current_stream() (#16200 ) Summary: Closes #16156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16200 Differential Revision: D13747455 Pulled By: mrshenli fbshipit-source-id: 00c0d5f341c3ac7a757bdb4631a17e11fbc6d3ec	2019-01-22 14:35:19 -08:00
Shen Li	292edfb087	Change current device in stream context manager if necessary (#16128 ) Summary: Fixes #16019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16128 Differential Revision: D13721850 Pulled By: mrshenli fbshipit-source-id: 422c6c0b97c1cd46e127e265b532cb8c74a3aac5	2019-01-18 12:39:51 -08:00
Edward Yang	411173757e	Rename away uses of THAllocator and THCDeviceAllocator (#16061 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16061 I discovered I needed to delete these names in preparation of moving THCCachingAllocator to c10_cuda; might as well also fix all the other sites too. Reviewed By: dzhulgakov Differential Revision: D13686869 fbshipit-source-id: e8cc55d39ac4bfd3e3a22c761f89a7a111ce5f5e	2019-01-16 05:36:47 -08:00
SsnL	300dcc3b96	Add cuda.reset_max_memory_* (#15985 ) Summary: Addresses #15968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15985 Differential Revision: D13649916 Pulled By: soumith fbshipit-source-id: a207aea5709a79dba7a6fc541d0a70103f49efff	2019-01-14 07:31:51 -08:00
Edward Yang	517c7c9861	Canonicalize all includes in PyTorch. (#14849 ) Summary: Anywhere we used #include "foo.h", we now say #include <foo.h> Paths are adjusted to be rooted out of aten/src, torch/lib, or the root level directory. I modified CMakeLists.txt by hand to remove TH and THC from the include paths. I used the following script to do the canonicalization: ``` import subprocess import re import os.path files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n') for fn in files: if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']): continue if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]): continue with open(fn, 'r') as f: c = f.read() def fmt(p): return "#include <{}>".format(p) def repl(m): p = m.group(1) if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]: return fmt(p) if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]): return fmt(p) for root in ["aten/src", "torch/lib", ""]: for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]: new_p = os.path.relpath(os.path.join(bad_root, p), root) if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))): return fmt(new_p) print("ERROR: ", fn, p) return m.group(0) new_c = re.sub(r'#include "([^"]+)"', repl, c) if new_c != c: print(fn) with open(fn, 'w') as f: f.write(new_c) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849 Reviewed By: dzhulgakov Differential Revision: D13363445 Pulled By: ezyang fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68	2018-12-08 19:38:30 -08:00
Peter Goldsborough	d6c53328f9	Large scale fix of python-related files in torch/csrc/ Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14515 Differential Revision: D13247966 Pulled By: goldsborough fbshipit-source-id: 7a127c508fc576a7a92626dd6b729f660162d628	2018-12-07 13:04:46 -08:00
Edward Yang	c5cc1e3ab2	Delete legacy THCStream (long live THCStream). (#14246 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14246 This commit systematically eliminates THCStream entirely from THC, replacing it with at::cuda::CUDAStream. In places where the previous pointer type showed up in a public API signature, those functions are now only available to C++ clients. (It would not be too difficult to make a C-compatible version of CUDAStream, as it's really just a simple struct, but we leave this for future work.) All functions in THC that referred to THCStream were expunged in favor of their modern counterparts. One annoyance was that I didn't feel like redoing how the torch.cuda.Stream binding code worked, but I really wanted to get rid of the stored THCStream* pointer. So I repurposed the bit-packing code I implemented for Stream hashing, and used that to (reversibly) store streams in a uint64_t cdata field. A perhaps more future proof solution would be to get rid of cdata entirely, and store the device and stream ID directly. Billing of changes: - All CUDAStream_ pointer API functions are now hidden and anonymously namespaced (instead of being in the impl namespace). All use sites rewritten to use the modern C++ API. Since CUDAStreamInternals is no longer part of the public API, the CUDAStreamInternals constructor and internals() method have been removed, and replaced with anonymous functions in the C++ file. - device_index() returns DeviceIndex rather than int64_t now - Stream and CUDAStream now have pack/unpack methods. (CUDAStream checks that the unpacked bit-pattern is for a CUDA device.) - THCStream.h header is removed entirely - Most THCStream handling functions in THC API are removed Reviewed By: gchanan Differential Revision: D13121531 fbshipit-source-id: 48873262cc0a37c3eec75a7ba1c93c800da40222	2018-11-27 08:32:09 -08:00
Edward Yang	c836a04dc8	Delete a bunch of uses of getType in favor of TensorOptions. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11087 Reviewed By: cpuhrsch Differential Revision: D9581560 fbshipit-source-id: ebe3c4c0956da8a7215ada287bf6526dbcb2b07d	2018-08-30 20:11:24 -07:00
Peter Goldsborough	7ddc6f84c4	NULL -> nullptr (#11047 ) Summary: How did we get so many uses of `NULL` again? ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047 Differential Revision: D9566799 Pulled By: goldsborough fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3	2018-08-30 16:25:42 -07:00
Tongzhou Wang	23af7deea7	Add has_lapack flag (#11024 ) Summary: Currently our `skipIfLapack` has uses a try-catch block and regex match the error message. It is highly unreliable. This PR adds `hasLAPACK` and `hasMAGMA` on ATen context, and expose the flags to python. Also fixes refcounting bug with `PyModule_AddObject`. The method steals reference, but we didn't `Py_INCREF` in some places before calling it with `Py_True` or `Py_False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11024 Differential Revision: D9564898 Pulled By: SsnL fbshipit-source-id: f46862ec3558d7e0058ef48991cd9c720cb317e2	2018-08-29 22:41:16 -07:00
Gregory Chanan	00f2731112	Merge THTensor into TensorImpl Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10479 Differential Revision: D9315800 Pulled By: gchanan fbshipit-source-id: b13ef0de3342600b02b54e0700eb02021a9d1a9e	2018-08-16 08:10:06 -07:00
Syed Tousif Ahmed	5adcac3dce	Cuda half macros cleanup (#10147 ) Summary: This PR removes couple of macros throughout TH* as part of the re-factoring effort for ATen. Removing these macros should avoid confusion among developers who are trying to move things from TH* to ATen. This PR is part of the THCNumerics deprecation that I have been working on following up on mruberry's https://github.com/pytorch/pytorch/pull/9318. I am separating these two commits to see if removal of these macros doesn't upset the pytorch public CI, as well as internal builds. - Commit `1248de7baf` removes the code paths guarded by `CUDA_HALF_INSTRUCTIONS` macro. Since the macro was removed in commit `2f186df52d`, `ifdef CUDA_HALF_INSTRUCTIONS` would return false and hence the code path that is kept after this change is for the false case of `ifdef CUDA_HALF_INSTRUCTIONS` - Commit `520c99b057` removes the code paths guarded by `CUDA_HALF_TENSOR` macro. Since Pytorch now provides support for only CUDA 8.0 and above, `CUDA_HALF_TENSOR` is always true since CUDA 8.0 satisfies `CUDA_HAS_FP16` and hence, the code path that is kept after this change is for the true case of `ifdef CUDA_HALF_TENSOR`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10147 Differential Revision: D9345940 Pulled By: soumith fbshipit-source-id: c9392261dd432d304f1cdaf961760cbd164a59d0	2018-08-15 13:25:42 -07:00
Mike Ruberry	1003ccfa15	Creates CUDAContext (#9435 ) Summary: ezyang noticed that the CUDAStream files lived under ATen/ despite being CUDA-specific, and suggested porting them to ATen/cuda and exposing them with a new CUDAContext. This PR does that. It also: - Moves ATen's CUDA-specific exceptions for ATen/cudnn to ATen/cuda for consistency - Moves getDeviceProperties() and getCurrentCUDASparseHandle() to CUDAContext from CUDAHooks The separation between CUDAContext and CUDAHooks is straightforward. Files that are in CUDA-only builds should rely on CUDAContext, while CUDAHooks is for runtime dispatch in files that can be included in CPU-only builds. A comment in CUDAContext.h explains this pattern. Acquiring device properties and CUDA-specific handles is something only done in builds with CUDA, for example, so I moved them from CUDAHooks to CUDAContext. This PR will conflict with #9277 and I will merge with master after #9277 goes in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9435 Reviewed By: soumith Differential Revision: D8917236 Pulled By: ezyang fbshipit-source-id: 219718864234fdd21a2baff1dd3932ff289b5751	2018-07-20 12:56:15 -07:00
Edward Yang	cffca2926b	Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (#9358 ) Summary: See Note [Supervisor deleter] for how SupervisedPtr works. This design is not the obvious one, but there were a lot of constraints feeding into it: - It must support the reallocation usage-pattern, where, given an existing Storage, we allocate a new region of memory, copy the existing data to it, and then deallocate the old region of memory. - Creation of a deleter for memory MUST avoid dynamic allocations in the common case. We've done some benchmarking in Caffe2 where dynamic allocation for deleters is ruinously expensive, and it's really hard to avoid these performance tarpits in very general function wrappers like std::function or folly::Function (while benchmarking this, we discovered that folly::Function's move constructor was way more expensive than it should be). - We need to be able to deallocate data that comes from external sources, e.g., dlpack and numpy tensors. Most notably, you often cannot deallocate these with merely the void* data pointer; you need some extra, out-of-band information (e.g., the managing struct) to deallocate it. Sometimes, you may even want to resize data living in an external source! - The "core" allocators need to support being wrapped in a Thrust allocator, so you need to be implement the following two functions: char* allocate(size_t); void deallocate(char, size_t); - We need to support tensors which contain non-POD, non-trivially copyable data; specifically tensors of std::string. This is an upcoming requirement from Caffe2. It's dirty AF, but it's really useful. - It should use C++ standard library types like std::unique_ptr (which is hugely problematic because std::unique_ptr doesn't call the deleter when the pointer is null.) Here is the billing of changes: - Built-in support for realloc() has been DROPPED ENTIRELY. Instead, you're expected to allocate and then copy from the old memory to the new memory if you want to do a reallocation. This is what you'd generally have expected to occur; and axing realloc() from the design lets us avoid some tricky correctness issues with std::realloc(), namely the fact that we must refuse the realloc if the type of the elements are not trivially copyeable. If it really matters, we can add this back, but there really needs to be a good explanation WHY you need fast resizing reallocations (by in large, people don't resize their storages, and it should be acceptable to have a performance degradation when they do). - TH_STORAGE_FREEMEM is no more; instead, if you want a storage which doesn't free its result, you just give it an empty deleter. - What we used to call an "allocator" (really, a combined object for allocating/deleting) has been split into two concepts, an allocator, and a smart pointer (SupervisedPtr) which knows how to delete data. - Unlike previously, where THAllocator/THCDeviceAllocator could have a per-tensor context storing extra information (e.g., a pointer to the metadata you need to actually free the tensor), there is no context in the allocator or the deleter of the smart pointer; instead, the smart pointer directly holds an owning reference to the metadata necessary to free the data. This metadata is freshly manufactured* upon every allocation, which permits us to resize tensors even in the absence of built-in support for realloc(). - By default, allocators don't support "raw" allocations and deallocations with raw pointers. This is because some allocations may return a different context every time, in which case you need to reconstruct the context at delete time (because all you got was a void, not a unique_ptr that carries the deleter). - The diff between at::Allocator and THCDeviceAllocator is a bit larger: - It used to return a cudaError_t. Now, allocators are expected to check the error status immediately and throw an exception if there was an error. It turns out that this is what was immediately done after all occurrences of allocate/release, so it wasn't a big deal (although some subsidiary interfaces had to themselves be converted to not return cudaError_t). There is one notable exception to this, and it is how we handle CUDA OOM: if this occurs, we attempt to return unused memory to the system and try again. This is now handled by a catch-all try-catch block. The cost of catching the exception is probably the least of your worries if you're about to OOM. - It used to take the CUDA stream to perform the allocation on as an argument. However, it turned out that all call sites, this stream was the stream for the current device. So we can push this into the allocator (and the choice, in the future, could be made explicitly by twiddling thread local state.) - It held two extra methods, emptyCache and cacheInfo, specifically for interacting with some state in THCCachingAllocator. But this "generality" was a lie, since THCCachingAllocator was the only allocator that actually implemented these methods, and there is actually a bunch of code in THC which assumes that it is the caching allocator that is the underlying allocator for CUDA allocations. So I folded these two methods into this interface as THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo. - It held its context directly inside the THCDeviceAllocator struct. This context has been moved out into whatever is holding the at::Allocator. - The APIs for getting at allocators/deleters is now a little different. - Previously there were a bunch of static variables you could get the address of (e.g., &THDefaultAllocator); now there is a function getTHDefaultAllocator(). - Some "allocators" didn't actually know how to allocate (e.g., the IPC "allocator"). These have been deleted; instead, you can wrap the produced pointers into SupervisedPtr using an appropriate makeSupervisedPtr() static method. - Storage sharing was a lot of work to wrangle, but I think I've tamed the beast. - THMapAllocator and its "subclasses" have been refactored to be proper, honest to goodness C++ classes. I used the enum argument trick to get "named" constructors. We use inheritance to add refcounting and management (in libshm). What we previously called the "Context" class (Context has been dropped from the name) is now the supervisor for the data. - Sometimes, we need to pull out the file descriptor from a tensor. Previously, it was pulled out of the allocator context. Now, we pull it out of the supervisor of the SupervisorPtr, using the static method fromSupervisedPtr(), which uses the deleter as the typeid, and refines the type if it matches. - I renamed the std::function deleter into InefficientStdFunctionSupervisor, to emphasize the fact that it does a dynamic allocation to save the std::function deleter. TODO: - Windows libshm is in shambles and needs to be fixed. Perhaps for the future: - newFromFd is now unconditionally calling cudaPointerGetAttributes even though this is unnecessary, because we know what the device is from higher up in the callstack. We can fix this by making newWithDataAndAllocator also take an explicit device argument. - Consider statically distinguishing between allocators that support raw_allocate/raw_deallocate, and those which don't. The Thrust constraint applies only to the CUDA device allocator; you never need to allocate CPU memory this way - Really want to get rid of storage views. Ugh. Nontrivial bugs I noticed when preparing this patch: - I forgot to placement-new unique pointers and attempted to assign them directly on uninitialized memory; very bad! Sam Gross has encouraged me to replace this with a proper constructor but I keep putting it off, because once everything goes in StorageImpl there really will be a proper constructor. - I rewrote a number of APIs to use newWithDataAndAllocator instead of newWithAllocator, calling the allocator at the call site (because they required "allocation context" which we no longer give to "allocators"). When I did this, I forgot to insert the multiplication with sizeof(real) to scale from numels to number of bytes. - The implementation of swap on storages was missing it for scalarType and backend. It was benign (because the only case we call swap is when these are the same), but I fixed it anyway. - I accidentally returned a nullptr unique_ptr with no deleter, even though there was a legitimate one. This matters, because some code still shoves its hands in the deleter context to get extra metadata about the function. - I used std::move() on a unique_ptr, and then did a boolean test on the pointer aftewards (always false!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358 Reviewed By: SsnL Differential Revision: D8811822 Pulled By: ezyang fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75	2018-07-15 15:11:18 -07:00
Soumith Chintala	dc186cc9fe	Remove NO_* and WITH_* across codebase, except in setup.py (#8555 ) * remove legacy options from CMakeLists * codemod WITH_ to USE_ for WITH_CUDA, WITH_CUDNN, WITH_DISTRIBUTED, WITH_DISTRIBUTED_MW, WITH_GLOO_IBVERBS, WITH_NCCL, WITH_ROCM, WITH_NUMPY * cover SYSTEM_NCCL, MKLDNN, NNPACK, C10D, NINJA * removed NO_* variables and hotpatch them only in setup.py * fix lint	2018-06-15 12:29:48 -04:00
Zachary DeVito	d985cf46f1	Add workaround to fix include warnings in Python 2 builds. (#6716 )	2018-04-24 12:30:19 -07:00
Tongzhou Wang	4563e190c4	Use THC cached CUDA device property when get_device_name and get_device_capability (#6027 ) Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python. Also fixes ATen compile error when it can't find CUDA. Fixes #4908. Using the script from that issue, we get roughly 18x speed-up. [ssnl@ ~] python dev.py # master 0.2826697587966919 0.00034999847412109375 0.0003493785858154297 0.000356292724609375 0.00036025047302246094 0.0003629922866821289 0.00036084651947021484 0.00035686492919921874 0.00036056041717529296 0.0003606319427490234 [ssnl@ ~] python dev.py # this PR 0.27275662422180175 2.1147727966308594e-05 1.9598007202148438e-05 1.94549560546875e-05 1.9359588623046876e-05 1.938343048095703e-05 2.0074844360351563e-05 1.952648162841797e-05 1.9311904907226562e-05 1.938343048095703e-05	2018-03-30 16:39:22 -04:00
Sam Gross	48a3349c29	Delete dead Tensor code paths (#5417 ) This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp. This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.	2018-02-27 17:58:09 -05:00
Carl Lemaire	6b95ca4eda	DataParallel: GPU imbalance warning (#5376 )	2018-02-27 21:30:41 +01:00
Sam Gross	406c9f9c28	Remove two uses of the old Tensor class (#5413 )	2018-02-26 15:00:51 -05:00
Sam Gross	30ec06c140	Merge Variable and Tensor classes (#5225 ) This replaces the torch.Tensor constructors with factories that produce Variables. Similarly, functions on the torch module (e.g. torch.randn) now return Variables. To keep the PR to a reasonable size, I've left most of the unused tensor code. Subsequent PRs will remove the dead code, clean-up calls to torch.autograd.Variable, and rename Variable to Tensor everywhere. There are some breaking changes because Variable and Tensors had slightly different semantics. There's a list of those changes here: https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge	2018-02-23 18:03:31 -05:00
Adam Paszke	1061d7970d	Move broadcast and broadcast_coalesced to C++	2018-01-18 11:16:45 +01:00
Adam Paszke	de5f7b725e	Base for pure C++ NCCL interface	2018-01-18 11:16:45 +01:00
Tongzhou Wang	5918243b0c	Methods for checking CUDA memory usage (#4511 ) * gpu mem allocated * add test * addressed some of @apaszke 's comments * cache stats * add more comments about test	2018-01-09 11:47:48 -05:00
peterjc123	77ea2f26d8	Add build support for Python 2.7 using MSVC (#4226 )	2017-12-20 15:07:25 +01:00
Sam Gross	bcfe259f83	Add streams and comms as optional arguments (#3968 ) Adds streams and comms as optional arguments to the NCCL calls in torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for multi-process mode. Moves Py_RETURN_NONE statements after the GIL is re-acquired.	2017-12-04 13:51:22 -05:00
Sam Gross	4bce69be22	Implement Variable.storage() (#3765 ) This still uses THPStorage, but avoids touching THPTensor	2017-11-20 14:18:07 -05:00
Soumith Chintala	50009144c0	add warnings if device capability is less than ideal (#3601 )	2017-11-09 11:48:59 -05:00
peterjc123	aa911939a3	Improve Windows Compatibility (for csrc/scripts) (#2941 )	2017-11-08 19:51:35 +01:00
SsnL	bb1b826cdc	Exposing emptyCache from allocator (#3518 ) * Add empty_cache binding * cuda.empty_cache document * update docs	2017-11-07 17:00:38 -05:00
Adam Paszke	cc3058bdac	Fix macOS build (with CUDA) (#3071 )	2017-10-11 19:04:15 +02:00
Soumith Chintala	e9dccb3156	implement all_reduce, broadcast, all_gather, reduce_scatter	2017-10-09 22:24:18 -04:00
Soumith Chintala	4d62933529	add initial NCCL C bindings	2017-10-09 22:24:18 -04:00
Soumith Chintala	b3bc5fe302	refactor THCP method defs into cuda/Module.cpp	2017-09-30 13:14:35 -07:00
Justin Johnson	94b5990201	Add torch.cuda.get_device_name function (#2540 )	2017-08-26 15:06:37 -04:00
Adam Paszke	8ab3d214d5	Fixes for DistributedDataParallel (#2168 )	2017-07-21 16:00:46 -04:00
Edward Z. Yang	72e9e7abf7	Warning squash. Signed-off-by: Edward Z. Yang <ezyang@fb.com>	2017-07-21 14:13:11 -04:00
Zach DeVito	9d8cff9bc1	initialize aten and pytorch to share the same THCState	2017-07-11 10:35:03 -04:00
Adam Paszke	12813b88f6	Add DistributedDataParallel	2017-06-12 22:00:22 -04:00
Trevor Killeen	05bc877a05	make THPPointer have explicit constructors (#1636 )	2017-05-25 15:35:54 -04:00
Sam Gross	4c1cdb6148	Refactor Python string utility function	2017-04-28 21:25:26 +02:00
Sam Gross	aab30d4ea2	Fix errors when no CUDA devices are available (#1334 ) Fixes #1267 This fixes a number of issues when PyTorch was compiled with CUDA support but run on a machine without any GPUs. Now, we treat all errors from cudaGetDeviceCount() as if the machine has no devices.	2017-04-23 14:45:27 +02:00
Sergey Zagoruyko	8dc5d2a22e	export current_blas_handle	2017-03-23 23:32:45 +01:00
Sam Gross	b9379cfab7	Use cuDNN and NCCL symbols from _C library (#1017 ) This ensures that we use the same library at the C++ level and with Python ctypes. It moves the searching for the correct library from run-time to compile-time.	2017-03-16 16:10:17 -04:00
soumith	7ad948ffa9	fix tests to not sys.exit(), also fix fatal error on THC initialization	2017-03-01 17:37:04 -05:00
Sam Gross	fc6fcf23f7	Lock the cudaFree mutex. (#880 ) Prevents NCCL calls from overlapping with cudaFree() which can lead to deadlocks.	2017-03-01 11:29:25 -05:00
Adam Paszke	19a65d2bea	Expose stateless methods for torch.cuda.HalfTensor	2017-02-26 20:02:42 +01:00
Sam Gross	bd5303010d	Refactor autograd package to separate Python dependencies. (#662 ) The core autograd Variable, Function, and Engine no longer depend on the Python API. This let's us implement functions in C++. In the future, we can also multithread engine and release the GIL for most of the non-Python backwards.	2017-02-13 16:00:16 -08:00
Zeming Lin	59d66e6963	Sparse Library (#333 )	2017-01-05 00:43:41 +01:00
Sam Gross	20fffc8bb7	Fix torch.is_tensor for half tensors (#322 ) Fixes #311	2016-12-19 15:27:47 +01:00
Sam Gross	1af9a9637f	Refactor copy and release GIL during copy (#286 )	2016-12-11 21:54:58 +01:00
Sam Gross	0d7d29fa57	Enable caching allocator for CUDA pinned memory (#275 ) Also add binding for CUDA "sleep" kernel	2016-12-02 01:33:56 -05:00
Adam Paszke	ebc70f7919	Look for libcudart in default CUDA installation paths (#195 )	2016-11-02 19:36:10 -04:00
Sam Gross	ad5fdef6ac	Make every user-visible Tensor have a Storage (#179 )	2016-10-31 12:12:22 -04:00
Sam Gross	79ead42ade	Add CUDA Stream and Event API (#133 )	2016-10-18 12:15:57 -04:00
Sam Gross	8d39fb4094	Use new THC API for device allocator	2016-10-17 09:35:41 -07:00
Sam Gross	ee14cf9438	Add support for pinned memory: (#127 ) torch.Storage/Tensor.pin_memory() torch.Storage/Tensor.is_pinned()	2016-10-15 18:38:26 -04:00
Sam Gross	c20828478e	Update Module.cpp for THC changes	2016-09-30 11:13:14 -07:00
Adam Paszke	3f7ab95890	Finish implementation of prng related functions	2016-09-29 11:33:25 -07:00
Sam Gross	4e9f0a8255	Use CUDA caching allocator	2016-09-26 13:12:39 -07:00
Adam Paszke	06ab3f962f	Refactor _C extension to export some utilities	2016-09-21 08:36:54 -07:00
Adam Paszke	3ea1da3b2c	Minor fix in CUDA module	2016-09-14 11:09:03 -04:00
soumith	1f2695e875	adding cuda driver check functions for runtime checking	2016-09-13 10:34:13 -07:00
Adam Paszke	8d933cbfc4	Fixes for OS X	2016-08-22 22:45:35 -04:00
Adam Paszke	12bed8dc0d	Add CUDA device selection	2016-08-12 07:46:46 -07:00
Adam Paszke	92e983a489	Fixes for Linux and new cutorch	2016-08-02 09:20:18 -07:00
Adam Paszke	c574295012	Various fixes	2016-07-19 10:45:59 -04:00
Adam Paszke	3a44259b32	Add support for CUDA	2016-07-19 10:45:59 -04:00

1 2 3 4 5

202 Commits