pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Maggie Moss	f414aa8e0d	Add pyrefly suppressions (3/n) (#164588 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: uncomment lines in the pyrefly.toml file step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/bb31574ac8a59893c9cf52189e67bb2d after: 0 errors (1,970 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164588 Approved by: https://github.com/oulgen	2025-10-03 22:03:03 +00:00
Yuanyuan Chen	e30f01b5b5	[1/N] Simplify "in" operation for containers of a single item (#164224 ) These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164224 Approved by: https://github.com/rec, https://github.com/Skylion007	2025-09-30 19:59:43 +00:00
atalman	9d0d98acfe	Use cuda nvrtc so file based on cuda version used by torch (#163642 ) Fixes https://github.com/pytorch/pytorch/issues/162367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642 Approved by: https://github.com/msaroufim	2025-09-24 14:23:39 +00:00
Mark Saroufim	a89d5e97ec	compile_kernel remove header_code arg (#163165 ) We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward ## BC note The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter. Note this only affects torch.cuda._compile_kernel, which is a private API. Example: Before ```python kernel = compile_kernel( kernel_source="global void my_kernel() { ... }", kernel_name="my_kernel", header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }" ) ``` After ```python kernel_source = """ #define SCALE 2.0f device float scale(float x) { return x * SCALE; } global void my_kernel() { ... } """ kernel = _compile_kernel(kernel_source, "my_kernel") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163165 Approved by: https://github.com/janeyx99, https://github.com/albanD	2025-09-17 19:47:32 +00:00
Mark Saroufim	090e6838a0	compile_kernel enable pch (#162972 ) Enabling automatic pre compiled headers per https://docs.nvidia.com/cuda/nvrtc/index.html#example-automatic-pch-cuda-12-8 I'm seeing large speedups in compilation times using PCH on average but the max compilation time with PCH is worst which is why I can't enable it by default. `load_inline()` also supports precompiled headers and does not enable them by default ``` Without PCH: 270.58 ms average With PCH: 115.27 ms average ``` ``` Without PCH: Max: 337.99 ms With PCH: Max: 383.82 ms ``` ```python source) [marksaroufim@devgpu005]~/pytorch% python simple_pch_benchmark.py ============================================================ Simple PCH Compilation Benchmark ============================================================ Device: NVIDIA B200 Iterations: 100 Testing WITHOUT PCH: ------------------------------ Compiling kernel 100 times WITHOUT PCH... Completed 10/100 compilations Completed 20/100 compilations Completed 30/100 compilations Completed 40/100 compilations Completed 50/100 compilations Completed 60/100 compilations Completed 70/100 compilations Completed 80/100 compilations Completed 90/100 compilations Completed 100/100 compilations Average: 270.58 ms (±6.99 ms) Min: 264.09 ms Max: 337.99 ms Testing WITH PCH: ------------------------------ Compiling kernel 100 times WITH PCH... Completed 10/100 compilations Completed 20/100 compilations Completed 30/100 compilations Completed 40/100 compilations Completed 50/100 compilations Completed 60/100 compilations Completed 70/100 compilations Completed 80/100 compilations Completed 90/100 compilations Completed 100/100 compilations Average: 115.27 ms (±27.32 ms) Min: 110.65 ms Max: 383.82 ms ``` ## Benchmarking script ```python #!/usr/bin/env python3 import argparse import os import sys import time from statistics import mean, stdev import torch from torch.cuda._utils import _nvrtc_compile def benchmark_compilation(use_pch, iterations=100): """Compile the same kernel many times with or without PCH.""" # CUB kernel that benefits from PCH kernel_source = """ #include <cub/block/block_reduce.cuh> #include <cub/block/block_scan.cuh> #include <cub/warp/warp_reduce.cuh> extern "C" __global__ void test_kernel(const float* input, float* output, int n) { using BlockReduce = cub::BlockReduce<float, 256>; using BlockScan = cub::BlockScan<float, 256>; using WarpReduce = cub::WarpReduce<float>; __shared__ union { typename BlockReduce::TempStorage reduce; typename BlockScan::TempStorage scan; typename WarpReduce::TempStorage warp[8]; } temp_storage; int idx = blockIdx.x * blockDim.x + threadIdx.x; float val = (idx < n) ? input[idx] : 0.0f; float sum = BlockReduce(temp_storage.reduce).Sum(val); __syncthreads(); float scan_result; BlockScan(temp_storage.scan).ExclusiveSum(val, scan_result); __syncthreads(); int warp_id = threadIdx.x / 32; float warp_sum = WarpReduce(temp_storage.warp[warp_id]).Sum(val); if (threadIdx.x == 0) { output[blockIdx.x] = sum + scan_result + warp_sum; } } """ device = torch.cuda.current_device() major, minor = torch.cuda.get_device_capability(device) compute_capability = f"{major}{minor}" compile_times = [] print( f"Compiling kernel {iterations} times {'WITH' if use_pch else 'WITHOUT'} PCH..." ) for i in range(iterations): # Use unique kernel name to avoid caching between iterations kernel_name = f"test_kernel_{i}" unique_source = kernel_source.replace("test_kernel", kernel_name) start = time.perf_counter() ptx, mangled_name = _nvrtc_compile( unique_source, kernel_name, compute_capability, header_code="", nvcc_options=["-std=c++17"], auto_pch=use_pch, ) elapsed = time.perf_counter() - start compile_times.append(elapsed * 1000) # Convert to ms # Progress indicator if (i + 1) % 10 == 0: print(f" Completed {i + 1}/{iterations} compilations") return compile_times def main(): parser = argparse.ArgumentParser(description="Simple PCH Compilation Benchmark") parser.add_argument("--pch", action="store_true", help="Test with PCH only") parser.add_argument("--no-pch", action="store_true", help="Test without PCH only") parser.add_argument( "--iterations", type=int, default=100, help="Number of compilations" ) args = parser.parse_args() print("=" * 60) print("Simple PCH Compilation Benchmark") print("=" * 60) print(f"Device: {torch.cuda.get_device_name()}") print(f"Iterations: {args.iterations}") print() # Determine what to test test_both = not args.pch and not args.no_pch results = {} # Test without PCH if args.no_pch or test_both: print("Testing WITHOUT PCH:") print("-" * 30) times_no_pch = benchmark_compilation(use_pch=False, iterations=args.iterations) if times_no_pch: avg_no_pch = mean(times_no_pch) std_no_pch = stdev(times_no_pch) if len(times_no_pch) > 1 else 0 print(f"Average: {avg_no_pch:.2f} ms (±{std_no_pch:.2f} ms)") print(f"Min: {min(times_no_pch):.2f} ms") print(f"Max: {max(times_no_pch):.2f} ms") results["no_pch"] = avg_no_pch print() # Test with PCH if args.pch or test_both: print("Testing WITH PCH:") print("-" * 30) times_with_pch = benchmark_compilation( use_pch=True, iterations=args.iterations ) if times_with_pch: avg_with_pch = mean(times_with_pch) std_with_pch = stdev(times_with_pch) if len(times_with_pch) > 1 else 0 print(f"Average: {avg_with_pch:.2f} ms (±{std_with_pch:.2f} ms)") print(f"Min: {min(times_with_pch):.2f} ms") print(f"Max: {max(times_with_pch):.2f} ms") results["pch"] = avg_with_pch print() if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162972 Approved by: https://github.com/albanD, https://github.com/janeyx99	2025-09-15 22:55:39 +00:00
Thien Tran	84186c39ed	[NVRTC] Enable compiling templated kernels (#162875 ) Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps NVRTC side - (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>` - `nvrtcCompileProgram` - (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed - `nvrtcDestroyProgram` CUDA side - use mangled name instead of normal name -> profit - `extern "C"` is not even needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875 Approved by: https://github.com/msaroufim	2025-09-14 06:17:36 +00:00
Aaryaman Vasishta	4a757e1e17	[ROCm] Support torch.cuda._compile_kernel (#162510 ) Supports `torch.cuda._compile_kernel` on ROCm. Related to https://github.com/pytorch/pytorch/pull/151484 Tested on Windows with gfx1201. Testing on Linux pending. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162510 Approved by: https://github.com/mycpuorg, https://github.com/msaroufim	2025-09-12 00:18:47 +00:00
Mark Saroufim	7345454e2e	compile_kernel: Handle python floats as c double (#162626 ) This was an open todo in the code and probably a footgun in waiting Pull Request resolved: https://github.com/pytorch/pytorch/pull/162626 Approved by: https://github.com/malfet	2025-09-11 06:03:25 +00:00
Mark Saroufim	12e993f533	compile_kernel large shared memory fix (#162647 ) Alternate solution to https://github.com/pytorch/pytorch/pull/162328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647 Approved by: https://github.com/eqy	2025-09-11 05:52:46 +00:00
Mark Saroufim	4fd2a2b273	Add cuda headers automatically for compile_kernel (#162634 ) Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst Benefit is now we can add `#include <cuda_fp16.h>` without crapping out Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634 Approved by: https://github.com/ngimel	2025-09-11 00:20:33 +00:00
Mark Saroufim	4e8dd11be1	simplify nvrtc discovery login in compile_kernel (#156674 ) Followup from https://github.com/pytorch/pytorch/pull/156332 Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380 Works just fine on dev gpus Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674 Approved by: https://github.com/malfet	2025-06-24 08:55:40 +00:00
Daniel Galvez	4c0aa37dda	Support stream capture of event record and wait nodes in cuda graphs (#155372 ) These are created by the user passing cudaEventRecordExternal and cudaEventWaitExternal to cudaEventRecordWithFlags() and cudaStreamWaitEvent() respectively. We do this by allowing the user to specify external=True when constructing a torch.cuda.Event(). If external=False, the cudaEventRecord and cudaStreamWaitEvent API's have a different meaning described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events In short, they will be used to experess fork and join operations in the graph if external=False. External events can be used for expressing a fine-grained dependency on the outcome of some nodes in a cuda graph (rather than all nodes). They can also be used for timing parts of a cuda graph's execution, rather than timing the entire graph's execution. Finishes #146145 I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372 Approved by: https://github.com/ngimel	2025-06-17 21:44:51 +00:00
Mark Saroufim	5b368fa0b7	Add torch.cuda._compile_kernel() (#151484 ) Followup work on top https://github.com/pytorch/pytorch/pull/149480 Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch Future work, as a prereq to making the work public * divup primitive * support multiple kernels * Expose _get_nvrtc_version from native code * interop with torch.compile * AMD support Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484 Approved by: https://github.com/malfet	2025-04-24 07:14:31 +00:00
Yu, Guangye	46e3f670b4	refactor code to share across different devices (#120602 ) # Motivation Refactor utils code to make it possible to share across CUDA, XPU, and other backends. # Solution Move `_dummy_type` and `_LazySeedTracker` to torch._utils; # Additional Context When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang	2024-02-28 09:42:58 +00:00
zabboud	01478f1afa	Fix pydocstyle errors listed in issue 112589 (#113227 ) Fixes #112589 Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module (see details below) pydocstyle torch/cuda/_utils.py --count before: 3 after: 0 pydocstyle torch/cuda/jiterator.py --count before: 3 after: 1 remaining errors: ``` torch/cuda/jiterator.py:1 at module level: D100: Missing docstring in public module ``` pydocstyle torch/cuda/graphs.py --count before: 25 after: 7 remaining errors: ``` torch/cuda/graphs.py:1 at module level: D100: Missing docstring in public module torch/cuda/graphs.py:54 in public method `__new__`: D102: Missing docstring in public method torch/cuda/graphs.py:108 in public method `debug_dump`: D205: 1 blank line required between summary line and description (found 0) torch/cuda/graphs.py:108 in public method `debug_dump`: D400: First line should end with a period (not ':') torch/cuda/graphs.py:150 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/graphs.py:172 in public method `__enter__`: D105: Missing docstring in magic method torch/cuda/graphs.py:186 in public method `__exit__`: D105: Missing docstring in magic method ``` pydocstyle torch/cuda/_sanitizer.py --count before: 35 after: 31 remaining errors: ``` torch/cuda/_sanitizer.py:43 in public class `AccessType`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:47 in public method `__str__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:84 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:96 in public method `__str__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:139 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:142 in public method `__str__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:218 in public class `StreamSynchronizations`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:219 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:256 in public method `create_stream`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:268 in public method `create_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:272 in public method `delete_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:276 in public method `update_seq_num`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:280 in public method `record_state`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:291 in public method `stream_wait_for_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:298 in public method `all_streams_wait_for_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:307 in public method `all_streams_wait_for_stream`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:316 in public method `sync_all_streams`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:323 in public method `is_ordered_after`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:339 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:460 in public function `zip_by_key`: D103: Missing docstring in public function torch/cuda/_sanitizer.py:466 in public function `zip_arguments`: D103: Missing docstring in public function torch/cuda/_sanitizer.py:478 in public class `ArgumentHandler`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:479 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:505 in public method `parse_inputs`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:520 in public method `parse_outputs`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:527 in public class `CUDASanitizerDispatchMode`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:528 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:562 in public method `__torch_dispatch__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:597 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:601 in public method `enable`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:605 in public method `__del__`: D105: Missing docstring in magic method ``` pydocstyle torch/storage.py --count before: 90 after: 37 remaining errors: ``` torch/storage.py:1 at module level: D100: Missing docstring in public module torch/storage.py:310 in public class `UntypedStorage`: D101: Missing docstring in public class torch/storage.py:311 in public method `__getitem__`: D105: Missing docstring in magic method torch/storage.py:317 in public method `is_cuda`: D102: Missing docstring in public method torch/storage.py:321 in public method `is_hpu`: D102: Missing docstring in public method torch/storage.py:325 in public method `share_memory_`: D102: Missing docstring in public method torch/storage.py:444 in public class `TypedStorage`: D101: Missing docstring in public class torch/storage.py:453 in public method `fill_`: D102: Missing docstring in public method torch/storage.py:458 in public method `__new__`: D102: Missing docstring in public method torch/storage.py:530 in public method `__init__`: D107: Missing docstring in __init__ torch/storage.py:599 in public method `is_cuda`: D102: Missing docstring in public method torch/storage.py:604 in public method `is_hpu`: D102: Missing docstring in public method torch/storage.py:624 in public method `__len__`: D105: Missing docstring in magic method torch/storage.py:653 in public method `__setitem__`: D105: Missing docstring in magic method torch/storage.py:681 in public method `__getitem__`: D105: Missing docstring in magic method torch/storage.py:715 in public method `copy_`: D102: Missing docstring in public method torch/storage.py:723 in public method `nbytes`: D102: Missing docstring in public method torch/storage.py:731 in public method `type`: D102: Missing docstring in public method torch/storage.py:744 in public method `cuda`: D102: Missing docstring in public method torch/storage.py:751 in public method `hpu`: D102: Missing docstring in public method torch/storage.py:758 in public method `element_size`: D102: Missing docstring in public method torch/storage.py:766 in public method `get_device`: D102: Missing docstring in public method torch/storage.py:770 in public method `__str__`: D105: Missing docstring in magic method torch/storage.py:781 in public method `__repr__`: D105: Missing docstring in magic method torch/storage.py:785 in public method `__iter__`: D105: Missing docstring in magic method torch/storage.py:789 in public method `__copy__`: D105: Missing docstring in magic method torch/storage.py:793 in public method `__deepcopy__`: D105: Missing docstring in magic method torch/storage.py:801 in public method `__sizeof__`: D105: Missing docstring in magic method torch/storage.py:877 in public method `device`: D102: Missing docstring in public method torch/storage.py:881 in public method `size`: D102: Missing docstring in public method torch/storage.py:891 in public method `pickle_storage_type`: D102: Missing docstring in public method torch/storage.py:902 in public method `__reduce__`: D105: Missing docstring in magic method torch/storage.py:907 in public method `data_ptr`: D102: Missing docstring in public method torch/storage.py:915 in public method `resize_`: D102: Missing docstring in public method torch/storage.py:931 in public method `from_buffer`: D102: Missing docstring in public method torch/storage.py:1032 in public method `from_file`: D402: First line should not be the function's "signature" torch/storage.py:1075 in public method `is_shared`: D102: Missing docstring in public method ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113227 Approved by: https://github.com/kit1980	2023-11-13 22:05:45 +00:00
Edward Z. Yang	3bf922a6ce	Apply UFMT to low traffic torch modules (#106249 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249 Approved by: https://github.com/Skylion007	2023-07-29 23:37:30 +00:00
Justin Chu	79c5e33349	[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-21 07:38:46 +00:00
Peter Bell	eece6da162	[inductor] Reduce device context manager overhead (#91045 ) This adds `torch.cuda._DeviceGuard` which is a stripped down version of `torch.cuda.device` with lower overhead. To do this, it only accepts `int` as the device so we don't need to call `_get_device_index` and is implemented with a new C++ helper `torch._C._cuda_exchangeDevice` that allows `_DeviceGuard.__enter__` to be just a single function call. On my machine, I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark: ```python def set_device(): with torch.cuda.device(0): pass %timeit set_device() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-01-12 16:51:59 +00:00
albanD	8713119c89	Stream actually overrides __new__ so we need to patch it as well (#89592 ) Avoids ``` $ python foo.py Traceback (most recent call last): File "foo.py", line 3, in <module> a = torch.cuda.Stream() File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__ return super(Stream, cls).__new__(cls, priority=priority, kwargs) TypeError: object.__new__() takes exactly one argument (the type to instantiate) ``` And now gets ``` $ python foo.py Traceback (most recent call last): File "foo.py", line 3, in <module> a = torch.cuda.Stream() File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__ return super(Stream, cls).__new__(cls, priority=priority, kwargs) File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn raise RuntimeError( RuntimeError: Tried to instantiate dummy base class Stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592 Approved by: https://github.com/soumith	2022-11-29 21:43:23 +00:00
Nikitha Malgi	197f9f0826	Merge CUDA Streams and Events (#53902 ) Summary: ----------- - Updates current_stream and default stream API's to take `optional[device]` argument - Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT - Merges StreamContext manager for both Eager and JIT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902 Test Plan: ------ Run JIT tests: python test/test_jit.py -v TestCUDA Run eager tests: python test/test_cuda.py -v TestCuda Reviewed By: glaringlee Differential Revision: D27494627 Pulled By: nikithamalgifb fbshipit-source-id: b30b0570e38a33fb335c83762eb06ffd46a44b5c	2021-04-05 08:19:55 -07:00
Jianyu Huang	7fc03dd7c9	Back out "[pytorch][PR] Merge CUDA Streams and Events" (#54996 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54996 Original commit changeset: 45d9fee9a582 Test Plan: CI Reviewed By: jspark1105 Differential Revision: D27444718 fbshipit-source-id: deb627230817923eaf84ade50ecb14bfbce4e779	2021-03-31 10:21:35 -07:00
Nikitha Malgi	416ba5c48f	Merge CUDA Streams and Events (#53902 ) Summary: ----------- - Updates current_stream and default stream API's to take `optional[device]` argument - Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT - Merges StreamContext manager for both Eager and JIT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902 Test Plan: ------ Run JIT tests: python test/test_jit.py -v TestCUDA Run eager tests: python test/test_cuda.py -v TestCuda Reviewed By: SplitInfinity Differential Revision: D27285996 Pulled By: nikithamalgifb fbshipit-source-id: 45d9fee9a582b5f4c82330f5f99eb88584804270	2021-03-26 14:19:39 -07:00
Nikita Shulga	43f0ccd1ec	torch.cuda.memory_allocated to return `{}` if not initialized (#51179 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179 Reviewed By: ngimel Differential Revision: D26094932 Pulled By: malfet fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d	2021-01-28 20:38:17 -08:00
Guilherme Leobas	4f9d0757f3	Add type informations to torch.cuda (#47134 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47134 Reviewed By: smessmer Differential Revision: D24955031 Pulled By: ezyang fbshipit-source-id: 87f4623643715baa6ac0627383f009956f80cd46	2020-11-13 21:34:35 -08:00
chengjun	8d570bc708	Decouple DataParallel/DistributedDataParallel from CUDA (#38454 ) Summary: Decouple DataParallel/DistributedDataParallel from CUDA to support more device types. - Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility - Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space. - Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs. Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454 Differential Revision: D22051557 Pulled By: mrshenli fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8	2020-07-07 12:48:16 -07:00
SsnL	de7ac60cf4	Add out= variants for cuda.comm.broadcast/gather/scatter (#39681 ) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/38911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681 Differential Revision: D22161342 Pulled By: mrshenli fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a	2020-06-24 12:58:19 -07:00
Nikita Shulga	5766da503b	Device name should be a string, not bytes (#40322 ) Summary: I.e. do not accept `bytes` as possible type of `device` argument in `torch.cuda._get_device_index` Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322 Differential Revision: D22176885 Pulled By: malfet fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186	2020-06-22 19:27:25 -07:00
Nikita Shulga	8b5732e8ad	Move `torch.cuda` annotations inline (#40075 ) Summary: Also enable `torch.cuda` typechecking Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075 Differential Revision: D22121275 Pulled By: malfet fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325	2020-06-18 15:52:29 -07:00
Nikita Shulga	76fbfba644	Move _dummy_type to _utils.py (#40177 ) Summary: Use it from both __init__ and streams to define dummy types when CUDA is missing Fix accidental reference of global `storage_name` from `_dummy_type` Add type annotations Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177 Differential Revision: D22106922 Pulled By: malfet fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e	2020-06-17 22:50:02 -07:00
Edward Yang	da2004e132	Upgrade lint. (#39483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483 I fixed all of the new errors that occurred because of the upgrade. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884575 Pulled By: ezyang fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685	2020-06-04 12:56:43 -07:00
Derek Kim	fbdafb006e	Fix trivial typos in torch.cuda._utils (#16026 ) Summary: Trivial typo fixings. Maybe the indefinite article "an" is needed before each "specified index" but I'm not perfectly sure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16026 Differential Revision: D13709499 Pulled By: ezyang fbshipit-source-id: 698b000bb8aa063afd81db6e67046456a439b2ce	2019-01-17 10:40:43 -08:00
SsnL	fab8085111	_get_device_index supports parsing device strings Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929 Reviewed By: weiyangfb Differential Revision: D13394498 Pulled By: soumith fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e	2018-12-09 21:12:46 -08:00
Tongzhou Wang	8e33451e2e	Make torch.cuda.* take device objects; Update distributed docs (#10833 ) Summary: Commits: 1. Make `torch.cuda.*` take device objects 2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833 Differential Revision: D9514241 Pulled By: SsnL fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e	2018-08-27 15:24:42 -07:00

33 Commits