These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.
We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().
If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events
In short, they will be used to experess fork and join operations in
the graph if external=False.
External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.
Finishes #146145
I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
# Motivation
We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`.
With this PR, we can do like this
```python
s1= torch.Stream()
# Set s1 to the current stream
torch.accelerator.set_stream(s1)
with torch.Stream() as s2:
# Inside with statement, we set s2 to the current stream
assert torch.accelerator.current_stream() == s2
# Here the current stream should be s1
assert torch.accelerator.current_stream() == s1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138
Approved by: https://github.com/albanD
Fixes#112591
Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and methods within each module (see details below).
pydocstyle torch/cuda/_memory_viz.py --count
before: 7
after: 4
**remaining errors:**
```
torch/cuda/_memory_viz.py:77 in public function `format_flamegraph`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:121 in public function `segments`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:128 in public function `memory`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:135 in public function `compare`:
D103: Missing docstring in public function
```
pydocstyle torch/cuda/streams.py --count
before: 29
after: 8
**remaining errors:**
```
torch/cuda/streams.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/streams.py:31 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:105 in public method `__eq__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:110 in public method `__hash__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:113 in public method `__repr__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:135 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:163 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:237 in public method `__repr__`:
D105: Missing docstring in magic method
```
pydocstyle torch/cuda/__init__.py --count
before: 100
after: 46
**remaining errors:**
```
torch/cuda/__init__.py:251 in public class `DeferredCudaCallError`:
D101: Missing docstring in public class
torch/cuda/__init__.py:327 in public function `cudart`:
D103: Missing docstring in public function
torch/cuda/__init__.py:332 in public class `cudaStatus`:
D101: Missing docstring in public class
torch/cuda/__init__.py:337 in public class `CudaError`:
D101: Missing docstring in public class
torch/cuda/__init__.py:338 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:343 in public function `check_error`:
D103: Missing docstring in public function
torch/cuda/__init__.py:369 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:373 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:376 in public method `__exit__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:391 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:473 in public class `StreamContext`:
D204: 1 blank line required after class docstring (found 0)
torch/cuda/__init__.py:485 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:499 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:514 in public method `__exit__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:541 in public function `set_stream`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:838 in public function `current_blas_handle`:
D400: First line should end with a period (not 'e')
torch/cuda/__init__.py:894 in public function `memory_usage`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:894 in public function `memory_usage`:
D400: First line should end with a period (not ')')
torch/cuda/__init__.py:913 in public function `utilization`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:913 in public function `utilization`:
D400: First line should end with a period (not 'r')
torch/cuda/__init__.py:949 in public function `power_draw`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:949 in public function `power_draw`:
D400: First line should end with a period (not ')')
torch/cuda/__init__.py:1089 in public class `ByteStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1091 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1100 in public class `DoubleStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1102 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1111 in public class `FloatStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1113 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1122 in public class `HalfStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1124 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1133 in public class `LongStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1135 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1144 in public class `IntStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1146 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1155 in public class `ShortStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1157 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1166 in public class `CharStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1168 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1177 in public class `BoolStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1179 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1188 in public class `BFloat16Storage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1190 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1199 in public class `ComplexDoubleStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1201 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1210 in public class `ComplexFloatStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1212 in public method `dtype`:
D102: Missing docstring in public method
```
@mikaylagawarecki @albanD @svekars @jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113233
Approved by: https://github.com/malfet
This PR implements 2 things:
1. support the device agnostic stream and runtime APIs captured by the dynamo.
2. support the stream methods(include the event) captured by the dynamo.
Here are details for 1st.
Previously the stream captured in dynamo was tightly bind to CUDA. Here we implement a global singleton container named `StreamMethodContainer` for different backends to register their associated stream methods to dynamo. When import the backend’s product, the stream operations can be registered directly by calling
```
device_stream_method = {'current_stream': method_1,
'create_stream_context': method_2,
'set_stream': method_3,
'set_stream_by_id': method_4}
torch._dynamo.stream.register_stream_method(device_name, device_stream_method)
```
Stream methods need to be passed in this API according to the precise semantics represented by the dict key in `device_stream_method`. After register, these methods can be used by dynamo to capture the stream operations in users’ script, for example, get the current stream or set the specific stream. Additionally, the wrapped stream variable and the stream context variable are changed to be the device-agnostic, the proxy functions of these variables are assigned by the associated methods in the container. All of this are illustrated in the below. Below is a illustration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108312
Approved by: https://github.com/jansel, https://github.com/jgong5
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.
Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.
Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
#75854
A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`.
Stills needs sanity checks, testing, and minimization of BC-breaking changes.
Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly.
CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596
Approved by: https://github.com/ezyang
Summary:
Previous is https://github.com/pytorch/pytorch/issues/57781
We add now two CUDA bindings to avoid using ctypes to fix a windows issue.
However, we use ctypes to allocate the stream and create its pointer
(we can do this with a 0-dim tensor too if it feels better).
CC. ezyang rgommers ngimel mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527
Reviewed By: albanD
Differential Revision: D29053062
Pulled By: ezyang
fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f
Summary:
This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947
We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api.
cc mruberry rgommers leofang asi1024 kmaehashi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781
Reviewed By: mrshenli
Differential Revision: D28326365
Pulled By: ezyang
fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811
Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst`
Reviewed By: H-Huang
Differential Revision: D27974751
fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51436.
Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038
Reviewed By: mruberry
Differential Revision: D27068649
Pulled By: ngimel
fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
Summary:
Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).
[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)
The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.
Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.
Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()
# pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()
# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```
Test plan (other suggestions appreciated):
- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436
Reviewed By: mruberry
Differential Revision: D26993790
Pulled By: ngimel
fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40878
Differential Revision: D22404647
Pulled By: ngimel
fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177
Differential Revision: D22106922
Pulled By: malfet
fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850
Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).
Test Plan: - built and viewed the documentation for each change locally.
Differential Revision: D17908123
Pulled By: zou3519
fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782
Warnings show up when running `make html` to build documentation. All of
the warnings are very reasonable and point to bugs in our docs. This PR
attempts to fix most of those warnings.
In the future we will add something to the CI that asserts that there
are no warnings in our docs.
Test Plan: - build and view changes locally
Differential Revision: D17887067
Pulled By: zou3519
fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937
Differential Revision: D13649001
Pulled By: mrshenli
fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
Summary:
See #15682
Pushing up this small PR to check if I am doing the right thing. If correct, more will follow for other Stream APIs. Questions will be added inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15737
Differential Revision: D13581400
Pulled By: mrshenli
fbshipit-source-id: 24afed7847b89b62f0692c79a101ec7ff9d9ee4d
Summary:
see #15682
This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.
~Test TBA~ Added
FYI jeffreyksmithjr
now
```python
In [1]: def f():
...: d0 = torch.device('cuda:0')
...: d1 = torch.device('cuda:1')
...: with torch.cuda.device(d0):
...: s0 = torch.cuda.current_stream()
...: with torch.cuda.device(d1):
...: s1 = torch.cuda.current_stream()
...: s0.query()
...: s1.query()
In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
before
```python
In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689
Differential Revision: D13571697
Pulled By: mrshenli
fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c
Summary: Record unit of time for torch.cuda.Event's elapsed_time
Differential Revision: D13467646
Pulled By: zou3519
fbshipit-source-id: 4f1f4ef5fa4bc5a1b4775dfcec6ab155e5bf8d6e
Here's the command I used to invoke autopep8 (in parallel!):
git ls-files | grep '\.py$' | xargs -n1 -P`nproc` autopep8 -i
Several rules are ignored in setup.cfg. The goal is to let autopep8
handle everything which it can handle safely, and to disable any rules
which are tricky or controversial to address. We may want to come back
and re-enable some of these rules later, but I'm trying to make this
patch as safe as possible.
Also configures flake8 to match pep8's behavior.
Also configures TravisCI to check the whole project for lint.