pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Matthew Hoffman	e40d6ae0a7	Improve torch.cuda.amp type hints (#108630 ) Fixes #108629 1. Add the following to their modules' `__all__` so that pyright considers them to be publicly exported: * [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) * [`torch.cuda.amp.GradScaler`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler) * [`torch.cuda.amp.autocast`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast) * [`torch.cuda.amp.custom_fwd`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.custom_fwd) * [`torch.cuda.amp.custom_bwd`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.custom_bwd) 2. Add `overload`s for `torch.cuda.amp.GradScaler.scale` to differentiate when a `torch.Tensor` is returned vs. an `Iterable[torch.Tensor]` is returned based on the type of the `outputs` parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108630 Approved by: https://github.com/ezyang	2023-09-08 06:06:25 +00:00
Zachary DeVito	7fb131043c	[memory snapshots] _record_memory_history_legacy bug fix (#108260 ) The argment order for the legacy path got swapped in a recent patch. Because there is still a blog post documenting the legacy interface people are hitting this pathway. This patch fixes #108208 I will also update the blog post to the new API so that people are more likely to use the newer `_record_memory_history` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108260 Approved by: https://github.com/awgu	2023-08-30 22:33:04 +00:00
Elias Ellison	0a9778a372	Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407 ) > capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream. Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for actions in the current thread, and "relaxed" will not error on these actions. Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads. Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407 Approved by: https://github.com/albanD, https://github.com/eqy	2023-08-25 01:44:26 +00:00
Zachary DeVito	40cbda274b	document memory snapshotting (#107660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660 Approved by: https://github.com/albanD ghstack dependencies: #107171, #107399	2023-08-24 19:20:03 +00:00
Edward Z. Yang	f53ecfbcc6	Correctly format original traceback for delayed CUDA error (#107297 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107297 Approved by: https://github.com/albanD	2023-08-17 03:13:31 +00:00
Zachary DeVito	80988b6277	Introduce memory stacks for free (#106758 ) Previously when we recorded a free action in a memory trace, we would provide the stack for when the block was allocated. This is faster because we do not have to record stacks for free, which would otherwise double the number of stacks collected. However, sometimes knowing the location of a free is useful for figuring out why a tensor was live. So this PR adds this behavior. If performance ends up being a concern the old behavior is possible by passing "alloc" to the context argument rather than "all". Also refactors some of glue logic to be consistent across C++ and Python and routes the Python API through the C++ version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758 Approved by: https://github.com/albanD	2023-08-14 20:38:15 +00:00
Zachary DeVito	6f07c57416	MemoryViz.js: format, move style (#106482 ) This updates the JS format of MemoryViz.js to match internal format. It also moves the style sheet into the JS so it is easier package for both oss and internal use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106482 Approved by: https://github.com/aaronenyeshi ghstack dependencies: #106328	2023-08-03 00:42:13 +00:00
Edward Z. Yang	3bf922a6ce	Apply UFMT to low traffic torch modules (#106249 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249 Approved by: https://github.com/Skylion007	2023-07-29 23:37:30 +00:00
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	79c5e33349	[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-21 07:38:46 +00:00
Nikita Shulga	5837e95d30	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04: - Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh` - Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-15 20:30:20 +00:00
PyTorch MergeBot	15fd1ea118	Revert "[Reland] Update mypy to 1.4.1 (#105227 )" This reverts commit `c9c4f8efc3`. Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))	2023-07-14 22:28:35 +00:00
Nikita Shulga	c9c4f8efc3	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-14 20:45:12 +00:00
PyTorch MergeBot	3c5a494d7a	Revert "Update mypy to 1.4.1 (#91983 )" This reverts commit `634659e262`. Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))	2023-07-14 15:59:16 +00:00
Nikita Shulga	634659e262	Update mypy to 1.4.1 (#91983 ) Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi	2023-07-13 16:30:36 +00:00
kshitij12345	d552c271db	[pt2] grad support (#102264 ) Teach dynamo about grad Pull Request resolved: https://github.com/pytorch/pytorch/pull/102264 Approved by: https://github.com/zou3519	2023-06-21 10:13:09 +00:00
PyTorch MergeBot	e737a8486f	Revert "[pt2] grad support (#102264 )" This reverts commit `85b83954c8`. Reverted https://github.com/pytorch/pytorch/pull/102264 on behalf of https://github.com/huydhn due to This is failing in trunk `85b83954c8` and looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102264#issuecomment-1600001309))	2023-06-21 03:02:55 +00:00
kshitij12345	85b83954c8	[pt2] grad support (#102264 ) Teach dynamo about grad Pull Request resolved: https://github.com/pytorch/pytorch/pull/102264 Approved by: https://github.com/zou3519	2023-06-21 01:37:08 +00:00
Zachary DeVito	ae78e80123	[memory_viz] fix javascript url (#103741 ) It turns out that jsdelivr, which is used to access the MemoryViz.js source from generated files, doesn't work unless a version is specified. This wasn't able to be tested until the PR actually landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103741 Approved by: https://github.com/aaronenyeshi	2023-06-16 13:15:45 +00:00
Zachary DeVito	19b3e07fe0	[memory_viz] Unified viewer (#103565 ) This replaces the invidual visualization routines in _memory_viz.py with a single javascript application. The javascript application can load pickled snapshot dumps directly using drag/drop, requesting them via fetch, or by embedding them in a webpage. The _memory_viz.py commands use the embedding approach. We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g. https://zdevito.github.io/assets/viz/ (eventually this should be hosted with the pytorch docs). All views/multiple cuda devices are supported on one page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565 Approved by: https://github.com/eellison, https://github.com/albanD	2023-06-16 03:49:48 +00:00
Zachary DeVito	346feb6b56	[memory_viz] profile_plot generates snapshot objects (#103497 ) This will make it easier to use a single html viewer for both ways of generating the data. The next PR will change MemoryPlot.js to simply read the snapshot information directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103497 Approved by: https://github.com/eellison	2023-06-16 03:49:48 +00:00
Zachary DeVito	efc3bcceb1	Move memory viz templates into separate javascript files (#103474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103474 Approved by: https://github.com/eellison	2023-06-16 03:49:46 +00:00
Zachary DeVito	0ca3c6f7d7	[_memory_viz.py] Fix bug when using profile_plot (#103384 ) When we updated plotting to add level of detail the Legend code for profile_plot got broken. This patch fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103384 Approved by: https://github.com/drisspg	2023-06-14 16:54:29 +00:00
Nikita Vedeneev	d80d3b18d0	nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 08f7a6a</samp> This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403 Approved by: https://github.com/malfet, https://github.com/cpuhrsch	2023-05-31 13:09:45 +00:00
Natalia Gimelshein	ecd79b1fef	add additional stream priority for cuda streams (#101956 ) Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions. Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956 Approved by: https://github.com/ezyang	2023-05-27 02:36:16 +00:00
PyTorch MergeBot	6c9b94dcda	Revert "add additional stream priority for cuda streams (#101956 )" This reverts commit `5da497cabb`. Reverted https://github.com/pytorch/pytorch/pull/101956 on behalf of https://github.com/osalpekar due to Broke internal builds that used -Wunused-function since this PR removed the call to StreamIdType::<< ([comment](https://github.com/pytorch/pytorch/pull/101956#issuecomment-1563875493))	2023-05-26 06:35:23 +00:00
Natalia Gimelshein	5da497cabb	add additional stream priority for cuda streams (#101956 ) Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions. Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956 Approved by: https://github.com/ezyang	2023-05-24 23:26:47 +00:00
Jane Xu	cde597efa1	[docs] Warn that GradScaler can scale under 1 (#101569 ) Completes action item 1 in https://github.com/pytorch/pytorch/issues/99640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101569 Approved by: https://github.com/ngimel	2023-05-16 23:56:07 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Masaki Kozuki	b87c7ab6d6	Remove redundant `found_inf` recompute from `_step_supports_amp_unscaling` path (#98620 ) following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115. Rel: https://github.com/pytorch/pytorch/pull/98613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620 Approved by: https://github.com/janeyx99	2023-04-20 19:24:09 +00:00
Animesh Jain	971df458db	Reland of "Python binding to set/get CUDA rng state offset" (#99565 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~~ Reland of https://github.com/pytorch/pytorch/pull/98965 (cherry picked from commit `8214fe07e8`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565 Approved by: https://github.com/anijain2305	2023-04-20 15:42:25 +00:00
PyTorch MergeBot	bb2cd4a107	Revert "Python binding to set/get CUDA rng state offset (#98965 )" This reverts commit `8214fe07e8`. Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-19 11:23:32 +00:00
Animesh Jain	8214fe07e8	Python binding to set/get CUDA rng state offset (#98965 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965 Approved by: https://github.com/kulinseth, https://github.com/ezyang	2023-04-18 07:52:21 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit `851e89c8e8`. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit `6a50b83b73`. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Zachary DeVito	e37986d48f	[memory viz] support larger visualizations (#98865 ) When there are > 15000 polygons trace_plot starts to get really slow. So order the allocations and take the smallest allocations beyond the 15000 limit and put them into a single summarized polygon. A slider allows this limit to be adjusted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98865 Approved by: https://github.com/yf225	2023-04-11 23:56:41 +00:00
Edward Z. Yang	b8b840be3d	Convert logging f-strings to use % format, part five (#98765 ) This does some annoying but simple cases by hand. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98765 Approved by: https://github.com/wanchaol	2023-04-11 13:17:59 +00:00
Aidyn-A	69eef5a4be	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-10 17:31:12 +00:00
Peeyush Agarwal	ebd4c165ff	Back out "`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 )" (#98613 ) Summary: This change causes multi-GPU job from XI team to hang after 8K steps. Differential Revision: D44797248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613 Approved by: https://github.com/ngimel	2023-04-07 23:31:31 +00:00
PyTorch MergeBot	279ca5f9db	Revert "[CUDA12] set_device change (#94864 )" This reverts commit `c18be2b2ec`. Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11	2023-04-05 14:53:00 +00:00
Aidyn-A	c18be2b2ec	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-05 14:34:00 +00:00
Xuehai Pan	e6888697c4	Revisit `torch._six.string_classes` removal (#94709 ) (#97863 ) Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`. Both `str` and `bytes` are `Sequence` classes. ```python In [1]: from typing import Sequence In [2]: issubclass(bytes, Sequence) Out[2]: True In [3]: issubclass(str, Sequence) Out[3]: True ``` Re-add `bytes` to type guards like: ```python def is_seq(obj): return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes)) ``` Ref: - https://github.com/pytorch/pytorch/pull/94709#issuecomment-1487282912 - #97737 - #97789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97863 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-03-30 17:02:45 +00:00
Aaron Gokaslan	47dca20d80	[BE] Enable flake8-comprehension rule C417 (#97880 ) Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880 Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD	2023-03-30 14:34:24 +00:00
Zachary DeVito	1c83888be8	[memory profiling] show pre-existing memory in trace_plot (#97590 ) Previously we only plotted memory if it was allocated or freed while trace recording was active. This change also adds any pre-existing blocks to the visualization. This helps because it is common to enable trace recording later and then not realize that there is a lot of allocated memory in the trace eventhough a lot was allocated beforehad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97590 Approved by: https://github.com/eellison	2023-03-28 16:31:10 +00:00
Zachary DeVito	b1a83c4da4	[memory history] cleanup recording API (#97406 ) This makes the options for recording memory history easier to understand and makes the default to record the most information. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`. Also adds a quick _dump_snapshot function to make it easier to look at the common visualizations. <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> * Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696)) * Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713)) * Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085)) * Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406 Approved by: https://github.com/ezyang	2023-03-28 16:31:10 +00:00
Masaki Kozuki	b5edf18334	`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 ) I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`. - non fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)` - fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)` - where `_check_inf_per_device` is `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)` The other way to align the behavior is to use the existing `found_inf` in `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)`. I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior. I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`. --- what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not. The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See `788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174)`. Fixes #96755 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415 Approved by: https://github.com/ngimel, https://github.com/janeyx99	2023-03-24 17:36:47 +00:00
Masaki Kozuki	22ea21da3d	Change 1D Tensor of 1 element to 0D Tensor (#96994 ) add 0d tensor to graph adam/adamw test Affected: - `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker` - `step` of Adam & AdamW of `capturable` Fixes #96776 🤞 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994 Approved by: https://github.com/janeyx99	2023-03-21 18:24:19 +00:00
loganthomas	c848a777e8	DOC: Various typo fixes (#97095 ) Various typos found while browsing documentation/source code. Thank you for a wonderful deep-learning library! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97095 Approved by: https://github.com/mikaylagawarecki, https://github.com/kit1980	2023-03-20 20:46:04 +00:00

1 2 3 4 5 ...

407 Commits