pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Shivam Raikundalia	a11538aa46	[GPU Snapshot] Add Clear History Flag (#149352 ) Summary: Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots. As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history. A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp. Test Plan: See diff below Differential Revision: D71159720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352 Approved by: https://github.com/eqy, https://github.com/aaronenyeshi	2025-03-19 21:44:20 +00:00
cyy	29f52e3972	[2/N] Remove unnecessary once flag usage (#145057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057 Approved by: https://github.com/albanD	2025-01-23 09:48:46 +00:00
Edward Yang	b14269dcfb	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) (#138155 ) Summary: - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Original pull request: https://github.com/pytorch/pytorch/pull/136519 Test Plan: contbuild & OSS CI, see `4a8e49389c` Reviewed By: malfet Differential Revision: D64471142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155 Approved by: https://github.com/malfet, https://github.com/bobrenjc93	2024-10-17 20:58:56 +00:00
PyTorch MergeBot	d4d687ffb2	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit `4a8e49389c`. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:16 +00:00
FFFrog	4a8e49389c	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) ---- - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-13 12:38:02 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit `be0b75256a`. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
Aaron Enye Shi	7172c732d9	[Memory Snapshot] Skip C++ warmup unwind() call if context is not set (#133038 ) Summary: Should skip C++ warmup `unwind::unwind();` if there is no context set. This call is sometimes causing hanging issues since C++ stack collection is not robust. Test Plan: CI Differential Revision: D60965985 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/133038 Approved by: https://github.com/eqy	2024-08-13 17:25:24 +00:00
Aaron Enye Shi	fddb1bcdea	[CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964 ) Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator. Test Plan: CI, resnet run, and FBR model. Differential Revision: D59703213 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964 Approved by: https://github.com/zdevito	2024-07-25 14:06:52 +00:00
Aaron Enye Shi	6c4efd4e95	[Memory Snapshot][BE] Clean up record function callback scope (#130265 ) Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback. Test Plan: Ran locally, works fine. https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle Differential Revision: D59477046 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265 Approved by: https://github.com/davidberard98	2024-07-09 05:23:48 +00:00
Aaron Enye Shi	f42d5b6dca	[Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242 ) Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric. Test Plan: CI and ran locally. Differential Revision: D58875576 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242 Approved by: https://github.com/zdevito	2024-06-22 04:05:55 +00:00
Aaron Enye Shi	b5d541609d	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI Pulled By: aaronenyeshi Differential Revision: D55941362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072 Approved by: https://github.com/zdevito	2024-06-19 18:05:41 +00:00
PyTorch MergeBot	718bb9016f	Revert "[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179 )" This reverts commit `187aeaeabf`. Reverted https://github.com/pytorch/pytorch/pull/124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 `187aeaeabf`, test was skipped due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/124179#issuecomment-2112948246))	2024-05-15 16:11:47 +00:00
Aaron Enye Shi	187aeaeabf	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI New Snapshot Generated: devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations: ``` [[{'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168556, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168738, 'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168865, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168920, 'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]}, {'action': 'alloc', 'addr': 140166073581568, 'size': 3211264, 'stream': 0, 'time_us': 1713558427172978, 'frames': [{'name': '_conv_forward', 'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv ``` Differential Revision: D55941362 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/124179 Approved by: https://github.com/zdevito	2024-05-15 14:19:40 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
cyy	6b0f61891f	[Clang-tidy header][25/N] Fix clang-tidy warnings and enable clang-tidy on c10/cuda/*.{cpp,h} (#121952 ) This PR enables clang-tidy to code in c10/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121952 Approved by: https://github.com/Skylion007	2024-03-16 00:09:54 +00:00
cyy	97918e8c37	[Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504 Approved by: https://github.com/albanD	2024-02-23 16:47:33 +00:00
Aaron Enye Shi	7973ac586d	[Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404 ) Summary: Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables: ``` double garbage_collection_threshold; size_t max_split_size; size_t pinned_num_register_threads; bool expandable_segments; bool release_lock_on_cudamalloc; bool pinned_use_cuda_host_register; std::string last_allocator_settings; std::vector<size_t> roundup_power2_divisions; ``` Test Plan: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True', 'max_split_size': -1, 'garbage_collection_threshold': 0.0, 'expandable_segments': True, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 0, '2': 0, '4': 0, '8': 0, '16': 0, '32': 0, '64': 0, '128': 0, '256': 0, '512': 0, '1024': 0, '2048': 0, '4096': 0, '8192': 0, '16384': 0, '32768': 0}} ``` `PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]', 'max_split_size': 2097152000, 'garbage_collection_threshold': 0.0, 'expandable_segments': False, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8} } ``` Differential Revision: D53536199 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404 Approved by: https://github.com/zdevito	2024-02-17 01:16:37 +00:00
cyy	6da0e7f84b	[Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829 Approved by: https://github.com/albanD	2024-01-26 13:33:24 +00:00
Aaron Enye Shi	3b80577212	[Memory Snapshot] Add timestamps to memory events collected in snapshots (#112266 ) Summary: Use the same clock as the profiler to collect the timestamps on when memory events occurred. Save these to the snapshot dicts as well, so that they can be saved with the raw memory events. Test Plan: CI Observed that trace_entry will now have time_us field, and it is ascending. For example: ``` trace entry: {'action': 'free_requested', 'addr': 140366476918784, 'size': 8192, 'stream': 0, 'time_us': 1698326576864190} trace entry: {'action': 'free_completed', 'addr': 140366476918784, 'size': 8192, 'stream': 0, 'time_us': 1698326576864190} trace entry: {'action': 'free_requested', 'addr': 140366476936192, 'size': 8192, 'stream': 0, 'time_us': 1698326576864194} trace entry: {'action': 'free_completed', 'addr': 140366476936192, 'size': 8192, 'stream': 0, 'time_us': 1698326576864194} trace entry: {'action': 'free_requested', 'addr': 140366641430528, 'size': 8192000, 'stream': 0, 'time_us': 1698326576864205} trace entry: {'action': 'free_completed', 'addr': 140366641430528, 'size': 8192000, 'stream': 0, 'time_us': 1698326576864205} trace entry: {'action': 'free_requested', 'addr': 140366403571712, 'size': 4000, 'stream': 0, 'time_us': 1698326576864209} trace entry: {'action': 'free_completed', 'addr': 140366403571712, 'size': 4000, 'stream': 0, 'time_us': 1698326576864209} ``` Differential Revision: D50602011 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/112266 Approved by: https://github.com/zdevito	2023-11-14 18:48:59 +00:00
Zachary DeVito	cc54448a07	[memory snapshot] add 'address' key to block (#107171 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171 Approved by: https://github.com/ngimel	2023-08-23 18:57:24 +00:00
Zachary DeVito	80988b6277	Introduce memory stacks for free (#106758 ) Previously when we recorded a free action in a memory trace, we would provide the stack for when the block was allocated. This is faster because we do not have to record stacks for free, which would otherwise double the number of stacks collected. However, sometimes knowing the location of a free is useful for figuring out why a tensor was live. So this PR adds this behavior. If performance ends up being a concern the old behavior is possible by passing "alloc" to the context argument rather than "all". Also refactors some of glue logic to be consistent across C++ and Python and routes the Python API through the C++ version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758 Approved by: https://github.com/albanD	2023-08-14 20:38:15 +00:00
Nikita Shulga	dfd441a12c	[BE] Use nested namespaces in `torch/csrc/cuda` (#106928 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 6b1dde1</samp> > _`namespace` syntax_ > _Simplified with C++17_ > _Code is more readable_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/106928 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb	2023-08-10 03:56:09 +00:00
Zachary DeVito	3e5a52cedd	[memory snapshot] track context for segments (#106113 ) We want to display the stack for the original cudaMalloc that created a segment. Previously we could only report the last time the segment memory was used, or the record of the segment_alloc could appear in the list of allocator actions. This PR ensure regardless of whether we still have the segment_alloc action, the context for a segment is still available. The visualizer is updated to be able to incorporate this information. This PR adds a new field to Block. However the previous stacked cleanup PR removed a field of the same size, making the change to Block size-neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113 Approved by: https://github.com/aaronenyeshi	2023-07-28 06:45:48 +00:00
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit `851e89c8e8`. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit `6a50b83b73`. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Zachary DeVito	759e527ea1	Use internal symbolizer for FBCODE (#97172 ) Summary: addr2line does not work fast on fbcode binaries, so use the internally symbolize pathway. Test Plan: sandcastle Differential Revision: D44227690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97172 Approved by: https://github.com/eellison	2023-03-27 19:24:12 +00:00
Zachary DeVito	e74f70d212	Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )"" (#96878 ) This reverts commit `e1ea584b1c`. Adds __has_include check to fix fbcode build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878 Approved by: https://github.com/ezyang	2023-03-16 04:12:54 +00:00
PyTorch MergeBot	e1ea584b1c	Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )" This reverts commit `4e1060c609`. Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-15 13:28:41 +00:00
Zachary DeVito	4e1060c609	[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 ) This refactors the stack trace facility specific to memory profiling in python+cuda to make a generic facility to generate combined stack traces. The generic facility (combined_traceback.h) does not require python to be around to work, but will return python stacks if it is present. This facility is then used to add support for stack trace gathering in memory profiling that happens directly from C++. It is also used to expose a python API for gathering and symbolizing combineds stacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541 Approved by: https://github.com/ezyang	2023-03-14 18:26:05 +00:00
Elias Ellison	d798de2b05	Checkpoint CUDA Allocator Private Pool State (#94653 ) Copying note from cuda caching allocator: ``` * Note [Checkpointing PrivatePoolState] * * Refer above to Note [Interaction with CUDA graph capture]. Allocations made * during graph capture are made from a separate private pool. During graph * capture allocations behave as usual. During graph replay the allocator * state does not change even as new tensors are created. The private pool * will not free its blocks to the main caching allocator until cuda graph use * is finished to prevent an allocation from eager clobbering the memory from * a live but unaccounted for tensor that was created during replay. * * `make_graphed_callables`, a series of separate callables chained in * successive cuda graphs, can share a memory pool because after a cuda graph * recording the allocations in the shared private pool exactly reflect the * tensors that are allocated. * * We would like to extend callable chaining to support a graphed callable * tree. In this scenario, we have a tree of callable chains which will be * captured with cuda graphs. In the diagram below, we have a tree with four * callables, A, B, C, and D. Suppose we have captured, and subsequently * replayed, A, B, and C. Then on a new invocation, we replay A and B, but * would now like to record D. At this point the private pool will not reflect * any of the live tensors created during graph replay. Allocations made * during a new recording with the pool could overwrite those live tensors. * * In order to record a new graph capture after replaying prior callables in * the tree, we need the allocator to reflect the state of the live tensors. * We checkpoint the state of the private after each recording, and then * reapply it when we are starting a new recording chain. Additionally, we * must free the allocations for any tensors that died between the end of our * previous graph replaying and our new recording (TODO). All of the allocated * segments that existed in the checkpointed state must still exist in the * pool. There may also exist new segments, which we will free (TODO : link * note [live tensors between iterations] when it exists). * * * ---------------> A ---------------> B ---------------> C * \| * \| * \| * \| * ---------------> D ``` A few TODOs: - need to add logic for freeing tensors that have died between a last replay and current new recording - Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors) The two scenarios above have not been exercised in the tests yet. Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653 Approved by: https://github.com/zdevito	2023-03-14 00:47:30 +00:00
c-odrin	54b7c7d5e9	Added requested_bytes to CUDA Caching Allocator Stats (#88575 ) Summary: The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575 Approved by: https://github.com/zdevito	2023-02-09 21:37:25 +00:00
Zachary DeVito	91b1bae1df	Caching allocator tracing (#86241 ) We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241 Approved by: https://github.com/ngimel	2022-10-07 23:19:54 +00:00
Zachary DeVito	736adc0808	Memory snapshots from C++ (#86190 ) Sometimes the driving process want to save memory snapshots but isn't Python. Add a simple API to turn it on without python stack traces. It still saves to the same format for the vizualization and summary scripts, using the C++ Pickler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190 Approved by: https://github.com/ezyang	2022-10-05 07:36:39 +00:00

36 Commits