pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Yanli Zhao	077e135ed6	add number of cuda retries into tracker (#92557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92557 Approved by: https://github.com/fegin, https://github.com/mrshenli	2023-01-25 14:44:34 +00:00
Yanli Zhao	743c385543	refactor show_traces in memory_tracker (#90145 ) refactor show_tracers in memory_tracker to make it plot multiple figures and also can load serialized stats and then plot figures Pull Request resolved: https://github.com/pytorch/pytorch/pull/90145 Approved by: https://github.com/rohan-varma	2023-01-03 22:10:15 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
Yanli Zhao	2bac4d1fae	[reland] add save and load stats in memory_tracker (#90510 ) reland https://github.com/pytorch/pytorch/pull/90144, this PR removed temporary path "memory.trace" in the unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/90510 Approved by: https://github.com/rohan-varma	2022-12-10 01:39:22 +00:00
PyTorch MergeBot	5f3ca208c5	Revert "add save and load stats in memory_tracker (#90144 )" This reverts commit `1f137c1e2f`. Reverted https://github.com/pytorch/pytorch/pull/90144 on behalf of https://github.com/ezyang due to dirty git working copy broke master	2022-12-08 05:16:56 +00:00
Yanli Zhao	1f137c1e2f	add save and load stats in memory_tracker (#90144 ) add save and load stats in memory_tracker, so that users could plot the traces in another place, rather than just inside trainer Pull Request resolved: https://github.com/pytorch/pytorch/pull/90144 Approved by: https://github.com/rohan-varma	2022-12-08 00:17:21 +00:00
Yanli Zhao	a0c7b88861	remove backward hook in memory_tracker (#90143 ) remove backward hook in memory_tracker, as it does not work well with jagged tensor in some cases, it is OK to remove this hook for now as it does not really track any stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/90143 Approved by: https://github.com/rohan-varma	2022-12-06 05:39:59 +00:00
Yanli Zhao	91899a9ebd	add memory_tracker tool to help profiling memory usages (#88825 ) Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well. The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers. Will add following up PRs: 1. allow tracing more than 1 iteration 2. dump json data for visualization 3. add unit test for DDP training 4. add unit test for FSDP training 5. add unit test for activation checkpointing + DDP/FSDP training 6. add traces for activation memories and top operators that generate activation memories 7. print summaries for more breakdowns like model size, optimizer states, etc 8. add traces for temporary memories or memories consumed by cuda streams or nccl library if possible 9. connect the tool with OOM memory debugging 10. add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces 11. add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well ====================================================== Current test result for the memory_tracker_example.py on notebook: Top 20 ops that generates memory are: bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB layer1.0.conv1.forward.convolution.default_1: 24.5MB layer1.0.conv2.forward.convolution.default_2: 24.5MB layer1.1.conv1.forward.convolution.default_3: 24.5MB layer1.1.conv2.forward.convolution.default_4: 24.5MB layer1.2.conv1.forward.convolution.default_5: 24.5MB layer1.2.conv2.forward.convolution.default_6: 24.5MB maxpool.backward.threshold_backward.default_32: 23.5MB layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB <img width="1079" alt="Screen Shot 2022-11-10 at 10 03 06 AM" src="https://user-images.githubusercontent.com/48731194/201172577-ddfb769c-fb0f-4962-80df-92456b77903e.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88825 Approved by: https://github.com/awgu	2022-11-29 06:42:57 +00:00

9 Commits