Commit Graph

13 Commits

Author SHA1 Message Date
Amin Alam
1266be21f4 deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141)
Fix to #136140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141
Approved by: https://github.com/kwen2501
2024-09-24 07:26:10 +00:00
Xuehai Pan
8a67daf283 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00
PyTorch MergeBot
a32ce5ce34 Revert "[BE][Easy] enable postponed annotations in tools (#129375)"
This reverts commit 59eb2897f1.

Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
Xuehai Pan
59eb2897f1 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-28 15:37:54 +00:00
Xiaodong Wang
06934518a2 [AMD] Fix deprecated amdsmi api (#126962)
Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by c551c3caed. So fixing this in a backward compatible way

Differential Revision: D57711088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962
Approved by: https://github.com/eqy, https://github.com/izaitsevfb
2024-05-26 20:11:23 +00:00
Jack Taylor
d30cdc4321 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
PyTorch MergeBot
0d4fdb0bb7 Revert "[ROCm] amdsmi library integration (#119182)"
This reverts commit 85447c41e3.

Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))
2024-05-09 21:18:21 +00:00
Jack Taylor
85447c41e3 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
BowenBao
60a68477a6 Bump black version to 23.1.0 (#96578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
Jeff Daily
f11dc26ed5 [ROCm] tools/stats/monitor.py support (#91732)
Initial support for rocm-smi monitoring of GPU utilization.  Works around difficulties of using the rocm-smi python bindings without having an explicit package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91732
Approved by: https://github.com/huydhn, https://github.com/pruthvistony
2023-01-05 18:34:11 +00:00
Huy Do
7c6fe21a38 Fix monitoring script for macos (#88159)
The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info) requires higher user privileges

Example failures:
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip

I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159
Approved by: https://github.com/clee2000
2022-11-01 05:58:44 +00:00
Huy Do
795906f207 Add total GPU memory utilization (#86250)
Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html.  Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250
Approved by: https://github.com/ZainRizvi
2022-10-06 18:53:59 +00:00
Catherine Lee
6f2a88dd50 script to monitor memory + cpu utilization (#82006)
Add a python script that runs in the background during test jobs to log cpu + gpu memory usage and cpu utilization of python tests (really any python process) to a file and upload the file as an artifact.

I plan on using the the gpu memory usage stats to better understand how to parallelize them, but it is easy to add on other stats if people want them.

In the future, we want to add the ability to track network usage to see if we can decrease it.  GPU utilization will also likely need to be improved.

Click the hud link to see uploaded usage log artifacts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82006
Approved by: https://github.com/huydhn
2022-07-25 16:53:31 +00:00