Yang Wang
a88d7d4268
[util] fetch logical count cpu ( #147413 )
...
To match with Vcpu count with aws:
after (96), before (48)
Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal
before: https://hud.pytorch.org/utilization/13377376406/37360984234/1
after: https://hud.pytorch.org/utilization/13401543806/37435031356/1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413
Approved by: https://github.com/clee2000
2025-02-19 23:44:54 +00:00
Yang Wang
b0553cee6b
[Utilization] post-test-process workflow ( #145310 )
...
# Overview
Add reusable workflow to trigger the post-test right after each test job is complete.
Cousion with pr to setup the runner permissions:
Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files
add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files
Currently I turn on the debug flag for testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310
Approved by: https://github.com/huydhn
2025-02-13 18:51:19 +00:00
Yang Wang
fd73ae2068
[Utilization] Convert timestamp to str for datetime64 ( #145985 )
...
Convert all timestamp(float) to int timestamp during data pipeline for db type datetime64.
float does not work when try to insert into clickhouse using jsonExtract.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985
Approved by: https://github.com/huydhn
2025-02-03 21:05:18 +00:00
Yang Wang
a9ed7bd78e
[utilization] pipeline to create clean db records ( #145327 )
...
upload_utilization_script to generate db-ready-insert records to s3
- generate two files: metadata and timeseries in ossci-utilization buckets
- convert log record to db format ones
- add unit test job for tools/stats/
Related Prs:
setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310
add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595
add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327
Approved by: https://github.com/huydhn
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-29 23:48:50 +00:00
Yang Wang
6d4f5f7688
[Utilization][Usage Log] Add data model for record ( #145114 )
...
Add data model for consistency and data model change in the future.
The data model will be used during the post-test-process pipeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114
Approved by: https://github.com/huydhn
2025-01-23 19:04:41 +00:00
Yang Wang
fea9d18d5a
[Utilization Log] Concurrently collect aggregate data during the output interval ( #143235 )
...
# overview
Add worker to collect metrics in short intervals
1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable
2.Calculate & avg and max as data point, by default, every 5 second.
# Other
clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235
Approved by: https://github.com/huydhn
2025-01-16 23:52:43 +00:00
Tom Ritchford
498a7808ff
Fix unused Python variables outside torch/ and test/ ( #136359 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
Yang Wang
b7a45dbae3
Add monitor script ( #141438 )
...
# Overview
Add monitor script to collect system-level utilization data during CI tests.
Currently all monitoring scripts are disabled.
# Details
- Add flag to customize the time intervals for logging
- Enable multiple GPU utilization logging
# Next step
enable monitor scritpt in non-perf-test workflows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438
Approved by: https://github.com/huydhn
2024-11-29 04:14:31 +00:00
Amin Alam
1266be21f4
deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix ( #136141 )
...
Fix to #136140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141
Approved by: https://github.com/kwen2501
2024-09-24 07:26:10 +00:00
Xuehai Pan
8a67daf283
[BE][Easy] enable postponed annotations in tools ( #129375 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00
PyTorch MergeBot
a32ce5ce34
Revert "[BE][Easy] enable postponed annotations in tools ( #129375 )"
...
This reverts commit 59eb2897f1 .
Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374 , please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541 ))
2024-06-29 00:44:25 +00:00
Xuehai Pan
59eb2897f1
[BE][Easy] enable postponed annotations in tools ( #129375 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-28 15:37:54 +00:00
Xiaodong Wang
06934518a2
[AMD] Fix deprecated amdsmi api ( #126962 )
...
Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by c551c3caed . So fixing this in a backward compatible way
Differential Revision: D57711088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962
Approved by: https://github.com/eqy , https://github.com/izaitsevfb
2024-05-26 20:11:23 +00:00
Jack Taylor
d30cdc4321
[ROCm] amdsmi library integration ( #119182 )
...
Adds monitoring support for ROCm using amdsmi in place of pynvml.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily , https://github.com/malfet , https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
PyTorch MergeBot
0d4fdb0bb7
Revert "[ROCm] amdsmi library integration ( #119182 )"
...
This reverts commit 85447c41e3 .
Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197 ))
2024-05-09 21:18:21 +00:00
Jack Taylor
85447c41e3
[ROCm] amdsmi library integration ( #119182 )
...
Adds monitoring support for ROCm using amdsmi in place of pynvml.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily , https://github.com/malfet , https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
BowenBao
60a68477a6
Bump black version to 23.1.0 ( #96578 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
Jeff Daily
f11dc26ed5
[ROCm] tools/stats/monitor.py support ( #91732 )
...
Initial support for rocm-smi monitoring of GPU utilization. Works around difficulties of using the rocm-smi python bindings without having an explicit package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91732
Approved by: https://github.com/huydhn , https://github.com/pruthvistony
2023-01-05 18:34:11 +00:00
Huy Do
7c6fe21a38
Fix monitoring script for macos ( #88159 )
...
The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info ) requires higher user privileges
Example failures:
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip
I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159
Approved by: https://github.com/clee2000
2022-11-01 05:58:44 +00:00
Huy Do
795906f207
Add total GPU memory utilization ( #86250 )
...
Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html . Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250
Approved by: https://github.com/ZainRizvi
2022-10-06 18:53:59 +00:00
Catherine Lee
6f2a88dd50
script to monitor memory + cpu utilization ( #82006 )
...
Add a python script that runs in the background during test jobs to log cpu + gpu memory usage and cpu utilization of python tests (really any python process) to a file and upload the file as an artifact.
I plan on using the the gpu memory usage stats to better understand how to parallelize them, but it is easy to add on other stats if people want them.
In the future, we want to add the ability to track network usage to see if we can decrease it. GPU utilization will also likely need to be improved.
Click the hud link to see uploaded usage log artifacts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82006
Approved by: https://github.com/huydhn
2022-07-25 16:53:31 +00:00