pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Sanket Purandare de35d3062f Runtime Estimator for estimating GPU compute time (#134243 ) This PR adds a basic Runtime Estimator for single-device models. It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``. It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`). For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders. ``` import torch from torch import nn, optim from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": def _train_step( model: nn.Module, optimizer: optim.Optimizer, inp: torch.Tensor, ): out = model(inp) loss = out.sum() loss.backward() optimizer.step() optimizer.zero_grad() dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 32, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) runtime_estimator = RuntimeEstimator() with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) with runtime_estimator("operator-level-benchmark"): _train_step(model, optimizer, inp) with runtime_estimator("operator-level-cost-model"): _train_step(model, optimizer, inp) # Actual model runtime with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) warmup_iters, actual_iters = 2, 5 start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for _ in range(warmup_iters): _train_step(model, optimizer, inp) start_event.record() for _ in range(actual_iters): _train_step(model, optimizer, inp) end_event.record() torch.cuda.synchronize() measured_time = start_event.elapsed_time(end_event) / actual_iters print(f"Actual total_time: {measured_time:.3f} ms") ``` <img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c"> @weifengpy @xuanzhang816 @gnadathur Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243 Approved by: https://github.com/weifengpy		2024-08-28 20:06:54 +00:00
..
test_fsdp2_mem_tracker.py	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 )	2024-07-17 16:57:39 +00:00
test_mem_tracker.py	Fixed Memory Leaks in tests (#129640 )	2024-06-27 20:26:21 +00:00
test_memory_tracker.py	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 )	2024-07-17 16:57:39 +00:00
test_mod_tracker.py	Add None return type to init -- tests rest (#132376 )	2024-08-01 15:44:51 +00:00
test_runtime_estimator.py	Runtime Estimator for estimating GPU compute time (#134243 )	2024-08-28 20:06:54 +00:00