mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 00:20:18 +01:00
## Motivation Many PRs optimizing samplers (for eg https://github.com/pytorch/pytorch/pull/147706, https://github.com/pytorch/pytorch/pull/137423) are leveraging an adhoc script for benchmarking samplers. The script and outputs are often copied over in PRs. We want to begin centralizing benchmarks for torch.utils.data components. ## What ? * This PR adds a new sub-folder in `benchmarks` for `data`. This is aimed to cover benchmarking scripts for torch.utils.data components like dataloader and sampler. * Specifically, this PR includes a simple script to time samplers. This is often "copy-pasted" in PRs optimizing samplers. Having it in a centralized location should prevent that, and allow a common standard. ## Output ``` Benchmark Results: +--------------+-------------+----------------+-----------+-----------+ | Batch Size | Drop Last | Original (s) | New (s) | Speedup | +==============+=============+================+===========+===========+ | 4 | True | 0.004 | 0.0088 | -119.62% | +--------------+-------------+----------------+-----------+-----------+ | 4 | False | 0.0083 | 0.009 | -9.23% | +--------------+-------------+----------------+-----------+-----------+ | 8 | True | 0.003 | 0.0074 | -147.64% | +--------------+-------------+----------------+-----------+-----------+ | 8 | False | 0.0054 | 0.0075 | -38.72% | +--------------+-------------+----------------+-----------+-----------+ | 64 | True | 0.0021 | 0.0056 | -161.92% | +--------------+-------------+----------------+-----------+-----------+ | 64 | False | 0.0029 | 0.0055 | -92.50% | +--------------+-------------+----------------+-----------+-----------+ | 640 | True | 0.002 | 0.0055 | -168.75% | +--------------+-------------+----------------+-----------+-----------+ | 640 | False | 0.0024 | 0.0062 | -161.35% | +--------------+-------------+----------------+-----------+-----------+ | 6400 | True | 0.0021 | 0.0055 | -160.13% | +--------------+-------------+----------------+-----------+-----------+ | 6400 | False | 0.0021 | 0.0068 | -215.46% | +--------------+-------------+----------------+-----------+-----------+ | 64000 | True | 0.0042 | 0.0065 | -55.29% | +--------------+-------------+----------------+-----------+-----------+ | 64000 | False | 0.0029 | 0.0077 | -169.56% | +--------------+-------------+----------------+-----------+-----------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156974 Approved by: https://github.com/ramanishsingh
63 lines
1.9 KiB
Markdown
63 lines
1.9 KiB
Markdown
# PyTorch Data Benchmarks
|
|
|
|
This directory contains benchmarks for the `torch.utils.data` module components, focusing on the performance of samplers.
|
|
|
|
## Dependencies
|
|
|
|
The benchmarks require the following dependencies:
|
|
```
|
|
numpy
|
|
tabulate
|
|
```
|
|
|
|
You can install them using pip:
|
|
```bash
|
|
pip install numpy tabulate
|
|
```
|
|
|
|
## Running the benchmarks
|
|
|
|
To run the BatchSampler benchmark:
|
|
```bash
|
|
python samplers_benchmark.py
|
|
```
|
|
|
|
## Sampler Benchmark
|
|
|
|
The `samplers_benchmark.py` script benchmarks the performance of PyTorch's BatchSampler against an alternative implementation as an example. It tests with the following parameters:
|
|
|
|
- Batch sizes: 4, 8, 64, 640, 6400, 64000
|
|
- Drop last options: True, False
|
|
- Each configuration is run 10 times and averaged
|
|
- Results include speedup percentage calculations
|
|
|
|
### Output
|
|
|
|
The benchmark outputs a table with the following columns:
|
|
- Batch Size
|
|
- Drop Last
|
|
- Original (s): Time taken by the original implementation
|
|
- New (s): Time taken by the alternative implementation
|
|
- Speedup: Percentage improvement of the new implementation over the original
|
|
|
|
Example output:
|
|
```
|
|
+------------+-----------+---------------+----------+---------+
|
|
| Batch Size | Drop Last | Original (s) | New (s) | Speedup |
|
|
+============+===========+===============+==========+=========+
|
|
| 4 | True | 0.1234 | 0.1000 | 18.96% |
|
|
+------------+-----------+---------------+----------+---------+
|
|
| 4 | False | 0.1345 | 0.1100 | 18.22% |
|
|
+------------+-----------+---------------+----------+---------+
|
|
...
|
|
```
|
|
|
|
### Extending the Benchmark
|
|
|
|
To benchmark a different implementation:
|
|
|
|
On local:
|
|
1. Modify the `NewBatchSampler` class in `samplers_benchmark.py` with your implementation. Similarly replace `BatchSampler` with the corresponding PyTorch implementation.
|
|
* Ensure to include all inputs like `replacement` for `RandomSampler` and its variations
|
|
2. Run the benchmark to compare its performance against the original
|