pytorch/torch/testing/_internal
Aaron Orenstein 524fe784ec BundledAutotuneCache (take 2) (#137902)
Summary:
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Attempt 2 of #134959 (D60677499).

Various configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Test Plan:
unit tests

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

- on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<<
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D64336043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902
Approved by: https://github.com/oulgen
2024-10-15 18:39:47 +00:00
..
codegen
data
distributed Fixed error string assertion in test_invalid_devices (#137772) 2024-10-13 18:10:07 +00:00
generated
opinfo [CPU] Expand torch.special.i1 to Half and BF16 (#137899) 2024-10-15 17:00:58 +00:00
optests [aotd] Fix rrelu compilation (#136008) 2024-09-25 11:26:19 +00:00
test_module
__init__.py
autocast_test_lists.py Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936) 2024-09-18 16:31:27 +00:00
autograd_function_db.py
check_kernel_launches.py
common_cuda.py BundledAutotuneCache (take 2) (#137902) 2024-10-15 18:39:47 +00:00
common_device_type.py [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947) 2024-10-12 13:21:20 +00:00
common_dist_composable.py
common_distributed.py Revert "[Distributed] Fix extra context on device 0 (#135273)" 2024-10-10 23:47:25 +00:00
common_dtype.py [redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341) 2024-10-07 00:58:51 +00:00
common_fsdp.py Generalization of FSDP common for non-cuda execution (#133209) 2024-09-27 00:38:10 +00:00
common_jit.py
common_methods_invocations.py [CPU] Expand torch.special.i1 to Half and BF16 (#137899) 2024-10-15 17:00:58 +00:00
common_mkldnn.py
common_modules.py Revert "Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)" 2024-09-13 18:06:56 +00:00
common_nn.py
common_optimizers.py Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107) 2024-10-14 19:24:44 +00:00
common_pruning.py
common_quantization.py Change to export_for_training in XNNPACK tests (#137238) 2024-10-03 21:28:05 +00:00
common_quantized.py
common_subclass.py Fix wrapper subclass serialization with custom sizes / strides (#137030) 2024-10-02 18:55:03 +00:00
common_utils.py Unify cpp_extension build directory removal (#136059) 2024-10-03 06:22:11 +00:00
composite_compliance.py Ensure noncontiguous tensor creation tests offsetting (#136396) 2024-10-02 00:40:43 +00:00
custom_op_db.py
custom_tensor.py
dist_utils.py
dynamo_test_failures.py
hop_db.py [FlexAttention] Add Better error message for cpu tensors (#136673) 2024-09-26 16:40:21 +00:00
hypothesis_utils.py
inductor_utils.py [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947) 2024-10-12 13:21:20 +00:00
jit_metaprogramming_utils.py
jit_utils.py
logging_tensor.py
logging_utils.py
quantization_torch_package_models.py
static_module.py
torchbind_impls.py
triton_utils.py
two_tensor.py