ollama/server
Jesse Gross f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
..
internal server/internal/client/ollama: fix file descriptor management in Pull (#9931) 2025-03-21 16:16:38 -07:00
testdata/tools all: fix typos in documentation, code, and comments (#7021) 2024-12-10 12:58:06 -08:00
auth.go fix nil deref in auth.go 2024-07-26 14:14:48 -07:00
create_test.go server: validate local path on safetensor create (#9379) 2025-02-28 16:10:43 -08:00
create.go server: validate local path on safetensor create (#9379) 2025-02-28 16:10:43 -08:00
download.go server: increase timeout in stall detection from 5s to 30s (#8831) 2025-02-05 10:00:26 -08:00
fixblobs_test.go server: replace blob prefix separator from ':' to '-' (#3146) 2024-03-14 20:18:06 -07:00
fixblobs.go server: replace blob prefix separator from ':' to '-' (#3146) 2024-03-14 20:18:06 -07:00
images.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00
layer.go One corrupt manifest should not wedge model operations (#7515) 2024-11-05 14:21:45 -08:00
manifest_test.go One corrupt manifest should not wedge model operations (#7515) 2024-11-05 14:21:45 -08:00
manifest.go One corrupt manifest should not wedge model operations (#7515) 2024-11-05 14:21:45 -08:00
model_test.go Update the /api/create endpoint to use JSON (#7935) 2024-12-31 18:02:30 -08:00
model.go templates: add autotemplate for gemma3 (#9880) 2025-03-20 00:15:30 -07:00
modelpath_test.go server: more support for mixed-case model names (#8017) 2024-12-11 15:29:59 -08:00
modelpath.go server: more support for mixed-case model names (#8017) 2024-12-11 15:29:59 -08:00
prompt_test.go prompt: Don't trim whitespace from prompts 2024-12-09 11:02:55 -08:00
prompt.go gemma3: Allow multiple image in a single input 2025-03-14 15:38:54 -07:00
routes_create_test.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00
routes_delete_test.go Update the /api/create endpoint to use JSON (#7935) 2024-12-31 18:02:30 -08:00
routes_generate_test.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00
routes_list_test.go Update the /api/create endpoint to use JSON (#7935) 2024-12-31 18:02:30 -08:00
routes_test.go server/internal/client/ollama: hold DiskCache on Registry (#9463) 2025-03-02 20:55:44 -08:00
routes.go add verbose mode to the show command (#9640) 2025-03-13 14:24:27 -07:00
sched_test.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00
sched.go ggml: Support heterogeneous KV cache layer sizes in memory estimation 2025-03-26 13:16:03 -07:00
sparse_common.go Don't hard fail on sparse setup error 2024-08-09 12:16:19 -07:00
sparse_windows.go Don't hard fail on sparse setup error 2024-08-09 12:16:19 -07:00
upload.go server: always print upload/download part info (#8832) 2025-02-04 19:30:49 -08:00