ollama

mirror of https://github.com/zebrajr/ollama.git synced 2025-12-06 00:19:51 +01:00

Author	SHA1	Message	Date
youzichuan	bb71654ebe	chore: fix some inconsistent function name in comment Signed-off-by: youzichuan <youzichuan6@outlook.com>	2025-08-13 09:50:27 -07:00
Michael Yang	d0cf6c8281	fix(openai): handle reasoning_effort (#11868 )	2025-08-12 11:02:01 -07:00
Devon Rifkin	dbfd7bd027	Merge pull request #11861 from ollama/drifkin/fix-parsing-error server: fix error when parsing bad harmony tool calls	2025-08-11 14:59:57 -07:00
Devon Rifkin	ee04dbba51	server: fix error when parsing bad harmony tool calls Thanks @moll for reporting! Fixes: #11781	2025-08-11 14:09:13 -07:00
Daniel Andersen	ea7657b54a	sched: Add support for grouping GPUs (#10678 ) This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs. Benefits: - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed - Allowing unallocated GPUs to get into power-saving mode. - Significantly reduce VRAM allocation when using more than 2 GPUs in a system - Due to the reduced memory allocation, you can run more models simultaneously.	2025-08-11 13:59:38 -07:00
Jesse Gross	f2e9c9aff5	server: Reduce gpt-oss context length for small VRAM GPUs gpt-oss works best with a context length of at least 8k. However, for GPUs with limited amount of VRAM, there is a significant performance hit to this increased context. In these cases, we switch to the Ollama default of 4k	2025-08-07 14:23:55 -07:00
Devon Rifkin	30f8a68c4c	tools: support anyOf types afaik gpt-oss is the first model that meaningfully transforms tool function definitions in its template. We found that relatively common definitions that include `anyOf` were not working because the template was assuming that types were always defined via a `type` field. anyOf allows for fully recursive types, so I exposed a `toTypeScriptType()` function to handle this recursive logic in go and keep the templates cleaner. The gpt-oss templates will need to be updated to use this. We should keep building out our function definition support to more fully support the parts of json schema that make sense for this use case, but in the meantime this will unblock some users (e.g., zed's ollama integration w/ gpt-oss). Probably the most urgent is proper array support	2025-08-05 16:46:24 -07:00
Michael Yang	fa7776fd24	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-08-05 12:21:16 -07:00
minxinyi	1e6eab5c33	server: use slices.Equal to simplify code (#11502 )	2025-07-23 14:25:39 -07:00
Patrick Devine	3bac5cba60	Fix GetModelInfo (#11496 ) --------- Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-07-22 13:40:47 -07:00
Daniel Hiltgen	20c3266e94	Reduce default parallelism to 1 (#11330 ) The current scheduler algorithm of picking the paralellism based on available VRAM complicates the upcoming dynamic layer memory allocation algorithm. This changes the default to 1, with the intent going forward that parallelism is explicit and will no longer be dynamically determined. Removal of the dynamic logic will come in a follow up.	2025-07-08 12:08:37 -07:00
Daniel Hiltgen	34088dbcfb	API/CLI context enhancements (#11331 ) * API: expose context size of loaded models * CLI: add context UX This adds a column in the ps output to show the models context size.	2025-07-08 11:59:06 -07:00
Michael Yang	d0b32def60	skip quantizing per_layer_token_embd (#11207 ) this tensor isn't compatible with cuda when quantized to q4_K so skip it	2025-06-26 21:49:35 -07:00
Devon Rifkin	b2b270ad5d	Merge branch 'main' into drifkin/array-head-count-simple	2025-06-23 10:37:31 -07:00
Michael Yang	0a066cfd91	Reapply "feat: incremental gguf parser (#10822 )" (#11114 ) (#11119 ) * Reapply "feat: incremental gguf parser (#10822)" (#11114) This reverts commit `a6e64fbdf2`. * fix older ggufs	2025-06-20 11:11:40 -07:00
Jeffrey Morgan	a6e64fbdf2	Revert "feat: incremental gguf parser (#10822 )" (#11114 ) This reverts commit `6b04cad7e8`.	2025-06-18 05:42:44 -07:00
曹家巧	60cfa2a203	cache: fix comment function name in cache.go (#11110 )	2025-06-18 05:21:45 -07:00
Jeffrey Morgan	9f8a18ec05	tools: loosen tool parsing to allow for more formats (#11030 )	2025-06-12 14:18:54 -07:00
Michael Yang	6b04cad7e8	feat: incremental gguf parser (#10822 ) * incremental gguf parser * gguf: update test to not rely on gguf on disc * re-use existing create gguf * read capabilities from gguf kv * kv exists * update tests * s/doneFunc/successFunc/g * new buffered reader --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-06-12 11:04:11 -07:00
Jeffrey Morgan	09d308d6b6	Revert "server: add model capabilities to the list endpoint (#10174 )" (#11004 ) This reverts commit `0943001193`.	2025-06-06 23:29:14 -04:00
Devon Rifkin	a3b6886b7d	move thinking logic into its own package (#10990 ) move thinking logic into its own package	2025-06-06 12:02:20 -07:00
Devon Rifkin	0683efa637	export ThinkingParser	2025-06-05 10:22:32 -07:00
JasonHonKL	0943001193	server: add model capabilities to the list endpoint (#10174 )	2025-06-04 11:39:48 -07:00
Devon Rifkin	5f57b0ef42	add thinking support to the api and cli (#10584 ) - Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/set think` or `/set nothink` - A `--hidethinking` option has also been added to the CLI. This makes it easy to use thinking in scripting scenarios like `ollama run qwen3 --think --hidethinking "my question here"` where you just want to see the answer but still want the benefits of thinking models	2025-05-28 19:38:52 -07:00
Kyle Steere	9239a254e0	server: abort download on empty digest Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev>	2025-05-27 11:28:48 -07:00
frob	eda472df1b	server: add hint to the error message when model path access fails (#10843 )	2025-05-24 13:17:04 -07:00
Parth Sareen	e8b981fa5d	tools: refactor tool call parsing and enable streaming (#10415 )	2025-05-23 14:19:31 -07:00
Daniel Hiltgen	d950ff12c0	sched: fix runner leak during reloading unload (#10819 ) When the same model is being reloaded rapidly with client connections being canceled before the model finishes loading, the queued unload event could cause a leak of runners by deleting a different runner from the loaded list.	2025-05-22 14:31:36 -07:00
Bruce MacDonald	fbe6ae285a	server: improve tensor quantization fallback logic (#10806 ) Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.	2025-05-22 10:48:08 -07:00
Michael Yang	61aeaf7e81	remove support for multiple ggufs in a single file (#10722 ) * remove support for multiple ggufs in a single file this was an attempt to make it easier to import multimodal models into ollama. this was rarely used and error prone so remove it * fix: create fused model from blob	2025-05-21 13:55:31 -07:00
Daniel Hiltgen	1a0cfd080a	avoid kv truncation during create (#10761 )	2025-05-19 13:54:54 -07:00
Jesse Gross	94ab428e3f	ggml: Seperate tensor load from backend creation Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.	2025-05-19 09:54:22 -07:00
Daniel Hiltgen	ff80718e9c	fix crash in old clients with quantization progress (#10710 ) Older clients assumed the digest was at least 19 characters long so increase the size of the dummy digest to avoid array out of bounds crashes.	2025-05-14 14:54:18 -07:00
Michael Yang	23125648b8	chore: update mllama to use ollama engine (#10637 )	2025-05-13 17:36:02 -07:00
Jeffrey Morgan	c7f4ae7b9c	server: add webp image input support (#10653 )	2025-05-12 20:41:42 -07:00
Daniel Hiltgen	9d6df90805	Follow up to #10363 (#10647 ) The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.	2025-05-12 15:23:31 -07:00
Bruce MacDonald	ad035ad595	convert: quantize from safetensors needs kv (#10675 ) When creating a quantized model from safetensors we need the array KV values to be loaded.Changing this value to -1 loads the KV values on the returned layer to be used and saved during quantization.	2025-05-12 12:04:20 -07:00
Michael Yang	f95a1f2bef	feat: add trace log level (#10650 ) reduce prompt log to trace level	2025-05-12 11:43:00 -07:00
Michael Yang	0d6e35d3c6	fix: stream accumulator exits early (#10593 ) the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.	2025-05-08 13:17:30 -07:00
Devon Rifkin	20c5fd39c8	Merge branch 'main' into drifkin/array-head-count-simple	2025-05-08 11:46:52 -07:00
Michael Yang	6e9a7a2568	lint: enable usetesting, disable tenv (#10594 )	2025-05-08 11:42:14 -07:00
Daniel Hiltgen	5e380c3b42	sched: fix race leading to orphaned runners (#10599 ) If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.	2025-05-07 09:38:17 -07:00
Jeffrey Morgan	392de84031	api: remove unused RetrieveModelResponse type (#10603 )	2025-05-06 23:08:03 -07:00
Devon Rifkin	4090aca97b	server: send 405 instead of 404 for unallowed methods (#10275 ) Fixes: #5483	2025-05-06 14:45:37 -07:00
Michael Yang	92ce438de0	server: remove internal cmd (#10595 )	2025-05-06 13:05:01 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Jeffrey Morgan	1703d1472e	server: fix panic when runner.Options is nil (#10566 )	2025-05-05 09:01:33 -07:00
Daniel Hiltgen	76ea735aaf	sched: logging improvements (#10550 ) This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.	2025-05-03 12:01:56 -07:00
frob	e6d2d04121	image: add vision capability for projector-based models (#10509 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-05-01 16:50:20 -07:00
Devon Rifkin	ad3c7c9bda	strip out thinking tags in message history for qwen3 & r1 (#10490 ) * strip out thinking tags in message history for qwen3 & r1 This is in advance of "proper" support where we'll make reasoning configurable and we'll parse out thinking/reasoning tags and provide them to the caller. These models expect there to be no thinking tags in the message history, so this should improve quality * parse model names instead of hacky prefix check	2025-04-30 13:57:45 -07:00
Daniel Hiltgen	415c8fcc3d	Fix "Stopping..." scheduler hang (#10487 ) * Adjust initial scheduler refCount Ensure we only set the refCount on success * sched: fix lock order inversion deadlock Under certain race conditions, there was a scenario where the scheduler would get into a deadlock while trying to update free space information while a model was trying to unload.	2025-04-30 11:26:52 -07:00
Devon Rifkin	fe5b9bb21b	lower default num parallel to 2 this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k	2025-04-29 02:04:14 -07:00
Devon Rifkin	dd93e1af85	Revert "increase default context length to 4096 (#10364 )" This reverts commit `424f648632`.	2025-04-28 16:54:11 -07:00
Devon Rifkin	d2ee599dcf	load arrays with up to 1024 elements when estimating This mirrors the old behavior before #10382	2025-04-27 13:45:13 -07:00
Michael Yang	340448d2d1	explicitly decode maxarraysize 1024	2025-04-25 16:59:01 -07:00
Michael Yang	214a7678ea	fix superfluous call to WriteHeader the first call to http.ResponseWriter.Write implicitly calls WriteHeader with http.StatusOK if it hasn't already been called. once WriteHeader has been called, subsequent calls has no effect. Write is called when JSON encoding progressUpdateJSON{}. calls to http.ResponseWriter.WriteHeader after the first encode is useless and produces a warning: http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)	2025-04-25 16:58:49 -07:00
Devon Rifkin	424f648632	increase default context length to 4096 (#10364 ) * increase default context length to 4096 We lower the default numParallel from 4 to 2 and use these "savings" to double the default context length from 2048 to 4096. We're memory neutral in cases when we previously would've used numParallel == 4, but we add the following mitigation to handle some cases where we would have previously fallen back to 1x2048 due to low VRAM: we decide between 2048 and 4096 using a runtime check, choosing 2048 if we're on a one GPU system with total VRAM of <= 4 GB. We purposefully don't check the available VRAM because we don't want the context window size to change unexpectedly based on the available VRAM. We plan on making the default even larger, but this is a relatively low-risk change we can make to quickly double it. * fix tests add an explicit context length so they don't get truncated. The code that converts -1 from being a signal for doing a runtime check isn't running as part of these tests. * tweak small gpu message * clarify context length default also make it actually show up in `ollama serve --help`	2025-04-22 16:33:24 -07:00
Michael Yang	88738b357b	create tempdir in models directory the models directory should have plenty of storage and also ensure there's no cross-device copy	2025-04-18 18:13:05 -07:00
Blake Mizerany	4e535e6188	server/internal/registry: make pull send errors with Error field (#10326 ) Previously, the pull handler would send an error message in the Status field, this prevented the client from using the message as a signal to stop. In the case of the "run" command, it would follow the pull with a "show" which would print a nearly identical "not found" message for unresolved models. Fixes #10307	2025-04-18 18:12:28 -07:00
Blake Mizerany	1d99451ad7	server/internal/client/ollama: handle some network errors gracefully (#10317 )	2025-04-17 12:43:09 -07:00
Blake Mizerany	369de832cd	server/internal/registry: remove superfluous progress bar flush (#10303 ) This removes the extra flushProgress() at the end of handlePull. It is unnecessary because final progress updates are flushed in all cases of the main select loop.	2025-04-16 14:43:07 -07:00
Blake Mizerany	3457a315b2	server/internal/client/ollama: cleanup use of multiple counters (#10304 ) The completed and received counters must work in tandem and the code should better reflect that. Previously, the act of updating them was 2-3 lines of code duplicated in multiple places. This consolidates them into a single update closure for easy reading and maintenance. This also simplifies error handling in places where we can use a return parameter and defer to handle the error case for updates. Also, remove the old Layer field from the trackingReader struct.	2025-04-16 14:33:40 -07:00
Daniel Hiltgen	56dc316a57	Give tests more time to run (#10306 ) Fix flake failures on windows	2025-04-16 13:37:00 -07:00
Blake Mizerany	1e7f62cb42	cmd: add retry/backoff (#10069 ) This commit adds retry/backoff to the registry client for pull requests. Also, revert progress indication to match original client's until we can "get it right." Also, make WithTrace wrap existing traces instead of clobbering them. This allows clients to compose traces.	2025-04-15 23:24:44 -07:00
Devon Rifkin	97fe45e36d	server: add `OpenAI-Beta` header to CORS safelist alphabetized the compat list and then added a single header fixes: #9801	2025-04-14 15:36:10 -07:00
Tom Sheffler	ef65174df2	types: include the 'items' and '$defs' fields to properly handle "array" types (#10091 ) --------- Co-authored-by: Parth Sareen <parth.sareen@ollama.com>	2025-04-09 17:45:49 -07:00
Ire Gaddr	42ecb9f138	fix(scheduler): make model unload order deterministic (#10185 )	2025-04-09 16:01:02 -07:00
Parth Sareen	6747099d71	types: add any type and validation for ToolFunction enum (#10166 )	2025-04-08 15:05:38 -07:00
Alex Rozgo	2f723ac2d6	types: allow tool function parameters with a single type or an array of types (#9434 )	2025-04-07 14:27:01 -07:00
Bruce MacDonald	e53b3cbd0c	llm: set done reason at server level (#9830 ) No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.	2025-04-03 10:19:24 -07:00
Bruce MacDonald	9876c9faa4	chore(all): replace instances of interface with any (#10067 ) Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.	2025-04-02 09:44:27 -07:00
Bruce MacDonald	e172f095ba	api: return model capabilities from the show endpoint (#10066 ) With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.	2025-04-01 15:21:46 -07:00
Blake Mizerany	ef27d52e79	server/internal/client/ollama: cache completed chunks (#9933 ) This change adds tracking of download chunks during the pull process so that subsequent pulls can skip downloading already completed chunks. This works across restarts of ollama. Currently, download state will be lost if a prune is triggered during a pull (e.g. restart or remove). This issue should be addressed in a follow-up PR.	2025-03-30 23:54:54 -07:00
CYJiang	0bd0454ea7	server: organize error types (#9465 ) Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-03-28 11:50:22 -07:00
Jesse Gross	f66216e399	ggml: Support heterogeneous KV cache layer sizes in memory estimation Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890	2025-03-26 13:16:03 -07:00
Blake Mizerany	ce929984a3	server/internal/client/ollama: fix file descriptor management in Pull (#9931 ) Close chunked writers as soon as downloads complete, rather than deferring closure until Pull exits. This prevents exhausting file descriptors when pulling many layers. Instead of unbounded defers, use a WaitGroup and background goroutine to close each chunked writer as soon as its downloads finish. Also rename 'total' to 'received' for clarity.	2025-03-21 16:16:38 -07:00
Blake Mizerany	c794fef2f2	server/internal/client/ollama: persist through chunk download errors (#9923 )	2025-03-21 13:03:43 -07:00
Patrick Devine	f8c3dbe5b5	templates: add autotemplate for gemma3 (#9880 ) This change allows the gemma3 template to be autodetected during `ollama create`.	2025-03-20 00:15:30 -07:00
Blake Mizerany	2ddacd7516	server/internal/client/ollama: confirm all chunksums were received (#9893 ) If the chunksums response is missing a chunk, the client should fail the download. This changes the client to check that all bytes are accounted for in the chunksums response. It is possible there are overlaps or gaps in the chunksums response and so the size is not the only thing left to check, but this provides enough coverage for now. We may want to check that chunks are contiguous later.	2025-03-19 14:59:57 -07:00
Blake Mizerany	8294676150	server/internal/client/ollama: set User-Agent for registry client (#9775 ) This sets the agent header in DefaultRegistry to include the version of the client, OS, and architecture in the previous format, with a minor twist. Note: The version is obtained from the build info, instead of the version in version.Version, which should not longer be necessary, but we can remove in a future commit. Using the build info is more accurate and also provides extra build information if the build is not tagged, and if it is "dirty". Previously, the version was just "0.0.0" with no other helpful information. The ollama.com registry and others handle this swimmingly.	2025-03-14 18:33:07 -07:00
Jesse Gross	7bf793a600	gemma3: Allow multiple image in a single input Previously processing multiple images in a batch would trigger segfaults so sending images together was disabled as a way to mitigate this. The trigger was processing one image on the CPU and one on the GPU. This can no longer happen: - The vision encoder is now on the GPU so both images would be processed on the GPU. - We require images to be fully contained in a batch and each image including its special tokens is over half the batch size. As a result, we will never get two images in the same batch. Fixes #9731	2025-03-14 15:38:54 -07:00
Blake Mizerany	4e320b8b90	server/internal/chunks: remove chunks package (#9755 )	2025-03-14 08:57:59 -07:00
Blake Mizerany	eb2b22b042	server/internal/client: use chunksums for concurrent blob verification (#9746 ) Replace large-chunk blob downloads with parallel small-chunk verification to solve timeout and performance issues. Registry users experienced progressively slowing download speeds as large-chunk transfers aged, often timing out completely. The previous approach downloaded blobs in a few large chunks but required a separate, single-threaded pass to read the entire blob back from disk for verification after download completion. This change uses the new chunksums API to fetch many smaller chunk+digest pairs, allowing concurrent downloads and immediate verification as each chunk arrives. Chunks are written directly to their final positions, eliminating the entire separate verification pass. The result is more reliable downloads that maintain speed throughout the transfer process and significantly faster overall completion, especially over unstable connections or with large blobs.	2025-03-13 22:18:29 -07:00
Patrick Devine	4bed739259	add verbose mode to the show command (#9640 ) Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com	2025-03-13 14:24:27 -07:00
Michael Yang	ec46f3286c	engine: error on embeddings; not currently implemented	2025-03-13 11:40:55 -07:00
jmorganca	65b0f329d1	Revert "Allow models to force a new batch" This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.	2025-03-11 14:49:20 -07:00
Jesse Gross	06007c0a18	Allow models to force a new batch This is useful for a few things: - Work around bugs, such as having 2 images in one batch - Keep the image in a single batch for fully connected attention - Improve performance by not evaluating embeddings multiple times	2025-03-11 14:49:20 -07:00
Jesse Gross	475005504e	Restrict Gemma to a single image per request	2025-03-11 14:49:20 -07:00
Blake Mizerany	e2252d0fc6	server/internal/registry: take over pulls from server package (#9485 ) This commit replaces the old pull implementation in the server package with the new, faster, more robust pull implementation in the registry package. The new endpoint, and now the remove endpoint too, are behind the feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT environment variable include "client2". Currently, the progress indication is wired to perform the same as the previous implementation to avoid making changes to the CLI, and because the status reports happen at the start of the download, and the end of the write to disk, the progress indication is not as smooth as it could be. This is a known issue and will be addressed in a future change. This implementation may be ~0.5-1.0% slower in rare cases, depending on network and disk speed, but is generally MUCH faster and more robust than the its predecessor in all other cases.	2025-03-05 14:48:18 -08:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00
Blake Mizerany	7a01ad7614	server/internal/registry: reintroduce pruning on model deletion (#9489 ) This reintroduces aggressive pruning on model deletion as a temporary measure until a more controlled garbage collection (GC) mechanism is implemented. Issues with the current approach: 1. Users may accidentally delete a model (`ollama rm llama3.3` instead of `ollama rm llama3.2`), requiring a full re-download unless another model references the same blobs. 2. Users may assume a deleted model is still referenced elsewhere, but due to prior updates or deletions, the references no longer exist, leading to unnecessary re-downloads. Soon, we should implement a structured GC mechanism to retain unreferenced blobs for a configurable period before removal, which will run on "ollama rm" and other commands we deem appropriate. Users that want to immediately remove unreferenced blobs can use a new prune command that will allow them to specify the age and class of blobs to remove. Example usage: # Run basic blob GC $ ollama prune # Remove unreferenced blobs older than 7 days $ ollama prune --age 7d # Remove all blobs, referenced or not, older than 7 days (and their manifests?) $ ollama prune --age 7d --all # Remove all unreferenced blobs immediately $ ollama prune --age 0 --all # Remove all blobs $ ollama prune --age 0 --all This should provide a safer and more predictable cleanup process.	2025-03-03 19:11:16 -08:00
Blake Mizerany	55ab9f371a	server/.../backoff,syncs: don't break builds without synctest (#9484 ) Previously, developers without the synctest experiment enabled would see build failures when running tests in some server/internal/internal packages using the synctest package. This change makes the transition to use of the package less painful but guards the use of the synctest package with build tags. synctest is enabled in CI. If a new change will break a synctest package, it will break in CI, even if it does not break locally. The developer docs have been updated to help with any confusion about why package tests pass locally but fail in CI.	2025-03-03 16:45:40 -08:00
Blake Mizerany	3519dd1c6e	server/internal/client/ollama: hold DiskCache on Registry (#9463 ) Previously, using a Registry required a DiskCache to be passed in for use in various methods. This was a bit cumbersome, as the DiskCache is required for most operations, and the DefaultCache is used in most of those cases. This change makes the DiskCache an optional field on the Registry struct. This also changes DefaultCache to initialize on first use. This is to not burden clients with the cost of creating a new cache per use, or having to hold onto a cache for the lifetime of the Registry. Also, slip in some minor docs updates for Trace.	2025-03-02 20:55:44 -08:00
Blake Mizerany	ee048b76d4	server/internal/client/ollama: handle extended names in client/ollama (#9454 ) The extended name format is a superset of the name format that only the client needs to know about, not the server or other dependents of the name package, so move the split logic into the client package. Also, take advantage of knowing about the extended name format to allow the client to use the extended name format when unlinking to verify they are unlinking the manifest with the content they intend.	2025-03-02 13:30:41 -08:00
Blake Mizerany	cda6f5c66c	server/internal/internal/names: validate names (#9400 ) This commit is a step towards a goal to make names less ceremonial outside of the registry client. Clients of the registry package can treat names as opaque strings, and the registry package will handle parsing, validating, and normalizing names. Ideally we end up with the names package tucked away in an internal package for good. We'll see how things go. Also, this package name is not permanent. This another step in the on-going process of refactoring the server code, and at some point it will most likely be renamed/moved.	2025-03-01 13:15:14 -08:00
Bruce MacDonald	bebb6823c0	server: validate local path on safetensor create (#9379 ) More validation during the safetensor creation process. Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths Add comprehensive test coverage for various paths No functionality changes for valid inputs - existing workflows remain unaffected Leverages Go 1.24's new os.Root functionality for secure containment	2025-02-28 16:10:43 -08:00
Blake Mizerany	eed11ded30	server/.../safetensors: fix offsets and include all model parts (#9427 ) Also, require the -as flag to be set when importing a model. This prevents the confusing error message "invalid name". Also, allow short names to be used when importing a model and auto-complete the name with the default mask.	2025-02-28 13:08:10 -08:00
Blake Mizerany	41dc280491	server/internal/registry: implement CloseNotify and Flush (for now) (#9402 ) This fixes panics introduced in `2412adf42b` when Gin ungracefully assumes that the http.ResponseWriter implements http.CloseNotifier and http.Flusher, which our new statusCodeRecorder does not. This is a temporary fix until we can pour the rest of the Gin out.	2025-02-27 14:00:37 -08:00
Blake Mizerany	2412adf42b	server/internal: replace model delete API with new registry handler. (#9347 ) This commit introduces a new API implementation for handling interactions with the registry and the local model cache. The new API is located in server/internal/registry. The package name is "registry" and should be considered temporary; it is hidden and not bleeding outside of the server package. As the commits roll in, we'll start consuming more of the API and then let reverse osmosis take effect, at which point it will surface closer to the root level packages as much as needed.	2025-02-27 12:04:53 -08:00
Blake Mizerany	348b3e0983	server/internal: copy bmizerany/ollama-go to internal package (#9294 ) This commit copies (without history) the bmizerany/ollama-go repository with the intention of integrating it into the ollama as a replacement for the pushing, and pulling of models, and management of the cache they are pushed and pulled from. New homes for these packages will be determined as they are integrated and we have a better understanding of proper package boundaries.	2025-02-24 22:39:44 -08:00

1 2 3 4 5 ...

925 Commits