ollama

mirror of https://github.com/zebrajr/ollama.git synced 2025-12-06 00:19:51 +01:00

Author	SHA1	Message	Date
Tobias Heinze	6fc9d22707	server: fix blob download when receiving a 200 response (#6656 )	2024-09-05 10:48:26 -07:00
Michael Yang	9468c6824a	Merge pull request #6534 from ollama/mxyng/messages update templates to use messages	2024-08-30 09:39:59 -07:00
Michael Yang	47c2b947a9	Merge pull request #6546 from ollama/mxyng/fix-test fix(test): do not clobber models directory	2024-08-28 15:37:47 -07:00
Michael Yang	e4d0a9c325	fix(test): do not clobber models directory	2024-08-28 14:07:48 -07:00
Michael Yang	d9d50c43cc	validate model path	2024-08-28 09:32:57 -07:00
Michael Yang	413ae39f3c	update templates to use messages	2024-08-27 15:44:04 -07:00
Jeffrey Morgan	47fa0839b9	server: clean up route names for consistency (#6524 )	2024-08-26 19:36:11 -07:00
Patrick Devine	0c819e167b	convert safetensor adapters into GGUF (#6327 )	2024-08-23 11:29:56 -07:00
Daniel Hiltgen	90ca84172c	Fix embeddings memory corruption (#6467 ) * Fix embeddings memory corruption The patch was leading to a buffer overrun corruption. Once removed though, parallism in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count. To work around this, only use slot 0 for embeddings. * Fix embed integration test assumption The token eval count has changed with recent llama.cpp bumps (0.3.5+)	2024-08-22 14:51:42 -07:00
Michael Yang	77903ab8b4	llama3.1	2024-08-21 11:49:31 -07:00
Michael Yang	4ecc70d3b4	Merge pull request #6386 from zwwhdls/fix-new-layer fix: chmod new layer to 0o644 when creating it	2024-08-21 10:58:45 -07:00
Daniel Hiltgen	88e7705079	Merge pull request #6402 from rick-github/numParallel Override numParallel in pickBestPartialFitByLibrary() only if unset.	2024-08-19 11:07:22 -07:00
Jeffrey Morgan	9fddef3731	server: limit upload parts to 16 (#6411 )	2024-08-19 09:20:52 -07:00
Richard Lyons	885cf45087	Fix white space.	2024-08-18 03:07:16 +02:00
Richard Lyons	9352eeb752	Reset NumCtx.	2024-08-18 02:55:01 +02:00
Richard Lyons	0ad0e738cd	Override numParallel only if unset.	2024-08-18 01:43:26 +02:00
zwwhdls	bdc4308afb	fix: chmod new layer to 0o644 when creating it Signed-off-by: zwwhdls <zww@hdls.me>	2024-08-16 11:43:19 +08:00
Michael Yang	3a75e74e34	only skip invalid json manifests	2024-08-15 10:29:14 -07:00
Michael Yang	237dccba1e	skip invalid manifest files	2024-08-14 16:55:45 -07:00
Michael Yang	b3f75fc812	fix noprune	2024-08-14 15:48:51 -07:00
Blake Mizerany	8e1050f366	server: reduce max connections used in download (#6347 ) The previous value of 64 was WAY too high and unnecessary. It reached diminishing returns and blew past it. This is a more reasonable number for _most_ normal cases. For users on cloud servers with excellent network quality, this will keep screaming for them, without hitting our CDN limits. For users with relatively poor network quality, this will keep them from saturating their network and causing other issues.	2024-08-13 16:47:35 -07:00
Michael Yang	2697d7f5aa	lint - fixes printf: non-constant format string in call to fmt.Printf - fixes SA1032: arguments have the wrong order - disables testifylint	2024-08-13 14:36:33 -07:00
royjhan	8b00a415ab	Load Embedding Model on Empty Input (#6325 ) * load on empty input * no load on invalid input	2024-08-13 10:19:56 -07:00
Josh	980dd15f81	cmd: speed up gguf creates (#6324 )	2024-08-12 11:46:09 -07:00
Josh	1dc3ef3aa9	Revert "server: speed up single gguf creates (#5898 )" (#6323 ) This reverts commit `8aac22438e`.	2024-08-12 09:57:51 -07:00
Josh	8aac22438e	server: speed up single gguf creates (#5898 )	2024-08-12 09:28:55 -07:00
Jeffrey Morgan	15c2d8fe14	server: parallelize embeddings in API web handler instead of in subprocess runner (#6220 ) For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.	2024-08-11 11:57:10 -07:00
Jesse Gross	9b53e39d8e	Merge pull request #6258 from coolljt0725/fix_typo server/download.go: Fix a typo in log	2024-08-09 17:19:48 -07:00
Daniel Hiltgen	2fa1db4345	Don't hard fail on sparse setup error It seems this can fail in some casees, but proceed with the download anyway.	2024-08-09 12:16:19 -07:00
Jitang Lei	7b61eba471	server/download.go: Fix a typo in log Signed-off-by: Jitang Lei <leijitang@outlook.com>	2024-08-08 20:28:01 +08:00
Jesse Gross	7edaf6e7e8	manifest: Store layers inside manifests consistently as values. Commit `1829fb61` ("manifest: Fix crash on startup when trying to clean up unused files (#5840)") changed the config layer stored in manifests from a pointer to a value. This was done in order to avoid potential nil pointer dereferences after it is deserialized from JSON in the event that the field is missing. This changes the Layers slice to also be stored by value. This enables consistency in handling across the two objects.	2024-08-07 17:03:06 -07:00
Jesse Gross	97ec8cfd4e	image: Clarify argument to WriteManifest is config When creating a model the config layer is appended to the list of layers and then the last layer is used as the config when writing the manifest. This change directly uses the config layer to write the manifest. There is no behavior change but it is less error prone.	2024-08-07 16:58:42 -07:00
Jesse Gross	1829fb61bd	manifest: Fix crash on startup when trying to clean up unused files (#5840 ) Currently if the config field is missing in the manifest file (or corrupted), Ollama will crash when it tries to read it. This can happen at startup or when pulling new models. This data is mostly just used for showing model information so we can be tolerant of it not being present - it is not required to run the models. Besides avoiding crashing, this also gives us the ability to restructure the config in the future by pulling it into the main manifest file.	2024-08-07 10:30:44 -07:00
Jesse Gross	685a53534b	manifest: Don't prune layers if we can't open a manifest file If there is an error when opening a manifest file (corrupted, permission denied, etc.) then the referenced layers will not be included in the list of active layers. This causes them to be deleted when pruning happens at startup or a model is pulled. In such a situation, we should prefer to preserve data in the hopes that it can be recovered rather than being agressive about deletion.	2024-08-06 23:11:19 -07:00
Daniel Hiltgen	fc85f50a2b	Ensure sparse files on windows during download The file.Truncate call on windows will write the whole file unless you set the sparse flag, leading to heavy I/O at the beginning of download. This should improve our I/O behavior on windows and put less stress on the users disk.	2024-08-06 10:58:08 -07:00
Michael Yang	a091fadfda	use testing tempdirs	2024-08-02 16:04:06 -07:00
Michael Yang	b732beba6a	lint	2024-08-01 17:06:06 -07:00
Michael Yang	ff7c9060ec	Merge pull request #6115 from slouffka/fix-context Fix context in /api/generate grows too much (#5980).	2024-08-01 15:13:59 -07:00
Michael Yang	0ff42e84b0	Merge pull request #4756 from ollama/mxyng/convert2 refactor convert	2024-08-01 14:16:30 -07:00
Vyacheslav Moskalev	8a9f946ca7	Refactor and format code.	2024-08-02 03:50:05 +07:00
Vyacheslav Moskalev	3b5210548e	Refactor code. Remove extra variable.	2024-08-01 19:56:15 +07:00
Vyacheslav Moskalev	b0c216584c	Better types and naming closer to style.	2024-08-01 19:43:44 +07:00
Vyacheslav Moskalev	49a5483139	Change the order of context and prompt.	2024-08-01 19:25:56 +07:00
Vyacheslav Moskalev	6bc5c13758	Fix extra context concatenation in generate handler (#5980 ).	2024-08-01 15:45:58 +07:00
Michael Yang	d87b4a488e	fix modelfile message quotes	2024-07-31 16:52:09 -07:00
Blake Mizerany	dc77bbcfa4	server: fix json marshalling of downloadBlobPart (#6108 )	2024-07-31 16:01:24 -07:00
Michael Yang	eafc607abb	convert: only extract large files	2024-07-31 15:58:55 -07:00
Michael Yang	df993fa37b	comments	2024-07-31 15:58:55 -07:00
Michael Yang	5e9db9fb0b	refactor convert	2024-07-31 15:58:33 -07:00
Michael Yang	c4c84b7a0d	Merge pull request #5196 from ollama/mxyng/messages-2 include modelfile messages	2024-07-31 10:18:17 -07:00
Michael Yang	5c1912769e	Merge pull request #5473 from ollama/mxyng/environ fix: environ lookup	2024-07-31 10:18:05 -07:00
royjhan	1b44d873e7	Add Metrics to `api\embed` response (#5709 ) * add prompt tokens to embed response * rm slog * metrics * types * prompt n * clean up * reset submodule * update tests * test name * list metrics	2024-07-30 13:12:21 -07:00
Daniel Hiltgen	345420998e	Prevent partial loading on mixed GPU brands In mult-brand GPU setups, if we couldn't fully load the model we would fall through the scheduler and mistakenly try to load across a mix of brands. This makes sure we find the set of GPU(s) that best fit for the partial load.	2024-07-30 11:00:55 -07:00
Michael Yang	079b2c3b03	Merge pull request #5999 from ollama/mxyng/fix-push fix nil deref in auth.go	2024-07-26 14:28:34 -07:00
Blake Mizerany	750c1c55f7	server: fix race conditions during download (#5994 ) This fixes various data races scattered throughout the download/pull client where the client was accessing the download state concurrently. This commit is mostly a hot-fix and will be replaced by a new client one day soon. Also, remove the unnecessary opts argument from downloadChunk.	2024-07-26 14:24:24 -07:00
Michael Yang	a622c47bd3	fix nil deref in auth.go	2024-07-26 14:14:48 -07:00
Michael Yang	ec4c35fe99	Merge pull request #5512 from ollama/mxyng/detect-stop autodetect stop parameters from template	2024-07-26 13:48:23 -07:00
Michael Yang	15af558423	include modelfile messages	2024-07-26 11:40:11 -07:00
Blake Mizerany	c8af3c2d96	server: reuse original download URL for images (#5962 ) This changes the registry client to reuse the original download URL it gets on the first redirect response for all subsequent requests, preventing thundering herd issues when hot new LLMs are released.	2024-07-25 15:58:30 -07:00
Josh	db0968f30c	fix dupe err message (#5857 )	2024-07-22 15:48:15 -07:00
Michael Yang	85d9d73a72	comments	2024-07-22 11:49:03 -07:00
Michael Yang	1954ec5917	uint64	2024-07-22 11:49:02 -07:00
Michael Yang	0f1910129f	int	2024-07-22 11:30:07 -07:00
Michael Yang	8570c1c0ef	keepalive	2024-07-22 11:27:22 -07:00
Michael Yang	55cd3ddcca	bool	2024-07-22 11:27:21 -07:00
Michael Yang	66fe77f084	models	2024-07-22 11:26:12 -07:00
Michael Yang	d1a5227cad	origins	2024-07-22 11:25:30 -07:00
Michael Yang	35b89b2eab	rfc: dynamic environ lookup	2024-07-22 11:25:30 -07:00
Jeffrey Morgan	b3e5491e41	server: collect nested tool call objects when parsing (#5824 )	2024-07-22 12:38:03 -04:00
Jeffrey Morgan	80ee9b5e47	Remove out of space test temporarily (#5825 )	2024-07-21 00:22:11 -04:00
Daniel Hiltgen	06e5d74e34	Merge pull request #5506 from dhiltgen/sched_tests Refine scheduler unit tests for reliability	2024-07-20 15:48:39 -07:00
Jeffrey Morgan	69a2d4ccff	Fix generate test flakyness (#5804 )	2024-07-19 19:11:25 -07:00
Josh	e8b954c646	server: validate template (#5734 ) add template validation to modelfile	2024-07-19 15:24:29 -07:00
Michael Yang	43606d6d6a	fix parsing tool calls	2024-07-18 12:08:11 -07:00
Jeffrey Morgan	70b1010fa5	server: check for empty tools array too (#5779 )	2024-07-18 11:44:57 -07:00
Jeffrey Morgan	319fb1ce03	server: only parse tool calls if tools are provided (#5771 ) * server: only parse tool calls if tools are provided * still set `resp.Message.Content`	2024-07-18 08:50:23 -07:00
Michael Yang	b255445557	marshal json automatically for some template values (#5758 )	2024-07-17 15:35:11 -07:00
Michael Yang	5fd6988126	parse tool call as individual objects	2024-07-17 11:19:04 -07:00
Michael Yang	c279f96371	remove ToolCall from GenerateResponse	2024-07-16 15:22:49 -07:00
Michael Yang	499e87c9ba	Merge pull request #5730 from ollama/mxyng/cleanup remove unneeded tool calls	2024-07-16 14:42:13 -07:00
Michael Yang	d290e87513	add suffix support to generate endpoint this change is triggered by the presence of "suffix", particularly useful for code completion tasks	2024-07-16 14:31:35 -07:00
Michael Yang	5a83f79afd	remove unneeded tool calls	2024-07-16 13:48:45 -07:00
royjhan	987dbab0b0	OpenAI: /v1/embeddings compatibility (#5285 ) * OpenAI v1 models * Empty List Testing * Add back envconfig * v1/models docs * Remove Docs * OpenAI batch embed compatibility * merge conflicts * integrate with api/embed * ep * merge conflicts * request tests * rm resp test * merge conflict * merge conflict * test fixes * test fn renaming * input validation for empty string --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2024-07-16 13:36:08 -07:00
Michael Yang	a8388beb94	Merge pull request #5726 from ollama/mxyng/tools-templates fix unmarshal type errors	2024-07-16 12:12:10 -07:00
Michael Yang	5afbb60fc4	fix unmarshal type errors	2024-07-16 11:39:34 -07:00
Jeffrey Morgan	4cb5d7decc	server: omit model system prompt if empty (#5717 )	2024-07-16 11:09:00 -07:00
Michael Yang	4a565cbf94	add chat and generate tests with mock runner	2024-07-16 09:39:31 -07:00
Michael Yang	64039df6d7	Merge pull request #5284 from ollama/mxyng/tools tools	2024-07-15 18:03:37 -07:00
Jeffrey Morgan	7ac6d462ec	server: return empty slice on empty `/api/embed` request (#5713 ) * server: return empty slice on empty `/api/embed` request * fix tests	2024-07-15 17:39:44 -07:00
Michael Yang	ef5136a745	tools test	2024-07-15 17:18:21 -07:00
Michael Yang	d02bbebb11	tools	2024-07-15 15:26:16 -07:00
royjhan	b9f5e16c80	Introduce `/api/embed` endpoint supporting batch embedding (#5127 ) * Initial Batch Embedding * Revert "Initial Batch Embedding" This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29. * Initial Draft * mock up notes * api/embed draft * add server function * check normalization * clean up * normalization * playing around with truncate stuff * Truncation * Truncation * move normalization to go * Integration Test Template * Truncation Integration Tests * Clean up * use float32 * move normalize * move normalize test * refactoring * integration float32 * input handling and handler testing * Refactoring of legacy and new * clear comments * merge conflicts * touches * embedding type 64 * merge conflicts * fix hanging on single string * refactoring * test values * set context length * clean up * testing clean up * testing clean up * remove function closure * Revert "remove function closure" This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787. * remove function closure * remove redundant error check * clean up * more clean up * clean up	2024-07-15 12:14:24 -07:00
Patrick Devine	057d31861e	remove template (#5655 )	2024-07-13 20:56:24 -07:00
jmorganca	f7ee012300	server: prepend system message in chat handler	2024-07-13 15:08:00 -07:00
Jeffrey Morgan	1ed0aa8fea	server: fix `context`, `load_duration` and `total_duration` fields (#5676 ) * server: fix `contet`, `load_duration` and `total_duration` fields * Update server/routes.go	2024-07-13 09:25:31 -07:00
Michael Yang	22c5451fc2	fix system prompt (#5662 ) * fix system prompt * execute template when hitting previous roles * fix tests --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2024-07-12 21:04:44 -07:00
Michael Yang	ebc529cbb3	autodetect stop parameters from template	2024-07-12 16:01:23 -07:00
Michael Yang	57ec6901eb	revert embedded templates to use prompt/response This reverts commit `19753c18c0`. for compat. messages will be added at a later date	2024-07-11 14:49:35 -07:00
Jeffrey Morgan	791650ddef	sched: only error when over-allocating system memory (#5626 )	2024-07-11 00:53:12 -07:00
Michael Yang	41be28096a	add system prompt to first legacy template	2024-07-10 17:03:08 -07:00
Daniel Hiltgen	f4408219e9	Refine scheduler unit tests for reliability This breaks up some of the test scenarios to create a more reliable set of tests, as well as adding a little more coverage.	2024-07-09 16:00:08 -07:00
Michael Yang	6bbbc50f10	Merge pull request #5440 from ollama/mxyng/messages-templates update named templates	2024-07-09 09:36:32 -07:00
Michael Yang	9bbddc37a7	Merge pull request #5126 from ollama/mxyng/messages update message processing	2024-07-09 09:20:44 -07:00
Jeffrey Morgan	e4ff73297d	server: fix model reloads when setting `OLLAMA_NUM_PARALLEL` (#5560 ) * server: fix unneeded model reloads when setting `OLLAMA_NUM_PARALLEL` * remove whitespace change * undo some changes	2024-07-08 22:32:15 -07:00
Jeffrey Morgan	0ee87615c7	sched: don't error if paging to disk on Windows and macOS (#5523 )	2024-07-06 22:01:52 -04:00
Michael Yang	fb6cbc02fb	update named templates	2024-07-05 16:29:32 -07:00
Michael Yang	ac7a842e55	fix model reloading ensure runtime model changes (template, system prompt, messages, options) are captured on model updates without needing to reload the server	2024-07-05 13:17:25 -07:00
Michael Yang	2c3fe1fd97	comments	2024-07-05 13:17:24 -07:00
Michael Yang	269ed6e6a2	update message processing	2024-07-05 13:16:58 -07:00
Daniel Hiltgen	af28b94533	Merge pull request #5469 from dhiltgen/prevent_system_oom Prevent loading models larger than total memory	2024-07-05 08:22:20 -07:00
Anatoli Babenia	0d16eb310e	fix: use `envconfig.ModelsDir` directly (#4821 ) * Co-authored-by: Anatoli Babenia <anatoli@rainforce.org> Co-authored-by: Maas Lalani <maas@lalani.dev>	2024-07-03 15:36:11 -07:00
Daniel Hiltgen	955f2a4e03	Only set default keep_alive on initial model load This change fixes the handling of keep_alive so that if client request omits the setting, we only set this on initial load. Once the model is loaded, if new requests leave this unset, we'll keep whatever keep_alive was there.	2024-07-03 15:29:56 -07:00
Daniel Hiltgen	3c75113e37	Prevent loading models larger than total memory Users may not realize the siny new model they're trying to load fits on their disk, but can't load into system+GPU memory. Today we crash, but with this fix, we'll give them a better error message before even trying to load it.	2024-07-03 14:47:42 -07:00
Michael Yang	65a5040e09	fix generate template	2024-07-02 16:42:17 -07:00
royjhan	d626b99b54	OpenAI: v1/completions compatibility (#5209 ) * OpenAI v1 models * Refactor Writers * Add Test Co-Authored-By: Attila Kerekes * Credit Co-Author Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com> * Empty List Testing * Use Namespace for Ownedby * Update Test * Add back envconfig * v1/models docs * Use ModelName Parser * Test Names * Remove Docs * Clean Up * Test name Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * Add Middleware for Chat and List * Completions Endpoint * Testing Cleanup * Test with Fatal * Add functionality to chat test * Rename function * float types * type cleanup * cleaning * more cleaning * Extra test cases * merge conflicts * merge conflicts * merge conflicts * merge conflicts * cleaning * cleaning --------- Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com> Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2024-07-02 16:01:45 -07:00
Michael Yang	dddb58a38b	Merge pull request #5051 from ollama/mxyng/capabilities add model capabilities	2024-07-02 14:26:07 -07:00
Michael Yang	400056e154	Merge pull request #5420 from ollama/mxyng/insecure-path err on insecure path	2024-07-02 14:03:23 -07:00
royjhan	996bb1b85e	OpenAI: /v1/models and /v1/models/{model} compatibility (#5007 ) * OpenAI v1 models * Refactor Writers * Add Test Co-Authored-By: Attila Kerekes * Credit Co-Author Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com> * Empty List Testing * Use Namespace for Ownedby * Update Test * Add back envconfig * v1/models docs * Use ModelName Parser * Test Names * Remove Docs * Clean Up * Test name Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * Add Middleware for Chat and List * Testing Cleanup * Test with Fatal * Add functionality to chat test * OpenAI: /v1/models/{model} compatibility (#5028) * Retrieve Model * OpenAI Delete Model * Retrieve Middleware * Remove Delete from Branch * Update Test * Middleware Test File * Function name * Cleanup * Test Update * Test Update --------- Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com> Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2024-07-02 11:50:56 -07:00
Michael Yang	88bcd79bb9	err on insecure path	2024-07-01 15:55:59 -07:00
Michael Yang	da8e2a0447	use kvs to detect embedding models	2024-07-01 10:47:43 -07:00
Michael Yang	a30915bde1	add capabilities	2024-07-01 10:47:43 -07:00
Michael Yang	58e3fff311	rename templates to template	2024-07-01 10:40:54 -07:00
Michael Yang	3f0b309ad4	remove ManifestV2	2024-07-01 10:40:54 -07:00
Daniel Hiltgen	cff3f44f4a	Fix case for NumCtx	2024-07-01 09:43:59 -07:00
Daniel Hiltgen	3518aaef33	Merge pull request #4218 from dhiltgen/auto_parallel Enable concurrency by default	2024-07-01 08:32:29 -07:00
Michael Yang	123a722a6f	zip: prevent extracting files into parent dirs (#5314 )	2024-06-26 21:38:21 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Daniel Hiltgen	642cee1342	Sort the ps output Provide consistent ordering for the ps command - longest duration listed first	2024-06-21 15:59:41 -07:00
Daniel Hiltgen	9929751cc8	Disable concurrency for AMD + Windows Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.	2024-06-21 15:45:05 -07:00
Daniel Hiltgen	17b7186cd7	Enable concurrency by default This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.	2024-06-21 15:45:05 -07:00
Michael Yang	e835ef1836	fix: quantization with template	2024-06-21 13:39:25 -07:00
royjhan	fedf71635e	Extend api/show and ollama show to return more model info (#4881 ) * API Show Extended * Initial Draft of Information Co-Authored-By: Patrick Devine <pdevine@sonic.net> * Clean Up * Descriptive arg error messages and other fixes * Second Draft of Show with Projectors Included * Remove Chat Template * Touches * Prevent wrapping from files * Verbose functionality * Docs * Address Feedback * Lint * Resolve Conflicts * Function Name * Tests for api/show model info * Show Test File * Add Projector Test * Clean routes * Projector Check * Move Show Test * Touches * Doc update --------- Co-authored-by: Patrick Devine <pdevine@sonic.net>	2024-06-19 14:19:02 -07:00
royjhan	89c79bec8c	Add ModifiedAt Field to /api/show (#5033 ) * Add Mod Time to Show * Error Handling	2024-06-15 20:53:56 -07:00
Daniel Hiltgen	45cacbaf05	Merge pull request #4517 from dhiltgen/gpu_incremental Enhanced GPU discovery and multi-gpu support with concurrency	2024-06-14 15:35:00 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	ff4f0cbd1d	Prevent multiple concurrent loads on the same gpus While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	fc37c192ae	Refine CPU load behavior with system memory visibility	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	434dfe30c5	Reintroduce nvidia nvml library for windows This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	48702dd149	Harden unload for empty runners	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	5e8ff556cb	Support forced spreading for multi GPU Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Jeffrey Morgan	dd7c9ebeaf	server: longer timeout in `TestRequests` (#5046 )	2024-06-14 09:48:25 -07:00
Patrick Devine	94618b2365	add OLLAMA_MODELS to envconfig (#5029 )	2024-06-13 12:52:03 -07:00
Jeffrey Morgan	1fd236d177	server: remove jwt decoding error (#5027 )	2024-06-13 11:21:15 -07:00
Michael Yang	c16f8af911	fix: multiple templates when creating from model multiple templates may appear in a model if a model is created from another model that 1) has an autodetected template and 2) defines a custom template	2024-06-12 13:35:49 -07:00
Michael Yang	515f497e6d	fix: skip removing layers that no longer exist	2024-06-10 11:32:19 -07:00
Michael Yang	b27268aaef	add test	2024-06-10 11:32:15 -07:00
Michael Yang	030e765e76	fix create model when template detection errors	2024-06-07 10:51:35 -07:00
Michael Yang	9b6c2e6eb6	detect chat template from KV	2024-06-06 16:03:47 -07:00
royjhan	1a29e9a879	API app/browser access (#4879 ) * API app/browser access * Add tauri (resolves #2291, #4791, #3799, #4388)	2024-06-06 15:19:03 -07:00
royjhan	4bf1da4944	Separate ListResponse and ModelResponse for api/tags vs api/ps (#4842 ) * Remove false time fields * Struct Separation for List and Process * Remove Marshaler	2024-06-06 10:11:45 -07:00
Blake Mizerany	de5beb06b3	server: skip blob verification for already verified blobs	2024-06-05 16:39:11 -07:00
Michael Yang	d61ef8b954	update create handler to use model.Name	2024-06-04 13:28:25 -07:00
Michael Yang	6297f85606	gofmt, goimports	2024-06-04 13:20:24 -07:00
Michael Yang	8ce4032e72	more lint	2024-06-04 11:13:30 -07:00
Michael Yang	e40145a39d	lint	2024-06-04 11:13:30 -07:00
Michael Yang	c895a7d13f	some gocritic	2024-06-04 11:13:30 -07:00
Michael Yang	8ffb51749f	nolintlint	2024-06-04 11:13:30 -07:00
Michael Yang	04f3c12bb7	replace x/exp/slices with slices	2024-06-04 11:13:30 -07:00
Michael Yang	96bc232b43	Merge pull request #4413 from ollama/mxyng/name-check check if name exists before create/pull/copy	2024-05-29 12:06:58 -07:00
Michael Yang	bca7b12284	Merge pull request #3718 from ollama/mxyng/modelname-3 update delete handler to use model.Name	2024-05-29 12:02:07 -07:00
Michael Yang	6adca97f37	Merge pull request #4619 from noxer/patch-1 Fix download retry issue	2024-05-24 17:21:57 -07:00
Patrick Devine	4cc3be3035	Move envconfig and consolidate env vars (#4608 )	2024-05-24 14:57:15 -07:00
Tim Scheuermann	db2ffa79f1	Fix download retry issue	2024-05-24 20:30:42 +02:00
Jeffrey Morgan	38255d2af1	Use flash attention flag for now (#4580 ) * put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests	2024-05-22 21:52:09 -07:00
Sang Park	4434d7f447	Correct typo in error message (#4535 ) The spelling of the term "request" has been corrected, which was previously mistakenly written as "requeset" in the error log message.	2024-05-21 13:39:01 -07:00
Michael Yang	807d092761	fix quantize file types	2024-05-20 15:22:11 -07:00
Michael Yang	f36f1d6be9	tidy intermediate blobs	2024-05-20 15:15:06 -07:00
Michael Yang	3520c0e4d5	cache and reuse intermediate blobs particularly useful for zipfiles and f16s	2024-05-20 13:25:10 -07:00
Patrick Devine	ccdf0b2a44	Move the parser back + handle utf16 files (#4533 )	2024-05-20 11:26:45 -07:00
Daniel Hiltgen	02b31c9dc8	Don't return error on signal exit	2024-05-16 16:25:38 -07:00
Michael Yang	84ed77cbd8	Merge pull request #4436 from ollama/mxyng/done-part return on part done	2024-05-15 17:16:24 -07:00
Patrick Devine	d1692fd3e0	fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461 )	2024-05-15 15:43:16 -07:00
Patrick Devine	f2cf97d6f1	fix typo in modelfile generation (#4439 )	2024-05-14 15:34:29 -07:00
Michael Yang	85a57006d1	check if name exists before create/pull/copy	2024-05-14 14:58:58 -07:00
Michael Yang	c5e892cb3e	update tests	2024-05-14 14:56:31 -07:00
Michael Yang	81fb06f530	more resilient Manifests	2024-05-14 14:08:24 -07:00
Michael Yang	a385382ff5	filepath.Join	2024-05-14 14:08:24 -07:00
Michael Yang	b8772a353f	remove DeleteModel	2024-05-14 14:08:24 -07:00
Michael Yang	c2714fcbfd	routes: use Manifests for ListHandler	2024-05-14 14:08:24 -07:00
Michael Yang	a2fc933fed	update delete handler to use model.Name	2024-05-14 14:08:24 -07:00
Michael Yang	ac145f75ca	return on part done	2024-05-14 13:04:30 -07:00
Ryo Machida	798b107f19	Fixed the API endpoint /api/tags when the model list is empty. (#4424 ) * Fixed the API endpoint /api/tags to return {models: []} instead of {models: null} when the model list is empty. * Update server/routes.go --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2024-05-14 11:18:10 -07:00
Daniel Hiltgen	ec231a7923	Remove VRAM convergence check for windows The APIs we query are optimistic on free space, and windows pages VRAM, so we don't have to wait to see reported usage recover on unload	2024-05-14 09:53:46 -07:00
Patrick Devine	7ca71a6b0f	don't abort when an invalid model name is used in /save (#4416 )	2024-05-13 18:48:28 -07:00
Patrick Devine	6845988807	Ollama `ps` command for showing currently loaded models (#4327 )	2024-05-13 17:17:36 -07:00
jmorganca	4ec7445a6f	Revert "use post token" This reverts commit `0fec3525ad`.	2024-05-11 22:19:14 -07:00
Michael Yang	0fec3525ad	use post token	2024-05-11 19:13:16 -07:00
Daniel Hiltgen	824ee5446f	Fix envconfig unit test	2024-05-10 16:49:48 -07:00
Daniel Hiltgen	4142c3ef7c	Always use the sorted list of GPUs Make sure the first GPU has the most free space	2024-05-10 13:53:21 -07:00
Jeffrey Morgan	6602e793c0	Use `--quantize` flag and `quantize` api parameter (#4321 ) * rename `--quantization` to `--quantize` * backwards * Update api/types.go Co-authored-by: Michael Yang <mxyng@pm.me> --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2024-05-10 13:06:13 -07:00
Jeffrey Morgan	bb6fd02298	Don't clamp ctx size in `PredictServerFit` (#4317 ) * dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning	2024-05-10 10:17:12 -07:00
Michael Yang	e03637176d	fix(routes): skip bad manifests	2024-05-10 08:46:11 -07:00
Jeffrey Morgan	302d7fdbf3	prune partial downloads (#4272 )	2024-05-09 16:35:20 -07:00
Daniel Hiltgen	3ae2f441e0	Fix race in shutdown logic Ensure the runners are terminated	2024-05-09 15:54:02 -07:00
Daniel Hiltgen	354ad9254e	Wait for GPU free memory reporting to converge The GPU drivers take a while to update their free memory reporting, so we need to wait until the values converge with what we're expecting before proceeding to start another runner in order to get an accurate picture.	2024-05-09 14:56:01 -07:00
Daniel Hiltgen	8727a9c140	Record more GPU information This cleans up the logging for GPU discovery a bit, and can serve as a foundation to report GPU information in a future UX.	2024-05-09 14:18:14 -07:00
Bruce MacDonald	cfa84b8470	add done_reason to the api (#4235 )	2024-05-09 13:30:14 -07:00
Michael Yang	a7ee84fc31	routes: skip invalid filepaths	2024-05-09 11:23:22 -07:00
Jeffrey Morgan	d5eec16d23	use model defaults for `num_gqa`, `rope_frequency_base` and `rope_frequency_scale` (#1983 )	2024-05-09 09:06:13 -07:00
Bruce MacDonald	cef45feaa4	Add preflight OPTIONS handling and update CORS config (#4086 ) * Add preflight OPTIONS handling and update CORS config - Implement early return with HTTP 204 (No Content) for OPTIONS requests in allowedHostsMiddleware to optimize preflight handling. - Extend CORS configuration to explicitly allow 'Authorization' headers and 'OPTIONS' method when OLLAMA_ORIGINS environment variable is set. * allow auth, content-type, and user-agent headers * Update routes.go	2024-05-08 13:14:00 -07:00
Michael Yang	b25976aeb8	routes: fix show llava models	2024-05-08 12:43:36 -07:00
Michael Yang	88cf154483	Merge pull request #4244 from ollama/mxyng/skip-if-same skip if same quantization	2024-05-07 19:03:37 -07:00
Bruce MacDonald	8cbd3e7510	skip hidden files in list models handler (#4247 )	2024-05-07 19:01:45 -07:00
Michael Yang	eeb695261f	skip if same quantization	2024-05-07 17:44:19 -07:00
Bruce MacDonald	dc9b1111e0	fix invalid destination error message	2024-05-07 17:35:52 -07:00
Michael Yang	ffbd3d173f	Merge pull request #3715 from ollama/mxyng/modelname-2 update list handler to use model.Name	2024-05-07 15:21:39 -07:00
Michael Yang	1e0a669f75	Merge pull request #3682 from ollama/mxyng/quantize-all-the-things quantize any fp16/fp32 model	2024-05-07 15:20:49 -07:00
Michael Yang	548a7df014	update list handler to use model.Name	2024-05-07 09:38:45 -07:00
Jeffrey Morgan	39d9d22ca3	close server on receiving signal (#4213 )	2024-05-06 16:01:37 -07:00
Michael Yang	b2f00aa977	close zip files	2024-05-06 15:27:19 -07:00
Michael Yang	f5e8b207fb	s/DisplayLongest/String/	2024-05-06 15:24:01 -07:00
Michael Yang	d245460362	only quantize language models	2024-05-06 15:24:01 -07:00
Michael Yang	4d0d0fa383	no iterator	2024-05-06 15:24:01 -07:00
Michael Yang	7ffe45734d	rebase	2024-05-06 15:24:01 -07:00
Michael Yang	01811c176a	comments	2024-05-06 15:24:01 -07:00
Michael Yang	a7248f6ea8	update tests	2024-05-06 15:24:01 -07:00
Michael Yang	9685c34509	quantize any fp16/fp32 model - FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}	2024-05-06 15:24:01 -07:00
Daniel Hiltgen	0963c65027	Merge pull request #4208 from dhiltgen/fix_sched_test Fix stale test logic	2024-05-06 14:23:12 -07:00
Jeffrey Morgan	c9f98622b1	Skip scheduling cancelled requests, always reload unloaded runners (#4189 )	2024-05-06 14:22:24 -07:00
Daniel Hiltgen	0a954e5066	Fix stale test logic The model processing was recently changed to be deferred but this test scenario hadn't been adjusted for that change in behavior.	2024-05-06 14:15:37 -07:00
Jeffrey Morgan	dfa2f32ca0	unload in critical section (#4187 )	2024-05-05 17:18:27 -07:00
Daniel Hiltgen	f56aa20014	Centralize server config handling This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs	2024-05-05 16:49:50 -07:00
Jeffrey Morgan	942c979232	allocate a large enough kv cache for all parallel requests (#4162 )	2024-05-05 15:59:32 -07:00
Patrick Devine	2a21363bb7	validate the format of the digest when getting the model path (#4175 )	2024-05-05 11:46:12 -07:00
Daniel Hiltgen	20f6c06569	Make maximum pending request configurable This also bumps up the default to be 50 queued requests instead of 10.	2024-05-04 21:00:52 -07:00
Michael Yang	b7a87a22b6	Merge pull request #4059 from ollama/mxyng/parser-2 rename parser to model/file	2024-05-03 13:01:22 -07:00
Daniel Hiltgen	9a32c514cb	Soften timeouts on sched unit tests This gives us more headroom on the scheduler tests to tamp down some flakes.	2024-05-03 09:08:33 -07:00
Michael Yang	e9ae607ece	Merge pull request #3892 from ollama/mxyng/parser refactor modelfile parser	2024-05-02 17:04:47 -07:00
Michael Yang	5b806d8d24	Merge pull request #4089 from ollama/mxyng/target-invalid server: destination invalid	2024-05-01 12:46:35 -07:00
Michael Yang	45b6a12e45	server: target invalid	2024-05-01 12:40:45 -07:00
Mark Ward	63c763685f	log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.	2024-05-01 18:51:10 +00:00
Mark Ward	f4a73d57a4	fix runner expire during active use. Clearing the expire timer as it is used. Allowing the finish to assign an expire timer so that the runner will expire after no use.	2024-05-01 18:51:10 +00:00
Michael Yang	119589fcb3	rename parser to model/file	2024-05-01 09:53:50 -07:00
Michael Yang	9cf0f2e973	use parser.Format instead of templating modelfile	2024-05-01 09:52:54 -07:00
Michael Yang	c0a00f68ae	refactor modelfile parser	2024-05-01 09:52:54 -07:00
Bruce MacDonald	0a7fdbe533	prompt to display and add local ollama keys to account (#3717 ) - return descriptive error messages when unauthorized to create blob or push a model - display the local public key associated with the request that was denied	2024-04-30 11:02:08 -07:00
Jeffrey Morgan	586672f490	fix copying model to itself (#4019 )	2024-04-28 23:47:49 -04:00
Daniel Hiltgen	d6e3b64582	Fix concurrency for CPU mode Prior refactoring passes accidentally removed the logic to bypass VRAM checks for CPU loads. This adds that back, along with test coverage. This also fixes loaded map access in the unit test to be behind the mutex which was likely the cause of various flakes in the tests.	2024-04-28 13:42:39 -07:00
Jeffrey Morgan	bb31def011	return code `499` when user cancels request while a model is loading (#3955 )	2024-04-26 17:38:29 -04:00
Blake Mizerany	37f9c8ad99	types/model: overhaul Name and Digest types (#3924 )	2024-04-26 13:08:32 -07:00
Daniel Hiltgen	9b5a3c5991	Merge pull request #3914 from dhiltgen/mac_perf Improve mac parallel performance	2024-04-25 16:28:31 -07:00
Jeffrey Morgan	00b0699c75	Reload model if `num_gpu` changes (#3920 ) * reload model if `num_gpu` changes * dont reload on -1 * fix tests	2024-04-25 19:02:40 -04:00
Daniel Hiltgen	b123be5b71	Adjust context size for parallelism	2024-04-25 13:58:54 -07:00
Daniel Hiltgen	f503a848c2	Merge pull request #3895 from brycereitano/shiftloading Move ggml loading to when attempting to fit	2024-04-25 09:24:08 -07:00
Bryce Reitano	36a6daccab	Restructure loading conditional chain	2024-04-24 17:37:03 -06:00
Bryce Reitano	ceb0e26e5e	Provide variable ggml for TestLoad	2024-04-24 17:19:55 -06:00
Bryce Reitano	284e02bed0	Move ggml loading to when we attempt fitting	2024-04-24 17:17:24 -06:00
Michael Yang	592dae31c8	update copy to use model.Name	2024-04-24 15:54:54 -07:00
Daniel Hiltgen	d8851cb7a0	Harden sched TestLoad Give the go routine a moment to deliver the expired event	2024-04-23 16:14:47 -07:00

... 3 4 5 6 7 ...

925 Commits