mirror of https://github.com/zebrajr/ollama.git synced 2025-12-06 00:19:51 +01:00

History

Jesse Gross a8d9c2648e llamarunner: Record the time for all batches during prompt processing Currently, we only record the time for the last batch when processing the prompt. This results in unrealistically high numbers for the old llama runner. Before: total duration: 31.273112939s load duration: 4.97054657s prompt eval count: 32768 token(s) prompt eval duration: 235.137439ms prompt eval rate: 139356.80 tokens/s eval count: 1873 token(s) eval duration: 18.173182374s eval rate: 103.06 tokens/s After: total duration: 30.024798033s load duration: 4.758588663s prompt eval count: 32768 token(s) prompt eval duration: 7.779621548s prompt eval rate: 4212.03 tokens/s eval count: 1769 token(s) eval duration: 17.148014223s eval rate: 103.16 tokens/s		2025-10-22 13:52:58 -07:00
..
common	chore: fix some inconsistent function name in comment	2025-08-13 09:50:27 -07:00
llamarunner	llamarunner: Record the time for all batches during prompt processing	2025-10-22 13:52:58 -07:00
ollamarunner	runner: always truncate embeddings requests (#12714 )	2025-10-20 16:47:05 -07:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Runner for Ollama engine	2025-02-13 17:09:26 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding