mirror of
https://github.com/zebrajr/ollama.git
synced 2025-12-06 00:19:51 +01:00
docs: temporarily restore api.md and cleanup docs paths (#12818)
This commit is contained in:
parent
a342160803
commit
93e45f0f0d
1868
docs/api.md
1868
docs/api.md
File diff suppressed because it is too large
Load Diff
|
|
@ -1,5 +1,5 @@
|
|||
---
|
||||
title: "Introduction"
|
||||
title: Introduction
|
||||
---
|
||||
|
||||
Ollama's API allows you to run and interact with models programatically.
|
||||
|
|
@ -1,71 +0,0 @@
|
|||
---
|
||||
title: Benchmark
|
||||
---
|
||||
|
||||
Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes.
|
||||
|
||||
## When to use
|
||||
|
||||
Run these benchmarks when:
|
||||
|
||||
- Making changes to the model inference engine
|
||||
- Modifying model loading/unloading logic
|
||||
- Changing prompt processing or token generation code
|
||||
- Implementing a new model architecture
|
||||
- Testing performance across different hardware setups
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Ollama server running locally with `ollama serve` on `127.0.0.1:11434`
|
||||
|
||||
## Usage and Examples
|
||||
|
||||
<Note>
|
||||
All commands must be run from the root directory of the Ollama project.
|
||||
</Note>
|
||||
|
||||
Basic syntax:
|
||||
|
||||
```bash
|
||||
go test -bench=. ./benchmark/... -m $MODEL_NAME
|
||||
```
|
||||
|
||||
Required flags:
|
||||
|
||||
- `-bench=.`: Run all benchmarks
|
||||
- `-m`: Model name to benchmark
|
||||
|
||||
Optional flags:
|
||||
|
||||
- `-count N`: Number of times to run the benchmark (useful for statistical analysis)
|
||||
- `-timeout T`: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes)
|
||||
|
||||
Common usage patterns:
|
||||
|
||||
Single benchmark run with a model specified:
|
||||
|
||||
```bash
|
||||
go test -bench=. ./benchmark/... -m llama3.3
|
||||
```
|
||||
|
||||
## Output metrics
|
||||
|
||||
The benchmark reports several key metrics:
|
||||
|
||||
- `gen_tok/s`: Generated tokens per second
|
||||
- `prompt_tok/s`: Prompt processing tokens per second
|
||||
- `ttft_ms`: Time to first token in milliseconds
|
||||
- `load_ms`: Model load time in milliseconds
|
||||
- `gen_tokens`: Total tokens generated
|
||||
- `prompt_tokens`: Total prompt tokens processed
|
||||
|
||||
Each benchmark runs two scenarios:
|
||||
|
||||
- Cold start: Model is loaded from disk for each test
|
||||
- Warm start: Model is pre-loaded in memory
|
||||
|
||||
Three prompt lengths are tested for each scenario:
|
||||
|
||||
- Short prompt (100 tokens)
|
||||
- Medium prompt (500 tokens)
|
||||
- Long prompt (1000 tokens)
|
||||
|
|
@ -58,7 +58,7 @@
|
|||
"redirects": [
|
||||
{
|
||||
"source": "/openai",
|
||||
"destination": "/api/openai"
|
||||
"destination": "/api/openai-compatibility"
|
||||
},
|
||||
{
|
||||
"source": "/api/openai",
|
||||
|
|
@ -130,7 +130,7 @@
|
|||
{
|
||||
"group": "API Reference",
|
||||
"pages": [
|
||||
"/api",
|
||||
"/api/index",
|
||||
"/api/authentication",
|
||||
"/api/streaming",
|
||||
"/api/usage",
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user