docs: temporarily restore api.md and cleanup docs paths (#12818)

This commit is contained in:
Jeffrey Morgan 2025-10-28 23:25:48 -07:00 committed by GitHub
parent a342160803
commit 93e45f0f0d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 1848 additions and 97 deletions

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,5 @@
--- ---
title: "Introduction" title: Introduction
--- ---
Ollama's API allows you to run and interact with models programatically. Ollama's API allows you to run and interact with models programatically.

View File

@ -1,71 +0,0 @@
---
title: Benchmark
---
Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes.
## When to use
Run these benchmarks when:
- Making changes to the model inference engine
- Modifying model loading/unloading logic
- Changing prompt processing or token generation code
- Implementing a new model architecture
- Testing performance across different hardware setups
## Prerequisites
- Ollama server running locally with `ollama serve` on `127.0.0.1:11434`
## Usage and Examples
<Note>
All commands must be run from the root directory of the Ollama project.
</Note>
Basic syntax:
```bash
go test -bench=. ./benchmark/... -m $MODEL_NAME
```
Required flags:
- `-bench=.`: Run all benchmarks
- `-m`: Model name to benchmark
Optional flags:
- `-count N`: Number of times to run the benchmark (useful for statistical analysis)
- `-timeout T`: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes)
Common usage patterns:
Single benchmark run with a model specified:
```bash
go test -bench=. ./benchmark/... -m llama3.3
```
## Output metrics
The benchmark reports several key metrics:
- `gen_tok/s`: Generated tokens per second
- `prompt_tok/s`: Prompt processing tokens per second
- `ttft_ms`: Time to first token in milliseconds
- `load_ms`: Model load time in milliseconds
- `gen_tokens`: Total tokens generated
- `prompt_tokens`: Total prompt tokens processed
Each benchmark runs two scenarios:
- Cold start: Model is loaded from disk for each test
- Warm start: Model is pre-loaded in memory
Three prompt lengths are tested for each scenario:
- Short prompt (100 tokens)
- Medium prompt (500 tokens)
- Long prompt (1000 tokens)

View File

@ -58,7 +58,7 @@
"redirects": [ "redirects": [
{ {
"source": "/openai", "source": "/openai",
"destination": "/api/openai" "destination": "/api/openai-compatibility"
}, },
{ {
"source": "/api/openai", "source": "/api/openai",
@ -130,7 +130,7 @@
{ {
"group": "API Reference", "group": "API Reference",
"pages": [ "pages": [
"/api", "/api/index",
"/api/authentication", "/api/authentication",
"/api/streaming", "/api/streaming",
"/api/usage", "/api/usage",