Elegant local language model inference with ollama
We investigate the internals of ollama, a widely-adored local LLM inference management platform built atop ggml.
Introduction
Getting started with ollama
is simple: clone the repository, follow the
development build
instructions,
and run the following two commands to spin up a server and set up inference on
Qwen’s latest qwq
reasoning model:
go run . serve & # Start the ollama server in the background
go run . run qwq # For example, run Qwen `qwq` (~20GB)
Upon doing so, you’ll note that the server prints many interesting logs (most
notably, many logs from the ggml
library, an early indicator that ggml
is
doing the heavy lifting for our inference calls); we’ll come back to these
later. The client displays a UI likely familiar to those who have worked with
Docker before, indicating that Ollama stores its models in layers:
pulling manifest
pulling c62ccde5630c... 100% ▕████████████████████████▏ 19 GB
pulling 41190096a061... 100% ▕████████████████████████▏ 1.2 KB
pulling d18a5cc71b84... 100% ▕████████████████████████▏ 11 KB
pulling 4afe5edfdb51... 100% ▕████████████████████████▏ 77 B
pulling 6a8faa2fb8b0... 100% ▕████████████████████████▏ 488 B
verifying sha256 digest
writing manifest
success
After the model is fetched, we can run inference in a straightforward manner (assuming we have enough RAM):
>>> How many times does the letter 'r' appear in the word "strawberry"?
<think>
...
… and we’re off, no extra work needed: that’s a pretty neat user experience.
In this post, we’ll invesigate some of the internals of Ollama, from the model
registry to the inference forward pass and key-value cache. At a high level,
the project is written in Go, and aims to provide a proper API, command line
interface, and model registry layer atop the ggml
on-device model inference
library. While the majority of this logic is orthogonal to the actual forward
pass implementation in ggml (if you’re curious about that, see my earlier post
on ggml internals), it’s still instructive to walk through the rest of the
implementation to provide a neat, condensed, and usable LLM serving workflow.
The Ollama Model Registry
We’ll begin with the Ollama Modelfile and registry, the first key feature the library adds atop a typical language model inference libary. Here, Ollama takes heavy inspiration from Docker, in both its Modelfile definition and registry implementation.
The Modelfile
A sample Ollama Modelfile can be written as follows:
FROM llama3.2
PARAMETER temperature 1
PARAMETER num_ctx 4096
SYSTEM You are Mario from super mario bros, acting as an assistant.
In analogy to a typical Dockerfile, the base model (defined in the FROM
instruction) plays the role of a base image, and subsequent commands that
augment the model in different ways (e.g. parameters used for inference, the
system prompt, adapters) are added in separate layers. This allows for re-use
of common components across models; for example, multiple Modelfiles can be
constructed from the llama3.2
base, and the base model will only be
downloaded once (in an identical fashion to the role of the Docker layer
cache).
For full documentation on the parameters that can be included in a Modelfile, visit the Ollama documentation on the subject. The parser itself is a single Go file, if you’re curious about its mechanics.
Creating model images from Modelfile
s
A natural question is how the “layers” of a Modelfile
are physically
represented. The answer is in two parts: plain-text layers are stored
as JSON (e.g. parameters, messages, etc.) or text (e.g. LICENSE), and model
data is stored in
GGUF
or
safetensors
format.
The Modelfile model creation handler implementation (the handler for
/api/create
) shines some more light on how exactly these layers are
interpreted by Ollama. A creation request sent by the client to the
server is typed as
// CreateRequest is the request passed to [Client.Create]. It's a
// parsed representation of the user-provided Modelfile.
type CreateRequest struct {
Model string `json:"model"`
Stream *bool `json:"stream,omitempty"`
Quantize string `json:"quantize,omitempty"`
From string `json:"from,omitempty"`
Files map[string]string `json:"files,omitempty"`
Adapters map[string]string `json:"adapters,omitempty"`
Template string `json:"template,omitempty"`
License any `json:"license,omitempty"`
System string `json:"system,omitempty"`
Parameters map[string]any `json:"parameters,omitempty"`
Messages []Message `json:"messages,omitempty"`
}
which is as we expect; the model name and relevant options/parameters are
passed as structured objects from the client (which parses the raw Modelfile
)
to the server; the server is expected to pull relevant binary blobs from the
registry and handle actual model creation.
The request creation handler is located
here.
If a FROM
statement is part of the
request
(the model is being fetched from a name, e.g. on the Ollama registry), the
model and manifest are
pulled
from the registry and written to local disk, in a cache directory. All models
on the Ollama registry are stored in GGUF format, which we’ll focus on here:
it’s also possible to import from safetensors
, with details
here
for the interested reader. A useful depiction of the GGUF file format is below;
the documentation in
ggml does an
excellent job describing its semantics.

After pulling the model’s GGUF binary blobs, the blobs are parsed (see the
decode call
here,
which is initially parsed
here
and parsed in detail
here)
and used to construct an array of base layers, typed as an array of pointers to
layerGGML
(which mirror the metadata stored on disk). Adapter layers are
considered separately from base layers, but are also typed the same way and
merged with base layers.
type layerGGML struct {
Layer // map[string]*Tensor, where Tensor holds the GGML tensor metadata
*ggml.GGML // a pointer to the full model metadata, which points to all tensors
}
Note that no model tensor data is loaded into CPU DRAM at this point in time!
All in-memory objects represent metadata corresponding to the GGUF
file: the
actual tensor data will be read and moved to device when the model is loaded
at inference time.
Finally, a full model is created from the []*layerGGML
metadata and the
template, system prompt, and hyperparameters. Interestingly, layer quantization
also happens at this stage (and is used to construct new
layers,
which replace the old ones). An final manifest for the requested Modelfile is
written with this information (under a directory keyed by the hash of the
Modelfile; e.g, the Modelfile digest); the manifest is later loaded when a
model is fetched for inference, to avoid GGUF parsing multiple times.
We now have a rough sense of what a Modelfile
really is—it’s a declarative
representation of a model and its prompt, represented as a collection of layers
either stored as text (for prompt, hyperparameter, and other metadata) or in
GGUF
format (for model tensor data). Ollama parses Modelfiles to construct
Manifests that point to GGUF
model tensor data files pulled to local disk,
which are ultimately used for inference when generation APIs are called.
Support for a model registry
Modelfiles thus perfectly define the concept of a “model image”, which can be
used analogously to Docker images (e.g. to share, remix, and derive from).
Any parsed Modelfile is associated with (a) a Manifest and (b) a GGUF-backed
tensor data representation, which can then be fetched and used in a FROM
statement by a second Modelfile. One can also imagine tagging these “model
image”s just as Docker images are tagged, so users can pull model files at
different versions.
The Ollama service provides the following APIs to facilitate this basic idea of a model registry:
// Create models
r.POST("/api/create", s.CreateHandler)
r.POST("/api/blobs/:digest", s.CreateBlobHandler)
r.HEAD("/api/blobs/:digest", s.HeadBlobHandler)
r.POST("/api/copy", s.CopyHandler)
// Read, update models in registry
r.POST("/api/pull", s.PullHandler)
r.POST("/api/push", s.PushHandler)
r.HEAD("/api/tags", s.ListHandler)
r.GET("/api/tags", s.ListHandler)
r.POST("/api/show", s.ShowHandler)
r.DELETE("/api/delete", s.DeleteHandler)
With time, I’m sure more Docker-like features will continue to be added.
The Ollama Architecture
With knowledge of how Ollama interprets Modelfiles, we’re ready to understand its client/server architecture and implementation of model inference. We’ll cover the design of four key components: client, server, scheduler, and model runner.
Client
While the Ollama client-side API is used by multiple frontend interfaces, we’ll
focus on the command-line client implementation here. Client state is kept very
light: a base URL and *http.Client
suffice:
type Client struct {
base *url.URL
http *http.Client
}
Two core
methods
are implemented on the Cilent
interface:
// Batch:
func (c *Client) do(
ctx context.Context, method, path string, reqData, respData any) error
// Streaming:
func (c *Client) stream(
ctx context.Context, method, path string, data any, fn func([]byte) error
These methods are similar in their implementation, with one
notable difference: while do
reads the entire response
body and returns data to the user, stream
creates a new
buffer and scans chunks of the response until completion.
// Note: error handling in both examples has been elided.
// do(...):
respObj, err := c.http.Do(request)
defer respObj.Body.Close()
respBody, err := io.ReadAll(respObj.Body)
// stream(...):
scanner := bufio.NewScanner(response.Body)
scanBuf := make([]byte, 0, maxBufferSize)
scanner.Buffer(scanBuf, maxBufferSize)
for scanner.Scan() {
var errorResponse struct {
Error string `json:"error,omitempty"`
}
bts := scanner.Bytes()
if err := fn(bts); err != nil {
return err
}
}
Individual method handlers are implemented in batch or streaming mode depending on the contract with the server-side implementation. For more information, see the implementation and types.
Server and Scheduler
The server performs routing, scheduling, and handoff to the Ollama runner for model inference. Alongside the model registry APIs listed above, the server implements the following inference APIs, along with some OpenAI API compatibility functionality.
// Inference
r.POST("/api/generate", s.GenerateHandler)
r.POST("/api/chat", s.ChatHandler)
r.POST("/api/embed", s.EmbedHandler)
r.POST("/api/embeddings", s.EmbeddingsHandler)
The server is created here (from command line here). Its state is also kept very light:
type Server struct {
addr net.Addr
sched *Scheduler
}
Alongside route generation, the brunt of the server’s work is offloaded to a
scheduler, which processes requests to load/unload models and run inference.
The scheduler maintains state of all queued requests and runners for loaded
models (which are created by newServerFn
).
type Scheduler struct {
pendingReqCh chan *LlmRequest
finishedReqCh chan *LlmRequest
expiredCh chan *runnerRef
unloadedCh chan interface{}
loaded map[string]*runnerRef
loadedMu sync.Mutex
loadFn func(
req *LlmRequest, f *ggml.GGML, gpus discover.GpuInfoList,
numParallel int)
newServerFn func(
gpus discover.GpuInfoList, model string, f *ggml.GGML,
adapters []string, projectors []string, opts api.Options,
numParallel int) (llm.LlamaServer, error)
getGpuFn func() discover.GpuInfoList
getCpuFn func() discover.GpuInfoList
reschedDelay time.Duration
}
Scheduler logic can be somewhat involved; to break it down, let’s walk through
an invocation of /api/generate
. When the server is first started (before any
API calls are serviced), it initializes and starts the scheduler, which
runs two goroutines that live for the lifetime of the server and process
queued/completed requests on the scheduler’s channels.
When an API request is received, the HTTP routing layer first calls the server handler here, with a request that includes the model name, prompt, and other optional data. The handler fetches the model Manifest and loads model metadata; it also constructs the final prompt from request metadata.
After validating inputs, capabilities, and the prompt, the server schedules a Runner by submitting a request to the scheduler:
req := &LlmRequest{
ctx: c,
model: model,
opts: opts,
sessionDuration: sessionDuration,
successCh: make(chan *runnerRef),
errCh: make(chan error, 1),
}
select {
case s.pendingReqCh <- req:
default:
req.errCh <- ErrMaxQueue
}
// The server selects against the first of
// these two channels to receive a response:
return req.successCh, req.errCh
The server blocks on a successful response or error from the scheduler: as is native in Go, doing so does not prevent other goroutines from proceeding (e.g. to accept new server requests, or perform model inference).
When the scheduler receives a request on its pending channel, the Go runtime
sets the goroutine that processes pending requests to runnable, and (when
assigned to a CPU core) it executes logic to identify whether any existing
model runners need to be expired, and assigns resources (e.g. GPUs). This
logic is encapsulated
here.
After resources are assigned, the scheduler calls its newServerFn
on the
model, which is implemented
here.
This method is responsible for launching a runner, which owns the actual
model inference execution and output generation process. When the server
receives a handle to a runner, it makes a completion request and streams the
token-by-token response from the runner.
// Heavily elided; note that c is the client-side request
r, m, opts, err := s.scheduleRunner(...) // Parameters are unimportant.
if err := r.Completion(c.Request.Context(), llm.CompletionRequest{
Prompt: prompt,
Images: images,
Format: req.Format,
Options: opts,
}, func(cr llm.CompletionResponse) {
res := api.GenerateResponse{
Model: req.Model,
CreatedAt: time.Now().UTC(),
Response: cr.Content,
Done: cr.Done,
DoneReason: cr.DoneReason,
Metrics: api.Metrics{
PromptEvalCount: cr.PromptEvalCount,
PromptEvalDuration: cr.PromptEvalDuration,
EvalCount: cr.EvalCount,
EvalDuration: cr.EvalDuration,
},
}
}
Runner
We’ll last discuss the implementation of the runner, a short-lived server that
communicates with the main Ollama server to run inference and stream responses
back to the user. If you watch ps | grep ollama
while an inference call is
running, you’ll see such a process appear:
90217 ttys000 0:01.86 <ollama_path>/ollama runner --model <model_path> <args>
Note that one server is constructed per-model; the information the Ollama server stores for each runner is below.
type runnerRef struct {
refMu sync.Mutex
refCount uint // prevent unloading if > 0
llama llm.LlamaServer
loading bool // True only during initial load, then false forever
gpus discover.GpuInfoList // Recorded at time of provisioning
estimatedVRAM uint64
estimatedTotal uint64
sessionDuration time.Duration
expireTimer *time.Timer
expiresAt time.Time
model *Model
modelPath string
numParallel int
*api.Options
}
The llama
field is an object that defines a client-side interface to the
runner server. It is currently only implemented by the llmServer
class.
type LlamaServer interface {
Ping(ctx context.Context) error
WaitUntilRunning(ctx context.Context) error
Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error
Embedding(ctx context.Context, input string) ([]float32, error)
Tokenize(ctx context.Context, content string) ([]int, error)
Detokenize(ctx context.Context, tokens []int) (string, error)
Close() error
EstimatedVRAM() uint64 // Total VRAM across all GPUs
EstimatedTotal() uint64
EstimatedVRAMByGPU(gpuID string) uint64
}
The main method that manages the creation of servers is NewLlamaServer
(here).
It defines two modes: the original engine (which uses llama.cpp
Cgo bindings
to load models) and a new engine (which uses ggml
Cgo bindings and loads
models atop it directly). In both modes, the runner is executed as a standalone
binary, managed by the runner
package within Ollama.
In the original engine (within the llamarunner
subdirectory), the GGUF model is
loaded by the llama.cpp
model here,
which makes the call
m := Model{c: C.llama_model_load_from_file(C.CString(modelPath), cparams)}
In contrast, the new engine (within the ollamarunner
subdirectory) loads the
GGUF model directly in ggml
here;
it does so by calling NewBackend
here,
which calls New
within ggml
here
to set up the device bbuffers and tensor memory allocations from the GGUF file,
and ultimately concurrently read data from GGUF to CPU (and optionally to device).
After the model is fully loaded (in either mode), the runner responds to Ollama
server calls (defined by the LlamaServer
interface), performing forward
passes on the embedded model and streaming responses to the Ollama server. When
the runner needs to be offloaded (dictated by a TTL and the scheduler), the
runner server process is killed, and model memory is freed.