Elegant local language model inference with ollama

We investigate the internals of ollama, a widely-adored local LLM inference management platform built atop ggml.

Introduction

Getting started with ollama is simple: clone the repository, follow the development build instructions, and run the following two commands to spin up a server and set up inference on Qwen’s latest qwq reasoning model:

go run . serve &  # Start the ollama server in the background
go run . run qwq  # For example, run Qwen `qwq` (~20GB)

Upon doing so, you’ll note that the server prints many interesting logs (most notably, many logs from the ggml library, an early indicator that ggml is doing the heavy lifting for our inference calls); we’ll come back to these later. The client displays a UI likely familiar to those who have worked with Docker before, indicating that Ollama stores its models in layers:

pulling manifest
pulling c62ccde5630c... 100% ▕████████████████████████▏  19 GB
pulling 41190096a061... 100% ▕████████████████████████▏ 1.2 KB
pulling d18a5cc71b84... 100% ▕████████████████████████▏  11 KB
pulling 4afe5edfdb51... 100% ▕████████████████████████▏   77 B
pulling 6a8faa2fb8b0... 100% ▕████████████████████████▏  488 B
verifying sha256 digest
writing manifest
success

After the model is fetched, we can run inference in a straightforward manner (assuming we have enough RAM):

>>> How many times does the letter 'r' appear in the word "strawberry"?
<think>
...

… and we’re off, no extra work needed: that’s a pretty neat user experience.

In this post, we’ll invesigate some of the internals of Ollama, from the model registry to the inference forward pass and key-value cache. At a high level, the project is written in Go, and aims to provide a proper API, command line interface, and model registry layer atop the ggml on-device model inference library. While the majority of this logic is orthogonal to the actual forward pass implementation in ggml (if you’re curious about that, see my earlier post on ggml internals), it’s still instructive to walk through the rest of the implementation to provide a neat, condensed, and usable LLM serving workflow.

The Ollama Model Registry

We’ll begin with the Ollama Modelfile and registry, the first key feature the library adds atop a typical language model inference libary. Here, Ollama takes heavy inspiration from Docker, in both its Modelfile definition and registry implementation.

The Modelfile

A sample Ollama Modelfile can be written as follows:

FROM llama3.2
PARAMETER temperature 1
PARAMETER num_ctx 4096
SYSTEM You are Mario from super mario bros, acting as an assistant.

In analogy to a typical Dockerfile, the base model (defined in the FROM instruction) plays the role of a base image, and subsequent commands that augment the model in different ways (e.g. parameters used for inference, the system prompt, adapters) are added in separate layers. This allows for re-use of common components across models; for example, multiple Modelfiles can be constructed from the llama3.2 base, and the base model will only be downloaded once (in an identical fashion to the role of the Docker layer cache).

For full documentation on the parameters that can be included in a Modelfile, visit the Ollama documentation on the subject. The parser itself is a single Go file, if you’re curious about its mechanics.

Creating model images from Modelfiles

A natural question is how the “layers” of a Modelfile are physically represented. The answer is in two parts: plain-text layers are stored as JSON (e.g. parameters, messages, etc.) or text (e.g. LICENSE), and model data is stored in GGUF or safetensors format.

The Modelfile model creation handler implementation (the handler for /api/create) shines some more light on how exactly these layers are interpreted by Ollama. A creation request sent by the client to the server is typed as

// CreateRequest is the request passed to [Client.Create]. It's a
// parsed representation of the user-provided Modelfile.
type CreateRequest struct {
  Model    string `json:"model"`
  Stream   *bool  `json:"stream,omitempty"`
  Quantize string `json:"quantize,omitempty"`

  From       string            `json:"from,omitempty"`
  Files      map[string]string `json:"files,omitempty"`
  Adapters   map[string]string `json:"adapters,omitempty"`
  Template   string            `json:"template,omitempty"`
  License    any               `json:"license,omitempty"`
  System     string            `json:"system,omitempty"`
  Parameters map[string]any    `json:"parameters,omitempty"`
  Messages   []Message         `json:"messages,omitempty"`
}

which is as we expect; the model name and relevant options/parameters are passed as structured objects from the client (which parses the raw Modelfile) to the server; the server is expected to pull relevant binary blobs from the registry and handle actual model creation.

The request creation handler is located here. If a FROM statement is part of the request (the model is being fetched from a name, e.g. on the Ollama registry), the model and manifest are pulled from the registry and written to local disk, in a cache directory. All models on the Ollama registry are stored in GGUF format, which we’ll focus on here: it’s also possible to import from safetensors, with details here for the interested reader. A useful depiction of the GGUF file format is below; the documentation in ggml does an excellent job describing its semantics.

Figure 1. The GGUF file format specification, summarized. Lifted from the GGUF documentation.

After pulling the model’s GGUF binary blobs, the blobs are parsed (see the decode call here, which is initially parsed here and parsed in detail here) and used to construct an array of base layers, typed as an array of pointers to layerGGML (which mirror the metadata stored on disk). Adapter layers are considered separately from base layers, but are also typed the same way and merged with base layers.

type layerGGML struct {
  Layer        // map[string]*Tensor, where Tensor holds the GGML tensor metadata
  *ggml.GGML   // a pointer to the full model metadata, which points to all tensors
}

Note that no model tensor data is loaded into CPU DRAM at this point in time! All in-memory objects represent metadata corresponding to the GGUF file: the actual tensor data will be read and moved to device when the model is loaded at inference time.

Finally, a full model is created from the []*layerGGML metadata and the template, system prompt, and hyperparameters. Interestingly, layer quantization also happens at this stage (and is used to construct new layers, which replace the old ones). An final manifest for the requested Modelfile is written with this information (under a directory keyed by the hash of the Modelfile; e.g, the Modelfile digest); the manifest is later loaded when a model is fetched for inference, to avoid GGUF parsing multiple times.

We now have a rough sense of what a Modelfile really is—it’s a declarative representation of a model and its prompt, represented as a collection of layers either stored as text (for prompt, hyperparameter, and other metadata) or in GGUF format (for model tensor data). Ollama parses Modelfiles to construct Manifests that point to GGUF model tensor data files pulled to local disk, which are ultimately used for inference when generation APIs are called.

Support for a model registry

Modelfiles thus perfectly define the concept of a “model image”, which can be used analogously to Docker images (e.g. to share, remix, and derive from). Any parsed Modelfile is associated with (a) a Manifest and (b) a GGUF-backed tensor data representation, which can then be fetched and used in a FROM statement by a second Modelfile. One can also imagine tagging these “model image”s just as Docker images are tagged, so users can pull model files at different versions.

The Ollama service provides the following APIs to facilitate this basic idea of a model registry:

// Create models
r.POST("/api/create", s.CreateHandler)
r.POST("/api/blobs/:digest", s.CreateBlobHandler)
r.HEAD("/api/blobs/:digest", s.HeadBlobHandler)
r.POST("/api/copy", s.CopyHandler)

// Read, update models in registry
r.POST("/api/pull", s.PullHandler)
r.POST("/api/push", s.PushHandler)
r.HEAD("/api/tags", s.ListHandler)
r.GET("/api/tags", s.ListHandler)
r.POST("/api/show", s.ShowHandler)
r.DELETE("/api/delete", s.DeleteHandler)

With time, I’m sure more Docker-like features will continue to be added.

The Ollama Architecture

With knowledge of how Ollama interprets Modelfiles, we’re ready to understand its client/server architecture and implementation of model inference. We’ll cover the design of four key components: client, server, scheduler, and model runner.

Client

While the Ollama client-side API is used by multiple frontend interfaces, we’ll focus on the command-line client implementation here. Client state is kept very light: a base URL and *http.Client suffice:

type Client struct {
  base *url.URL
  http *http.Client
}

Two core methods are implemented on the Cilent interface:

// Batch:
func (c *Client) do(
  ctx context.Context, method, path string, reqData, respData any) error

// Streaming:
func (c *Client) stream(
  ctx context.Context, method, path string, data any, fn func([]byte) error

These methods are similar in their implementation, with one notable difference: while do reads the entire response body and returns data to the user, stream creates a new buffer and scans chunks of the response until completion.

// Note: error handling in both examples has been elided.

// do(...):
respObj, err := c.http.Do(request)
defer respObj.Body.Close()
respBody, err := io.ReadAll(respObj.Body)

// stream(...):
scanner := bufio.NewScanner(response.Body)
scanBuf := make([]byte, 0, maxBufferSize)
scanner.Buffer(scanBuf, maxBufferSize)
for scanner.Scan() {
  var errorResponse struct {
    Error string `json:"error,omitempty"`
  }
  bts := scanner.Bytes()
  if err := fn(bts); err != nil {
    return err
  }
}

Individual method handlers are implemented in batch or streaming mode depending on the contract with the server-side implementation. For more information, see the implementation and types.

Server and Scheduler

The server performs routing, scheduling, and handoff to the Ollama runner for model inference. Alongside the model registry APIs listed above, the server implements the following inference APIs, along with some OpenAI API compatibility functionality.

// Inference
r.POST("/api/generate", s.GenerateHandler)
r.POST("/api/chat", s.ChatHandler)
r.POST("/api/embed", s.EmbedHandler)
r.POST("/api/embeddings", s.EmbeddingsHandler)

The server is created here (from command line here). Its state is also kept very light:

type Server struct {
  addr  net.Addr
  sched *Scheduler
}

Alongside route generation, the brunt of the server’s work is offloaded to a scheduler, which processes requests to load/unload models and run inference. The scheduler maintains state of all queued requests and runners for loaded models (which are created by newServerFn).

type Scheduler struct {
  pendingReqCh  chan *LlmRequest
  finishedReqCh chan *LlmRequest
  expiredCh     chan *runnerRef
  unloadedCh    chan interface{}

  loaded   map[string]*runnerRef
  loadedMu sync.Mutex

  loadFn       func(
                  req *LlmRequest, f *ggml.GGML, gpus discover.GpuInfoList,
                  numParallel int)
  newServerFn  func(
                  gpus discover.GpuInfoList, model string, f *ggml.GGML,
                  adapters []string, projectors []string, opts api.Options,
                  numParallel int) (llm.LlamaServer, error)
  getGpuFn     func() discover.GpuInfoList
  getCpuFn     func() discover.GpuInfoList
  reschedDelay time.Duration
}

Scheduler logic can be somewhat involved; to break it down, let’s walk through an invocation of /api/generate. When the server is first started (before any API calls are serviced), it initializes and starts the scheduler, which runs two goroutines that live for the lifetime of the server and process queued/completed requests on the scheduler’s channels.

When an API request is received, the HTTP routing layer first calls the server handler here, with a request that includes the model name, prompt, and other optional data. The handler fetches the model Manifest and loads model metadata; it also constructs the final prompt from request metadata.

After validating inputs, capabilities, and the prompt, the server schedules a Runner by submitting a request to the scheduler:

req := &LlmRequest{
  ctx:             c,
  model:           model,
  opts:            opts,
  sessionDuration: sessionDuration,
  successCh:       make(chan *runnerRef),
  errCh:           make(chan error, 1),
}

select {
case s.pendingReqCh <- req:
default:
  req.errCh <- ErrMaxQueue
}

// The server selects against the first of
// these two channels to receive a response:
return req.successCh, req.errCh

The server blocks on a successful response or error from the scheduler: as is native in Go, doing so does not prevent other goroutines from proceeding (e.g. to accept new server requests, or perform model inference).

When the scheduler receives a request on its pending channel, the Go runtime sets the goroutine that processes pending requests to runnable, and (when assigned to a CPU core) it executes logic to identify whether any existing model runners need to be expired, and assigns resources (e.g. GPUs). This logic is encapsulated here. After resources are assigned, the scheduler calls its newServerFn on the model, which is implemented here. This method is responsible for launching a runner, which owns the actual model inference execution and output generation process. When the server receives a handle to a runner, it makes a completion request and streams the token-by-token response from the runner.

// Heavily elided; note that c is the client-side request
r, m, opts, err := s.scheduleRunner(...)  // Parameters are unimportant.
if err := r.Completion(c.Request.Context(), llm.CompletionRequest{
  Prompt:  prompt,
  Images:  images,
  Format:  req.Format,
  Options: opts,
}, func(cr llm.CompletionResponse) {
  res := api.GenerateResponse{
    Model:      req.Model,
    CreatedAt:  time.Now().UTC(),
    Response:   cr.Content,
    Done:       cr.Done,
    DoneReason: cr.DoneReason,
    Metrics: api.Metrics{
      PromptEvalCount:    cr.PromptEvalCount,
      PromptEvalDuration: cr.PromptEvalDuration,
      EvalCount:          cr.EvalCount,
      EvalDuration:       cr.EvalDuration,
    },
  }
}

Runner

We’ll last discuss the implementation of the runner, a short-lived server that communicates with the main Ollama server to run inference and stream responses back to the user. If you watch ps | grep ollama while an inference call is running, you’ll see such a process appear:

90217 ttys000    0:01.86 <ollama_path>/ollama runner --model <model_path> <args>

Note that one server is constructed per-model; the information the Ollama server stores for each runner is below.

type runnerRef struct {
  refMu sync.Mutex
  refCount uint // prevent unloading if > 0

  llama          llm.LlamaServer      
  loading        bool                 // True only during initial load, then false forever
  gpus           discover.GpuInfoList // Recorded at time of provisioning
  estimatedVRAM  uint64
  estimatedTotal uint64

  sessionDuration time.Duration
  expireTimer     *time.Timer
  expiresAt       time.Time

  model       *Model
  modelPath   string
  numParallel int
  *api.Options
}

The llama field is an object that defines a client-side interface to the runner server. It is currently only implemented by the llmServer class.

type LlamaServer interface {
  Ping(ctx context.Context) error
  WaitUntilRunning(ctx context.Context) error
  Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error
  Embedding(ctx context.Context, input string) ([]float32, error)
  Tokenize(ctx context.Context, content string) ([]int, error)
  Detokenize(ctx context.Context, tokens []int) (string, error)
  Close() error
  EstimatedVRAM() uint64 // Total VRAM across all GPUs
  EstimatedTotal() uint64
  EstimatedVRAMByGPU(gpuID string) uint64
}

The main method that manages the creation of servers is NewLlamaServer (here). It defines two modes: the original engine (which uses llama.cpp Cgo bindings to load models) and a new engine (which uses ggml Cgo bindings and loads models atop it directly). In both modes, the runner is executed as a standalone binary, managed by the runner package within Ollama.

In the original engine (within the llamarunner subdirectory), the GGUF model is loaded by the llama.cpp model here, which makes the call

m := Model{c: C.llama_model_load_from_file(C.CString(modelPath), cparams)}

In contrast, the new engine (within the ollamarunner subdirectory) loads the GGUF model directly in ggml here; it does so by calling NewBackend here, which calls New within ggml here to set up the device bbuffers and tensor memory allocations from the GGUF file, and ultimately concurrently read data from GGUF to CPU (and optionally to device).

After the model is fully loaded (in either mode), the runner responds to Ollama server calls (defined by the LlamaServer interface), performing forward passes on the embedded model and streaming responses to the Ollama server. When the runner needs to be offloaded (dictated by a TTL and the scheduler), the runner server process is killed, and model memory is freed.