Understanding ggml, from the ground up

On-device, low-latency language model inference have become increasingly important in designing production systems. Here, we dive deep into one leading framework for performant inference.

Introduction

ggml is a tensor library for machine learning. Like other popular tensor libraries (e.g. PyTorch, Tensorflow), it implements an n-dimensional tensor1, a set of tensor operations that chain tensors together in a computational graph, and reverse-mode automatic differentiation. It distinguishes itself in its design for low-latency, low-overhead inference compute: there are no memory allocations during a model forward or backward pass (akin to TensorFlow’s static computational graph model2), deep care is given to variable alignment and low-level systems optimizations, and a wide variety of execution backends are supported (most notably Apple Metal).

It’s often the case that the best way to understand the workings of a system is to fully grok its internals. In this post, we’ll do just that, starting with a simple example. This post is based on ggml tree 9a4ac; to follow along, check out the repository at that commit.

Precursor: CMake

The ggml project is built with CMake, a popular build automation tool for cross-platform build script generation3. That is, we can define our build requirements in a CMakeLists.txt file, and cmake will auto-generate make-compatible build artifacts that can be used to build and run binaries.

To start, let’s create our project with the following directory structure:

learn_ggml
├── CMakeLists.txt    # Our CMake build definition
├── build             # An empty directory, where CMake will write build files to
├── ggml              # The ggml repository at commit 9a4ac
└── main.cpp          # Our starter progra, copied from `simple-backend.cpp`

We can write our build definition in CMakeLists.txt as follows (see comments for details):

cmake_minimum_required(VERSION 3.12)
project("learn_ggml")

# See https://joshtronic.com/2024/01/14/cmake-compile-commands-json/ for more
# information on this flag:
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

# Output binaries to `{pwd}/bin`:
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)

# `ggml` is a subdirectory/sub-project with its own CMakeLists.txt:
add_subdirectory(ggml)

# `main.cpp` is our executable:
add_executable(learn_ggml main.cpp)

# Enable DEBUG mode:
set(GGML_DEBUG ON)
add_definitions(-DGGML_DEBUG)

# Link `learn_ggml` (our executable) against `ggml`, and use
# C++20:
target_include_directories(learn_ggml PUBLIC .)
target_compile_features(learn_ggml PUBLIC cxx_std_20)
target_link_libraries(learn_ggml PRIVATE ggml)

All that’s left is to initialize CMake, and build:

# Initialize with source directory `.` and build directory `build`
cmake -B build -S .
cd build; cmake --build . -j8

Voilà! All done. We can now run ./build/bin/main, which constructs a ggml model, performs two basic tensor operations, and returns the result to the user. We’ll dive deeper into the individual components of main.cpp in the subsections below.

Part 1: Initialization and Concepts

Recall that we’ll be following along with the “simple backend” example in the ggml repository: it’s located here. The file implements a very simple computation, computing $Y = AB^T$ for known matrices $A$ and $B$, but spans over 200 lines of code due to the level of detail at which components must be specified.

In main, we begin defining our inputs matrix_A and matrix_B:

const int rows_A = 4, cols_A = 2;

float matrix_A[rows_A * cols_A] = {
  2, 8,
  5, 1,
  4, 2,
  8, 6
};

const int rows_B = 3, cols_B = 2;
/* Transpose([
  10, 9, 5,
  5, 9, 4
]) 2 rows, 3 cols */
float matrix_B[rows_B * cols_B] = {
  10, 5,
  9, 9,
  5, 4
};

We specify our input data as a one-dimensional float array, without associated dimensions: we will pass dimensions later when we construct ggml_tensor objects from the raw data. Note that we transpose B ahead-of-time to align with the subsequent matrix multiplication routines; this requires us to be extra cautious when laying out our data order. All ggml objects are allocated on CPU first, and subsequently moved to an alternative backend if specified.

We next define a “model”, a struct specified as

struct simple_model {
  struct ggml_tensor * a;
  struct ggml_tensor * b;

  // the backend to perform the computation (e.g., CPU, CUDA, METAL)
  ggml_backend_t backend = NULL;

  // the backend buffer to storage the data of tensors a and b
  ggml_backend_buffer_t buffer;

  // the context to define the tensor information (dimensions, size, memory address)
  struct ggml_context * ctx;
};

The model stores pointers to objects that ggml will manage for our simple model (a single matrix multiplication); it’s created to group them for convenience. In more advanced settings, one can imagine a model consisting of many layers, each storing their own set of tensors (with their own buffers), but all sharing a single context.

The model groups two distinct types of information in one struct:

Logical Information: the Tensor and Context

We’ll start with the logical layout of objects (tensors) and the computational graph composed of their operations.

The tensor object: ggml_tensor

A tensor struct defines properties of a tensor, which can either be a leaf (no dependencies; constructed from input data) or a node (dependencies stored in src; constructed from previously constructed ggml_tensor objects) in a computational graph. Here’s the struct in its entirety:

struct ggml_tensor {
  // Data type (e.g. FP32, FP16)
  enum ggml_type type;

  // Metadata for tensors stored in non-CPU backends:
  struct ggml_backend_buffer * buffer;

  int64_t ne[GGML_MAX_DIMS]; // number of elements
  size_t  nb[GGML_MAX_DIMS]; // stride in bytes

  // If a tensor corresponsd to a leaf in the computational
  // graph, op is GGML_OP_NONE. Otherwise, corresponds to
  // the operation that produced this node:
  enum ggml_op op;

  // op params - allocated as int32_t for alignment
  int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];

  int32_t flags;

  // Source nodes or leaves that correspond to this
  // tensor:
  struct ggml_tensor * src[GGML_MAX_SRC];

  // For tensors that are views of other tensors, the
  // source tensor and offset into that source tensor:
  struct ggml_tensor * view_src;
  size_t               view_offs;

  // A pointer to the tensor's data.
  // TODO(manan): For tensors allocated on non-CPU backends,
  // what happens?
  void * data;

  // The name of the tensor in the computational graph,
  // used for printing and searching:
  char name[GGML_MAX_NAME];

  void * extra; // extra things e.g. for ggml-cuda.cu

  // Alignment:
  char padding[8];
};

Importantly, ggml tensors support full-precision types, half-precision types, and many quantized types for low-memory, low-latency inference (see the ggml_type enum for more information). Like numpy and other similar libraries, tensors are not assumed to be contiguous and instead have shape/stride metadata stored with them; this is useful for operations like transposition and permutation. This is also a place where ggml zigs while other libraries zag: tensors are stored with their lowest dimension first, while other libraries store the lowest dimension last. Concretely, this means that a 2x3x4 tensor will have ne = {4, 3, 2}.

We’ll get to initializing our tensors from the raw (flattened) data in matrix_A and matrix_B soon. For now, let’s turn to the next important logical concept: the ggml context.

The memory manager: ggml_context

The ggml context manages the memory of all tensors, operations, and other metadata necessary to execute the ggml computational graph. In particular, it defines a linked list of ggml_object elements (similar to simple memory allocators), and knows its own size, buffer type, and number of objects.

struct ggml_context {
    size_t mem_size;
    void * mem_buffer;
    bool   mem_buffer_owned;
    bool   no_alloc;

    int    n_objects;

    struct ggml_object * objects_begin;
    struct ggml_object * objects_end;
};

Each ggml_object is defined as follows:

struct ggml_object {
    size_t offs;
    size_t size;
    struct ggml_object * next;
    // one of TENSOR, GRAPH, or WORK_BUFFER
    enum ggml_object_type type;
    char padding[4];
};

and manages size bytes of memory within ggml_context’s memory pool, namely mem_buffer. We can understand buffer management better by inspecting the new object creation routine, which will resurface when we consider tensor creation. The method is as follows:

static struct ggml_object * ggml_new_object(
  struct ggml_context * ctx, enum ggml_object_type type, size_t size) {

  /* Insert objects at the end of the context's memory pool: find the
     offset in the pool that corresponds to this, and align. */
  struct ggml_object * obj_cur = ctx->objects_end;
  const size_t cur_offs = obj_cur == NULL ? 0 : obj_cur->offs;
  const size_t cur_size = obj_cur == NULL ? 0 : obj_cur->size;
  const size_t cur_end  = cur_offs + cur_size;

  // align to GGML_MEM_ALIGN
  size_t size_needed = GGML_PAD(size, GGML_MEM_ALIGN);

  // Error handling elided...

  /* Construct the new ggml object within the context's memory buffer, and
     associate it with the physical object it is responsible for (via offset and
     size). The object it is responsible for will be placed in the buffer right
     after itself (hence why .offs = cur_end + GGML_OBJECT_SIZE) */
  char * const mem_buffer = ctx->mem_buffer;
  struct ggml_object * const obj_new = (struct ggml_object *)(mem_buffer + cur_end);
  *obj_new = (struct ggml_object) {
    .offs = cur_end + GGML_OBJECT_SIZE,
    .size = size_needed,
    .next = NULL,
    .type = type,
  };

  /* Update the linked list of ggml objects */
  if (obj_cur != NULL) {
      obj_cur->next = obj_new;
  } else {
      // this is the first object in this context
      ctx->objects_begin = obj_new;
  }
  ctx->objects_end = obj_new;
  return obj_new;
}

We now have a grasp on two core ggml concepts: ggml_tensors manage the data and dimensions of objects that participate in the computational graph. Each tensor’s memory allocation is managed by a linked list of ggml_objects, which keep their metadata and memory allocated within the ggml_context.

But where is the allocation actually performed, and who owns the memory of the ggml_context’s memory buffer? This question leads us to the next set of concepts related to the physical layout of objects across device memory, whether that be CPU, GPU, or something else.

Physical Information: Devices and the Compute Backend

An important consideration in ggml’s design is the support of various backends; as a result, a unified interface for defining and registering devices as well as accessing their memory banks is paramount. The interfaces discussed below are abstracted such that this is possible in a hardware-agnostic manner.

The compute backend interface: ggml_backend

simple_model defines a member of type ggml_backend_t backend, where ggml_backend_t represents a pointer to a ggml_backend object. Backends are the core interface that ggml uses for exposing compute devices and their associated memory (eg. CPU DRAM, CUDA HBM, etc.). Every tensor is allocated on a backend; the default backend is NULL, which allocates on CPU.

A ggml backend is defined as follows:

struct ggml_backend {
  // Globally unique backend ID:
  ggml_guid_t guid;
  
  // Interface to set/get/copy tensors, record events, etc.
  // on this backend:
  struct ggml_backend_i iface;
  
  // The backend device (e.g. CPU, CUDA0), which contains:
  //  - its own device interface (to get device properties, memory,
  //    supported operations, etc.)
  //  - a backend registry, which contains its own registry interface
  //    that supports methods to enumerate available devices
  ggml_backend_dev_t device;

  // Other auxiliary items, per-backend:
  void * context;
};

The operations a backend can perform are grouped under its interface, which is specified as

// Some elements elided:
struct ggml_backend_i {
  const char * (*get_name)(ggml_backend_t backend);
  void (*free)(ggml_backend_t backend);

  // Asynchronous tensor data access. Parameters elided, roughly take the
  // form backend, tensor, data, offset, size:
  void (*set_tensor_async)(...)
  void (*get_tensor_async)(...);
  bool (*cpy_tensor_async)(...);

  // Complete all pending operations (required if the backend supports async operations):
  void (*synchronize)(ggml_backend_t backend);
};

An implementation: CUDA. While this provides us with high-level intuition about the specifications of a backend, it’s often helpful to look at a practical example. Let’s take the CUDA backend, which is initialized per device4. It’s constructed as follows:

ggml_backend_t ggml_backend_cuda_init(int device) {
  // Error handling elided...
  ggml_backend_cuda_context * ctx = new ggml_backend_cuda_context(device);
  ggml_backend_t cuda_backend = new ggml_backend {
    /* .guid      = */ ggml_backend_cuda_guid(),
    /* .interface = */ ggml_backend_cuda_interface,
    /* .device    = */ ggml_backend_reg_dev_get(ggml_backend_cuda_reg(), device),
    /* .context   = */ ctx,
  };
  return cuda_backend;
}

The GUID is a 16-byte globally unique ID defined per backend. The backend interace is defined as

// NULL elements elided:
static const ggml_backend_i ggml_backend_cuda_interface = {
  /* .get_name                = */ ggml_backend_cuda_get_name,
  /* .free                    = */ ggml_backend_cuda_free,
  /* .set_tensor_async        = */ ggml_backend_cuda_set_tensor_async,
  /* .get_tensor_async        = */ ggml_backend_cuda_get_tensor_async,
  /* .cpy_tensor_async        = */ ggml_backend_cuda_cpy_tensor_async,
  /* .synchronize             = */ ggml_backend_cuda_synchronize,
  /* .graph_compute           = */ ggml_backend_cuda_graph_compute,
  /* .event_record            = */ ggml_backend_cuda_event_record,
  /* .event_wait              = */ ggml_backend_cuda_event_wait,
};

The implementation of set_tensor_async does exactly what we would expect: it copies the tensor data from host (CPU) to device (GPU), leveraging the current context’s CUDA stream.

static void ggml_backend_cuda_set_tensor_async(
  ggml_backend_t backend, ggml_tensor * tensor, const void * data,
  size_t offset, size_t size) {
    ggml_backend_cuda_context * cuda_ctx = (
      (ggml_backend_cuda_context *)backend->context);
    // Error handling elided...
    CUDA_CHECK(cudaMemcpyAsync((char *)tensor->data + offset, data,
               size, cudaMemcpyHostToDevice, cuda_ctx->stream()));
}

Let’s next turn our attention to the .device property, which defines the physical device (think compute accelerator) that a backend is constructed on.

The device interface: ggml_backend_device

In ggml_backend, device is a pointer to a ggml_backend_device struct, which has the following definition:

struct ggml_backend_device {
  struct ggml_backend_device_i iface;
  ggml_backend_reg_t reg;
  void * context;
};

In keeping with the same semantics as the backend, a device also has an interface, which defines the common operations a device must support. These operations are as follows:

struct ggml_backend_device_i {
  // Some elements elided for simplicity...

  // device name: short identifier, such as "CPU" or "CUDA0"
  const char * (*get_name)(ggml_backend_dev_t dev);

  // device memory in bytes
  void (*get_memory)(ggml_backend_dev_t dev, size_t * free, size_t * total);

  // device type
  enum ggml_backend_dev_type (*get_type)(ggml_backend_dev_t dev);

  // device properties
  void (*get_props)(ggml_backend_dev_t dev, struct ggml_backend_dev_props * props);

  // backend (stream) initialization
  ggml_backend_t (*init_backend)(ggml_backend_dev_t dev, const char * params);

  // check if the backend can compute an operation
  bool (*supports_op)(ggml_backend_dev_t dev, const struct ggml_tensor * op);
};

So far, pretty straightforward: each device (CPU, CUDA0, CUDA1, etc.) all define a unified way to access their available memory, properties, construct streams, and perform other buffer-related operations.

A device is also associated with a registry of all available devices of its type. The registry is a simple struct, which supports fetching the name of the device type and enumerating devices as exposed via its interface.

struct ggml_backend_reg {
  int api_version; // initialize to GGML_BACKEND_API_VERSION
  struct ggml_backend_reg_i iface;
  void * context;
};

struct ggml_backend_reg_i {
  const char * (*get_name)(ggml_backend_reg_t reg);
  size_t             (*get_device_count)(ggml_backend_reg_t reg);
  ggml_backend_dev_t (*get_device)(ggml_backend_reg_t reg, size_t index);
  // One method elided for simplicity...
};

CUDA, continued. Recall that the initialization of the CUDA backend is done as follows:

  /* .device    = */ ggml_backend_reg_dev_get(ggml_backend_cuda_reg(), device),

In particular, the CUDA registry is fetched (which contains information about all available CUDA devices on the node), and the device is fetched from this registry via the registry interface’s get_device method:

ggml_backend_dev_t ggml_backend_reg_dev_get(ggml_backend_reg_t reg, size_t index) {
  return reg->iface.get_device(reg, index);
}

This returns a device, associated with the CUDA registry, that is associated with the ggml_backend object within simple_model.

Phew–that was a lot. To recap, a ggml_backend object defines a single backend device, and and interface to allocate tensors, listen for events, etc. on this device. Devices themselves provide an interface to expose their (hardware) properties, and are grouped under registry entries.

Putting it All Together

After initializing our simple_model object, our simple example initializes its elements as follows:

  simple_model model;
  load_model(model, matrix_A, matrix_B, rows_A, cols_A, rows_B, cols_B);

The call to load_model (with some unncessary operations eliminated) performs the following set of operations:

// Initialize the backend using CUDA device 0:
model.backend = ggml_backend_cuda_init(0);
int num_tensors = 2;
struct ggml_init_params params {
    /*.mem_size   =*/ ggml_tensor_overhead() * num_tensors,
    /*.mem_buffer =*/ NULL,
    /*.no_alloc   =*/ true,  // since we are using buffers
};

/* Initializes ggml, which (among other things) initializes a fp32->fp16
   conversion table, sets up logging, and constructs the `ggml_context`
   object. Note that mem_buffer is NULL above, which results in the
   context being allocated on the default buffer (CPU), but no_alloc
   is true, which results in the context not allocating objects
   within its buffer (they are next allocated on GPU directly). */
model.ctx = ggml_init(params);

/* Create new tensors within the ggml context: internally, this calls
   ggml_new_object with the appropriate object type and size: */
model.a = ggml_new_tensor_2d(model.ctx, GGML_TYPE_F32, cols_A, rows_A);
model.b = ggml_new_tensor_2d(model.ctx, GGML_TYPE_F32, cols_B, rows_B);

/* Allocate the tensor data directly on the model backend (GPU 0).
   Tensor allocation on a backend is relatively involved, ultimately
   calling the ggml_backend's init_tensor interface. In CUDA, this
   performs a cudaMemset operation. */
model.buffer = ggml_backend_alloc_ctx_tensors(model.ctx, model.backend);
ggml_backend_tensor_set(model.a, a, 0, ggml_nbytes(model.a));
ggml_backend_tensor_set(model.b, b, 0, ggml_nbytes(model.b));

Importantly, load_model constructs the ggml_context object on the default device (CPU) and constructs a ggml_object for each tensor, but allocates the tensor data in the model buffer (GPU 0).

Part 2: Constructing a Computational Graph

So far, we’ve:

Now that initialization of our input data and compute backend is complete, we next construct a computational graph (type ggml_cgraph), which defines the set of operations we wish to perform on our input tensors. The computational graph is statically constructed and records these operations, and allocates memory for temporary tensors constructed within the computation of the computational graph so this memory is not allocated/deallocated on the fly during the forward pass.

Setting Up

Our example begins by allocating memory for the computational graph as follows:

ggml_gallocr_t allocr = (
  ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend)));

allocr is a pointer to a ggml_gallocr object… we’ll make headway into understanding this struct as we parse the methods used to build a graph; for now, it’s worth noting that the ggml_gallocr_new method allocates these members on the (CPU) stack (e.g. with calloc).

After initializing allocr, the example constructs a ggml_cgraph twice: once to define a memory estimation for the graph allocator, and the second to actually allocate the current graph. This optimization is typically useful for models that support dynamic batch sizes, with the desire of allocating the worst-case graph once (e.g. with the maximum batch size) and avoiding re-allocations when forwarding with different inputs. In our case (for a static matmul of two fixed-dimension inputs), it’s unnecessary.

We can thus consolidate computational graph (ggml_cgraph) construction into the following two calls:

struct ggml_cgraph * gf = build_graph(model);
ggml_gallocr_alloc_graph(allocr, gf);

Put simply, we first build a computational graph on top of the input tensors in our model, and we next allocate the tensors in this graph for use in our model forward pass; doing so results in no dynamic memory allocations at runtime, one of the library’s core promises.

To build our computational graph, build_graph performs the following operations:

/* Allocate buf_size bytes on CPU DRAM for the graph */
static size_t buf_size = ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
static std::vector<uint8_t> buf(buf_size);

struct ggml_init_params params0 = {
    /*.mem_size   =*/ buf_size,
    /*.mem_buffer =*/ buf.data(),

    // the tensors will be allocated later, in
    // device memory, by ggml_allocr_alloc_graph()
    /*.no_alloc   =*/ true,
};

/* Create a temporary ggml context to build the graph,
   cosntruct the graph by performing operations on the
   input tensors, and recursively visit the parents of
   the final outpute tensor (result) to build gf */
struct ggml_context * ctx0 = ggml_init(params0);
struct ggml_cgraph  * gf = ggml_new_graph(ctx0);

// result = a*b^T
struct ggml_tensor * result = ggml_mul_mat(ctx0, model.a, model.b);

// build operations nodes
ggml_build_forward_expand(gf, result);

/* Cleanup */
ggml_free(ctx0);
return gf;

We first allocate enough memory to hold GGML_DEFAULT_GRAPH_SIZE tensors on CPU; this memory is static within the build_graph method, so it is allocated once and persists for the lifetime of the program. We next construct a temporary context (ctx0) and buid a new graph atop it; note that while the context itself will be freed at the end of build_graph (as the computational graph itself is allocated in buf), the ggml_cgraph object itself will persist.

Building the computational graph: ggml_cgraph

Let’s now step into ggml_new_graph(ctx0), to see what makes up a ggml_cgraph. For reference, the ggml_cgraph struct has the following definition:

struct ggml_cgraph {
  int size;    // maximum number of nodes/leafs/grads/grad_accs
  int n_nodes; // number of nodes currently in use
  int n_leafs; // number of leafs currently in use

  // tensors with data that can change if the graph is evaluated
  struct ggml_tensor ** nodes;
  // the outputs of these tensors are the gradients of the nodes
  struct ggml_tensor ** grads;
  // accumulators for node gradients
  struct ggml_tensor ** grad_accs;
  // tensors with constant data
  struct ggml_tensor ** leafs;

  struct ggml_hash_set visited_hash_set;
  enum ggml_cgraph_eval_order order;
};

Constructing a new graph follows multiple steps. We begin by computing the number of bytes that the graph object requires, and allocating a new object within ctx0 on buf of this size.

// Fetch the size of the graph; note here that size is GGML_DEFAULT_GRAPH_SIZE,
// or 2048:
const size_t obj_size = ggml_graph_nbytes(size, grads);
struct ggml_object * obj = ggml_new_object(ctx, GGML_OBJECT_TYPE_GRAPH, obj_size);

// Recall, obj->offs is the offset of the object from the start of the memory buffer:
struct ggml_cgraph * cgraph = (struct ggml_cgraph *)((char *) ctx->mem_buffer + obj->offs);

// The size of the hash table is doubled since it needs to hold both nodes and leafs
size_t hash_size = ggml_hash_size(size * 2);

We next construct a hash table of size first_prime_larger_than(size * 2); this table will be used to store a visited set of tensors when we recursively construct our computational graph (tree) atop tensor operations. ggml uses a simple hash table implementation with open addressing and linear probing; a prime size minimizes the number of collisions produced by the hash function (reproduced below, without the final % size).

// hash function for ggml_tensor
static inline size_t ggml_hash(const struct ggml_tensor * p) {
  // the last 4 bits are always zero due to alignment
  return (size_t)(uintptr_t)p >> 4;
}

We subsequently initialize the members of ggml_cgraph while preserving pointer alignment, and clear the hash table and gradients if present. This completes initalization of the cgraph object; our next step is to add nodes and leafs to the graph corresponding to operations that we perform on our input tensors. We do so with the following operations:

/* Construct result tensor; store tensor metadata (the ggml_object)
   in ctx0, but tensor data will not be allocated yet. */
struct ggml_tensor * result = ggml_mul_mat(ctx0, model.a, model.b);

/* Recursively post-order traverse the graph (starting from `result`),
   updating node names and keeping track of nodes and leaves. */ 
ggml_build_forward_expand(gf, result);

Note that ggml_mul_mat does not eagerly perform matrix multiplication; instead, it simply constructs a new result tensor object and stores information about the operation that computed it. In our case, this object is constructed in ctx0; its ggml_object will be constructed within the previously statically allocated buf (on CPU), but its data will not yet be allocated.

struct ggml_tensor * ggml_mul_mat(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
  GGML_ASSERT(ggml_can_mul_mat(a, b));
  GGML_ASSERT(!ggml_is_transposed(a));

  const int64_t ne[4] = { a->ne[1], b->ne[1], b->ne[2], b->ne[3] };
  struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);

  result->op     = GGML_OP_MUL_MAT;
  result->src[0] = a;
  result->src[1] = b;

  return result;
}

And that’s that! We’ve now fully initialized our computational graph, with leaves (constant or input tensors) and nodes (tensors that are a result of operations) properly named. Nodes are constructed with their operations and inputs stored as references, so the forward pass has all the information it needs to compute them from inputs.

Allocating intermediate tensors: ggml_allocr

With a better understanding of ggml_cgraph, we can now turn our attention to the previously mentioned ggml_allocr, which allocates all the (previously unallocated) tensor data on our device, due to the setting of no_alloc to false in ctx0. Recall that ggml_allocr operates on the model’s backend buffer (in our case, GPU 0).

The ggml_allocr struct is defined as

struct ggml_gallocr {
  ggml_backend_buffer_type_t * bufts; // [n_buffers]
  ggml_backend_buffer_t * buffers; // [n_buffers]
  struct ggml_dyn_tallocr ** buf_tallocs; // [n_buffers]
  int n_buffers;

  struct ggml_hash_set hash_set;
  struct hash_node * hash_values; // [hash_set.size]

  struct node_alloc * node_allocs; // [n_nodes]
  int n_nodes;

  struct leaf_alloc * leaf_allocs; // [n_leafs]
  int n_leafs;
};

which should intuitively make some sense: the allocator’s job is to allocate nodes and leaves into the model’s buffer, and to do so it must keep track of its visited tensors and the allocation metadata of all allocated nodes and leaves (to handle things like permutations and views). The complexity here is mostly within ggml_gallocr_allocate_node, for the interested reader.

We’ve now defined our computational graph and allocated memory in our backend buffer for all intermediate tensors produced as a result of performing the operations in our graph. All that’s left to do is to run the model forward pass, executing the ops that produce nodes from leaves, and read out the data in the result tensor.

Part 3: Forwarding the Computational Graph

Executing a forward pass boils down to the following two calls:

// The actual forward pass:
ggml_backend_graph_compute(model.backend, gf);

// In our case, the output tensor is the last one in the graph:
return ggml_graph_node(gf, -1);

The call to ggml_backend_graph_compute calls the appropriate backend interface:

enum ggml_status ggml_backend_graph_compute_async(
  ggml_backend_t backend, struct ggml_cgraph * cgraph) {
    return backend->iface.graph_compute(backend, cgraph);
}

which forwards the computational graph over the specified backend (defined by a backend-dependent implementation) and stores computed results in their associated pre-allocated tensors.

CUDA, continued. In the CUDA backend, graph_compute is a relatively involved method that optionally utilizes CUDA graphs to bundle the operations comprising the forward pass into one single unit. Without CUDA graphs, the backend iterates over all nodes in the graph:

// Massive oversimplifiation...
for (int i = 0; i < cgraph->n_nodes; i++) {
  ggml_tensor * node = cgraph->nodes[i];
  ggml_cuda_compute_forward(*cuda_ctx, node); 
}

and computes the forward pass in a large switch statement, calling into respective kernels:

static bool ggml_cuda_compute_forward(
  ggml_backend_cuda_context & ctx, struct ggml_tensor * dst) {
    switch (dst->op) {
      case GGML_OP_ARGMAX:
        ggml_cuda_argmax(ctx, dst);
        break;
      case GGML_OP_COUNT_EQUAL:
        ggml_cuda_count_equal(ctx, dst);
        break;
      case GGML_OP_REPEAT:
        ggml_cuda_op_repeat(ctx, dst);
        break;

      /* ... and so on, you get the picture */
    }
}

where all operations write into the dst tensor’s memory. Of course, each operation is heavily optimized in ggml, and the kernels are pretty cool to examine in depth; we’ll leave that for a separate blog post.

Wrapping Up

This (rather long) post covers many of the core principles underlying the ggml library, from devices to backends, computational graphs, tensors, and the model forward pass. While the explicit management of devices, memory buffers, allocation/deallocation, and static graph construction can be tough to wrap one’s head around (and is likely not necessary for many exploratory machine learning applications), it provides large efficiency gains, reduces forward pass overhead, and is an excellent pedagogical tool.

Notes

  1. In ggml, $n < 4$ as transformers do not need more than a 4-d tensor. 

  2. See this StackOverflow answer for more details. 

  3. At the risk of massive oversimplification, it’s a Makefile generator. 

  4. A CUDA device corresponds to one GPU.