162 lines
5.7 KiB
Markdown
162 lines
5.7 KiB
Markdown
|
|
# ZML Concepts
|
|
|
|
## Model lifecycle
|
|
|
|
ZML is an inference stack that helps running Machine Learning (ML) models, and
|
|
particulary Neural Networks (NN).
|
|
|
|
The lifecycle of a model is implemented in the following steps:
|
|
|
|
1. Open the model file and read the shapes of the weights, but leave the
|
|
weights on the disk.
|
|
|
|
2. Using the loaded shapes and optional metadata, instantiate a model struct
|
|
with `Tensor`s, representing the shape and layout of each layer of the NN.
|
|
|
|
3. Compile the model struct and it's `forward` function into an accelerator
|
|
specific executable. The `forward` function describes the mathematical
|
|
operations corresponding to the model inference.
|
|
|
|
4. Load the model weights from disk, onto the accelerator memory.
|
|
|
|
5. Bind the model weights to the executable.
|
|
|
|
6. Load some user inputs, and copy them to the accelerator.
|
|
|
|
7. Call the executable on the user inputs.
|
|
|
|
8. Fetch the returned model output from accelerator into host memory, and
|
|
finally present it to the user.
|
|
|
|
9. When all user inputs have been processed, free the executable resources and
|
|
the associated weights.
|
|
|
|
|
|
**Some details:**
|
|
|
|
Note that the compilation and weight loading steps are both bottlenecks to your
|
|
model startup time, but they can be done in parallel. **ZML provides
|
|
asynchronous primitives** to make that easy.
|
|
|
|
The **compilation can be cached** across runs, and if you're always using the
|
|
same model architecture with the same shapes, it's possible to by-pass it
|
|
entirely.
|
|
|
|
The accelerator is typically a GPU, but can be another chip, or even the CPU
|
|
itself, churning vector instructions.
|
|
|
|
|
|
## Tensor Bros.
|
|
|
|
In ZML, we leverage Zig's static type system to differentiate between a few
|
|
concepts, hence we not only have a `Tensor` to work with, like other ML
|
|
frameworks, but also `Buffer`, `HostBuffer`, and `Shape`.
|
|
|
|
Let's explain all that.
|
|
|
|
* `Shape`: _describes_ a multi-dimension array.
|
|
- `Shape.init(.{16}, .f32)` represents a vector of 16 floats of 32 bits
|
|
precision.
|
|
- `Shape.init(.{512, 1024}, .f16)` represents a matrix of `512*1024` floats
|
|
of 16 bits precision, i.e. a `[512][1024]f16` array.
|
|
|
|
A `Shape` is only **metadata**, it doesn't point to or own any memory. The
|
|
`Shape` struct can also represent a regular number, aka a scalar:
|
|
`Shape.init(.{}, .i32)` represents a 32-bit signed integer.
|
|
|
|
* `HostBuffer`: _is_ a multi-dimensional array, whose memory is allocated **on
|
|
the CPU**.
|
|
- points to the slice of memory containing the array
|
|
- typically owns the underlying memory - but has a flag to remember when it
|
|
doesn't.
|
|
|
|
* `Buffer`: _is_ a multi-dimension array, whose memory is allocated **on an
|
|
accelerator**.
|
|
- contains a handle that the ZML runtime can use to convert it into a
|
|
physical address, but there is no guarantee this address is visible from
|
|
the CPU.
|
|
- can be created by loading weights from disk directly to the device via
|
|
`zml.aio.loadBuffers`
|
|
- can be created by calling `HostBuffer.toDevice(accelerator)`.
|
|
|
|
* `Tensor`: is a mathematical object representing an intermediary result of a
|
|
computation.
|
|
- is basically a `Shape` with an attached MLIR value representing the
|
|
mathematical operation that produced this `Tensor`.
|
|
|
|
|
|
## The model struct
|
|
|
|
The model struct is the Zig code that describes your Neural Network (NN).
|
|
Let's look a the following model architecture:
|
|
|
|

|
|
|
|
This is how we can describe it in a Zig struct:
|
|
|
|
```zig
|
|
const Model = struct {
|
|
input_layer: zml.Tensor,
|
|
output_layer: zml.Tensor,
|
|
|
|
pub fn forward(self: Model, input: zml.Tensor) zml.Tensor {
|
|
const hidden = self.input_layer.matmul(input);
|
|
const output = self.output_layer.matmul(hidden);
|
|
return output;
|
|
}
|
|
}
|
|
```
|
|
|
|
NNs are generally seen as a composition of smaller NNs, which are split into
|
|
layers. ZML makes it easy to mirror this structure in your code.
|
|
|
|
```zig
|
|
const Model = struct {
|
|
input_layer: MyOtherLayer,
|
|
output_layer: MyLastLayer,
|
|
|
|
pub fn forward(self: Model, input: zml.Tensor) zml.Tensor {
|
|
const hidden = self.input_layer.forward(input);
|
|
const output = self.output_layer.forward(hidden);
|
|
return output;
|
|
}
|
|
}
|
|
```
|
|
|
|
`zml.nn` module provides a number of well-known layers to more easily bootstrap
|
|
models.
|
|
|
|
Since the `Model` struct contains `Tensor`s, it is only ever useful during the
|
|
compilation stage, but not during inference. If we want to represent the model
|
|
with actual `Buffer`s, we can use the `zml.Bufferize(Model)`, which is a mirror
|
|
struct of `Model` but with a `Buffer` replacing every `Tensor`.
|
|
|
|
## Strong type checking
|
|
|
|
Let's look at the model life cycle again, but this time annotated with the
|
|
corresponding types.
|
|
|
|
1. Open the model file and read the shapes of the weights -> `zml.HostBuffer`
|
|
(using memory mapping, no actual copies happen yet)
|
|
|
|
2. Instantiate a model struct -> `Model` struct (with `zml.Tensor` inside)
|
|
|
|
3. Compile the model struct and its `forward` function into an executable.
|
|
`foward` is a `Tensor -> Tensor` function, executable is a
|
|
`zml.FnExe(Model.forward)`
|
|
|
|
4. Load the model weights from disk, onto accelerator memory ->
|
|
`zml.Bufferized(Model)` struct (with `zml.Buffer` inside)
|
|
|
|
5. Bind the model weights to the executable `zml.ModuleExe(Model.forward)`
|
|
|
|
6. Load some user inputs (custom struct), encode them into arrays of numbers
|
|
(`zml.HostBuffer`), and copy them to the accelerator (`zml.Buffer`).
|
|
|
|
7. Call the executable on the user inputs. `module.call` accepts `zml.Buffer`
|
|
arguments and returns `zml.Buffer`
|
|
|
|
8. Return the model output (`zml.Buffer`) to the host (`zml.HostBuffer`),
|
|
decode it (custom struct) and finally return to the user.
|