Radix/docs/tutorials/write_first_model.md

529 lines
17 KiB
Markdown
Raw Normal View History

# Writing your first model
**In this short guide, we will do the following:**
- clone ZML to work directly within the prepared example folder
- add Zig code to implement our model
- add some Bazel to integrate our code with ZML
- no weights files or anything external is required for this example
The reason we're doing our exercise in the `examples` folder is because it's
especially prepared for new ZML projects. It contains everything needed for ZML
development. From `bazel` configs to `vscode` settings, and `neovim` LSP
support. The `examples` folder serves as a cookiecutter ZML project example,
with just a few example models added already.
**Note:** _The `examples` folder is self-contained. You **can** make a copy of
it to a location outside of the ZML repository. Simply remove all examples you
don't need and use it as a template for your own projects._
So, let's get started, shall we?
**If you haven't done so already, please [install bazel](../tutorials/getting_started.md)**.
Check out the ZML repository. In the `examples` directory, create a new folder
for your project. Let's call it `simple_layer`.
```
git clone https://github.com/zml/zml.git
cd zml/examples
mkdir -p simple_layer
```
... and add a file `main.zig` to it, along with a bazel build file:
```
touch simple_layer/main.zig
touch simple_layer/BUILD.bazel
```
By the way, you can access the complete source code of this walkthrough here:
- [main.zig](https://github.com/zml/zml/tree/master/examples/simple_layer/main.zig)
- [BUILD.bazel](https://github.com/zml/zml/tree/master/examples/simple_layer/BUILD.bazel)
## The high-level Overview
Before firing up our editor, let's quickly talk about a few basic ZML
fundamentals.
In ZML, we describe a _Module_, which represents our AI model, as a Zig
`struct`. That struct can contain Tensor fields that are used for computation,
e.g. weights and biases. In the _forward_ function of a Module, we describe the
computation by calling tensor operations like _mul_, _add_, _dotGeneral_,
_conv2D_, etc., or even nested Modules.
ZML creates an MLIR representation of the computation when we compile the
Module. For compilation, only the _Shapes_ of all tensors must be known. No
actual tensor data is needed at this step. This is important for large models:
we can compile them while the actual weight data is being fetched from disk.
To accomplish this, ZML uses a _BufferStore_. The _BufferStore_ knows how to
only load shapes and when to load actual tensor data. In our example, we will
fake the _BufferStore_ a bit: we won't load from disk; we'll use float arrays
instead.
After compilation is done (and the _BufferStore_ has finished loading weights),
we can send the weights from the _BufferStore_ to our computation device. That
produces an _executable_ module which we can call with different _inputs_.
In our example, we then copy the result from the computation device to CPU
memory and print it.
**So the steps for us are:**
- describe the computation as ZML _Module_, using tensor operations
- create a _BufferStore_ that provides _Shapes_ and data of weights and bias
(ca. 5 lines of code).
- compile the _Module_ **asynchronously**
- make the compiled _Module_ send the weights (and bias) to the computation
device utilizing the _BufferStore_, producing an _executable_ module
- prepare input tensor and call the _executable_ module.
- get the result back to CPU memory and print it
If you like to read more about the underlying concepts of the above, please see
[ZML Concepts](../learn/concepts.md).
## The code
Let's start by writing some Zig code, importing ZML and often-used modules:
```zig
const std = @import("std");
const zml = @import("zml");
const async = @import("async");
```
You will use above lines probably in all ZML projects. Also, note that **ZML is
async** and comes with its own async runtime, thanks to
[zigcoro](https://github.com/rsepassi/zigcoro).
### Defining our Model
We will start with a very simple "Model". One that resembles a "multiply and
add" operation.
```zig
/// Model definition
const Layer = struct {
bias: ?zml.Tensor = null,
weight: zml.Tensor,
pub fn forward(self: Layer, x: zml.Tensor) zml.Tensor {
var y = self.weight.mul(x);
if (self.bias) |bias| {
y = y.add(bias);
}
return y;
}
};
```
You see, in ZML AI models are just structs with a forward function!
There are more things to observe:
- forward functions typically take Tensors as inputs, and return Tensors.
- more advanced use-cases are passing in / returning structs or tuples, like
`struct { Tensor, Tensor }` as an example for a tuple of two tensors.
You can see such use-cases, for example in the
[Llama Model](https://github.com/zml/zml/tree/master/examples/llama)
- in the model, tensors may be optional. As is the case with `bias`.
### Adding a main() function
ZML code is async. Hence, We need to provide an async main function. It works
like this:
```zig
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
try async.AsyncThread.main(gpa.allocator(), asyncMain);
}
pub fn asyncMain() !void {
// ...
```
The above `main()` function only creates an allocator and an async main thread
that executes our `asyncMain()` function by calling it with no (`.{}`)
arguments.
So, let's start with the async main function:
```zig
pub fn asyncMain() !void {
// Short lived allocations
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Arena allocator for BufferStore etc.
var arena_state = std.heap.ArenaAllocator.init(allocator);
defer arena_state.deinit();
const arena = arena_state.allocator();
// Create ZML context
var context = try zml.Context.init();
defer context.deinit();
const platform = context.autoPlatform(.{});
...
}
```
This is boilerplate code that provides a general-purpose allocator and, for
convenience, an arena allocator that we will use later. The advantage of arena
allocators is that you don't need to deallocate individual allocations; you
simply call `.deinit()` to deinitialize the entire arena instead!
We also initialize the ZML context `context` and get our CPU `platform`
automatically.
### The BufferStore
Next, we need to set up the concrete weight and bias tensors for our model.
Typically, we would load them from disk. But since our example works without
stored weights, we are going to create a BufferStore manually, containing
_HostBuffers_ (buffers on the CPU) for both the `weight` and the `bias` tensor.
A BufferStore basically contains a dictionary with string keys that match the
name of the struct fields of our `Layer` struct. So, let's create this
dictionary:
```zig
// Our weights and bias to use
var weights = [3]f16{ 2.0, 2.0, 2.0 };
var bias = [3]f16{ 1.0, 2.0, 3.0 };
const input_shape = zml.Shape.init(.{3}, .f16);
// We manually produce a BufferStore. You would not normally do that.
// A BufferStore is usually created by loading model data from a file.
var buffers: zml.aio.BufferStore.Buffers = .{};
try buffers.put(arena, "weight", zml.HostBuffer.fromArray(&weights));
try buffers.put(arena, "bias", zml.HostBuffer.fromArray(&bias));
// the actual BufferStore
const bs: zml.aio.BufferStore = .{
.arena = arena_state,
.buffers = buffers,
};
```
Our weights are `{2.0, 2.0, 2.0}`, and our bias is just `{1.0, 2.0, 3.0}`. The
shape of the weight and bias tensors is `{3}`, and because of that, the **shape
of the input tensor** is also going to be `{3}`!
Note that `zml.Shape` always takes the data type associated with the tensor. In
our example, that is `f16`, expressed as the enum value `.f16`.
### Compiling our Module for the accelerator
We're only going to use the CPU for our simple model, but we need to compile the
`forward()` function nonetheless. This compilation is usually done
asynchronously. That means, we can continue doing other things while the module
is compiling:
```zig
// A clone of our model, consisting of shapes. We only need shapes for compiling.
// We use the BufferStore to infer the shapes.
const model_shapes = try zml.aio.populateModel(Layer, allocator, bs);
// Start compiling. This uses the inferred shapes from the BufferStore.
// The shape of the input tensor, we have to pass in manually.
var compilation = try async.async(
zml.compileModel,
.{ allocator, Layer.forward, model_shapes, .{input_shape}, platform },
);
// Produce a bufferized weights struct from the fake BufferStore.
// This is like the inferred shapes, but with actual values.
// We will need to send those to the computation device later.
var model_weights = try zml.aio.loadBuffers(Layer, .{}, bs, arena, platform);
defer zml.aio.unloadBuffers(&model_weights); // for good practice
// Wait for compilation to finish
const compiled = try compilation.await();
```
Compiling is happening in the background via the `async` function. We call
`async` with the `zml.compileModel` function and its arguments
separately. The arguments themselves are basically the shapes of the weights in
the BufferStore, the `.forward` function name in order to compile
`Layer.forward`, the shape of the input tensor(s), and the platform for which to
compile (we used auto platform).
### Creating the Executable Model
Now that we have compiled the module utilizing the shapes, we turn it into an
executable.
```zig
// pass the model weights to the compiled module to create an executable module
// all required memory has been allocated in `compile`.
var executable = compiled.prepare(model_weights);
defer executable.deinit();
```
### Calling / running the Model
The executable can now be invoked with an input of our choice.
To create the `input`, we directly use `zml.Buffer` by calling
`zml.Buffer.fromArray()`. It's important to note that `Buffer`s reside in
_accelerator_ (or _device_) memory, which is precisely where the input needs to
be for the executable to process it on the device.
For clarity, let's recap the distinction: `HostBuffer`s are located in standard
_host_ memory, which is accessible by the CPU. When we initialized the weights,
we used `HostBuffers` to set up the `BufferStore`. This is because the
`BufferStore` typically loads weights from disk into `HostBuffer`s, and then
converts them into `Buffer`s when we call `loadBuffers()`.
However, for inputs, we bypass the `BufferStore` and create `Buffer`s directly
in device memory.
```zig
// prepare an input buffer
// Here, we use zml.HostBuffer.fromSlice to show how you would create a
// HostBuffer with a specific shape from an array.
// For situations where e.g. you have an [4]f16 array but need a .{2, 2} input
// shape.
var input = [3]f16{ 5.0, 5.0, 5.0 };
var input_buffer = try zml.Buffer.from(
platform,
zml.HostBuffer.fromSlice(input_shape, &input),
);
defer input_buffer.deinit();
// call our executable module
var result: zml.Buffer = executable.call(.{input_buffer});
defer result.deinit();
// fetch the result buffer to CPU memory
const cpu_result = try result.toHostAlloc(arena);
std.debug.print(
"\n\nThe result of {d} * {d} + {d} = {d}\n",
.{ &weights, &input, &bias, cpu_result.items(f16) },
);
```
Note that the result of a computation is usually residing in the memory of the
computation device, so with `.toHostAlloc()` we bring it back to CPU memory in
the form of a `HostBuffer`. After that, we can print it.
In order to print it, we need to tell the host buffer how to interpret the
memory. We do that by calling `.items(f16)`, making it cast the memory to `f16`
items.
And that's it! Now, let's have a look at building and actually running this
example!
## Building it
As mentioned already, ZML uses Bazel; so to build our model, we just need to
create a simple `BUILD.bazel` file, next to the `main.zig` file, like this:
```python
load("@rules_zig//zig:defs.bzl", "zig_binary")
zig_binary(
name = "simple_layer",
main = "main.zig",
deps = [
"@zml//async",
"@zml//zml",
],
)
```
To produce an executable, we import `zig_cc_binary` from the zig rules, and
pass it a name and the zig file we just wrote. The dependencies in `deps` are
what's needed for a basic ZML executable and correlate with our imports at the
top of the Zig file:
```zig
const zml = @import("zml");
const async = @import("async");
```
## Running it
With everything in place now, running the model is easy:
```
# run release (--config=release)
bazel run --config=release //examples/simple_layer
# compile and run debug version
bazel run //simple_layer
```
And voila! Here's the output:
```
bazel run --config=release //simple_layer
INFO: Analyzed target //simple_layer:simple_layer (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //simple_layer:simple_layer up-to-date:
bazel-bin/simple_layer/simple_layer
INFO: Elapsed time: 0.120s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/simple_layer/simple_layer
info(pjrt): Loaded library: libpjrt_cpu.dylib
info(zml_module): Compiling main.Layer.forward with { Shape({3}, dtype=.f16) }
The result of { 2, 2, 2 } * { 5, 5, 5 } + { 1, 2, 3 } = { 11, 12, 13 }
```
---
You can access the complete source code of this walkthrough here:
- [main.zig](https://github.com/zml/zml/tree/master/examples/simple_layer/main.zig)
- [BUILD.bazel](https://github.com/zml/zml/tree/master/examples/simple_layer/BUILD.bazel)
## The complete example
```zig
const std = @import("std");
const zml = @import("zml");
const async = @import("async");
/// Model definition
const Layer = struct {
bias: ?zml.Tensor = null,
weight: zml.Tensor,
pub fn forward(self: Layer, x: zml.Tensor) zml.Tensor {
var y = self.weight.mul(x);
if (self.bias) |bias| {
y = y.add(bias);
}
return y;
}
};
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
try async.AsyncThread.main(gpa.allocator(), asyncMain);
}
pub fn asyncMain() !void {
// Short lived allocations
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Arena allocator for BufferStore etc.
var arena_state = std.heap.ArenaAllocator.init(allocator);
defer arena_state.deinit();
const arena = arena_state.allocator();
// Create ZML context
var context = try zml.Context.init();
defer context.deinit();
const platform = context.autoPlatform(.{});
// Our weights and bias to use
var weights = [3]f16{ 2.0, 2.0, 2.0 };
var bias = [3]f16{ 1.0, 2.0, 3.0 };
const input_shape = zml.Shape.init(.{3}, .f16);
// We manually produce a BufferStore. You would not normally do that.
// A BufferStore is usually created by loading model data from a file.
var buffers: zml.aio.BufferStore.Buffers = .{};
try buffers.put(arena, "weight", zml.HostBuffer.fromArray(&weights));
try buffers.put(arena, "bias", zml.HostBuffer.fromArray(&bias));
// the actual BufferStore
const bs: zml.aio.BufferStore = .{
.arena = arena_state,
.buffers = buffers,
};
// A clone of our model, consisting of shapes. We only need shapes for
// compiling. We use the BufferStore to infer the shapes.
const model_shapes = try zml.aio.populateModel(Layer, allocator, bs);
// Start compiling. This uses the inferred shapes from the BufferStore.
// The shape of the input tensor, we have to pass in manually.
var compilation = try async.async(zml.compileModel, .{ allocator, Layer.forward, model_shapes, .{input_shape}, platform });
// Produce a bufferized weights struct from the fake BufferStore.
// This is like the inferred shapes, but with actual values.
// We will need to send those to the computation device later.
var model_weights = try zml.aio.loadBuffers(Layer, .{}, bs, arena, platform);
defer zml.aio.unloadBuffers(&model_weights); // for good practice
// Wait for compilation to finish
const compiled = try compilation.await();
// pass the model weights to the compiled module to create an executable
// module
var executable = compiled.prepare(model_weights);
defer executable.deinit();
// prepare an input buffer
// Here, we use zml.HostBuffer.fromSlice to show how you would create a
// HostBuffer with a specific shape from an array.
// For situations where e.g. you have an [4]f16 array but need a .{2, 2}
// input shape.
var input = [3]f16{ 5.0, 5.0, 5.0 };
var input_buffer = try zml.Buffer.from(
platform,
zml.HostBuffer.fromSlice(input_shape, &input),
);
defer input_buffer.deinit();
// call our executable module
var result: zml.Buffer = executable.call(.{input_buffer});
defer result.deinit();
// fetch the result to CPU memory
const cpu_result = try result.toHostAlloc(arena);
std.debug.print(
"\n\nThe result of {d} * {d} + {d} = {d}\n",
.{ &weights, &input, &bias, cpu_result.items(f16) },
);
}
```
## Where to go from here
- [Add some weights files to your model](../howtos/add_weights.md)
- [Run the model on GPU](../tutorials/getting_started.md)
- [Deploy the model on a server](../howtos/deploy_on_server.md)
- [Dockerize this model](../howtos/dockerize_models.md)
- [Learn more about ZML concepts](../learn/concepts.md)
- [Find out how to best port PyTorch models](../howtos/howto_torch2zml.md)