138 lines
4.2 KiB
Markdown
138 lines
4.2 KiB
Markdown
|
||
# Deploying Models on a Server
|
||
|
||
To run models on remote GPU/TPU machines, it is inconvenient to have to check
|
||
out your project’s repository and compile it on every target. Instead, you more
|
||
likely want to cross-compile right from your development machine, **for every**
|
||
supported target architecture and accelerator.
|
||
|
||
See [Getting Started with ZML](../tutorials/getting_started.md) if you need more
|
||
information on how to compile a model.
|
||
|
||
**Here's a quick recap:**
|
||
|
||
You can compile models for accelerator runtimes by appending one or more of the
|
||
following arguments to the command line when compiling / running a model:
|
||
|
||
- NVIDIA CUDA: `--@zml//runtimes:cuda=true`
|
||
- AMD RoCM: `--@zml//runtimes:rocm=true`
|
||
- Google TPU: `--@zml//runtimes:tpu=true`
|
||
- AWS Trainium/Inferentia 2: `--@zml//runtimes:neuron=true`
|
||
- **AVOID CPU:** `--@zml//runtimes:cpu=false`
|
||
|
||
So, to run the OpenLLama model from above **on your development machine**
|
||
housing an NVIDIA GPU, run the following:
|
||
|
||
```
|
||
cd examples
|
||
bazel run --config=release //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true
|
||
```
|
||
|
||
|
||
## Cross-Compiling and creating a TAR for your server
|
||
|
||
Currently, ZML lets you cross-compile to one of the following target
|
||
architectures:
|
||
|
||
- Linux X86_64: `--platforms=@zml//platforms:linux_amd64`
|
||
- Linux ARM64: `--platforms=@zml//platforms:linux_arm64`
|
||
- MacOS ARM64: `--platforms=@zml//platforms:macos_arm64`
|
||
|
||
As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:
|
||
|
||
```
|
||
cd examples
|
||
bazel build --config=release //llama:OpenLLaMA-3B \
|
||
--@zml//runtimes:cuda=true \
|
||
--@zml//runtimes:cpu=false \
|
||
--platforms=@zml//platforms:linux_amd64
|
||
```
|
||
|
||
### Creating the TAR
|
||
|
||
When cross-compiling, it is convenient to produce a compressed TAR file that
|
||
you can copy to the target host, so you can unpack it there and run the model.
|
||
|
||
Let's use MNIST as example.
|
||
|
||
If not present already, add an "archive" target to the model's `BUILD.bazel`,
|
||
like this:
|
||
|
||
```python
|
||
load("@aspect_bazel_lib//lib:tar.bzl", "mtree_spec", "tar")
|
||
|
||
# Manifest, required for building the tar archive
|
||
mtree_spec(
|
||
name = "mtree",
|
||
srcs = [":mnist"],
|
||
)
|
||
|
||
# Create a tar archive from the above manifest
|
||
tar(
|
||
name = "archive",
|
||
srcs = [":mnist"],
|
||
args = [
|
||
"--options",
|
||
"zstd:compression-level=9",
|
||
],
|
||
compress = "zstd",
|
||
mtree = ":mtree",
|
||
)
|
||
```
|
||
|
||
... and then build the TAR archive:
|
||
|
||
```
|
||
# cd examples
|
||
bazel build --config=release //mnist:archive \
|
||
--@zml//runtimes:cuda=true \
|
||
--@zml//runtimes:cpu=false \
|
||
--platforms=@zml//platforms:linux_amd64
|
||
```
|
||
|
||
Note the `//mnist:archive` notation.
|
||
|
||
The resulting tar file will be in `bazel-bin/mnist/archive.tar.zst`.
|
||
|
||
### Run it on the server
|
||
|
||
You can copy the TAR archive onto your Linux X86_64 NVIDIA GPU server, untar
|
||
and run it:
|
||
|
||
```bash
|
||
# on your machine
|
||
scp bazel-bin/mnist/archive.tar.zst destination-server:
|
||
ssh destination-server # to enter the server
|
||
|
||
# ... on the server
|
||
tar xvf archive.tar.zst
|
||
./mnist \
|
||
'mnist.runfiles/_main~_repo_rules~com_github_ggerganov_ggml_mnist/file/mnist.pt' \
|
||
'mnist.runfiles/_main~_repo_rules~com_github_ggerganov_ggml_mnist_data/file/mnist.ylc'
|
||
```
|
||
|
||
The easiest way to figure out the commandline arguments of an example model is
|
||
to consult the model's `BUILD.bazel` and check out its `args` section. It will
|
||
reference e.g. weights files that are defined either in the same `BUILD.bazel`
|
||
file or in a `weights.bzl` file.
|
||
|
||
You can also consult the console output when running your model locally:
|
||
|
||
```bash
|
||
bazel run //mnist
|
||
|
||
INFO: Analyzed target //mnist:mnist (0 packages loaded, 0 targets configured).
|
||
INFO: Found 1 target...
|
||
Target //mnist:mnist up-to-date:
|
||
bazel-bin/mnist/mnist
|
||
INFO: Elapsed time: 0.302s, Critical Path: 0.00s
|
||
INFO: 3 processes: 3 internal.
|
||
INFO: Build completed successfully, 3 total actions
|
||
INFO: Running command line: bazel-bin/mnist/mnist ../_main~_repo_rules~com_github_ggerganov_ggml_mnist/file/mnist.pt ../_main~_repo_rules~com_github_ggerganov_ggml_mnist_data/file/mnist.ylc
|
||
# ...
|
||
```
|
||
|
||
You see the command line right up there. On the server, you just need to replace
|
||
`../` with the 'runfiles' directory of your TAR.
|
||
|