Update docs (deploy_on_server, dockerize_models, getting_started) and example Bazel files to include AWS Neuron/Trainium/Inferentia deployment guidance.

This commit is contained in:
Foke Singh 2023-08-21 09:15:48 +00:00
parent 7d24329d0a
commit af0630616c
6 changed files with 482 additions and 421 deletions

View File

@ -17,6 +17,7 @@ following arguments to the command line when compiling / running a model:
- NVIDIA CUDA: `--@zml//runtimes:cuda=true` - NVIDIA CUDA: `--@zml//runtimes:cuda=true`
- AMD RoCM: `--@zml//runtimes:rocm=true` - AMD RoCM: `--@zml//runtimes:rocm=true`
- Google TPU: `--@zml//runtimes:tpu=true` - Google TPU: `--@zml//runtimes:tpu=true`
- AWS Trainium/Inferentia 2: `--@zml//runtimes:neuron=true`
- **AVOID CPU:** `--@zml//runtimes:cpu=false` - **AVOID CPU:** `--@zml//runtimes:cpu=false`
So, to run the OpenLLama model from above **on your development machine** So, to run the OpenLLama model from above **on your development machine**

View File

@ -7,11 +7,11 @@ just have to append a few lines to your model's `BUILD.bazel`. Here is how it's
done. done.
**Note:** This walkthrough will work with your installed container runtime, no **Note:** This walkthrough will work with your installed container runtime, no
matter if it's **Docker or e.g. Podman.** Also, we'll create images in the matter if it's **Docker or e.g. Podman.** Also, we'll create images in the
[OCI](https://github.com/opencontainers/image-spec) open image format. [OCI](https://github.com/opencontainers/image-spec) open image format.
Let's try containerizing our [first model](../tutorials/write_first_model.md), as it Let's try containerizing our [first model](../tutorials/write_first_model.md), as it
doesn't need any additional weights files. We'll see [down below](#adding-weights-and-data) doesn't need any additional weights files. We'll see [down below](#adding-weights-and-data)
how to add those. We'll also see how to add GPU/TPU support for our container how to add those. We'll also see how to add GPU/TPU support for our container
there. there.
@ -57,7 +57,7 @@ zig_cc_binary(
### 1. The Manifest ### 1. The Manifest
To get started, let's make bazel generate a manifest that will be used when To get started, let's make bazel generate a manifest that will be used when
creating the TAR archive. creating the TAR archive.
```python ```python
# Manifest created from the simple_layer binary and friends # Manifest created from the simple_layer binary and friends
@ -118,7 +118,7 @@ See how we use string interpolation to fill in the folder name for the
container's entrypoint? container's entrypoint?
Next, we use a transition rule to force the container to be built for Next, we use a transition rule to force the container to be built for
Linux X86_64: Linux X86_64:
```python ```python
@ -150,10 +150,10 @@ INFO: Build completed successfully, 1 total action
### 4. The Load ### 4. The Load
While inspecting the image is surely interesting, we usually want to load the While inspecting the image is surely interesting, we usually want to load the
image so we can run it. image so we can run it.
There is a bazel rule for that: `oci_load`. When we append the following lines There is a bazel rule for that: `oci_load`. When we append the following lines
to `BUILD.bazel`: to `BUILD.bazel`:
```python ```python
@ -218,7 +218,7 @@ how to build Docker images that also contain data files.
You can `bazel run -c opt //mnist:push -- --repository You can `bazel run -c opt //mnist:push -- --repository
index.docker.io/my_org/zml_mnist` in the `./examples` folder if you want to try index.docker.io/my_org/zml_mnist` in the `./examples` folder if you want to try
it out. it out.
**Note: Please add one more of the following parameters to specify all the **Note: Please add one more of the following parameters to specify all the
platforms your containerized model should support.** platforms your containerized model should support.**
@ -226,6 +226,7 @@ platforms your containerized model should support.**
- NVIDIA CUDA: `--@zml//runtimes:cuda=true` - NVIDIA CUDA: `--@zml//runtimes:cuda=true`
- AMD RoCM: `--@zml//runtimes:rocm=true` - AMD RoCM: `--@zml//runtimes:rocm=true`
- Google TPU: `--@zml//runtimes:tpu=true` - Google TPU: `--@zml//runtimes:tpu=true`
- AWS Trainium/Inferentia 2: `--@zml//runtimes:neuron=true`
- **AVOID CPU:** `--@zml//runtimes:cpu=false` - **AVOID CPU:** `--@zml//runtimes:cpu=false`
**Example:** **Example:**
@ -337,7 +338,7 @@ oci_image(
name = "image_", name = "image_",
base = "@distroless_cc_debian12", base = "@distroless_cc_debian12",
# the entrypoint comes from the expand_template rule `entrypoint` above # the entrypoint comes from the expand_template rule `entrypoint` above
entrypoint = ":entrypoint", entrypoint = ":entrypoint",
tars = [":archive"], tars = [":archive"],
) )

View File

@ -115,6 +115,7 @@ following arguments to the command line when compiling or running a model:
- NVIDIA CUDA: `--@zml//runtimes:cuda=true` - NVIDIA CUDA: `--@zml//runtimes:cuda=true`
- AMD RoCM: `--@zml//runtimes:rocm=true` - AMD RoCM: `--@zml//runtimes:rocm=true`
- Google TPU: `--@zml//runtimes:tpu=true` - Google TPU: `--@zml//runtimes:tpu=true`
- AWS Trainium/Inferentia 2: `--@zml//runtimes:neuron=true`
- **AVOID CPU:** `--@zml//runtimes:cpu=false` - **AVOID CPU:** `--@zml//runtimes:cpu=false`
The latter, avoiding compilation for CPU, cuts down compilation time. The latter, avoiding compilation for CPU, cuts down compilation time.

View File

@ -4,9 +4,9 @@ bazel_dep(name = "bazel_skylib", version = "1.7.1")
bazel_dep(name = "rules_zig", version = "20240913.0-1957d05") bazel_dep(name = "rules_zig", version = "20240913.0-1957d05")
bazel_dep(name = "platforms", version = "0.0.10") bazel_dep(name = "platforms", version = "0.0.10")
bazel_dep(name = "zml", version = "0.1.0") bazel_dep(name = "zml", version = "0.1.0")
bazel_dep(name = "aspect_bazel_lib", version = "2.8.1.1") bazel_dep(name = "aspect_bazel_lib", version = "2.9.3")
bazel_dep(name = "rules_oci", version = "2.0.0") bazel_dep(name = "rules_oci", version = "2.0.0")
oci = use_extension("@rules_oci//oci:extensions.bzl", "oci") oci = use_extension("@rules_oci//oci:extensions.bzl", "oci")
oci.pull( oci.pull(
name = "distroless_cc_debian12", name = "distroless_cc_debian12",
@ -17,6 +17,15 @@ oci.pull(
], ],
) )
use_repo(oci, "distroless_cc_debian12", "distroless_cc_debian12_linux_amd64") use_repo(oci, "distroless_cc_debian12", "distroless_cc_debian12_linux_amd64")
oci.pull(
name = "distroless_cc_debian12_debug",
digest = "sha256:ae6f470336acbf2aeffea3db70ec0e74d69bee7270cdb5fa2f28fe840fad57fe",
image = "gcr.io/distroless/cc-debian12",
platforms = [
"linux/amd64",
],
)
use_repo(oci, "distroless_cc_debian12_debug", "distroless_cc_debian12_debug_linux_amd64")
# Mnist weights # Mnist weights
http_file = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_file") http_file = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_file")
@ -37,7 +46,6 @@ http_file(
# Llama weights # Llama weights
huggingface = use_extension("@zml//bazel:huggingface.bzl", "huggingface") huggingface = use_extension("@zml//bazel:huggingface.bzl", "huggingface")
huggingface.model( huggingface.model(
name = "Karpathy-TinyLlama-Stories", name = "Karpathy-TinyLlama-Stories",
build_file_content = """\ build_file_content = """\
@ -101,7 +109,6 @@ filegroup(
model = "meta-llama/Meta-Llama-3.1-8B-Instruct", model = "meta-llama/Meta-Llama-3.1-8B-Instruct",
) )
use_repo(huggingface, "Meta-Llama-3.1-8B-Instruct") use_repo(huggingface, "Meta-Llama-3.1-8B-Instruct")
huggingface.model( huggingface.model(
name = "Meta-Llama-3.1-70B-Instruct", name = "Meta-Llama-3.1-70B-Instruct",
build_file_content = """\ build_file_content = """\
@ -125,7 +132,6 @@ filegroup(
model = "meta-llama/Meta-Llama-3.1-70B-Instruct", model = "meta-llama/Meta-Llama-3.1-70B-Instruct",
) )
use_repo(huggingface, "Meta-Llama-3.1-70B-Instruct") use_repo(huggingface, "Meta-Llama-3.1-70B-Instruct")
huggingface.model( huggingface.model(
name = "TinyLlama-1.1B-Chat-v1.0", name = "TinyLlama-1.1B-Chat-v1.0",
build_file_content = """\ build_file_content = """\
@ -149,7 +155,6 @@ filegroup(
model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0", model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
) )
use_repo(huggingface, "TinyLlama-1.1B-Chat-v1.0") use_repo(huggingface, "TinyLlama-1.1B-Chat-v1.0")
huggingface.model( huggingface.model(
name = "OpenLM-Research-OpenLLaMA-3B", name = "OpenLM-Research-OpenLLaMA-3B",
build_file_content = """\ build_file_content = """\

File diff suppressed because it is too large Load Diff

View File

@ -150,9 +150,12 @@ tar(
oci_image( oci_image(
name = "image_", name = "image_",
base = "@distroless_cc_debian12", base = "@distroless_cc_debian12_debug",
entrypoint = ["./{}/llama".format(package_name())], entrypoint = ["./{}/llama".format(package_name())],
tars = [":archive"], tars = [
"@zml//runtimes:layers",
":archive",
],
) )
platform_transition_filegroup( platform_transition_filegroup(