Update docs: add deployment guide, Hugging Face token instructions, getting‑started tutorial, and include a Bazel lock example.

2025-06-05 13:18:14 +00:00 · 2025-06-05 13:18:14 +00:00 · 7fb02e1888
commit 7fb02e1888
parent f9280b1069
4 changed files with 379 additions and 30 deletions
--- a/docs/howtos/deploy_on_server.md
+++ b/docs/howtos/deploy_on_server.md
@ -20,12 +20,12 @@ following arguments to the command line when compiling / running a model:
 - AWS Trainium/Inferentia 2: `--@zml//runtimes:neuron=true`
 - **AVOID CPU:** `--@zml//runtimes:cpu=false`
-So, to run the OpenLLama model from above **on your development machine**
+So, to run the Llama model from above **on your development machine**
 housing an NVIDIA GPU, run the following:
 ```
 cd examples
-bazel run --config=release //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true
+bazel run --config=release //llama --@zml//runtimes:cuda=true -- --hf-model-path=$HOME/Llama-3.2-1B-Instruct
 ```
@ -38,14 +38,14 @@ architectures:
 - Linux ARM64: `--platforms=@zml//platforms:linux_arm64`
 - MacOS ARM64: `--platforms=@zml//platforms:macos_arm64`
-As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:
+As an example, here is how you build above Llama for CUDA on Linux X86_64:
 ```
 cd examples
-bazel build --config=release //llama:OpenLLaMA-3B               \
+bazel build --config=release //llama          \
-            --@zml//runtimes:cuda=true                \
+    --@zml//runtimes:cuda=true                \
-            --@zml//runtimes:cpu=false                \
+    --@zml//runtimes:cpu=false                \
-            --platforms=@zml//platforms:linux_amd64
+    --platforms=@zml//platforms:linux_amd64
 ```
 ### Creating the TAR
--- a/docs/howtos/huggingface_access_token.md
+++ b/docs/howtos/huggingface_access_token.md
@ -1,24 +1,21 @@
 # Running Gated Huggingface Models with Token Authentication
-# Huggingface Token Authentication
+Some models have restrictions and may require some sort of approval or agreement
 process, which, by consequence, **requires token-authentication with Huggingface**.
-Some models have restrictions and may require some sort of approval or
+The easiest way might be to use the `huggingface-cli login` command.
 agreement process, which, by consequence, **requires token-authentication with
 Huggingface**.
-Here is how you can generate a **"read-only public repositories"** access token
+Alternatively, here is how you can generate a **"read-only public repositories"**
-to log into your account on Huggingface, directly from `bazel`, in order to
+access token to log into your account on Huggingface, directly from `bazel`, in order to download models.
 download models.
 * log in at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
 * click on "Create new token"
-* give the token a name, eg `zml_public_repos`
+* give the token a name, eg `zml_public_repos`,
-* under _Repositories_, grant the following permission: "Read access to
+* under _Repositories_, grant the following permission: "Read access to contents of all public gated repos you can access".
-  contents of all public gated repos you can access".
+* at the bottom click on "Create token".
 * at the bottom, click on "Create token".
 * copy the token by clicking `Copy`. **You won't be able to see it again.**
 * the token looks something like `hf_abCdEfGhijKlM`.
-* store the token on your machine (replace the placeholder with your actual
+* store the token on your machine (replace the placeholder with your actual token):
  token):
 You can use the `HUGGINGFACE_TOKEN` environment variable to store the token or use
 its standard location:
@ -26,7 +23,7 @@ its standard location:
 mkdir -p $HOME/.cache/huggingface/; echo <hf_my_token> > "$HOME/.cache/huggingface/token"
 ```
-Now you're ready to download a gated model like `Meta-Llama-3-8b`!
+Now you're ready to download a gated model like `Meta-Llama-3.2-1b`!
 **Example:**
@ -34,8 +31,7 @@ Now you're ready to download a gated model like `Meta-Llama-3-8b`!
 # requires token in $HOME/.cache/huggingface/token, as created by the
 # `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.
 cd examples
-bazel run --config=release //llama:Meta-Llama-3-8b
+bazel run @zml//tools:hf -- download meta-llama/Llama-3.2-1B-Instruct --local-dir $HOME/Llama-3.2-1B-Instruct --exclude='*.pth'
-bazel run --config=release //llama:Meta-Llama-3-8b -- --promt="Once upon a time,"
+bazel run --config=release //llama -- --hf-model-path=$HOME/Llama-3.2-1B-Instruct --prompt="What is the capital of France?"
 ```
--- a/docs/tutorials/getting_started.md
+++ b/docs/tutorials/getting_started.md
@ -75,8 +75,9 @@ Once you've been granted access, you're ready to download a gated model like
 # requires token in $HOME/.cache/huggingface/token, as created by the
 # `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.
 cd examples
-bazel run --config=release //llama:Llama-3.1-8B-Instruct
+bazel run @zml//tools:hf -- download meta-llama/Llama-3.1-8B-Instruct --local-dir $HOME/Llama-3.1-8B-Instruct --exclude='*.pth'
-bazel run --config=release //llama:Llama-3.1-8B-Instruct -- --prompt="What is the capital of France?"
+bazel run --config=release //llama -- --hf-model-path=$HOME/Llama-3.1-8B-Instruct
 bazel run --config=release //llama -- --hf-model-path=$HOME/Llama-3.1-8B-Instruct --prompt="What is the capital of France?"
 ```
 You can also try `Llama-3.1-70B-Instruct` if you have enough memory.
@ -88,8 +89,9 @@ Like the 8B model above, this model also requires approval. See
 ```
 cd examples
-bazel run --config=release //llama:Llama-3.2-1B-Instruct
+bazel run @zml//tools:hf -- download meta-llama/Llama-3.2-1B-Instruct --local-dir $HOME/Llama-3.2-1B-Instruct --exclude='*.pth'
-bazel run --config=release //llama:Llama-3.2-1B-Instruct -- --prompt="What is the capital of France?"
+bazel run --config=release //llama -- --hf-model-path=$HOME/Llama-3.2-1B-Instruct
 bazel run --config=release //llama -- --hf-model-path=$HOME/Llama-3.2-1B-Instruct --prompt="What is the capital of France?"
 ```
 For a larger 3.2 model, you can also try `Llama-3.2-3B-Instruct`.
--- a/examples/MODULE.bazel.lock
+++ b/examples/MODULE.bazel.lock