Commit Graph

51 Commits

Author SHA1 Message Date
1427286716 runtimes/neuron: fix neuron runtime
This PR fixes the neuron runtime with the following:

Proxy the PJRT Api method to enforce the client struct sizes since the
neuron PJRT plugin doesn't use `>=` but `==` to assert them, breaking
PJRT compatibility guarantees.
Fixes https://github.com/aws-neuron/aws-neuron-sdk/issues/1095

Reimplement `libneuronxla` in Zig to control neuronx-cc sandboxing and
invocation.

Implement a python bootstrapper in Zig to create a full blown
`neuronx-cc` executable, avoiding the infamous chicken and egg problem
of python executables boostrapping when sandboxed (due to fixed path
shebangs).

---------

Co-authored-by: Corentin Kerisit <corentin.kerisit@gmail.com>
2025-07-15 15:26:03 +00:00
e1ee340306 runtimes/cuda: implement zmlxcuda in Zig 2025-07-08 09:25:25 +00:00
c488b634fc runtimes/rocm: implement zmlxrocm in Zig
Also, sandbox `amdgpu.ids` and restore safetensors json parsing.
2025-07-07 16:48:07 +00:00
cf00506dbb Switch workspace build rules from zig_cc_binary to zig_binary, removing the hack and using the C linker directly. 2025-07-03 15:10:36 +00:00
e789e26008 Remove examples workspace and clean up related Bazel BUILD/MODULE files and Zig build scripts. 2025-06-19 09:30:29 +00:00
1a2b862ec2 Add sandbox neuron dependencies: define a trampoline PJRT, create an empty repository for distroless deps, and update Bazel build files and Zig/C sources accordingly. 2025-05-19 17:35:33 +00:00
55c5b540f8 Add XLA 20250718.0‑6319f0d with ROCm 6.4.1 support, update Bazel module files and runtime configs, and apply migration, FFI‑handler and header‑cleanup patches. 2025-05-12 12:10:27 +00:00
ed5ae31338 runtimes/rocm: fetch libdrm from amdgpu repository and add amdgpu.ids layer 2025-04-30 15:53:51 +00:00
e7323be10b runtimes/rocm: switch to in-process LLD, removing the need for sandboxed lld. 2025-04-23 11:43:18 +00:00
7d9fdf94e7 runtimes/rocm: sandbox ROCm dependencies and ensure they load on the main thread due to TLS usage in static C++ destructors. 2025-04-14 16:38:15 +00:00
eba0e72532 runtimes/tpu: sandbox TPU PJRT plugin; no external dependencies. 2025-04-10 14:47:16 +00:00
78d7b672e7 runtimes/cpu: sandbox CPU PJRT plugin, simplifying as there are no additional NEEDED dependencies. 2025-04-03 11:57:46 +00:00
2d321d232d runtimes/cuda: sandbox CUDA dependencies by removing them from the leaf binary, sandboxing the dependency graph, marking dlopen direct dependencies as NEEDED, setting RPATH to the sandbox, loading the PJRT plugin from the sandbox, and enabling weak CUDA symbols without direct linking. 2025-03-26 11:18:29 +00:00
f27a524f31 Update rules_zig: add zig_srcs target, fix source handling bug, clean up BUILD files, adjust async/coro.zig tests, and disable nemo and yaml model loaders. 2025-03-13 12:27:21 +00:00
9488672d4b workspace: bump xla to version 20250710.0-22ea002
Also:
- Bump XLA deps : `com_github_grpc_grpc` and `com_google_protobuf`
- Inject `rules_ml_toolchain`
- Fix `zig_proto_library` rule
2025-03-04 17:12:34 +00:00
fa0ed045ef runtimes/cuda: downgrade cuda and cudnn
This commit reverts part of https://github.com/zml/zml/pull/238/files
This is required because XLA has a strong dependencies on CUDA 12.8 and
upgrading to 12.9 is impossible due to
https://github.com/NVIDIA/cccl/issues/4967
2025-02-28 17:36:12 +00:00
1cafcc3c60 Workspace: bump XLA to newer version. 2025-02-05 17:35:27 +00:00
9ef838be25 Update neuron runtime BUILD.bazel to use Bazel manual tag and S3 cache integration. 2025-02-03 14:03:33 +00:00
95453c7242 Update XLA dependency to version 20250527.0‑cb67f2f and refresh related Bazel BUILD, MODULE, overlay and patch files. 2024-11-22 16:50:20 +00:00
d8a83830e8 runtimes: switch to Cloudflare Debian snapshots for more reliable dependency pinning. 2024-11-15 09:40:58 +00:00
ea3ce685a9 runtimes/neuron: bump runtime version and expose nrt.h header to Zig. 2024-11-14 13:37:47 +00:00
47a4eda5f6 runtimes/cuda: expose cuda.h in the C namespace for CUDA runtimes, enabling custom calls to CUDA functions. 2024-11-01 13:27:24 +00:00
4a0b1cce50 Update Bazel workspace and XLA overlay (MODULE.bazel, BUILD files, patches) to prevent dual LLVM builds and apply migration/bump patches. 2024-09-27 14:00:44 +00:00
63ef78efcc zml: add support for NVTX tracing 2024-08-21 14:41:40 +00:00
ca4e061ad5 Add Bazel build configurations for macOS x86_64 CPU runtime and ZLS third‑party integration. 2024-07-25 15:58:14 +00:00
efcf955a4e workspace, third_party/rules_zig: adjust ZLS to require --version as the first parameter and add missing keys to the BuildConfig object for code completion 2024-07-10 15:20:12 +00:00
967eeb928f Update Bazel workspace and runtime configs: rework sandboxing, bump PJRT to 7.0.0, and upgrade CUDA (12.8), cuDNN (9.8), and ROCm (6.3.4). 2024-06-25 11:00:29 +00:00
3aac788544 Update Bazel build configurations (zig.bzl, BUILD files) for MLIR, PJRT, Neuron, ROCm, tokenizer, and tools, fixing broken dependencies. 2024-05-20 11:28:25 +00:00
f5ab6ff2c6 Update XLA to version 20250204.0-6789523 and adjust Bazel module and runtime files for Bazel 8 compatibility. 2024-05-03 15:57:56 +00:00
5a2171793d workspace: MODULE.bazel cleanup
Title says it all !
2024-04-22 09:27:44 +00:00
980f1b17fb Ensure all runtime plugins have correct SONAME values, fixing issues with prebuilt PJRT plugins. 2024-03-11 10:15:22 +00:00
8a25b1eb74 Revert CUDA PJRT plugin version to 0.4.38 to address performance regression on XLA master. 2024-03-05 17:04:42 +00:00
169a24307c Migrate workspace and XLA module definitions to Bazel 8, updating MODULE.bazel files, BUILD rules, and related migration patches. 2024-02-12 12:43:23 +00:00
7e6103d876 Upgrade XLA to version 20250122.0-cc075be, switch to nvptx compiler and nvlink with nvjitlink support, add warning for CUDA path in LD_LIBRARY_PATH, and revert the previous CUDA sandbox fix. 2024-02-06 09:31:48 +00:00
edc2ac26f8 Adjust ROCm runtime sandboxing to hook only the PJRT plugin and make hipblastlt bytecodes optional. 2024-01-26 13:02:23 +00:00
a7b7ae0180 Fix async hangs by reworking the libxev epoll backend and using callBlocking for PJRT plugin loading, improving performance across async and runtime modules. 2024-01-16 14:13:45 +00:00
434cee3a6c Fix CUDA and ROCm sandbox discovery, update epoll libxev patch to prevent high CPU usage, enable XLA GPU latency‑hiding scheduler, and upgrade cuDNN to 9.6.0. 2024-01-15 09:41:42 +00:00
145e60b4dd workspace: Update LLVM, XLA, StableHLO, and PJRT plugins to latest versions. 2023-12-13 10:10:32 +00:00
37725cdaa6 Update PJRT, runtime, and ZML modules to use per‑target output folders and expose profiler.dumpDataAsJson for JSON profiling output. 2023-12-04 10:38:10 +00:00
455bb3877f runtimes/cuda: obtain NCCL from the pip package, matching XLA behavior. 2023-09-20 17:41:44 +00:00
0d5389ceda Update CUDA runtime sandboxing and dynamic symbol renaming, switch to pre‑built jax‑cuda‑pjrt plugin, and bump CUDA to 12.6.2 and cuDNN to 9.5.1. 2023-09-14 13:28:25 +00:00
7d24329d0a Add Bazel build rules and runtime implementation for AWS Neuron/Trainium/Inferentia support. 2023-08-18 17:11:27 +00:00
01eff33fa0 Update workspace dependencies to newer LLVM, XLA, StableHLO, and PJRT versions and expose new pjrt plugin attribute APIs and stablehlo version APIs in build and runtime configurations. 2023-08-07 12:28:36 +00:00
54e7eb30b4 Introduce a thin abstraction layer between ZML and PJRT to manage plugin loading decisions, enable compile‑time detection of linked runtimes, and handle cases such as libtpu blocking metadata access. 2023-05-15 09:36:41 +00:00
cfe38f27ca Switch ROCm dlopen handling to patchelf's rename_dynamic_symbols for more robust dynamic symbol import. 2023-05-03 17:33:46 +00:00
833ff5f28d Upgrade PJRT CUDA Plugin to version 0.2.3, adding NCCL support for correct sharding. 2023-04-12 15:47:06 +00:00
70d40208a2 runtimes/cuda: Fix version variable definitions in the build script to enable successful CUDA builds. 2023-03-09 11:31:02 +00:00
0c126c2e12 runtimes/cuda: Upgrade CUDA to 12.6.2 and cuDNN to 9.4.0. 2023-03-03 15:17:26 +00:00
f595d22134 runtimes/rocm: Upgrade ROCm to version 6.2.2. 2023-03-01 13:15:50 +00:00
0606ea1d7c Update Bazel workspace and runtime BUILD files to newer XLA, StableHLO, and LLVM versions, enabling batching‑dims support for the gather operator. 2023-02-01 15:58:30 +00:00