| title | zk0 Node Operators Guide: Join Decentralized Robotics AI Training |
|---|---|
| description | Complete guide for node operators to contribute to zk0 federated learning network with private SO-100 datasets, using zk0bot CLI for SmolVLA training in humanoid robotics. |
Welcome to the zk0 Node Operators Guide! This document provides everything you need to know to participate in the zk0 federated learning network as a node operator.
Installation Guide | Architecture Overview | Running Simulations
zk0 is a federated learning platform for robotics AI, enabling privacy-preserving training of SmolVLA models across distributed clients using real-world SO-100/SO-101 datasets. Node operators contribute their private robotics datasets while maintaining full data privacy.
To join the zk0 network:
-
Review Requirements: Ensure you have:
- A private robotics dataset (SO-100/SO-101 compatible)
- GPU-enabled machine (recommended for training)
- Stable internet connection
- Basic familiarity with Conda and tmux
-
Submit Application: Create a new issue using our Node Operator Application Template
-
Wait for Approval: Our team will review your application and contact you via Discord
Once approved, install the zk0bot CLI tool:
# One-line installer
curl -fsSL https://raw.githubusercontent.com/ivelin/zk0/dev/get-zk0bot.sh | bashThis will:
- Clone/update zk0 repo to ~/zk0 (dev branch)
- Create conda zk0 env (Python 3.10)
- Install flwr[superexec], lerobot[smolvla], torch cu121, pip install -e .
- Optional Hugging Face login
Set up required environment variables in .env (auto-sourced by zk0bot.sh):
# .env example (create in ~/zk0)
HF_TOKEN=your_huggingface_token_here
WANDB_API_KEY=your_wandb_key_here # optional, server-side onlyNote: zk0bot.sh automatically sources .env after conda activation, propagating HF_TOKEN/WANDB_API_KEY to tmux Flower subprocesses (SuperLink/SuperNode). No manual export needed.
Server Machine:
curl -fsSL https://raw.githubusercontent.com/ivelin/zk0/main/website/get-zk0bot.sh | bash
cd ~/zk0
zk0bot server start # Auto-activates zk0 env; SuperLink readyClient Machines (same LAN, add to ~/.bashrc: export ZK0_SERVER_IP=server_ip):
curl -fsSL https://raw.githubusercontent.com/ivelin/zk0/main/website/get-zk0bot.sh | bash
cd ~/zk0
zk0bot client start shaunkirby/record-test # Auto-activates zk0 env; or your private dataset
zk0bot client start ethanCSL/direction_testOn Server (submit run):
zk0bot run --rounds 20 --stream # Full FL session, stateless; auto-zk0 envRemote Clients: Set ZK0_SERVER_IP=public_server_ip (insecure=true for dev; TLS for prod).
Note: WandB logging is handled server-side only. Client training does not require WandB credentials.
Light Test Production Run (Recommended):
- Server: zk0bot server start
- Clients: zk0bot client start shaunkirby/record-test
- Submit run: zk0bot run --rounds 3 --stream
Examples: zk0bot client start shaunkirby/record-test zk0bot client start ethanCSL/direction_test
Production Run (Stateless):
# Standard (pyproject.toml defaults) - runs all server rounds
zk0bot client start yourusername/your-private-dataset
zk0bot client start local:/path/to/your/datasetYour client will:
- Connect to zk0 server (auto-starts at min_fit_clients=2)
- Train locally for all server rounds (stateless, no persistence)
- Send only model updates (no raw data leaves machine)
If you're running a zk0 server:
# Start the server
zk0bot server start
# Check status
zk0bot status
# View logs
zk0bot server log
# Stop the server
zk0bot server stopzk0bot status# Server logs
zk0bot server log
# Client logs
zk0bot client logtmux not found: sudo apt install tmux (Linux) or brew install tmux (macOS) Conda zk0 not active: conda activate zk0 Dataset not found: Verify dataset path/URL and credentials Connection failed: Check internet connection and server availability Installer fails: Check GitHub status, ensure curl available, or git clone https://github.com/ivelin/zk0.git ~/zk0; cd ~/zk0; ./get-zk0bot.sh
- SO-100: Real-world robotics manipulation tasks
- SO-101: Extended robotics tasks
- Custom: LeRobot-compatible datasets
- Clear, well-annotated episodes
- Consistent task definitions
- No overlap with existing network datasets
- Minimum 100 episodes recommended
- All training happens locally
- Only model gradients are shared
- Raw data never leaves your environment
- Dataset metadata is anonymized
- Always-On Operation: Server runs continuously via SuperExec-Server.
- Automatic Start: Sessions start at min_fit_clients=2.
- Idle Handling: Idles below min_clients.
- Full Participation: Clients run ALL server rounds (num_server_rounds).
- Manual Stop: Use zk0bot client stop to disconnect.
- Clean Restarts: No state; always fresh.
- Stateless SuperExec: Clean restarts, no persistence.
- Insecure Mode: Dev; TLS for prod.
Note over SuperLink,SuperNode2: Persistent Infrastructure (started first)
Admin->>+SuperLink: Start SuperLink\n(zk0bot.sh server start)
Note right of SuperLink: Listens on gRPC Fleet API\n(ports 9091-9093)
Admin->>+SuperNode1: Start SuperNode 1\n(zk0bot.sh client start <dataset-uri1>)
Note right of SuperNode1: e.g., dataset-uri1 = "shaunkirby/record-test"\nor "local:/data/client1_episodes"
SuperNode1->>+SuperLink: Register via gRPC\n(Fleet API handshake)
Note right of SuperNode1: Passes node-config\n(dataset-uri=uri1 → unique/private dataset)
Admin->>+SuperNode2: Start SuperNode 2\n(zk0bot.sh client start <dataset-uri2>)
Note right of SuperNode2: e.g., dataset-uri2 = "ethanCSL/direction_test"\nor private HF repo / local path
SuperNode2->>+SuperLink: Register via gRPC\n(Fleet API handshake)
Note right of SuperNode2: Passes node-config\n(dataset-uri=uri2 → unique/private dataset)
Note over SuperLink,SuperNode2: SuperNodes now visible/registered in SuperLink logs
Admin->>+SuperLink: Submit Run\n(zk0bot.sh run → flwr run)
Note right of Admin: Uploads Flower App Bundle (FAB)\ncontaining ServerApp + ClientApp code
SuperLink->>ServerApp: Spawn SuperExec process\nfor ServerApp execution
Note right of ServerApp: ServerApp starts (strategy, rounds, etc.)
SuperLink->>SuperNode1: Instruct to execute ClientApp\n(via registered Fleet API, sends FAB + config)
SuperNode1->>ClientApp1: Spawn SuperExec process\nfor ClientApp
Note over ClientApp1: ClientApp loads private/unique dataset\n(from injected node-config dataset-uri=uri1)\ne.g., HF dataset download or local path
SuperLink->>SuperNode2: Instruct to execute ClientApp\n(via registered Fleet API, sends FAB + config)
SuperNode2->>ClientApp2: Spawn SuperExec process\nfor ClientApp
Note over ClientApp2: ClientApp loads private/unique dataset\n(from injected node-config dataset-uri=uri2)\ne.g., different HF repo or local episodes
Note over ServerApp,ClientApp2: Federation begins (gRPC message passing via SuperLink/SuperNodes)
loop For each federation round (e.g., Fit)
ServerApp->>SuperLink: Send FitIns (parameters)\nto selected SuperNodes
SuperLink->>SuperNode1: Forward FitIns (gRPC)
SuperNode1->>ClientApp1: Forward to local SuperExec
ClientApp1->>ClientApp1: Local training\non private unique dataset (from uri1)
ClientApp1->>SuperNode1: Return FitRes (updated parameters)
SuperNode1->>SuperLink: Forward FitRes
SuperLink->>ServerApp: Deliver FitRes
SuperLink->>SuperNode2: Forward FitIns (gRPC)
SuperNode2->>ClientApp2: Forward to local SuperExec
ClientApp2->>ClientApp2: Local training\non private unique dataset (from uri2)
ClientApp2->>SuperNode2: Return FitRes
SuperNode2->>SuperLink: Forward FitRes
SuperLink->>ServerApp: Deliver FitRes
ServerApp->>ServerApp: Aggregate updates\n(e.g., FedAvg)
end
Note over ServerApp,ClientApp2: Similar flow for Evaluate rounds\n(Server sends EvaluateIns, clients evaluate locally on their private datasets from uri1/uri2)
Note over SuperLink, SuperNode2: Run completes → SuperExec processes terminate. SuperLink & SuperNodes remain running for next Run
The core architecture uses Flower's Deployment Engine, with client data privacy via positional <dataset-uri> in zk0bot.sh client start <dataset-uri>.
zk0bot.sh client start <dataset-uri>- Examples:
./zk0bot.sh client start shaunkirby/record-test./zk0bot.sh client start ethanCSL/direction_test- Private:
./zk0bot.sh client start yourusername/private-so100 - Local:
./zk0bot.sh client start local:/home/user/robot_episodes
- Examples:
- Passes
--node-config '{"dataset-uri": "<uri>"}'toflower-supernode.
- Start SuperLink (
zk0bot server start). - SuperNodes register with SuperLink, injecting unique
dataset-uriin node-config.
zk0bot runsubmits FAB to SuperLink.- SuperLink spawns ServerApp; instructs SuperNodes to spawn ClientApps with pre-registered config.
- Each ClientApp loads exclusive dataset from its
node_config["dataset-uri"](HF download/local).
- gRPC via SuperLink/SuperNodes: parameters/metrics only.
- Local training/eval on private datasets.
- No raw data leaves client.
Join our Discord community for support and updates: zk0 Discord
- Report issues: GitHub Issues
- Documentation: zk0 Docs
- Email: operators@zk0.bot
- Discord: zk0-team
- OS: Linux, macOS, Windows (WSL2)
- RAM: 16GB minimum, 32GB recommended
- GPU: NVIDIA GPU with CUDA support (optional but recommended)
- Storage: 50GB free space for datasets and models
- Network: Stable broadband connection
- Communications use insecure mode for development (no TLS encryption)
- Data remains on your local machine
- Secure parameter validation and hashing
- No external access to your datasets
- Note: TLS can be enabled for production deployments
- Training time: 10-30 minutes per round
- Network transfer: Minimal (only model updates)
- GPU utilization: Automatic detection and optimization
zk0bot uses native Flower CLI + tmux for persistence. For custom setups:
- Edit
zk0bot.shports/node-config. - Env vars:
DATASET_URI,HF_TOKEN.
DATASET_URI: Dataset location (hf:repo/name or local:/path)HF_TOKEN: Hugging Face API tokenWANDB_API_KEY: Weights & Biases API keyZK0_SERVER_URL: Custom server URL (default: auto-discovery)
We welcome contributions to improve the zk0 platform:
- Fork the repository
- Create a feature branch
- Submit a pull request
- Join our Discord for discussion
zk0 is open-source software licensed under the Apache 2.0 License.
Last updated: 2025-12-17