Skip to content

docs: add hf rl environment datasets rfc#795

Open
burtenshaw wants to merge 1 commit into
huggingface:mainfrom
burtenshaw:codex/reopen-hf-rl-env-datasets
Open

docs: add hf rl environment datasets rfc#795
burtenshaw wants to merge 1 commit into
huggingface:mainfrom
burtenshaw:codex/reopen-hf-rl-env-datasets

Conversation

@burtenshaw

@burtenshaw burtenshaw commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

This PR adds RFC 006 for Hugging Face RL environment datasets, documenting dataset-root environment.yaml declarations, AutoEnv handling for hf://datasets references, and dataset-bound environment behavior. It also updates the RFC index so the proposal is discoverable.\n\nValidation: git diff --check and git diff --cached --check.

@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Darktex Darktex left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Alignment Review Report

Automated Checks

  • Lint: PASS - docs-only PR, no Python source modified
  • Debug code: CLEAN - no debug artifacts in changed files

Tier 1: Fixes Required

  • RFC number collisionrfcs/006-hf-rl-environment-datasets.md (header, line 7) and rfcs/README.md

    RFC 006 is already reserved and cross-referenced in three existing documents for a different topic ("production performance simulation / training-production delta"):

    • rfcs/000-project-phases.md:56 — "Enabling production performance simulation to minimize training-production delta (RFC 006)."
    • rfcs/001-abstractions.md:170 — "See RFC 006 for how we simulate production performance characteristics…"
    • rfcs/003-mcp-support.md:17 — "…inject performance metadata (see RFC 006)…"

    This RFC must be renumbered. The next unambiguously free slot appears to be 008 (RFC 007 is referenced in RFC 000 and RFC 003 for MCP protocol interception). Use 008-hf-rl-environment-datasets.md.

  • Author handle inconsistencyrfcs/006-hf-rl-environment-datasets.md:12

    Header reads **Authors**: @ben. All existing RFCs use full GitHub handles (e.g., @Darktex, @pankit-eng, @jspisak). The PR author is @burtenshaw; the author field should match.


Tier 2: Alignment Discussion

ALIGNMENT FLAG: Dataset-bound step() carries implicit row-cursor side effect

  • Principle at stake: Gymnasium API signatures are an invariant — step(action) -> Observation must not change without a major version bump (INVARIANTS.md §API Invariants #1). The "one env = one trajectory" principle (PRINCIPLES.md) also relies on a clean step loop.
  • The concern: The RFC proposes that every call to step() silently advances a dataset row cursor (row 0 = step 0, row 1 = step 1, etc.). This is a new implicit side effect on step() that is not visible in the method signature and diverges from standard Gymnasium semantics where the environment owns its own state transitions completely. It is also under-specified: does a no-op or penalised step advance the cursor? What happens at end-of-dataset? These gaps risk producing environment implementations that violate the stable step contract.
  • Suggested reviewer: @Darktex

ALIGNMENT FLAG: AutoEnv.from_env(hf://...) caller persona is unspecified

  • Principle at stake: "Agents cannot reset" / dual API boundary (INVARIANTS.md §Security Invariants #1, §Architectural Invariants #1). The existing AutoEnv.from_env() returns an infrastructure/orchestration-layer EnvClient; agents interact via MCP only.
  • The concern: The RFC presents AutoEnv.from_env("hf://datasets/...") in code examples with direct env.reset() / env.step() calls, but does not explicitly state whether the returned object is an infrastructure client (correct) or intended for use by an agent (which would violate the invariant). Given that dataset discovery is a highly user-facing convenience feature, there is a real risk that downstream authors treat the dataset-bound client as agent-accessible. The RFC should explicitly state that the returned client is orchestration-layer-only, consistent with all other AutoEnv clients.
  • Suggested reviewer: @Darktex

ALIGNMENT FLAG: Verifiers adapter: field allows arbitrary remote class references

  • Principle at stake: "Avoid arbitrary remote Python execution by default" (RFC Goal #6, repeated in Key Design Decisions).
  • The concern: The terminal-bench example YAML includes adapter: verifiers.v1.packages.tasksets.HarborTaskset under the verifiers: framework block. While the RFC says "environment.yaml does not execute arbitrary adapter code" for the OpenEnv block, it does not enforce this constraint on other framework blocks. A consumer loading the same environment.yaml and honoring the Verifiers block would execute an arbitrary remote class. The RFC should either disclaim responsibility for third-party framework block execution, or add an explicit note that the adapter: pattern in the Verifiers example is a Verifiers-owned concern and OpenEnv's parser must never execute non-openenv: block content.
  • Suggested reviewer: @Darktex

Summary

  • 2 mechanical issues to fix (RFC number collision is blocking — it creates factual errors in the existing RFC corpus; author handle is minor)
  • 3 alignment points for human review

The RFC number collision is the critical blocker: merging this as-is would permanently conflict with the already-documented RFC 006 slot across three existing files. The document itself is well-structured and addresses a real gap in the ecosystem; the content concerns above are refinements rather than fundamental objections to the approach.


Automated review by Claude Code | Learn more

@Darktex Darktex left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Alignment Review: RFC 006 — HF RL Environment Datasets

Automated Checks

  • Lint: N/A — only Markdown files changed; lint hook targets Python in src/, tests/, envs/ only
  • Debug code: CLEAN — no Python source files introduced

Tier 1: Fixes Required

RFC Numbering Collision (blocking)

  • rfcs/006-hf-rl-environment-datasets.md header and filename — RFC ID 006 is already claimed by the simultaneously-open PR #794 ([RFC 006] add external environment import rfc, file rfcs/006-external-environment-imports.md). Both PRs are by the same author (burtenshaw) and are open at the same time. They will produce a filesystem-level conflict on rfcs/README.md and a semantic collision on the RFC 006 slot. This PR must be renumbered (to 007 or the next available slot), or PR #794 must land and close first. Both the filename and the **RFC ID**: 006 header field must be updated consistently.

  • rfcs/README.md new entry — The table entry added under ### Environment Datasets must match whatever renumbered RFC ID is chosen. Note that PR #794 also modifies rfcs/README.md, so these two PRs will conflict on merge regardless of ordering.

Author Handle (doc correctness)

  • rfcs/006-hf-rl-environment-datasets.md line 7 — **Authors**: @ben is not a valid GitHub handle. The RFC template requires [@username] format. The PR author field is burtenshaw; the header should read **Authors**: @burtenshaw.

Missing Template Sub-structure (doc correctness)

  • rfcs/006-hf-rl-environment-datasets.md "Key Design Decisions" section — The rfcs/README.md template requires each design decision entry to have Chosen Approach, Rationale, and Trade-offs sub-fields. The current section is a flat bullet list. Please expand each bullet into the structured format so reviewers can evaluate trade-offs explicitly.

Inconsistent spec_version in Examples

  • rfcs/006-hf-rl-environment-datasets.md "Optional OpenEnv Runtime Default" example — The openenv.yaml snippet uses spec_version: 1 (a plain integer), while all environment.yaml examples use spec_version: hf-rl-env-0.1 (a semver-style string). If this snippet shows the existing openenv.yaml manifest format rather than the new environment.yaml schema, add a clarifying comment. If it is meant to show the new schema, correct the value.

Tier 2: Alignment Discussion

ALIGNMENT FLAG: Dataset row cursor advances on step(), conflicting with the Gymnasium API contract and "one env = one trajectory" principle

  • Principle at stake: Gymnasium API signature invariant (step(action) -> Observation advances simulation by one timestep, not to a new task), and "One env = one trajectory" (PRINCIPLES.md)
  • The concern: The RFC proposes that each env.step(action) automatically advances the dataset row cursor to the next task row. Overloading step to also silently switch tasks means a multi-turn episode spanning N actions would consume N dataset rows regardless of episode boundaries. The RFC does not explain how the row cursor interacts with multi-step tasks, nor reconcile cursor advancement with reset() episode boundaries. This needs to be specified explicitly and reconciled with the Gym invariant.
  • Suggested reviewer: @Darktex

ALIGNMENT FLAG: env.dataset, env.current_row, and env.dataset_index exposed on the client object, creating a path for agents to observe task metadata including answers and verifier data

  • Principle at stake: Agent isolation and "Rewards inside environment" (INVARIANTS.md Security Invariants §1, Architectural Invariants §3)
  • The concern: The RFC attaches env.dataset (the full split) and env.current_row (the raw row, which may contain answer, hidden_tests, or verifier metadata per the canonical row schema) directly to the environment client. The RFC notes authors "must ensure" such data is not leaked, but this is a documentation warning only — there is no framework-level guard. The RFC should specify a framework-level mechanism (e.g. stripping sensitive fields before attaching, or keeping raw row data server-side only) rather than relying solely on environment-author discipline.
  • Suggested reviewer: @Darktex

ALIGNMENT FLAG: verifiers framework block in the environment.yaml example includes an adapter field containing a Python class import path

  • Principle at stake: "Avoid arbitrary remote Python execution by default" (RFC's own Goal #6) and container isolation (INVARIANTS.md Security Invariants §2)
  • The concern: The Terminal-Bench 2.0 example shows adapter: verifiers.v1.packages.tasksets.HarborTaskset in the verifiers framework block. The RFC states environment.yaml does not execute arbitrary adapter code and that OpenEnv only reads frameworks.openenv. However, it does not explicitly state that OpenEnv ignores all non-openenv framework keys entirely and never imports or evaluates their values. The RFC should explicitly specify that OpenEnv treats all non-openenv framework blocks as opaque, uninterpreted data.
  • Suggested reviewer: @Darktex

Summary

  • 4 mechanical issues to fix (RFC numbering collision with PR #794 is blocking; author handle; template structure; spec_version inconsistency)
  • 3 alignment discussion points for human review (Gym API / step-cursor semantics; client-side raw row exposure; verifiers adapter field scope)

The RFC numbering conflict is the hard blocker: this PR cannot merge as-is without a guaranteed merge ordering with PR #794 or an explicit renumbering. Since both PRs are by the same author, please coordinate which takes 006 and which takes 007 before either is approved.


Automated review by Claude Code | Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants