docs: add hf rl environment datasets rfc#795
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: PASS - docs-only PR, no Python source modified
- Debug code: CLEAN - no debug artifacts in changed files
Tier 1: Fixes Required
-
RFC number collision —
rfcs/006-hf-rl-environment-datasets.md(header, line 7) andrfcs/README.mdRFC 006 is already reserved and cross-referenced in three existing documents for a different topic ("production performance simulation / training-production delta"):
rfcs/000-project-phases.md:56— "Enabling production performance simulation to minimize training-production delta (RFC 006)."rfcs/001-abstractions.md:170— "See RFC 006 for how we simulate production performance characteristics…"rfcs/003-mcp-support.md:17— "…inject performance metadata (see RFC 006)…"
This RFC must be renumbered. The next unambiguously free slot appears to be 008 (RFC 007 is referenced in RFC 000 and RFC 003 for MCP protocol interception). Use
008-hf-rl-environment-datasets.md. -
Author handle inconsistency —
rfcs/006-hf-rl-environment-datasets.md:12Header reads
**Authors**: @ben. All existing RFCs use full GitHub handles (e.g.,@Darktex,@pankit-eng,@jspisak). The PR author is@burtenshaw; the author field should match.
Tier 2: Alignment Discussion
ALIGNMENT FLAG: Dataset-bound step() carries implicit row-cursor side effect
- Principle at stake: Gymnasium API signatures are an invariant —
step(action) -> Observationmust not change without a major version bump (INVARIANTS.md §API Invariants #1). The "one env = one trajectory" principle (PRINCIPLES.md) also relies on a cleansteploop. - The concern: The RFC proposes that every call to
step()silently advances a dataset row cursor (row 0 = step 0, row 1 = step 1, etc.). This is a new implicit side effect onstep()that is not visible in the method signature and diverges from standard Gymnasium semantics where the environment owns its own state transitions completely. It is also under-specified: does a no-op or penalised step advance the cursor? What happens at end-of-dataset? These gaps risk producing environment implementations that violate the stablestepcontract. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: AutoEnv.from_env(hf://...) caller persona is unspecified
- Principle at stake: "Agents cannot reset" / dual API boundary (INVARIANTS.md §Security Invariants #1, §Architectural Invariants #1). The existing
AutoEnv.from_env()returns an infrastructure/orchestration-layerEnvClient; agents interact via MCP only. - The concern: The RFC presents
AutoEnv.from_env("hf://datasets/...")in code examples with directenv.reset()/env.step()calls, but does not explicitly state whether the returned object is an infrastructure client (correct) or intended for use by an agent (which would violate the invariant). Given that dataset discovery is a highly user-facing convenience feature, there is a real risk that downstream authors treat the dataset-bound client as agent-accessible. The RFC should explicitly state that the returned client is orchestration-layer-only, consistent with all otherAutoEnvclients. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: Verifiers adapter: field allows arbitrary remote class references
- Principle at stake: "Avoid arbitrary remote Python execution by default" (RFC Goal #6, repeated in Key Design Decisions).
- The concern: The terminal-bench example YAML includes
adapter: verifiers.v1.packages.tasksets.HarborTasksetunder theverifiers:framework block. While the RFC says "environment.yaml does not execute arbitrary adapter code" for the OpenEnv block, it does not enforce this constraint on other framework blocks. A consumer loading the sameenvironment.yamland honoring the Verifiers block would execute an arbitrary remote class. The RFC should either disclaim responsibility for third-party framework block execution, or add an explicit note that theadapter:pattern in the Verifiers example is a Verifiers-owned concern and OpenEnv's parser must never execute non-openenv:block content. - Suggested reviewer: @Darktex
Summary
- 2 mechanical issues to fix (RFC number collision is blocking — it creates factual errors in the existing RFC corpus; author handle is minor)
- 3 alignment points for human review
The RFC number collision is the critical blocker: merging this as-is would permanently conflict with the already-documented RFC 006 slot across three existing files. The document itself is well-structured and addresses a real gap in the ecosystem; the content concerns above are refinements rather than fundamental objections to the approach.
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review: RFC 006 — HF RL Environment Datasets
Automated Checks
- Lint: N/A — only Markdown files changed; lint hook targets Python in
src/,tests/,envs/only - Debug code: CLEAN — no Python source files introduced
Tier 1: Fixes Required
RFC Numbering Collision (blocking)
-
rfcs/006-hf-rl-environment-datasets.mdheader and filename — RFC ID 006 is already claimed by the simultaneously-open PR #794 ([RFC 006] add external environment import rfc, filerfcs/006-external-environment-imports.md). Both PRs are by the same author (burtenshaw) and are open at the same time. They will produce a filesystem-level conflict onrfcs/README.mdand a semantic collision on the RFC 006 slot. This PR must be renumbered (to 007 or the next available slot), or PR #794 must land and close first. Both the filename and the**RFC ID**: 006header field must be updated consistently. -
rfcs/README.mdnew entry — The table entry added under### Environment Datasetsmust match whatever renumbered RFC ID is chosen. Note that PR #794 also modifiesrfcs/README.md, so these two PRs will conflict on merge regardless of ordering.
Author Handle (doc correctness)
-
rfcs/006-hf-rl-environment-datasets.mdline 7 —**Authors**: @benis not a valid GitHub handle. The RFC template requires[@username]format. The PR author field isburtenshaw; the header should read**Authors**: @burtenshaw.
Missing Template Sub-structure (doc correctness)
-
rfcs/006-hf-rl-environment-datasets.md"Key Design Decisions" section — Therfcs/README.mdtemplate requires each design decision entry to have Chosen Approach, Rationale, and Trade-offs sub-fields. The current section is a flat bullet list. Please expand each bullet into the structured format so reviewers can evaluate trade-offs explicitly.
Inconsistent spec_version in Examples
-
rfcs/006-hf-rl-environment-datasets.md"Optional OpenEnv Runtime Default" example — Theopenenv.yamlsnippet usesspec_version: 1(a plain integer), while allenvironment.yamlexamples usespec_version: hf-rl-env-0.1(a semver-style string). If this snippet shows the existingopenenv.yamlmanifest format rather than the newenvironment.yamlschema, add a clarifying comment. If it is meant to show the new schema, correct the value.
Tier 2: Alignment Discussion
ALIGNMENT FLAG: Dataset row cursor advances on step(), conflicting with the Gymnasium API contract and "one env = one trajectory" principle
- Principle at stake: Gymnasium API signature invariant (
step(action) -> Observationadvances simulation by one timestep, not to a new task), and "One env = one trajectory" (PRINCIPLES.md) - The concern: The RFC proposes that each
env.step(action)automatically advances the dataset row cursor to the next task row. Overloadingstepto also silently switch tasks means a multi-turn episode spanning N actions would consume N dataset rows regardless of episode boundaries. The RFC does not explain how the row cursor interacts with multi-step tasks, nor reconcile cursor advancement withreset()episode boundaries. This needs to be specified explicitly and reconciled with the Gym invariant. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: env.dataset, env.current_row, and env.dataset_index exposed on the client object, creating a path for agents to observe task metadata including answers and verifier data
- Principle at stake: Agent isolation and "Rewards inside environment" (INVARIANTS.md Security Invariants §1, Architectural Invariants §3)
- The concern: The RFC attaches
env.dataset(the full split) andenv.current_row(the raw row, which may containanswer,hidden_tests, or verifier metadata per the canonical row schema) directly to the environment client. The RFC notes authors "must ensure" such data is not leaked, but this is a documentation warning only — there is no framework-level guard. The RFC should specify a framework-level mechanism (e.g. stripping sensitive fields before attaching, or keeping raw row data server-side only) rather than relying solely on environment-author discipline. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: verifiers framework block in the environment.yaml example includes an adapter field containing a Python class import path
- Principle at stake: "Avoid arbitrary remote Python execution by default" (RFC's own Goal #6) and container isolation (INVARIANTS.md Security Invariants §2)
- The concern: The Terminal-Bench 2.0 example shows
adapter: verifiers.v1.packages.tasksets.HarborTasksetin theverifiersframework block. The RFC statesenvironment.yamldoes not execute arbitrary adapter code and that OpenEnv only readsframeworks.openenv. However, it does not explicitly state that OpenEnv ignores all non-openenvframework keys entirely and never imports or evaluates their values. The RFC should explicitly specify that OpenEnv treats all non-openenvframework blocks as opaque, uninterpreted data. - Suggested reviewer: @Darktex
Summary
- 4 mechanical issues to fix (RFC numbering collision with PR #794 is blocking; author handle; template structure; spec_version inconsistency)
- 3 alignment discussion points for human review (Gym API / step-cursor semantics; client-side raw row exposure; verifiers adapter field scope)
The RFC numbering conflict is the hard blocker: this PR cannot merge as-is without a guaranteed merge ordering with PR #794 or an explicit renumbering. Since both PRs are by the same author, please coordinate which takes 006 and which takes 007 before either is approved.
Automated review by Claude Code | Learn more
This PR adds RFC 006 for Hugging Face RL environment datasets, documenting dataset-root environment.yaml declarations, AutoEnv handling for hf://datasets references, and dataset-bound environment behavior. It also updates the RFC index so the proposal is discoverable.\n\nValidation: git diff --check and git diff --cached --check.