Skip to content

feat(memory): add TAU-2 trajectory-view treatment#2017

Open
huangruiteng wants to merge 15 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-trajectory-memory
Open

feat(memory): add TAU-2 trajectory-view treatment#2017
huangruiteng wants to merge 15 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-trajectory-memory

Conversation

@huangruiteng
Copy link
Copy Markdown
Contributor

@huangruiteng huangruiteng commented May 13, 2026

Status

Follow-up to #2003. This PR is the minimal OpenViking-side step for the TAU-2 fine-grained trajectory-memory line: turn raw trajectory extraction into a reusable procedure-like trajectory view, then expose an explicit TAU-2 trajectory retrieval route for first-user and pre-write evaluation.

The headline read is that the trajectory-view route is directionally useful, especially when memory is injected at the pre-write decision node. The strongest current OV-native read is still split by domain: PR-B is strongest on retail, while category rerank in #2044 is strongest on airline.

Route Retail Airline Domain avg Delta vs no-memory Read
No-memory baseline (0515) 0.84688 0.75000 0.79844 - Current OV-native paired baseline, 16/16 cells complete.
PR-B no-first-user + pre-write/scope 0.85938 0.74375 0.80157 +0.00313 Retail improves by +0.01250; airline is roughly flat/slightly below baseline.
PR-C category exact-pair (#2044) 0.83437 0.81250 0.82344 +0.02500 Airline improves by +0.06250; category rerank is tracked in #2044.
Domain-best read 0.85938 0.81250 0.83594 +0.03750 Retail takes PR-B, airline takes PR-C. This is a design read, not one merged strategy.

This PR intentionally stays small. It does not include category rerank or the full Harness diagnostic stack; those are tracked separately in #2044.

What This Changes

  • Refines trajectory extraction into a compact procedure-like trajectory view:
    • trigger
    • preconditions
    • procedure
    • anti-patterns
    • applicability boundary
    • result
    • evidence
  • Adds benchmark/tau2/config/trajectory.yaml with explicit trajectory treatments:
    • memory_v2_trajectory_view: first-user retrieval from memories/trajectories
    • memory_v2_trajectory_prewrite: first-user + pre-write retrieval from memories/trajectories
  • Adds search_memory_type so TAU-2 configs can select experiences or trajectories without changing the adapter path.
  • Adds fixed-first-user / reusable-corpus / retrieval-budget / eval-concurrency plumbing so no-memory, trajectory first-user, trajectory pre-write, and scoped trajectory variants can be compared under the same runner.

Evidence Details

The latest read uses reasoning-high, fixed-first-user, retail + airline held-out test, and 8 repeats per domain. Domain avg is the simple mean of retail and airline; task-weighted totals remain available in the run scoreboards.

Core Routes

Route Retail Airline Domain avg Delta vs no-memory Read
No-memory baseline (0515) 0.84688 0.75000 0.79844 - Current OV-native paired baseline, 16/16 cells complete.
Main-branch Memory V2 first-user 0.75000 0.77500 0.76250 -0.03594 Fresh Memory V2 reference route from the validation doc.
PR-B trajectory first-user 0.82182 0.76875 0.79528 -0.00316 Direct trajectory-view first-user route.
PR-B no-first-user + pre-write/scope 0.85938 0.74375 0.80157 +0.00313 Retail is the clearest PR-B signal (+0.01250); airline is roughly flat/slightly below baseline.
PR-C category exact-pair 0.83437 0.81250 0.82344 +0.02500 Category rerank signal, mainly from airline (+0.06250); separate PR.
Domain-best read 0.85938 0.81250 0.83594 +0.03750 Retail takes PR-B pre-write/scope, airline takes PR-C category. This is a design read, not one merged strategy.

Ablations

Ablation Retail Airline Task-weighted total Read
Old trajectory prompt -> memories/experiences top2 0.83125 0.66875 0.77708 16/16 valid cells. Does not beat no-memory; especially weak on airline.
PR-B trajectory-view corpus -> memories/experiences top2 0.87188 0.71250 0.81875 Prompt/corpus quality helps retail strongly, but airline still blocks a global claim.
PR-B trajectory pre-write + scope 0.84062 0.74375 0.80833 Complete route-level run; runnable and useful, but not a standalone uplift claim.

Interpretation:

  • PR-B is a solid foundation PR: the trajectory view is cleaner than the old prompt route, the TAU-2 route is runnable, and pre-write/scope exposes the right decision node for later strategy work.
  • PR-B alone should not be presented as the final uplift claim. Its strongest standalone full8 signal is retail-local; the best global read currently requires category selection or a domain-specific combination.
  • The next step is to converge PR-B's trajectory-view route and PR-C's category selection into one OV-native treatment.

Validation

  • git diff --check
  • uvx ruff format --check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.py
  • uvx ruff check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.py
  • python3 -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.py

Earlier route smokes also confirmed:

  • memory_v2_trajectory_view: non-empty trajectory corpus probe and first-user trace injected from memories/trajectories.
  • memory_v2_trajectory_prewrite: non-empty trajectory corpus probe and both first-user / before_write_tool_call traces injected from memories/trajectories.

Thanks @yangxinxin-7 for the TAU-2 benchmark scaffold in #2003; this PR builds on that workflow rather than adding a separate eval path.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

2003 - Fully compliant

Compliant requirements:

  • Add TAU-2 trajectory config
  • Add search_memory_type parameter and validation
  • Refine trajectory extraction instruction and template
  • Update prompts to use neutral wording
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@huangruiteng huangruiteng force-pushed the feat/tau2-trajectory-memory branch from 9cfe362 to c2228a2 Compare May 13, 2026 16:12
@huangruiteng huangruiteng force-pushed the feat/tau2-trajectory-memory branch from c2228a2 to 2b767f2 Compare May 13, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants