feat(memory): add TAU-2 trajectory-view treatment by huangruiteng · Pull Request #2017 · volcengine/OpenViking

huangruiteng · 2026-05-13T09:56:49Z

Status

Follow-up to #2003. This PR is the minimal OpenViking-side step for the TAU-2 fine-grained trajectory-memory line: turn raw trajectory extraction into a reusable procedure-like trajectory view, then expose an explicit TAU-2 trajectory retrieval route for first-user and pre-write evaluation.

The headline read is that the trajectory-view route is directionally useful, especially when memory is injected at the pre-write decision node. The strongest current OV-native read is still split by domain: PR-B is strongest on retail, while category rerank in #2044 is strongest on airline.

Route	Retail	Airline	Domain avg	Delta vs no-memory	Read
No-memory baseline (0515)	0.84688	0.75000	0.79844	-	Current OV-native paired baseline, 16/16 cells complete.
PR-B no-first-user + pre-write/scope	0.85938	0.74375	0.80157	+0.00313	Retail improves by +0.01250; airline is roughly flat/slightly below baseline.
PR-C category exact-pair (#2044)	0.83437	0.81250	0.82344	+0.02500	Airline improves by +0.06250; category rerank is tracked in #2044.
Domain-best read	0.85938	0.81250	0.83594	+0.03750	Retail takes PR-B, airline takes PR-C. This is a design read, not one merged strategy.

This PR intentionally stays small. It does not include category rerank or the full Harness diagnostic stack; those are tracked separately in #2044.

What This Changes

Refines trajectory extraction into a compact procedure-like trajectory view:
- trigger
- preconditions
- procedure
- anti-patterns
- applicability boundary
- result
- evidence
Adds benchmark/tau2/config/trajectory.yaml with explicit trajectory treatments:
- memory_v2_trajectory_view: first-user retrieval from memories/trajectories
- memory_v2_trajectory_prewrite: first-user + pre-write retrieval from memories/trajectories
Adds search_memory_type so TAU-2 configs can select experiences or trajectories without changing the adapter path.
Adds fixed-first-user / reusable-corpus / retrieval-budget / eval-concurrency plumbing so no-memory, trajectory first-user, trajectory pre-write, and scoped trajectory variants can be compared under the same runner.

Evidence Details

The latest read uses reasoning-high, fixed-first-user, retail + airline held-out test, and 8 repeats per domain. Domain avg is the simple mean of retail and airline; task-weighted totals remain available in the run scoreboards.

Core Routes

Route	Retail	Airline	Domain avg	Delta vs no-memory	Read
No-memory baseline (0515)	0.84688	0.75000	0.79844	-	Current OV-native paired baseline, 16/16 cells complete.
Main-branch Memory V2 first-user	0.75000	0.77500	0.76250	-0.03594	Fresh Memory V2 reference route from the validation doc.
PR-B trajectory first-user	0.82182	0.76875	0.79528	-0.00316	Direct trajectory-view first-user route.
PR-B no-first-user + pre-write/scope	0.85938	0.74375	0.80157	+0.00313	Retail is the clearest PR-B signal (+0.01250); airline is roughly flat/slightly below baseline.
PR-C category exact-pair	0.83437	0.81250	0.82344	+0.02500	Category rerank signal, mainly from airline (+0.06250); separate PR.
Domain-best read	0.85938	0.81250	0.83594	+0.03750	Retail takes PR-B pre-write/scope, airline takes PR-C category. This is a design read, not one merged strategy.

Ablations

Ablation	Retail	Airline	Task-weighted total	Read
Old trajectory prompt -> `memories/experiences` top2	0.83125	0.66875	0.77708	16/16 valid cells. Does not beat no-memory; especially weak on airline.
PR-B trajectory-view corpus -> `memories/experiences` top2	0.87188	0.71250	0.81875	Prompt/corpus quality helps retail strongly, but airline still blocks a global claim.
PR-B trajectory pre-write + scope	0.84062	0.74375	0.80833	Complete route-level run; runnable and useful, but not a standalone uplift claim.

Interpretation:

PR-B is a solid foundation PR: the trajectory view is cleaner than the old prompt route, the TAU-2 route is runnable, and pre-write/scope exposes the right decision node for later strategy work.
PR-B alone should not be presented as the final uplift claim. Its strongest standalone full8 signal is retail-local; the best global read currently requires category selection or a domain-specific combination.
The next step is to converge PR-B's trajectory-view route and PR-C's category selection into one OV-native treatment.

Validation

git diff --check
uvx ruff format --check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.py
uvx ruff check benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.py
python3 -m py_compile benchmark/tau2/scripts/run_eval.py benchmark/tau2/scripts/run_memory_v2_eval.py benchmark/tau2/scripts/tau2_common.py openviking/session/memory/agent_trajectory_context_provider.py

Earlier route smokes also confirmed:

memory_v2_trajectory_view: non-empty trajectory corpus probe and first-user trace injected from memories/trajectories.
memory_v2_trajectory_prewrite: non-empty trajectory corpus probe and both first-user / before_write_tool_call traces injected from memories/trajectories.

Thanks @yangxinxin-7 for the TAU-2 benchmark scaffold in #2003; this PR builds on that workflow rather than adding a separate eval path.

github-actions · 2026-05-13T09:58:00Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅ 2003 - Fully compliant Compliant requirements: Add TAU-2 trajectory config Add search_memory_type parameter and validation Refine trajectory extraction instruction and template Update prompts to use neutral wording
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-05-13T09:59:42Z

PR Code Suggestions ✨

No code suggestions found for the PR.

…memory

feat(benchmark): add TAU-2 trajectory memory treatment

d1caa77

github-project-automation Bot added this to OpenViking project May 13, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 13, 2026

huangruiteng added 5 commits May 13, 2026 18:31

style(benchmark): format tau2 trajectory scripts

a68d5e7

Merge remote-tracking branch 'origin/main' into feat/tau2-trajectory-…

0700781

…memory

refine trajectory memory view prompt

fddd7ba

feat(benchmark): prepare tau2 memory corpora before eval

0496000

fix(benchmark): tighten trajectory evidence prompt

08d33a9

huangruiteng force-pushed the feat/tau2-trajectory-memory branch from 9cfe362 to c2228a2 Compare May 13, 2026 16:12

fix(benchmark): guard tau2 infrastructure failures

2b767f2

huangruiteng force-pushed the feat/tau2-trajectory-memory branch from c2228a2 to 2b767f2 Compare May 13, 2026 17:03

huangruiteng added 7 commits May 14, 2026 01:54

fix(benchmark): resolve tau2 runner paths

63a1004

fix(memory): add trajectory evidence examples

fb62c46

fix(benchmark): run no-memory tau2 eval in process

e30b79b

bench(tau2): align retrieval budget and fixed first user

662cf0b

bench(tau2): reuse memory corpora across eval runs

f85d60b

bench(tau2): add scoped trajectory eval concurrency

d833980

style(benchmark): format tau2 eval runner

74c18db

huangruiteng mentioned this pull request May 14, 2026

bench(tau2): add category-aware memory rerank treatment #2044

Closed

style(benchmark): satisfy tau2 eval lint

8cf7737

MaojiaSheng approved these changes May 15, 2026

View reviewed changes

huangruiteng mentioned this pull request May 15, 2026

bench(tau2): add category-aware trajectory memory rerank #2079

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory): add TAU-2 trajectory-view treatment#2017

feat(memory): add TAU-2 trajectory-view treatment#2017
huangruiteng wants to merge 15 commits into
volcengine:mainfrom
huangruiteng:feat/tau2-trajectory-memory

huangruiteng commented May 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huangruiteng commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

What This Changes

Evidence Details

Core Routes

Ablations

Validation

Uh oh!

github-actions Bot commented May 13, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 13, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huangruiteng commented May 13, 2026 •

edited

Loading