Skip to content

Major API overhaul, unified schema, and ecosystem cleanup#60

Merged
pdiakumis merged 147 commits into
mainfrom
dev
Jun 22, 2026
Merged

Major API overhaul, unified schema, and ecosystem cleanup#60
pdiakumis merged 147 commits into
mainfrom
dev

Conversation

@pdiakumis

@pdiakumis pdiakumis commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

Complete overhaul of the Config, Tool, and Workflow APIs, transitioning to
a unified schema.yaml format that replaces the two-file raw.yaml +
tidy.yaml split, a new assertion and utility layer, expanded CLI, CI/CD
migration to reusable workflows, and substantially improved test coverage.

Breaking changes

1. Unified schema.yaml replaces raw.yaml + tidy.yaml

The most structural change. Each tool previously required two separate YAML
files in inst/config/tools/<tool>/:

  • raw.yaml - file patterns, ftypes, raw column names, and per-version schemas
  • tidy.yaml - tidy column names, types, descriptions, also per-version

These are replaced by a single schema.yaml with a flat tables: map. Each
column entry now carries its raw name, tidy name, type, description, and
a versions list in one place:

# Before (two separate files)
# raw.yaml
raw:
  table1:
    pattern: "\.tool1\.table1\.tsv$"
    ftype: 'tsv'
    schema:
      v1.2.3:
        - field: 'SampleID'
          type: 'char'
      latest:
        - field: 'SampleID'
          type: 'char'
        - field: 'metricY'
          type: 'float'

# tidy.yaml
tidy:
  table1:
    schema:
      latest:
        tbl1:
          - field: 'sample_id'
            type: 'char'

# After (single schema.yaml)
tables:
  table1:
    description: 'Table1 for tool1.'
    pattern: "\.tool1\.table1\.tsv$"
    ftype: 'txt'
    columns:
      - raw: 'SampleID'
        tidy: 'sample_id'
        type: 'char'
        description: 'sample ID'
        versions: ['v1.2.3', 'latest']
      - raw: 'metricY'
        tidy: 'metric_y'
        type: 'float'
        description: 'metric Y'
        versions: ['latest']

Child packages (tidywigits, tidydragen) will need their per-tool configs
migrated to this format.

2. Config API renames and encapsulation

All schema/config accessor methods were renamed for consistency. The config
raw parsed list, raw_schemas_all, and tidy_schemas_all fields are removed
from the public interface; schemas are now computed internally and served
through methods.

Before After
conf$get_raw_patterns() conf$get_patterns()
conf$get_raw_versions() (removed — derived from column versions:)
conf$get_raw_descriptions() conf$get_descriptions()
conf$get_raw_schemas_all() conf$get_schemas_raw()
conf$get_tidy_schemas_all() conf$get_schemas_tidy()
conf$get_raw_schema(tbl, v = ...) conf$get_schema_raw(tbl, version = ...)
conf$get_tidy_schema(tbl, v = ...) conf$get_schema_tidy(tbl, version = ...)
conf$are_raw_schemas_valid() (removed; validation now in initialize())
conf$config (field) (private)
conf$raw_schemas_all (field) (private)
conf$tidy_schemas_all (field) (private)

New: conf$get_col_map(tbl), conf$get_pattern(tbl), conf$get_ftype(tbl),
conf$get_ftypes(), conf$get_description(tbl), conf$get_tables().

Config also gains pkg as a public field (was only a constructor argument
before), and is now cloneable = FALSE.

3. Tool API renames and encapsulation

Before After
tool$files (field) tool$list_files() (accessor)
tool$tbls (field) tool$get_tbls() (accessor)
tool$raw_schemas_all (field) tool$config$get_schemas_raw()
tool$tidy_schemas_all (field) tool$config$get_schemas_tidy()
tool$get_raw_schema (delegate field) tool$config$get_schema_raw()
tool$get_tidy_schema (delegate field) tool$config$get_schema_tidy()
tool$nemofy(diro, ...) tool$run(output_dir, ...)
diro parameter output_dir
out_dir parameter output_dir
input_pfix column input_prefix column
pfix_include parameter prefix_include parameter
group column in list_files() prefix_suffix
enframe_data() nemo_enframe()

files, tbls, and files_tbl are moved to private. Tool is now
cloneable = FALSE.

4. Workflow API renames and encapsulation

Before After
wf$tools (field) wf$get_tools() (accessor)
wf$files_tbl (field) (private)
wf$nemofy(diro, ...) wf$run(output_dir, ...)
wf$list_files(type = ...) wf$list_files() (type arg removed)
wf$get_metadata(..., pkgs = c("nemo")) pkgs defaults to NULL — resolved from self$metapkg

Workflow now validates path existence at construction time, gains a metapkg
argument (defaults to "nemo") for metadata version reporting, and is
cloneable = FALSE.

print() now shows files_total, files_matched, tidied, and written
(formatted as a knitr table, consistent with Tool and Config).

filter_files() now validates include/exclude values against known
tool_parser names (e.g. "tool1_table1") and correctly dispatches per-tool
when include/exclude doesn't match any parser in a given tool.

New methods: get_schemas_raw() and get_schemas_tidy() aggregate raw/tidy
schemas across all tools, adding a tool column for identification.

5. CLI renames

Before After
--out_dir --output_dir
--pfix_include --prefix_include
input_pfix output column input_prefix

--prefix_include is now opt-in (no longer adds input_prefix to output by
default).

6. nemo_metadata() signature change

input_dir parameter renamed to input_dirs (always a character vector).
Return value changed from a named list() (with jsonlite::unbox wrappers) to
a single-row tibble with list-columns (input_dirs, pkg_versions,
files).

7. RPostgres dropped as hard dependency

RPostgres is removed from DESCRIPTION. The DB writer now accepts a
caller-supplied dbdrv argument (any DBI-compatible driver), so callers bring
their own driver.


New features

New R modules

File Purpose
R/assert.R nemo_stop(), nemo_assert_scalar_chr(), nemo_assert_chr(), nemo_assert_not_null(), nemo_assert_out_fmt(), and internal helpers assert_files_tbl(), assert_include_exclude(), check_unknown_parsers()
R/config_prep.R Schema scaffolding helpers: config_prep_raw_schema(), config_prep_raw(), config_prep_multi(), config_prep_write() — bootstrap a schema.yaml from example raw files
R/gha.R nemo_gha_mermaid() — generates a Mermaid flowchart of the full CI/CD pipeline by combining local and remote GHA YAML
R/uml.R nemo_uml() — generates a PlantUML SVG from R6 class names using the R6toPlant package
R/schema_vis.R nemo_schema_reactable(), nemo_schemavis_data(), reactable_schema() — interactive reactable schema explorer (was inst/scripts/vignettes/schemas.R)

File type changes (ftype)

The schema.yaml format consolidates and renames ftype values. The base Tool
class dispatches on ftype in parse_by_ftype(); the full ftype set is now:

ftype Format Status
txt Tab-delimited, with header Renamed from tsv
txt-keyvalue No header, 2-col key=value, pivoted wide Renamed from txt-nohead
txt-nohead No header, positional columns named X1..XN New (different semantics from old txt-nohead)
csv Comma-delimited, with header New
csv-nohead-long No header, long format, pivoted wide New — no default parser in Tool, requires subclass parse_{table}() override

Child packages using ftype: 'tsv' in their old raw.yaml must change to ftype: 'txt' in
schema.yaml. Old ftype: 'txt-nohead' (key-value) must become ftype: 'txt-keyvalue'.

New schema example tables

inst/config/tools/tool1/schema.yaml now covers 6 tables, one per supported
ftype, providing reference data for parsing tests:

  • table1txt (3 versions: v1.2.3, v4.5.6, latest)
  • table2txt (2 versions: v1.0.0, latest)
  • table3txt-keyvalue
  • table4txt-nohead
  • table5csv-nohead-long
  • table6csv

Corresponding example data added under inst/extdata/tool1/.

metadata.parquet written per run

Both Tool$write() and Workflow$write() now accept a write_metadata = TRUE
boolean. When true, a metadata.parquet file is written to the output directory
alongside the tidy tables. The metadata tibble records input_id, output_id,
input_dirs, output_dir, pkg_versions, and a file manifest.

Tool$list_files() ~20x speedup

Switched from a regex approach via fs::dir_info to map + grepl over a
pre-built flat file tibble. The files_tbl is computed once at construction and
reused across all lookups.

Config scaffold helpers (config_prep_*)

A new family of functions for bootstrapping a schema.yaml from example files:

path <- system.file("extdata/tool1/latest/sampleA.tool1.table1.tsv", package = "nemo")
config_prep_raw_schema(path, delim = "\t")
# returns a tibble: raw, tidy, type, description, versions

config_prep_multi(x)  # x is a tibble with name/descr/pat/type/path cols
config_prep_write(config, out = "schema.yaml")

--output_id / --ulid CLI flags

The tidy subcommand gains:

  • --output_id VALUE — tags output files with a fixed run identifier
  • --ulid — auto-generates a ULID as the output identifier (mutually exclusive
    with --output_id)

--max on list subcommand

cli_nemo_list() now accepts a max parameter to cap the number of rows shown.


Infrastructure and CI/CD

deploy.yaml refactored to reusable workflows

The 136-line monolithic deploy.yaml is replaced by a 42-line orchestrator that
calls reusable workflows from tidywf/actions:

jobs:
  version:   uses: tidywf/actions/.github/workflows/version.yaml@main
  condarise: uses: tidywf/actions/.github/workflows/condarise.yaml@main
  tag:       uses: tidywf/actions/.github/workflows/tag.yaml@main
  pkgdownise: uses: tidywf/actions/.github/workflows/pkgdownise.yaml@main

The workflow now also triggers on dev pushes (not just main).

claude.yml — restrict @claude to repo members

The Claude GHA now gates all @claude triggers on
author_association IN ["OWNER", "COLLABORATOR", "MEMBER"] to prevent abuse
from external commenters.

dependabot.yml added

Automated dependency update PRs enabled for GitHub Actions.

bump.yaml — use tidywf/actions repo

Points to tidywf/actions/.github/workflows/bump.yaml@main instead of the old
tidywf/.github monorepo reference.

conda: aarch64 lock file added

deploy/conda/env/lock/conda-linux-aarch64.lock is now tracked alongside the
existing linux-64 lock file.

deploy/conda/env/yaml/bump.yaml added

A dedicated conda env for the bumpversion workflow.


Testing

Manual test files for all R6 classes

New standalone test-<ClassName>.R files with proper test_that blocks:

File Coverage
tests/testthat/test-Config.R Construction, get_* methods, error paths
tests/testthat/test-Tool.R Full lifecycle: construct, filter, tidy, write, run
tests/testthat/test-Tool1.R All 6 tables, parse/tidy correctness, version detection
tests/testthat/test-Workflow.R Construction, filter_files dispatch, tidy, write, run
tests/testthat/test-Workflow1.R Smoke test for the example workflow

Roxytest files expanded

New auto-generated test files from @testexamples blocks:

  • test-roxytest-testexamples-assert.R
  • test-roxytest-testexamples-cli_list.R
  • test-roxytest-testexamples-cli_tidy.R
  • test-roxytest-testexamples-config_prep.R
  • test-roxytest-testexamples-gha.R
  • test-roxytest-testexamples-schema_vis.R

Removed (R6 classes now tested manually):

  • test-roxytest-testexamples-Tool1.R
  • test-roxytest-testexamples-Workflow.R
  • test-roxytest-testexamples-Workflow1.R

Documentation and vignettes

New vignettes

File Content
vignettes/cicd.qmd CI/CD pipeline diagram (Mermaid flowchart via nemo_gha_mermaid())
vignettes/new-tool.qmd How to author a new Tool subclass
vignettes/schema_table.qmd Interactive schema.yaml explorer via nemo_schema_reactable()

Removed vignettes

  • vignettes/contribute.qmd (replaced by new-tool.qmd)

inst/doc-templates/ reorganised

The inst/documentation/ directory is renamed to inst/doc-templates/.
Parameterised installation template fragments (conda, docker, pixi, R) are added
for child packages to include/reuse.

pkgdown

  • _pkgdown.yml updated with new vignettes and function reference groupings
  • pkgdown/extra.scss added for custom styling
  • LLM-docs disabled in pkgdown config

CLAUDE.md

.claude/CLAUDE.md added with full nemo repo documentation for in-session
context (repo layout, reference implementations, testing conventions, CLI docs,
logging, key API table, dev commands).


Dependency changes (DESCRIPTION)

Change Detail
Removed assertthat Replaced by assert.R wrappers using rlang
Removed jsonlite No longer needed (metadata is now a tibble)
Removed RPostgres Now caller-supplied; dropped hard dep
Removed quarto VignetteBuilder Switched to knitr
Added stringr (Imports) String utilities
Added knitr (Imports) Table rendering in print() methods
Added here (Suggests) Dev convenience
Added htmltools, reactable (Suggests) Schema visualisation
Added R6toPlant (Suggests, GitLab remote) UML diagram generation
Added withr (Suggests) Test environment management

File inventory

146 files changed, 6032 insertions(+), 2902 deletions(-)

Key additions:

  • R/assert.R (98 lines) — new assertion layer
  • R/config_prep.R (181 lines) — schema scaffolding helpers
  • R/gha.R (153 lines) — GHA Mermaid diagram generator
  • R/uml.R (70 lines) — PlantUML integration
  • R/schema_vis.R (promoted from inst/scripts/)
  • inst/config/tools/tool1/schema.yaml (171 lines) — unified schema
  • tests/testthat/test-Tool.R (233 lines)
  • tests/testthat/test-Tool1.R (138 lines)
  • tests/testthat/test-Config.R (38 lines)
  • tests/testthat/test-Workflow.R (89 lines)
  • deploy/conda/env/lock/conda-linux-aarch64.lock (214 lines)

Key deletions:

  • inst/config/tools/tool1/raw.yaml — replaced by schema.yaml
  • inst/config/tools/tool1/tidy.yaml — replaced by schema.yaml
  • inst/scripts/file_to_yaml.R — superseded by config_prep_* helpers
  • inst/scripts/uml.R — superseded by R/uml.R
  • man/valid_out_fmt.Rd — function renamed to nemo_assert_out_fmt
  • vignettes/contribute.qmd — replaced by new-tool.qmd

Checklist for child packages after merge

  • Migrate all per-tool raw.yaml + tidy.yaml to unified schema.yaml
  • In migrated schema.yaml: rename ftype: 'tsv'ftype: 'txt'; rename ftype: 'txt-nohead' (if key-value) → ftype: 'txt-keyvalue'
  • Update all $nemofy(diro = ...) calls → $run(output_dir = ...)
  • Update $tools field access → $get_tools()
  • Update $files field access → $list_files()
  • Update $tbls field access → $get_tbls()
  • Update conf$get_raw_* / conf$get_tidy_* method calls to new names
  • Update --out_dir / --pfix_include CLI flags if wrapping nemo.R
  • Update nemo_metadata(..., input_dir = ...)..., input_dirs = ...
  • Supply dbdrv explicitly when using the DB writer (no longer defaults to
    RPostgres)

pdiakumis and others added 30 commits April 22, 2026 00:12
* linkml: add tool1 schema

* linkml: add schema utils

* linkml: add schema vignette + schema_to_mermaid.R

* linkml: reorder tool1 schema

* add schema_versions for mermaid diagrams

* pkgdown fixes

* gha: refactor deploy workflow for dev and main branches

* pkgdown: use auto development mode

* Bump version: 0.0.3 => 0.0.3.9000

* r-ulid: grab from umccr conda channel

* makefile: add bump rule

* makefile: add bump rule

* Bump version: 0.0.3.9000 => 0.0.3.9001

* rattler-build upload anaconda: use channel, not label

* Bump version: 0.0.3.9001 => 0.0.3.9002

* gha conda: drop umccr prefix to find dev label

* Bump version: 0.0.3.9002 => 0.0.3.9003

* gha: use ssh-key for bot committing to protected branch

* Bump version: 0.0.3.9003 => 0.0.3.9004

* gha conda pkgdown: drop umccr prefix to find dev label

* Bump version: 0.0.3.9004 => 0.0.3.9005

* gha conda pkgdown: specify dev label

* Bump version: 0.0.3.9005 => 0.0.3.9006

* [bot] Updating conda-lock files (v0.0.3.9006)

* precommit: add air formatter

* add CLAUDE.md

* claude: add new nemotool skill

* "Claude PR Assistant workflow"

* "Claude Code Review workflow"

* Change GitHub + Anaconda orgs (#39)

* change gh org

* change anaconda org

* change anaconda org

* GitHub Actions: use GitHub app for branch protection override (#40)

* gha: use gh app for branch protection override

* gha: use app email

* gha: use same wf for dev + main (#41)

* GitHub Actions: use reusable workflows for conda + pkgdown (#42)

* gha: fix permissions (#43)

* Add GHA-based version bumping workflow (#44)

* Bump version: 0.0.3.9006 => 0.0.3.9007

* [bot] Updating conda-lock files (v0.0.3.9007)

* precommit update

* remove LinkML schema system (to be redesigned in separate PR)

* gha: remove auto claude code review workflow

* gha: restrict claude workflow to repo owners/collaborators/members

---------

Co-authored-by: GitHub Actions <actions@github.com>
Co-authored-by: tidywf-ci-bot[bot] <3171681+tidywf-ci-bot[bot]@users.noreply.github.com>
@pdiakumis pdiakumis changed the title Refactor into unified schema, refactor Tool, CLI + CI/CD, new vis Major API overhaul, unified schema, and ecosystem cleanup Jun 22, 2026
@pdiakumis pdiakumis merged commit a4d5519 into main Jun 22, 2026
4 checks passed
@pdiakumis pdiakumis deleted the dev branch June 22, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make input_id and output_id optional

2 participants