Skip to content

sync: catch up clone detection with pyscn (initial catch-up 1/3)#50

Merged
DaisukeYoda merged 4 commits into
mainfrom
sync/catchup-clone
Jun 11, 2026
Merged

sync: catch up clone detection with pyscn (initial catch-up 1/3)#50
DaisukeYoda merged 4 commits into
mainfrom
sync/catchup-clone

Conversation

@DaisukeYoda

@DaisukeYoda DaisukeYoda commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

First catch-up PR (1/3) following the catch-up mode in SYNC.md: it brings the clone detection area in line with pyscn origin/main (73acc60). The cumulative diff since the f0457d7 baseline (v1.4.1) was compared against jscan's current state, and all language-independent changes were ported.

Ported changes

APTED correctness fixes (apted.go — adopted pyscn's version almost wholesale)

  • Sort key roots in ascending order (leaves before parents); the previous descending order computed key roots in an invalid sequence
  • Fix the forest-distance subtree cost to fd[lml_x][lml_y] + td[x+1][y+1] (the previous branching plus td overwrite was incorrect)
  • Normalize similarity by max(size1, size2) instead of size1 + size2 (removes systematic overestimation)
  • ComputeDistanceAndSimilarity computes the distance once for both values
  • Large-tree approximation: bounded exact distance for same-shape trees (20,000 alignment-cell cap) and label/shape profile distances, replacing the old depth*2 + size*0.5 heuristic

Classification quality (clone_detector.go, new textual_similarity.go / syntactic_similarity.go)

  • Jaccard pre-filter on cached fragment features (< 0.10 skips APTED entirely)
  • Type-1 gated on exact textual match (normalization: JS/TS comment removal + whitespace folding — pyscn's Python comment handling adapted for // and /* */)
  • Type-2 gated on normalized-AST-hash Jaccard similarity
  • Pairs without a textual match have their similarity capped below the Type-1 threshold
  • Thresholds recalibrated to 0.85 / 0.75 / 0.70 / 0.65 (pyscn 05601e5; calibrated together with max-based normalization)
  • Type-3 disabled by default (DefaultEnabledCloneTypes) plus service-level type/similarity filtering
  • Identifier/literal labels excluded from clone features (ast_features.go, JS adaptation of pyscn's Name/Constant exclusion)
  • MinLines/MinNodes defaults raised to 10/20 (pyscn's operational defaults)

Pairing and grouping (grouping_strategy.go and friends)

  • Overlapping same-file fragment pairs rejected (isOverlappingLocation)
  • Strict-subset group members removed after grouping (ported group_dedup: dedupeStrictSubsetGroupMembers etc.)
  • Complete linkage replaced Bron-Kerbosch with pyscn's exact agglomerative complete-linkage clustering (clusterer + best-neighbor heap)
  • majorityCloneType: skip unknown types, deterministic tie-breaking, fall back to Type-4 (never report unknown as Type-1)
  • averageGroupSimilarity counts only pairs present in the similarity map (missing pairs are no longer treated as 0)
  • Default grouping: connected components at the Type-4 threshold
  • jscan-specific bug fix: each pair used to create fresh domain.Clone objects with unique IDs, so union-find never recognized shared members and groups never merged. Clones are now shared per fragment (equivalent to pyscn's fragment identity)

Infrastructure (parser/ast.go, domain/clone.go, service/clone_service.go)

  • parser.OrderedChildren: visits every structural child exactly once in a stable order (jscan port of pyscn's same-named API). APTED trees now include Test/Left/Right/Source/TypeAnnotation, etc.
  • ShouldUseLSHWithPairEstimate: LSH also auto-enables when the estimated pair count exceeds 10,000; LSH signatures reuse cached features
  • Statistics gain total_fragments / nodes_analyzed
  • LSH auto threshold 200 → 500

Skipped changes (recorded under "保留中の変更" in SYNC.md)

  • Docstring skipping (SkipDocstrings and friends) — Python-specific
  • Boilerplate cost model (dataclass/Pydantic) — Python framework-specific
  • Multi-dimensional classifier, DFA, and Type-4 CFG gating (87babcc, 9efebd4) — depend on the unported semantic analysis
  • LSH int IDs and WithMaxCandidates — require the lsh_index.go API change (belongs to the "misc" catch-up area)
  • MergeConfig — config-loader layer differs in jscan
  • star_medoid graph optimization — performance-only and structurally different from jscan's implementation (the averageGroupSimilarity semantic change was ported)

Verification

  • make test, go vet, and golangci-lint (v1.64.8, matching CI) are all green. Tests were updated for the behavior changes, and pyscn's new tests were ported (large-tree label-distance preservation, same-shape distance matching exact APTED, expression-field conversion)
  • Golden comparison (analyze --json before vs. after):
    • latest-version / junk: still zero clones (with the new APTED, junk would detect one pair, but the new MinNodes=20 default excludes those fragments — consistent with pyscn defaults)
    • got (mid-size): old 6 pairs / 0 groups → new 380 pairs / 67 groups (Type-1 22 / Type-2 245 / Type-4 113). Spot-checking samples: Type-1 hits are literal duplicates and Type-2 hits are genuine identifier-renamed clones (mostly test code) that the old pipeline missed due to the APTED bugs (descending key roots, td overwrite) and the inflated similarity normalization. Runtime went from 14.6s to 53.5s because exact APTED now actually runs (with the Jaccard pre-filter and LSH auto-enable already included)

🤖 Generated with Claude Code

Daisuke Yoda and others added 4 commits June 11, 2026 15:36
Initial catch-up of the clone detection area against pyscn origin/main
(73acc60), following SYNC.md. Ports language-independent algorithm fixes
and improvements accumulated since the f0457d7 baseline:

APTED correctness:
- Sort key roots ascending (leaves before parents)
- Fix forest-distance subtree cost to use precomputed td[x+1][y+1]
- Normalize similarity by max(size1, size2) instead of size1+size2
- ComputeDistanceAndSimilarity single-pass API

Large-tree bounded approximation:
- Same-shape bounded distance with sibling realignment (capped cells)
- Label/shape profile distance replaces naive depth/size heuristic

Classification quality:
- Jaccard pre-filter (0.10) on cached fragment features before APTED
- Type-1 gated on exact textual match (textual_similarity.go, JS comments)
- Type-2 gated on normalized-AST Jaccard (syntactic_similarity.go)
- Non-textual similarity capped below Type-1 threshold
- Thresholds recalibrated to 0.85/0.75/0.70/0.65, Type-3 off by default
- Identifier/literal labels excluded from clone features
- MinLines/MinNodes defaults raised to 10/20 (pyscn operational defaults)

Pairing/grouping:
- Overlapping same-file fragments rejected (isOverlappingLocation)
- Strict-subset group members deduped post-grouping (group_dedup)
- Complete-linkage rewritten as exact agglomerative clusterer + heap
- majorityCloneType: skip unknown, deterministic ties, Type-4 fallback
- averageGroupSimilarity counts only known pairs
- Default grouping: connected components at Type-4 threshold
- One domain.Clone per fragment so union-find recognizes shared members

Infrastructure:
- parser.OrderedChildren: every structural child exactly once, stable
  order (APTED trees now include Test/Left/Right/Source/etc.)
- LSH auto-enable by estimated pair count; signatures reuse cached
  features; statistics gain total_fragments / nodes_analyzed

Python-specific changes (docstring skipping, boilerplate cost model,
multi-dimensional classifier/DFA, Type-4 CFG gating) are recorded as
skipped or pending in SYNC.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
golangci-lint flagged computeApproximateDistance as unused; pyscn keeps
it exercised via apted_validation_test.go, so port the equivalent nil
handling test instead of deleting the function.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Migrate .golangci.yml to the v2 config format (gofmt moves to the
formatters section; the v1 enable list matches v2 defaults). Exclude
ST1*/QF1* staticcheck checks to keep the v1 lint scope — these were
the separate stylecheck/quickfix linters in v1 and were not enabled;
this also matches pyscn's v2 config. Bump golangci-lint-action v6 -> v9.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@DaisukeYoda DaisukeYoda self-assigned this Jun 11, 2026
@DaisukeYoda DaisukeYoda merged commit a9a4ff7 into main Jun 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant