sync: catch up clone detection with pyscn (initial catch-up 1/3)#50
Merged
Conversation
Initial catch-up of the clone detection area against pyscn origin/main (73acc60), following SYNC.md. Ports language-independent algorithm fixes and improvements accumulated since the f0457d7 baseline: APTED correctness: - Sort key roots ascending (leaves before parents) - Fix forest-distance subtree cost to use precomputed td[x+1][y+1] - Normalize similarity by max(size1, size2) instead of size1+size2 - ComputeDistanceAndSimilarity single-pass API Large-tree bounded approximation: - Same-shape bounded distance with sibling realignment (capped cells) - Label/shape profile distance replaces naive depth/size heuristic Classification quality: - Jaccard pre-filter (0.10) on cached fragment features before APTED - Type-1 gated on exact textual match (textual_similarity.go, JS comments) - Type-2 gated on normalized-AST Jaccard (syntactic_similarity.go) - Non-textual similarity capped below Type-1 threshold - Thresholds recalibrated to 0.85/0.75/0.70/0.65, Type-3 off by default - Identifier/literal labels excluded from clone features - MinLines/MinNodes defaults raised to 10/20 (pyscn operational defaults) Pairing/grouping: - Overlapping same-file fragments rejected (isOverlappingLocation) - Strict-subset group members deduped post-grouping (group_dedup) - Complete-linkage rewritten as exact agglomerative clusterer + heap - majorityCloneType: skip unknown, deterministic ties, Type-4 fallback - averageGroupSimilarity counts only known pairs - Default grouping: connected components at Type-4 threshold - One domain.Clone per fragment so union-find recognizes shared members Infrastructure: - parser.OrderedChildren: every structural child exactly once, stable order (APTED trees now include Test/Left/Right/Source/etc.) - LSH auto-enable by estimated pair count; signatures reuse cached features; statistics gain total_fragments / nodes_analyzed Python-specific changes (docstring skipping, boilerplate cost model, multi-dimensional classifier/DFA, Type-4 CFG gating) are recorded as skipped or pending in SYNC.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
golangci-lint flagged computeApproximateDistance as unused; pyscn keeps it exercised via apted_validation_test.go, so port the equivalent nil handling test instead of deleting the function. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Migrate .golangci.yml to the v2 config format (gofmt moves to the formatters section; the v1 enable list matches v2 defaults). Exclude ST1*/QF1* staticcheck checks to keep the v1 lint scope — these were the separate stylecheck/quickfix linters in v1 and were not enabled; this also matches pyscn's v2 config. Bump golangci-lint-action v6 -> v9. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First catch-up PR (1/3) following the catch-up mode in SYNC.md: it brings the clone detection area in line with pyscn
origin/main(73acc60). The cumulative diff since thef0457d7baseline (v1.4.1) was compared against jscan's current state, and all language-independent changes were ported.Ported changes
APTED correctness fixes (
apted.go— adopted pyscn's version almost wholesale)fd[lml_x][lml_y] + td[x+1][y+1](the previous branching plustdoverwrite was incorrect)max(size1, size2)instead ofsize1 + size2(removes systematic overestimation)ComputeDistanceAndSimilaritycomputes the distance once for both valuesdepth*2 + size*0.5heuristicClassification quality (
clone_detector.go, newtextual_similarity.go/syntactic_similarity.go)//and/* */)05601e5; calibrated together with max-based normalization)DefaultEnabledCloneTypes) plus service-level type/similarity filteringast_features.go, JS adaptation of pyscn's Name/Constant exclusion)Pairing and grouping (
grouping_strategy.goand friends)isOverlappingLocation)group_dedup:dedupeStrictSubsetGroupMembersetc.)majorityCloneType: skip unknown types, deterministic tie-breaking, fall back to Type-4 (never report unknown as Type-1)averageGroupSimilaritycounts only pairs present in the similarity map (missing pairs are no longer treated as 0)domain.Cloneobjects with unique IDs, so union-find never recognized shared members and groups never merged. Clones are now shared per fragment (equivalent to pyscn's fragment identity)Infrastructure (
parser/ast.go,domain/clone.go,service/clone_service.go)parser.OrderedChildren: visits every structural child exactly once in a stable order (jscan port of pyscn's same-named API). APTED trees now include Test/Left/Right/Source/TypeAnnotation, etc.ShouldUseLSHWithPairEstimate: LSH also auto-enables when the estimated pair count exceeds 10,000; LSH signatures reuse cached featurestotal_fragments/nodes_analyzedSkipped changes (recorded under "保留中の変更" in SYNC.md)
SkipDocstringsand friends) — Python-specific87babcc,9efebd4) — depend on the unported semantic analysisWithMaxCandidates— require thelsh_index.goAPI change (belongs to the "misc" catch-up area)MergeConfig— config-loader layer differs in jscanaverageGroupSimilaritysemantic change was ported)Verification
make test,go vet, andgolangci-lint(v1.64.8, matching CI) are all green. Tests were updated for the behavior changes, and pyscn's new tests were ported (large-tree label-distance preservation, same-shape distance matching exact APTED, expression-field conversion)analyze --jsonbefore vs. after):latest-version/junk: still zero clones (with the new APTED, junk would detect one pair, but the new MinNodes=20 default excludes those fragments — consistent with pyscn defaults)got(mid-size): old 6 pairs / 0 groups → new 380 pairs / 67 groups (Type-1 22 / Type-2 245 / Type-4 113). Spot-checking samples: Type-1 hits are literal duplicates and Type-2 hits are genuine identifier-renamed clones (mostly test code) that the old pipeline missed due to the APTED bugs (descending key roots, td overwrite) and the inflated similarity normalization. Runtime went from 14.6s to 53.5s because exact APTED now actually runs (with the Jaccard pre-filter and LSH auto-enable already included)🤖 Generated with Claude Code