Skip to content

Latest commit

 

History

History
124 lines (92 loc) · 3.18 KB

File metadata and controls

124 lines (92 loc) · 3.18 KB

API Reference

TokenBoundaryDetector

Main class for sentence boundary detection.

Constructor

TokenBoundaryDetector(
    language: str = 'en',
    min_sentence_length: int = 1,
    abbreviations: Set[str] = None,
    aggressive_abbreviations: bool = False,
    merge_short_sentences: bool = False,
    include_rules: List[int] = None,
    exclude_rules: List[int] = None,
    debug: bool = False
)

Parameters

Parameter Type Default Description
language str 'en' Language code for rules
min_sentence_length int 1 Minimum sentence length
abbreviations Set[str] None Custom abbreviations
aggressive_abbreviations bool False Stricter abbreviation handling
merge_short_sentences bool False Merge short sentences
include_rules List[int] None Use specific rules
exclude_rules List[int] None Exclude specific rules
debug bool False Enable debug mode

Methods

split()

detector.split(text: str, return_spans: bool = False, return_metadata: bool = False)

Parameters:

  • text (str): Input text to split
  • return_spans (bool): If True, returns (start, end) tuples
  • return_metadata (bool): If True, returns dict with confidence scores

Returns:

  • List[str]: List of sentences (default)
  • List[tuple]: (start, end) character positions if return_spans=True
  • List[dict]: Sentences with metadata if return_metadata=True

set_abbreviations()

detector.set_abbreviations(abbreviations: Set[str])

Set custom abbreviations that should NOT be treated as sentence boundaries.

get_abbreviations()

abbrev = detector.get_abbreviations()

Get current abbreviation set.

explain()

explanations = detector.explain(text: str)

Returns detailed explanation of which rules were applied and why (requires debug=True).

Rules

Built-in Rules

PySET includes 85 rules across 8 categories:

  1. Standard Terminals - Common sentence endings (. ! ?)
  2. Ellipsis - ... and variants
  3. Quotation Marks - Smart quotes handling
  4. Brackets - Parentheticals, brackets
  5. Numbers - Abbreviations with periods
  6. Context - Word-based context rules
  7. Special Cases - URLs, emails, decimals
  8. Advanced - Complex legal text

Custom Rules

from pyset.rules import Rule, RuleContext

class MyCustomRule(Rule):
    priority = 90
    
    def evaluate(self, context: RuleContext) -> float:
        # Your logic here
        if context.prev_word() == "hello":
            return 1.0  # BOUNDARY
        return 0.0  # NOT_BOUNDARY

detector = TokenBoundaryDetector(
    include_rules=[MyCustomRule],
    custom_rules=[MyCustomRule()]
)

Context API

RuleContext provides access to position information:

context.char()              # Current character
context.prev_char(n)        # Nth previous character
context.next_char(n)        # Nth next character
context.prev_word(n)        # Nth previous word
context.next_word(n)        # Nth next word
context.position()         # Current position
context.text_length()       # Total text length