Main class for sentence boundary detection.
TokenBoundaryDetector(
language: str = 'en',
min_sentence_length: int = 1,
abbreviations: Set[str] = None,
aggressive_abbreviations: bool = False,
merge_short_sentences: bool = False,
include_rules: List[int] = None,
exclude_rules: List[int] = None,
debug: bool = False
)| Parameter | Type | Default | Description |
|---|---|---|---|
language |
str | 'en' |
Language code for rules |
min_sentence_length |
int | 1 |
Minimum sentence length |
abbreviations |
Set[str] | None |
Custom abbreviations |
aggressive_abbreviations |
bool | False |
Stricter abbreviation handling |
merge_short_sentences |
bool | False |
Merge short sentences |
include_rules |
List[int] | None |
Use specific rules |
exclude_rules |
List[int] | None |
Exclude specific rules |
debug |
bool | False |
Enable debug mode |
detector.split(text: str, return_spans: bool = False, return_metadata: bool = False)Parameters:
text(str): Input text to splitreturn_spans(bool): If True, returns (start, end) tuplesreturn_metadata(bool): If True, returns dict with confidence scores
Returns:
- List[str]: List of sentences (default)
- List[tuple]: (start, end) character positions if return_spans=True
- List[dict]: Sentences with metadata if return_metadata=True
detector.set_abbreviations(abbreviations: Set[str])Set custom abbreviations that should NOT be treated as sentence boundaries.
abbrev = detector.get_abbreviations()Get current abbreviation set.
explanations = detector.explain(text: str)Returns detailed explanation of which rules were applied and why (requires debug=True).
PySET includes 85 rules across 8 categories:
- Standard Terminals - Common sentence endings (. ! ?)
- Ellipsis - ... and variants
- Quotation Marks - Smart quotes handling
- Brackets - Parentheticals, brackets
- Numbers - Abbreviations with periods
- Context - Word-based context rules
- Special Cases - URLs, emails, decimals
- Advanced - Complex legal text
from pyset.rules import Rule, RuleContext
class MyCustomRule(Rule):
priority = 90
def evaluate(self, context: RuleContext) -> float:
# Your logic here
if context.prev_word() == "hello":
return 1.0 # BOUNDARY
return 0.0 # NOT_BOUNDARY
detector = TokenBoundaryDetector(
include_rules=[MyCustomRule],
custom_rules=[MyCustomRule()]
)RuleContext provides access to position information:
context.char() # Current character
context.prev_char(n) # Nth previous character
context.next_char(n) # Nth next character
context.prev_word(n) # Nth previous word
context.next_word(n) # Nth next word
context.position() # Current position
context.text_length() # Total text length