API Reference

TokenBoundaryDetector

Main class for sentence boundary detection.

Constructor

TokenBoundaryDetector(
    language: str = 'en',
    min_sentence_length: int = 1,
    abbreviations: Set[str] = None,
    aggressive_abbreviations: bool = False,
    merge_short_sentences: bool = False,
    include_rules: List[int] = None,
    exclude_rules: List[int] = None,
    debug: bool = False
)

Parameters

Parameter	Type	Default	Description
`language`	str	`'en'`	Language code for rules
`min_sentence_length`	int	`1`	Minimum sentence length
`abbreviations`	Set[str]	`None`	Custom abbreviations
`aggressive_abbreviations`	bool	`False`	Stricter abbreviation handling
`merge_short_sentences`	bool	`False`	Merge short sentences
`include_rules`	List[int]	`None`	Use specific rules
`exclude_rules`	List[int]	`None`	Exclude specific rules
`debug`	bool	`False`	Enable debug mode

Methods

split()

detector.split(text: str, return_spans: bool = False, return_metadata: bool = False)

Parameters:

text (str): Input text to split
return_spans (bool): If True, returns (start, end) tuples
return_metadata (bool): If True, returns dict with confidence scores

Returns:

List[str]: List of sentences (default)
List[tuple]: (start, end) character positions if return_spans=True
List[dict]: Sentences with metadata if return_metadata=True

set_abbreviations()

detector.set_abbreviations(abbreviations: Set[str])

Set custom abbreviations that should NOT be treated as sentence boundaries.

get_abbreviations()

abbrev = detector.get_abbreviations()

Get current abbreviation set.

explain()

explanations = detector.explain(text: str)

Returns detailed explanation of which rules were applied and why (requires debug=True).

Rules

Built-in Rules

PySET includes 85 rules across 8 categories:

Standard Terminals - Common sentence endings (. ! ?)
Ellipsis - ... and variants
Quotation Marks - Smart quotes handling
Brackets - Parentheticals, brackets
Numbers - Abbreviations with periods
Context - Word-based context rules
Special Cases - URLs, emails, decimals
Advanced - Complex legal text

Custom Rules

from pyset.rules import Rule, RuleContext

class MyCustomRule(Rule):
    priority = 90
    
    def evaluate(self, context: RuleContext) -> float:
        # Your logic here
        if context.prev_word() == "hello":
            return 1.0  # BOUNDARY
        return 0.0  # NOT_BOUNDARY

detector = TokenBoundaryDetector(
    include_rules=[MyCustomRule],
    custom_rules=[MyCustomRule()]
)

Context API

RuleContext provides access to position information:

context.char()              # Current character
context.prev_char(n)        # Nth previous character
context.next_char(n)        # Nth next character
context.prev_word(n)        # Nth previous word
context.next_word(n)        # Nth next word
context.position()         # Current position
context.text_length()       # Total text length

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Reference

TokenBoundaryDetector

Constructor

Parameters

Methods

split()

set_abbreviations()

get_abbreviations()

explain()

Rules

Built-in Rules

Custom Rules

Context API

FilesExpand file tree

api-reference.md

Latest commit

History

api-reference.md

File metadata and controls

API Reference

TokenBoundaryDetector

Constructor

Parameters

Methods

split()

set_abbreviations()

get_abbreviations()

explain()

Rules

Built-in Rules

Custom Rules

Context API