-
Unbreak building of the grmtools book with the latest
mdbook. -
Change test-only code that is incompatible with an upcoming rustc change.
This release contains a number of new features and breaking changes. Most of the breaking changes are in advanced/niche parts of the API, which few users will notice. However, four breaking changes might affect a more substantial subset of users: those are highlighted in the "breaking changes (major)" section below.
-
Lex and Yacc-like inputs can now optionally take a
%grmtoolsdirective which allows customisation of how grmtools treats the file. This allows users to keep together the necessary grmtools settings, rather than having to remember what is set in, for example, abuild.rsfile. See Lex Extensions and Yacc Extensions in the grmtools book for more details.nimbleparseis also able to use%grmtoolsdirectives.Note that setting options via the command-line / build script overrides
%grmtoolsdirectives. -
Many error messages have been improved, ranging from incorrect grammars to reporting the look ahead value with reduce/reduce conflicts.
-
lrpar uses bincode directly for generated tables, making the serde dependency optional.
-
parse_mapbeen added as a more generic version ofparse_generictree. The latter is marked as deprecated. -
CTTokenMapBuilderhas been added as a more flexible alternative toct_token_map. The latter is marked as deprecated. -
lrlex has a new flag
allow_wholeline_commentswhich allows// ...comments to be added to lex files. This defaults to off, because it is not uncommon for Lex rules themselves to use//. -
Support for use under WASM has been improved.
-
LexErrorKind is no longerEq/PartialEq`. This allows specific errors with regexes to be reported, instead of the generic error previously. -
Many
structs andenums have been marked non-exhaustive. This means that external users cannot directly construct instances of such types. In general, these are parts of the API that users would have expected to have received from grmtools, not construct themselves. -
allow_missing_tokens_in_parseris now treated as a warning. -
RegexOptionsin lrlex has been renamedLexFlags. This struct was, and remains, mostlydoc(hidden)but it cannot be fully hidden from users.
-
Code generation now uses the quote crate and formatted using prettyplease. This makes dealing with the generated code more pleasant.
-
The signature for
ct_token_maphas been generalized to use theBorrowtrait instead of a reference. -
The lifetime of the
lrpar_configcallback has been relaxed (previously it was the onerous'static).
- Add option for more complete POSIX lex compatible regex escapes. For
example,
\bis the backspace character in POSIX Lex, but a word boundary association in the Rust regex crate that lrlex uses. This defaults tofalsefor backwards compatibility.
-
Respect the timeout in all stages of error recovery. Previously the timeout only applied to the first of (several!) stages of error recovery, which could lead to a comically long time spent in the latter stages.
-
Add accessor functions for overly
pubfields inlrlex::Rule. Accessing the fields directly now causes a deprecation warning. -
New
-doption fornimbleparseoutputs the stategraph.
-
%parse-paramcan now use types that implementClone(i.e. relaxing the previous stringent requirement that types wereCopy). -
Document start states in the grmtools book.
-
Allow
lrlexandnimbleparseto read from stdin if the path is-.
-
Catch incorrectly terminated productions. Previously this led to a confusing situation where productions could be merged together.
-
Give better warnings if lexer/grammar can't be read at build time.
-
Allow
%precto define a new token in the grammar. Previously if%prec 'x'was the first mention of "x" then an error would be raised. -
Tell the user where an incomplete action in a grammar started, not finished (since it always "finishes" at the end of the file).
- Improve error messages for conflicts and the like, giving the span of input related to the error.
-
Change generated code to avoid errors about
unsafeaction code in input grammars. -
Hide unstable parts of the API behind an
_unstable_apifeature that is off by default.
-
lrlex now explicitly raises an error when a rule in an input file has leading space. There is a small chance of this breaking existing input files, but it brings lrlex into line with POSIX lex where leading space indicates verbatim code (a concept for which lrlex currently has no support), making porting errors less likely.
-
Report errors on
%eppdeclarations in terms of the input file (rather than pretending they're all at line 1, column 1). -
Reorganise internal testing framework.
- Add CTLexerBuilder options for configuring regex behavior.
-
Support
%emptyin productions. This Bison-ism can be used as a signal to readers that a production really is meant to be empty. -
Allow rules to be repeated, with each being treated as a separate production(s). In other words this grammar:
A: 'x'; A: 'y';is now equivalent to:
A: 'x' | 'y';
This release contains a number of new features and breaking changes. Most of the breaking changes are in advanced/niche parts of the API, which few users will notice. However, four breaking changes might affect a more substantial subset of users: those are highlighted in the "breaking changes (major)" section below.
-
Improved error messages, with various parts of grammar and lexer files now carrying around
Spaninformation to help pinpoint errors. -
A single error can be related to multiple
Spans. For example, if you duplicate a rule name in a grammar, all duplicates are reported in a single error. -
The new
cfgrammar::NewlineCachestruct makes it easier to store the minimal information needed to convert byte offsets in an input into logical line numbers. -
lrlex now supports start states.
-
Unused tokens / rules in a grammar are now detected and, by default, reported as errors. The
%expect-unusedcan suppress such warnings on a per-token/rule basis.
-
Start states mean that lrlex now interprets
<in the regular expression differently than before: to restore the previous behaviour, escape<with\. For example, the lrlex rule< "<"now appears to lrlex as an incomplete start state: replacing it with\< "<"fixes the problem. -
grmtools now bundles many of its type parameters into a
LexexTypestrait, to avoid forcing users to endlessly repeat multiple arguments. If you specified a customStorageTwith lrlex/lrpar then you will need to change idioms such asDefaultLexeme<StorageT>, StorageT>toDefaultLexerTypes<StorageT>. For example, you might need to change:use lrlex::{DefaultLexeme, LRNonStreamingLexer}; ... lexer: &LRNonStreamingLexer<DefaultLexeme<StorageT>, StorageT>,to:
use lrlex::{DefaultLexerTypes, LRNonStreamingLexer}; ... lexer: &LRNonStreamingLexer<DefaultLexerTypes<StorageT>>, -
Unused tokens / rules in a grammar are now detected and, by default, reported as errors. For example, the common "trick" to turn lexing errors into parsing errors suggests adding the following to a grammar:
Unmatched -> (): "UNMATCHED" { } ;Both the
Unmatchedrule and theUNMATCHEDtoken will be reported as unused. You can tell grmtools that you expect this to happen with the%expect-unuseddirective:%expect-unused Unmatched "UNMATCHED" -
StorageTis now used to represent parser states (whereas before it was hard-coded to au16). If you used a customStorageT, it may no longer be big enough: if this happens, an error will be reported while the grammar is being built. You will then need to increase the size of yourStoargeT(e.g. you might need to changeStorageTfromu8tou16).
-
Serde support for lrpar now requires enabling the
serdefeature. -
cfgrammar::yacc::grammar::YaccGrammar::token_spannow returnsOption<Span>rather thanOption<&Span>. -
cfgrammar::yacc::ast::{Production, Symbol}no longer deriveEq,Hash, andPartialEq. Since both now carry aSpan, it's easy to confuse "two {productions, symbols} have the same name" with "at the same place in the input file." -
cfgrammar::yacc::ast::add_rule's signature has changed from:pub fn add_rule(&mut self, name: String, actiont: Option<String>) {to:
pub fn add_rule(&mut self, (name, name_span): (String, Span), actiont: Option<String>) { -
GrammarValidationErrorandYaccParserErrorhave been combined into a structYaccGrammarError(which replaces the previous enum of that name). The newYaccGrammarErrorhas a private enum, so will mean fewer semver-breaking changes. -
LexBuildResultreturns on failureErr(Vec<LexBuildError>)rather thanErr(LexBuildError). -
The
lrlex::lexemesmodule has been renamed tolrlex::defaultsto better describe what it is providing.
-
Spanhas moved fromlrpartocfgrammar. The import is still available vialrparbut is deprecated (though due to rust-lang/rust#30827 this unfortunately does not show as a formal deprecation warning). -
cfgrammar::yacc::grammar::YaccGrammar::rule_namehas been renamed torule_name_str. The old name is still available but is deprecated.
-
Move to Rust 2021 edition.
-
Various minor clean-ups.
- Explicitly error if the users tries to generate two or more lexers or parsers
with the same output file name. Previously the final lexer/parser created
"won the race", leading to a confusing situation where seemingly correct code
would not compile. Users can explicitly set an output path via
output_paththat allows multiple lexers/parsers to be generated with unique names.
An overview of the changes in this version:
Lexemeis now a trait not a struct, increasing flexibility, but requiring some changes in user code.- The build API has slightly changed, requiring some changes in user code.
%parse-paramis now supported.lrlexprovides a new API to make it easy to use simple hand-written lexers instead of its default lexer.
lrpar now defines a Lexeme trait not a Lexeme struct: this allows the
parser to abstract away from the particular data-layout of a lexeme (allowing,
for example, a lexer to attach extra data to a lexeme that can be accessed by
parser actions) but does add an extra type parameter LexemeT to several
interfaces. Conventionally the LexemeT type parameter precedes the StorageT
type parameter in the list of type parameters.
lrlex defaults to using its new DefaultLexeme struct, which provides a
generic lexeme struct similar to that previously provided by lrlex (though
note that you can use lrlex with a lexeme struct of your own choosing).
The precise effects of these changes will depend on how you use grmtools' libraries but in general:
-
You will need to change your lexeme imports from:
use lrpar::Lexeme;
to:
use lrlex::DefaultLexeme; use lrpar::Lexeme;
-
Most references to
Lexemewill need to refer toDefaultLexeme. -
Any references to
LRNonStreamingLexerwill need to change from:LRNonStreamingLexer<Lexeme<u32>>
to:
LRNonStreamingLexerDef<DefaultLexeme<u32>, u32>
where
u32is theStorageTof your choice.
One of the additional benefits to this change is that it allows lrpar and
other lexers (e.g. lrlex) to be clearly separated: lrpar now only defines
traits which lexers have to conform to.
Several of the functions / structs surrounding the compile-time construction
of grammars have changed: more details are given below, but in most cases
a build.rs that looks as follows:
use cfgrammar::yacc::YaccKind;
use lrlex::LexerBuilder;
use lrpar::CTParserBuilder;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let lex_rule_ids_map = CTParserBuilder::<u8>::new_with_storaget()
.yacckind(YaccKind::GrmTools)
.process_file_in_src("calc.y")?;
LexerBuilder::new()
.rule_ids_map(lex_rule_ids_map)
.process_file_in_src("calc.l")?;
Ok(())
}can be changed to:
use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;
fn main() -> Result<(), Box<dyn std::error::Error>> {
CTLexerBuilder::new()
.lrpar_config(|ctp| {
ctp.yacckind(YaccKind::Grmtools)
.grammar_in_src_dir("calc.y")
.unwrap()
})
.lexer_in_src_dir("calc.l")?
.build()?;
Ok(())
}In more detail:
-
lrlex'sLexerBuilderhas been renamed toCTLexerBuilderfor symmetry withlrpar. -
CTLexerBuildernow provides thelrpar_configconvenience function which removes some of the grottiness involved in tying together anlrlexlexer andlrparparser.lrpar_configis passed aCTParserBuilderinstance to which normallrparcompile-time options can be applied. -
CTParserBuilder::process_file_in_srcis deprecated in favour ofCTParserBuilder::grammar_in_src_dirandCTParserBuilder::build. The latter method consumes theCTParserBuilderreturning aCTParserwhich exposes atoken_mapmethod whose result can be passed to lexers such aslrlex.The less commonly used
process_filefunction is similarly deprecated in favour of thegrammar_path,output_path, andbuildfunctions. -
LexerBuilder::process_file_in_srcis deprecated in favour ofLexerBuilder::lexer_in_src_dirandLexerBuilder::build.The less commonly used
process_filefunction is similarly deprecated in favour of thelexer_path,output_path, andbuildfunctions. -
CTLexerBuilder's andCTParserBuilder'sbuildmethods both consume the builder, producing aCTLexerandCTParserrespectively, which can be queried for additional information.
- The unstable
CTParserBuilder::conflictsmethod has moved toCTParser. This interface remains unstable and may change without notice.
-
Yacc grammars now support the
%parse-param <var>: <type>declaration. The variable<var>is then visible in all action code. Note that<type>must implement theCopytrait. The generatedparsefunction then takes two parameters(lexer: &..., <var>: <type>). -
lrlexnow exposes act_token_mapfunction which creates a module with a parser's token IDs, and allows users to callLRNonStreamingLexer::newdirectly. This makes creating simple hand-written lexers much easier (see the newcalc_manual_lexexample to see this in action).
-
Optimise
NonStreamingLexer::span_lines_strfrom O(n) to O(log n). -
Deprecate GrammarAST::add_programs in favour of GrammarAST::set_programs.
-
Add support for Yacc's
%expectand Bison's%expect-rrdeclarations. These allow grammar authors to specify how many shift/reduce and reduce/reduce conflicts they expect in their grammar, and to error if either quantity is different, which is a more fine-grained check than theerror_on_conflictsboolean. -
Generate code with "correct" camel case names to avoid Clippy warnings in code that uses grmtools' output.
- A number of previously public items (functions, structs, struct attributes) have been made private. Most of these were either not externally visible, or only visible if accessed in undocumented ways, but some were incorrectly public.
-
Optimise
NonStreamingLexer::line_colfrom O(n) to O(log n) (where n is the number of lines in the file). -
Document more clearly the constraints on what
Spans one can safely ask a lexer to operate on. Note that it is always safe to use aSpangenerated directly by a lexer: the constraints relate to what happens if a user derives aSpanthemselves. -
Suppress Clippy warnings about
unnecessary_wraps, including in code generated from grammar files.
-
Export the
Visibilityenum fromlrlex. -
Ensure that lrpar rebuilds a grammar if its
visibilityis changed.
- Fix a handful of Clippy warnings.
- The
MFandPanicrecoverers (deprecated, and undocumented, since 0.4.3) have been removed. Please change toRecoveryKind::CPCTPlus(or, if you don't want error recovery,RecoveryKind::None).
- The stategraph is no longer stored in the generated grammar, leading to useful savings in the generated binary size.
- The modules generated for compile-time parsing by lrlex and lrpar have
private visibility by default. Changing this previously required a manual
alias. The
visibilityfunction in lrlex and lrpar's compile-time builders allows a different visibility to be set (e.g.visibility(Visibility::Public). Rust has a number of visibility settings and theVisibilityenums in lrlex and lrpar reflect this.
lrlexnow uses aLexerDefwhich all lexer definitions mustimpl. This means that if you want to call methods on a concrete lexer definition, you will almost certainly need to importlrlex::LexerDef. This opens the possibility that lrlex can seamlessly produce lexers other thanLRNonStreamingLexerDefs in the future.
-
lrlex::NonStreamingLexerDefhas been renamed tolrlex::LRNonStreamingLexerDef; use of the former is deprecated. -
The
lrlex::build_lexfunction has been deprecated in favour ofLRNonStreamingLexerDef::from_str.
- The statetable and other elements were previously included in the user binary
with
include_bytes!, but this could cause problems with relative path names. We now include the statetable and other elements in generated source code to avoid this issue.
-
The
Lexertrait has been broken into two:LexerandNonStreamingLexer. The former trait is now only capable of producingLexemes: the latter is capable of producing substrings of the input and calculating line/column information. This split allows the flexibility to introduce streaming lexers in the future (which will not be able to produce substrings of the input in the same way as aNonStreamingLexer).Most users will need to replace references to the
Lexertrait in their code toNonStreamingLexer. -
NonStreamingLexertakes a lifetime'inputwhich allows the input to last longer than theNonStreamingLexeritself.Lexer::span_strandLexer::span_lines_strhad the following definitions:fn span_str(&self, span: Span) -> &str; fn span_lines_str(&self, span: Span) -> &str;As part of
NonStreamingLexertheir definitions are now:fn span_str(&self, span: Span) -> &'input str; fn span_lines_str(&self, span: Span) -> &'input str;This change allows users to throw away the
Lexerbut still keep around structures (e.g. ASTs) which reference the user's input.rustc infers the
'inputlifetime in some situations but not others, so if you get an error:error[E0106]: missing lifetime specifierthen it is likely that you need to change a type from
NonStreamingLexertoNonStreamingLexer<'input>.
-
Fix two Clippy warnings and suppress two others.
-
Prefer "unmatched" rather than "unknown" when using the "turn lexing errors into parsing errors" trick.
-
Deprecate
Lexeme::len,Lexeme::start, andLexeme::end. Each is now replaced byLexeme::span().len()etc. An appropriate warning is generated if the deprecated methods are used. -
Avoid use of the unit return type in action code causing Clippy warnings.
-
Document the "turn lexing errors into parsing errors" technique and extend
lrpar/examples/calc_astto use it.
-
Introduce the concept of a
Spanwhich records what portion of the user's input something (e.g. a lexeme or production) references. Users can turn aSpaninto a string through theLexer::span_strfunction. This has several API changes:lrparnow exports aSpantype.Lexemes now have afn span(&self) -> Spanfunction which returns theLexeme's `Span.Lexer::span_strreplacesLexer::lexeme_strfunction. Roughly speaking this:becomes:let s = lexer.lexeme_str(&lexeme);
let s = lexer.span_str(lexeme.span());
Lexer::line_colnow takes aSpanrather than ausizeand, since aSpancan be over multiple lines, returns((start line, start column), (end line, end column)).Lexer::surrounding_line_stris removed in favour ofspan_lines_strwhich takes aSpanand returns a (possibly multi-line)&strof the lines containing thatSpan.- The
$spanspecial variable now returns aSpanrather than(usize, usize).
In practise, this means that in many cases where you previously had to use
Lexeme<StorageT>, you can now useSpaninstead. This has two advantages. First, it simplifies your code. Second, it enables better error reporting, as you can now point the user to a span of text, rather than a single point. See the (new) AST evaluator section of the grmtools book for an example of how code usingSpanlooks. -
The
$spanspecial variable is now enabled at all times and no longer needs to be turned on withCTBuilder::span_var. This function has thus been removed.
-
If called as a binary, lrlex now exits with a return code of 1 if it could not lex input. This matches the behaviour of lrpar.
-
Module names in generated code can now be optionally configured with
mod_name. The names default to the same naming scheme as before. -
Fully qualify more names in generated code.
-
lrlex_modandlrpar_modnow take strings that match the paths ofprocess_file_in_src. In other words what was:... .process_file_in_src("a/b/grm.y"); ... lrpar_mod!(grm_y);
is now:
... .process_file_in_src("a/b/grm.y"); ... lrpar_mod!("a/b/grm.y");
and similarly for
lrlex_mod. This is hopefully easier to remember and also allows projects to have multiple grammar files with the same name. -
The
LexerAPI no longer requires mutability. What was:trait Lexer { fn next(&mut self) -> Option<Result<Lexeme<StorageT>, LexError>>; fn all_lexemes(&mut self) -> Result<Vec<Lexeme<StorageT>>, LexError> { ... } ... }
has now been replaced by an iterator over lexemes:
trait Lexer { fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = Result<Lexeme<StorageT>, LexError>> + 'a>; ... }
This enables more ergonomic use of the new zero-copy feature, but does require changing structs which implement this trait.
lrlexhas been adjusted appropriately.In practise, the only impact that most users will notice is that the following idiom:
let (res, errs) = grm_y::parse(&mut lexer);
will produce a warning that the
mutpart of&mutis no longer needed.
-
Add support for zero-copying user input when parsing. A special lifetime
'inputis now available in action code and allows users to extract parts of the input without callingto_owned()(or equivalent). For example:Name -> &'input str: 'ID' { $lexer.lexeme_str(&$1.map_err(|_| ())?) } ;See
lrpar/examples/calc_ast/src/calc.yfor a more detailed example.
-
Generated code now uses fully qualified names so that name clashes between user action code and that created by grmtools is less likely.
-
Action types can now be fully qualified. In other words this:
R -> A::B: ... ;
means that the rule
Rnow has an action typeA::B.
-
Deprecate the MF recoverer: CPCT+ is now the default and MF is now undocumented. For most people, CPCT+ is good enough, and it's quite a bit easier to understand. In the longer term, MF will probably disappear entirely.
-
License as dual Apache-2.0/MIT (instead of a more complex, and little understood, triple license of Apache-2.0/MIT/UPL-1.0).
- Action code uses
$as a way of denoting special variables. For example, the pseudo-variable$2is replaced with a "real" Rust variable by grmtools. However, this means that$2cannot appear in, say, a string without being replaced. This release uses$$as an escaping mechanism, so that one can write code such as"$$1"in action code; this is rewritten to"$1"by grmtools.
- Newer versions of rustc produce "deprecated" warnings when trait objects are
used without the
dynkeyword. This previously caused a large number of warnings in generated grammar code fromlrpar. This release ensures that generated grammar code uses thedynkeyword when needed, removing such warnings.
-
Lexeme::empty()has been renamed toLexeme::inserted(). Although rare, there are grammars with empty lexemes that aren't the result of error recovery (e.g. DEDENT tokens in Python). The previous name was misleading in such cases. -
Lexeme insertion is no longer explicitly encoded in the API for lexemes end/length. Previously these functions returned
Noneif a lexeme had been inserted by error recovery. This has proven to be more effort than it's worth with variants on the idiomlexeme.end().unwrap_or_else(|| lexeme.start())used extensively. These definitions have thus been simplified, changing from:pub fn end(&self) -> Option<usize> pub fn len(&self) -> Option<usize>
to:
pub fn end(&self) -> usize pub fn len(&self) -> usize
- A new pseudo-variable
$spancan be enabled within parser actions ifCTBuilder::span_var(true)is called. This pseudo-variable has the type (usize, usize) where these represent (start, end) offsets in the input and allows users to determine how much input a rule has matched.
- Some dynamic assertions about the correct use of types have been converted to static assertions. In the unlikely event that you try to run grmtools on a platform with unexpected type sizes (which, in practise, probably only means 16 bit machines), this will lead to the problems being determined at compile-time rather than run-time.
- Document lrpar more thoroughly, in particular hiding the inner modules, whose location might change one day in the future: all useful structs (etc.) are explicitly exposed at the module level.
-
Have the
process_filefunctions in bothLexerBuilderandCTParserBuilderplace output into a named file (whereas previouslyCTParserBuilderexpected a directory name). -
Rename
offset_line_coltoline_coland have the latter return character offsets (whereas before it returned byte offsets). This makes the resulting numbers reported to humans much less confusing when multi-byte UTF-8 characters are used.
-
Add
surrounding_line_strhelper function to lexers. This is helpful when printing out error messages to users. -
Add a comment with rule names when generating grammars at compile-time. Thus if user action code contains an error, it's much easier to relate this to the appropriate point in the
.yfile.
- Documentation fixes.
-
Previously users had to specify the
YaccKindof a grammar and then theActionKindof actions. This is unnecessarily fiddly, so removeActionKindentirely and instead flesh outYaccKindto deal with the possible variants. For exampleActionKind::CustomActionis now, in essence,YaccKind::Original(YaccOriginalActionKind::UserAction). This is a breaking change but one that will make future evolution much easier. -
The
%typedirective in grammars exposed by YaccKind::Original(YaccOriginalActionKind::UserAction) has been renamed to%actiontypeto make it clear what type is being referred to. In general, most people will want to move to theYaccKind::Grmtoolsvariant (see below), which doesn't require the%actiontypedirective.
-
grmtools has moved to the 2018 edition of Rust and thus needs rustc-1.31 or later to compile.
-
Add
YaccKind::Grmtoolsvariant, allowing grammar rules to have different action types. For most practical use cases, this is much better than using%actiontype. -
Add
%avoid_insertdirective to bias ranking of repair sequences and make it more likely that parsing can continue.
-
Add
-qswitch tonimbleparseto suppress printing out the stategraph and conflicts (some grammars have conflicts by design, so being continually reminded of it isn't helpful). -
Fix problem where errors which lead to vast (as in hundreds of thousands) of repair sequences being found could take minutes to sort and rank.
-
Add
YaccKind::Original(YaccOriginalActionKind::NoAction)variant to generate a parser which simply tells the user where errors were found (i.e. no actions are executed, and not even a parse tree is created). -
lrlexno longer tries to create Rust-level identifiers for tokens whose names can't be valid Rust identifiers (which led to compile-time syntax errors in the generated Rust code).
- Fix bug where
%eppstrings with quote marks in caused a code-generation failure in compile-time mode.
First stable release.