Is reusing rules always slower than redefine with with tokens in Antlr4?

I am profiling my Antlr4 generated parser in JavaScript. I have a few rules that match ID | STRING.

Lexer

ID
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

STRING
 : '"' (~["\r\n] | '""')* ('"'|[\r\n])?
 ;

Parser

name: ID | STRING ;

rule1: some other rules;
rule2: different rules

some: ID | STRING ;
different: ID | STRING ;

If I change some to some: name; and different to different: name; the performance goes down about 30%. (To parse a given code 100 times, time goes up from 1.5s to about 2s).

In this case, name is a terminal node in parser. So I would not assume a lot of overhead in itself. We have 8 other places using ID | STRING. That 30% was after I replaced all of them with name.

The testing code is:

x = B."method. {a, b} 1"(1,2)

In the above code, the following will be matched by "ID | STRING":

  1. x
  2. B
  3. "metohd. {a, b} 1"
  4. 1
  5. 2

Is my assumption stated in the title correct?

2 thoughts on “Is reusing rules always slower than redefine with with tokens in Antlr4?”

  1. 30% seems like a LOT (but that might be artificial in a very simple example)

    Using a recursive descent parser, it would make sense that there would be some overhead in calling the name rule rather than recognizing either of two tokens.

    I would think the overall impact would be negligible in a larger contest, unless this is a VERY fundamental part of your grammar that is used a LOT.

    If you’re feeling performance pain around it, then "unrolling" it might make some sense. Of course, you’d lose the "name" context in your resulting parse tree. That may be a good or a bad thing depending upon how you want to handle things. (sometimes those extra parse tree nodes are just noise you have that can feel like an irritation, and other times, they are important pieces of information).

    Reply

Leave a Comment