本文簡單介紹 antlr4的基本知識，介紹了antlr4 語法中二義性及解決思路，anrlr4 可能出現的錯誤，以及錯誤定位和解決的辦法。

簡單介紹

ANTLR（Another Tool for Language Recognition）是一個開源的語法分析器生成工具。ANTLR4 語法分析器使用了一種名爲自適應的 LL(*) 或者 ALL(*)(讀作 all star)的新技術，ALL（*）是 ANTLR3 中 LL(*)的擴展。

早期 Antlr 的 LL(*) 文法仍不支持“左遞歸”（left-recursion），這是所有LL剖析器]的侷限，在左遞歸過程沒有消耗掉任何token, LL 分析器很容易造成stack overflow。ANTLR4 的 ALL(*) 解決了左遞歸的問題，但是仍然不能處理間接左遞歸的情況¹

antlr4 是用 java 編寫的，所以首先保證環境中 java 環境已經正確安裝。在官網或者 github 下載 antlr-4.7.1-complete.jar，然後配置環境變量如下

# ANTLR
ANTLRPATH=/home/jona/software/antlr4/antlr-4.7.1-complete.jar
export CLASSPATH=.:$ANTLRPATH:$CLASSPATH
alias antlr4="java -Xmx1000M -cp "/home/jona/software/antlr4/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool"
alias grun="java org.antlr.v4.gui.TestRig"

這樣就能使用antlr4 工具了。antlr4 的 IDE 名爲 antlrworks2。使用圖形工具編寫語法規則會更加高效。

antlr4 雖然是用 java 語言寫的，但是生成的目標語言可以支持 cpp, c sharp, go, java, php, python 和 swift。在源碼目錄 antl4/runtime 中可以查看得到。antlr4 支持上寫文無關文法規則(context-free)，能夠根據語法規則生成相應的語法解析代碼，開發者根據生成的代碼，編寫自己的邏輯。

antlr4 工具提供如下選項

 -o ___              specify output directory where all output is generated
 -lib ___            specify location of grammars, tokens files
 -atn                generate rule augmented transition network diagrams
 -encoding ___       specify grammar file encoding; e.g., euc-jp
 -message-format ___ specify output style for messages in antlr, gnu, vs2005
 -long-messages      show exception details when available for errors and warnings
 -listener           generate parse tree listener (default)
 -no-listener        don't generate parse tree listener
 -visitor            generate parse tree visitor
 -no-visitor         don't generate parse tree visitor (default)
 -package ___        specify a package/namespace for the generated code
 -depend             generate file dependencies
 -D<option>=value    set/override a grammar-level option
 -Werror             treat warnings as errors
 -XdbgST             launch StringTemplate visualizer on generated code
 -XdbgSTWait         wait for STViz to close before continuing
 -Xforce-atn         use the ATN simulator for all predictions
 -Xlog               dump lots of logging info to antlr-timestamp.log
 -Xexact-output-dir  all output goes into -o dir regardless of paths/package

antlr4 提供了兩種訪問模式，一個是訪問者 visitor 模式，一個是監聽器 listener 模式，-visitor 和 -no-visitor 分別是打開訪問者和關閉訪問者的選項，-listener 和 -no-listener 分別是打開監聽器和關閉監聽器的模式。-long-messages會顯示詳細的錯誤信息和告警信息。 -package 選項，會在代碼生成時，制定代碼所在的 namespace。其他選項可以參考官方文檔。比如

java -Xmx500M -cp /home/jona/software/antlr4/antlr-4.7.1-complete.jar org.antlr.v4.Tool -Dlanguage=Cpp -long-messages -listener -visitor -o generated/ KingbaseSqlLexer.g4 KingbaseSqlParser.g4

這裏，根據詞法文件 KingbaseSqlLexer.g4 和語法文件 KingbaseSqlParser.g4 生成 cpp 的語法分析器，源文件存儲在 generated 目錄中，同時打開了訪問者和監聽器模式。

關於 visitor 和 listener 的具體使用方法，可以參考[antlr4 權威指南]，這本書講解的非常詳細。下面題主想要寫的，是在實際工作中所遇到的一些問題，想跟大家分享一下。

左遞歸和間接左遞歸

antlr4 是可以處理左遞歸的，但是不能處理間接左遞歸，這個在 issue#417 中有過討論。

expr
    : expr '*' expr
    | expr '+' expr
    | id
    ;

上面這種情況就是左遞歸，expr 本身又是表達式，同時還可以是 id 標識符。但是下面這種情況就屬於間接左遞歸了，這種情況 antlr4 還不能處理，會出現錯誤 The following sets of rules are mutually left-recursive

expr
    : expr1 '*' expr1
    | expr1 '+' expr1
    | id
    ;

expr1
    : expr '==' expr  // indirect left-recursion to expr rule.
    | id
    ;

expr 是 expr1 組成的表達式，同時，expr1 又是 expr 組成的表達式，二者相互引用，構成了相互左遞歸。這種情況必須通過優化語法結果的方式消除，antlr4 才能正確的生成語法分析的代碼。

舉一個明顯一點的例子，下面這種情況的間接左遞歸

table_ref
	: limit_clause
	| join_clause
	;
	
limit_clause
	: table_ref limit_clause_part
	;

join_clause
	: table_ref join_clause_part
	;

通過優化語法，limit_clause 和 join_clause 有很多共同的部分，把相同的部分提取出來，不同的部分作爲兩個分支處理，可以改爲下面這種方式

table_ref
	: table_ref (limit_clause_part | join_clause_part)
	;

這樣就正確的消除了左遞歸。antlr4 是可以處理右遞歸的。

上面這種思路是我在工作中總結出來的，並不全面，如果有人碰到類似的問題，可以一起交流。

二義性和兩種消除二義性的方法

token 引起的二義性(Lexer)

比如關鍵字 async是一個token，有如下這樣一條語句

async var async = 42;

在這句話中，async既是一個關鍵字，同時還是一個變量，這就出現了二義性的問題。這種情況 antlr4 有兩種方法解決：

在語法規則中增加語義判定
```
async: {_input.LT(1).GetText() == "async"}? ID ; 
```
如果 async 關鍵字存在，那麼就是一個關鍵字，如果不存在，就是ID，就是一個標識符。但是這種方法，使得代碼與規則發生了耦合，不利於規則的維護。antlr4 相比於前面的版本，就是實現了代碼與規則的解耦，使得代碼與語法規則能夠相互獨立分開，易於維護和閱讀。
直接將該 token 插入到 id 的定義中
```
ASYNC: 'async';
...
id
: ID
...
| ASYNC;
```
這樣，標識符中包含了 async，就能正確表示了。

表達式中的二義性(Parser)

比如下面這個語法規則

stat: expr ';' // expression statement
    | ID '(' ')' ';' // function call statement;
    ;
expr: ID '(' ')'
    | INT
    ;

當 ID '(' ')' 出現時，我們不能確定，這是一個 expression statement 還是一個 function call statement，這就造成了二義性。

ANTLR4 在生成此法分析器的過程中是不能檢測二義性的，但是如果我們設定模式ALL(ALL 是一種動態算法 dynamic algorithm)，在分析過程中是可以確定二義性的。二義性可能出現在詞法分析中，也可能出現在語法分析中，詞法分析中的二義性的情況就是上一小節的情況，語法分析就是當前小節的情況。然而，對於一些語言(比如 c++)中，可以允許接受的一些二義性的情況，可以通過增加語義判定的方式解決(semantic predicates code insertions to resolve)，比如下面這種方式

expr: { isfunc(ID) }? ID '(' expr ')' // func call with 1 arg
    | { istype(ID) }? ID '(' expr ')' // ctor-style type cast of expr
    | INT
    | void
    ;

通過判定 ID 是 func 還是 expr，來決定是函數調用還是表達式。

在 c++ 語法中，之前的版本有一個問題，就是 >> 的問題，>> 是一個右移運算符，同時，對於 std::vector<std::list<std::string>> 這種情況，最後面也出現了 >> 的符號，這個時候就出現了二義性的問題，這個方法是怎麼解決的呢，查看資料

Sometimes the ambiguity can be fixed after a little reinvention of grammar. For example, there is a right shift bit operator RIGHT_SHIFT: '>>' in C#: two angle brackets can also be used to describe a generics class: List>. If we define the >> as a token, the construction of two lists would never be parsed because the parser will assume that there is a >> operator instead of two closing brackets. To resolve this you only need to put the RIGHT_SHIFT token aside. At the same time, we can leave the LEFT_SHIFT: '<<' token as-is, because such a sequence of characters would not take place during the parsing of a valid code.

幾種常見的規則調試手段

ANTLR4 中的幾種錯誤

Token recognition error (Lexer no viable alt). Is the only lexical error, indicating the absence of the rule used to create the token from an existing lexeme:

class # { int i; } — # is the above mentioned lexeme.
Missing token. In this case, ANTLR inserts the missing token to a stream of tokens, marks it as missing, and continues parsing as if this token exists.

class T { int f(x) { a = 3 4 5; } } — } is the above mentioned token.
Extraneous token. ANTLR marks a token as incorrect and continues parsing as if this token doesn’t exist: The example of such a token will be the first ;

class T ; { int i; }
Mismatched input. In this case “panic mode” will be initiated, a set of input tokens will be ignored, and the parser will wait for a token from the synchronizing set. The 4th and 5th tokens of the following example are ignored and ; is the synchronizing token

class T { int f(x) { a = 3 4 5; } }
No viable alternative input. This error describes all other possible parsing errors.

class T { int ; }

當然，是可以手動在規則分支中添加錯誤處理的方式處理錯誤，如下所示
```
function_call
    : ID '(' expr ')'
    | ID '(' expr ')' ')' {notifyErrorListeners("Too many parentheses");}
    | ID '(' expr {notifyErrorListeners("Missing closing ')'");}
    ;
```

在 ANTLR4 中添加自定義的錯誤監聽器

ANTLR4 提供幾種默認的錯誤機制，ANTLRErrorListener 和 ANTLRErrorStrategy，我們可以通過繼承的方式，實現自己的錯誤監聽器

class ErrorVerboseListener : public antlr4::BaseErrorListener {
	public:
		ErrorVerboseListener(){}
		~ErrorVerboseListener() {}
		
		void syntaxError(antlr4::Recognizer *recognizer, antlr4::Token *offendingSymbol, size_t line, size_t charPositionInLine, const std::string &msg, std::exception_ptr e);
}

繼承和實現 syntaxError 函數，這個函數就是錯誤處理函數。其中，line 是錯誤所在行數，charPositionInLine 是所在列，msg 是詳細的錯誤信息，offendingSymbol 是錯誤出現的 Token 。這些信息，能夠對定位規則中出現的錯誤提供一定的幫助。

通過下面的方法，在 cpp 中使用錯誤監聽器

// get a parser
ANTLRInputStream input(str);
XXXLexer lexer(&input);
CommonTokenStream tokens(&lexer);
XXXParser parser(&tokens);

// remove and add new error listeners
ErrorVerboseListener err_listener;
parser.removeErrorListeners();	// remove all error listeners
parser.addErrorListener(&err_listener);	// add

規則定位(調試)

當出現上述的 ANTLR4 錯誤時，可以通過以下幾種方法定位問題。

一

根據錯誤信息，也可以自定義的錯誤監聽器提供的信息，定位錯誤發生的 token 或者地點，然後打印整顆語法分析樹結果，如果發生錯誤，語法分析樹會在發生錯誤的時候，停止解析後面的內容，通過語法分析樹，可以確定前面的語法解析所分析出來的語法分支是否與預期一致

line 1:24 extraneous input 'FROM' expecting {ABORT, ABS, ACCESS,

語法分析樹結構如下所示，這只是我的一個例子，原語句是對 sql 語句 select name, phone from from student 進行語法分析

(sql_script (unit_sql_statement (unit_statement (sql_statement (data_manipulation_language_statements (select_statement (subquery (subquery_basic_elements (query_block SELECT (selected_list (selected_list_element (column_name (identifier (id_expression (regular_id (non_reserved_keywords_pre12c NAME)))))) , (selected_list_element (column_name (identifier (id_expression (regular_id PHONE)))))) (from_clause FROM (table_ref_list (table_ref (table_ref_aux (table_ref_aux_internal FROM (dml_table_expression_clause (tableview_name (table_name (identifier (id_expression (regular_id STUDENT))))))))))) limit_clause))))))) ;) <EOF>)

可以看到，錯誤信息指出是在 1:24，即第1行24列處，token 爲 from 時發生了錯誤，語法解析樹解析到第二個from 時，語法分支就出現了錯誤，不是預期的結果。

二

查看解析出來的詞法 tokens ，查看 tokens 是否解析錯誤(有時候，tokens 解析就會發生問題，直接導致後面的語法解析出現異常，或者得不到預期的結果)

[@0,0:5='SELECT',<1487>,1:0]
[@1,6:6=' ',<2326>,channel=1,1:6]
[@2,7:10='NAME',<882>,1:7]
[@3,11:11=',',<2302>,1:11]
[@4,12:12=' ',<2326>,channel=1,1:12]
[@5,13:17='PHONE',<2325>,1:13]
[@6,18:18=' ',<2326>,channel=1,1:18]
[@7,19:22='FROM',<555>,1:19]
[@8,23:23=' ',<2326>,channel=1,1:23]
[@9,24:27='FROM',<555>,1:24]
[@10,28:28=' ',<2326>,channel=1,1:28]

我們直接看這兩個 from (我對所有的字符進行了大小寫敏感的轉換，所以這裏看到的都是大寫)

[@7,19:22='FROM',<555>,1:19]
[@8,23:23=' ',<2326>,channel=1,1:23]
[@9,24:27='FROM',<555>,1:24]

@7 表示第七個位置(從0開始), 19:22 表明在第19-22和字符之間，內容是 FROM，token 的 id 是 555, 1:19 表示的是，位於輸入字符串第一行，第19個位置處。

這裏的 token id 是指 antlr4 生成語法分析器時，在後綴爲 XXXLexer.tokens 文件中，各個tokens 賦予的值，上面這兩個 from，第一個的 token id 是555, 第二個是 555, 在 XXXLexer.tokens 中，from 就是 555, 這裏的詞法解析是正確的

在 cpp 目標中，使用 LL 和 ALL 優化

Moreover, ANTLR 4 allows you to use your own error handling mechanism. This option may be used to increase the performance of the parser: first, code is parsed using a fast SLL algorithm, which, however, may parse the ambiguous code in an improper way. If this algorithm reveals at least a single error (this may be an error in the code or ambiguity), the code is parsed using the complete, but less rapid ALL-algorithm. Of course, an actual error (e.g., the missed semicolon) will always be parsed using LL, but the number of such files is less compared to ones without any errors.

LR(*)與LL(*)

現在主流的語法分析器分兩大陣營，LR()與LL()。

LR是自低向上（bottom-up）的語法分析方法，其中的L表示分析器從左（Left）至右單向讀取每行文本，R表示最右派生（Rightmost derivation），可以生成LR語法分析器的工具有YACC、Bison等，它們生成的是增強版的LR，叫做LALR。

LL是自頂向下（top-down）的語法分析方法，其中的第一個L表示分析器從左（Left）至右單向讀取每行文本，第二個L表示最左派生（Leftmost derivation），ANTLR生成的就是LL分析器。

ALL(*)原理

ANTLR從4.0開始生成的是ALL(*)解析器，其中A是自適應（Adaptive）的意思。**ALL(*)解析器是由Terence Parr、Sam Harwell與Kathleen Fisher共同研發的，對傳統的LL(*)解析器有很大的改進，ANTLR是目前唯一可以生成ALL(*)**解析器的工具。

**ALL(*)改進了傳統LL(*)**的前瞻算法。其在碰到多個可選分支的時候，會爲每一個分支運行一個子解析器，每一個子解析器都有自己的DFA（deterministic finite automata，確定性有限態機器），這些子解析器以僞並行（pseudo-parallel）的方式探索所有可能的路徑，當某一個子解析器完成匹配之後，它走過的路徑就會被選定，而其他的子解析器會被殺死，本次決策完成。也就是說，**ALL(*)**解析器會在運行時反覆的掃描輸入，這是一個犧牲計算資源換取更強解析能力的算法。在最壞的情況下，這個算法的複雜度爲O(n⁴)，它幫助ANTLR在解決歧義與分支決策的時候更加智能。

在cpp 中，按照下面所示選擇使用 SLL 還是 ALL

  // PredictionMode: LL, SLL
  // try with simpler and faster SLL first
  parser.getInterpreter<atn::ParserATNSimulator>()->setPredictionMode(
      atn::PredictionMode::SLL);
  parser.removeErrorListeners();

  // add error listener
  ErrorVerboseListener err_verbose;
  parser.addErrorListener(&err_verbose);
  parser.setErrorHandler(std::make_shared<BailErrorStrategy>());

  // BailErrorStrategy 會拋出 ParseCancellationException 的異常
  try {
    std::cout << "Try with SLL(*)" << std::endl;
    _ParseString(parser, tokens);
  } catch (ParseCancellationException ex) {
    std::cout << "Syntax error, try with LL(*)" << std::endl;
    std::cout << ex.what() << std::endl;

    // rewind input stream
    tokens.reset();
    parser.reset();

    // back to default listener and strategy
    parser.addErrorListener(&ConsoleErrorListener::INSTANCE);
    parser.setErrorHandler(std::make_shared<DefaultErrorStrategy>());
    parser.getInterpreter<atn::ParserATNSimulator>()->setPredictionMode(
        atn::PredictionMode::LL);

    _ParseString(parser, tokens);
  }

Reference

間接左遞歸後面詳細闡述 ↩︎

antlr4的介紹以及常見錯誤和調試方法

簡單介紹

左遞歸和間接左遞歸

二義性和兩種消除二義性的方法

token 引起的二義性(Lexer)

表達式中的二義性(Parser)

幾種常見的規則調試手段

ANTLR4 中的幾種錯誤

在 ANTLR4 中添加自定義的錯誤監聽器

規則定位(調試)

在 cpp 目標中，使用 LL 和 ALL 優化

Reference

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

大齡程序員思考

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

nuget添加readme

BIOS模式的win10與UEFI模式的Ubuntu雙系統的安裝問題

antlr4的介紹以及常見錯誤和調試方法

redis源碼分析 -- cs結構之服務器

redis 中的 reactor 模型

代碼風格的簡單整理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結