詞法分析器實現過程（java和c++實現）

詞法分析器是編譯原理的一個實驗，本文將會詳細給出實現的具體步驟，利用java進行示例講解，源碼（包含java和c++兩種實現方式）可在 https://download.csdn.net/download/qq_40121502/10926516 下載。

一、實驗目的

設計、編寫一個詞法分析程序，加深對詞法分析原理的理解。

二、實驗原理

詞法分析是從左向右一個字符、一個字符地讀入源程序，掃描每行源程序的符號，依據詞法規則，識別單詞。執行詞法分析的程序稱爲詞法分析器，將給定的程序通過詞法分析器，識別出一個個單詞符號。單詞符號常採用統一的二元式表示：（單詞種別碼，單詞符號自身的值），單詞種別碼是語法分析所需要的信息，而單詞符號自身屬性值是編譯其他階段需要的信息。

三、實驗內容

（1）程序利用C++或Java或其他編程語言，給出其詞法的產生式（詞法部分也可以用regular expression），包含數學運算表達式，賦值，函數調用，控制流語句（分支或循環），類型聲明等基本元素。
（2）詞法實驗部分要求寫出該語言的詞法分析程序。要求識別出詞法分析單元，給出詞法單元的分類，按照不同類別填入相應的表中，表中給出詞法單元在源程序中的位置。

四、實驗方法

（1）詞法的產生式（或regular expression）轉換成NFA
（2）NFA確定化爲DFA
（3）DFA最小化
（4）根據最小化後的DFA寫出詞法分析程序。
（5）使用選定的編程語言寫幾個測試用例文件作爲輸入，測試詞法分析程序是否正確。

五、實驗設計

（1）關鍵字：
“char”,“int”,“float”,“double”,“final”,“if”,“else”,“while”,“do”,“for”,“break”,“continue”,”void”,”return”
（2）運算符：=、==、>、<、>=、<=、+、-、*、/
（3）界限符：( ) [ ]{ } , : ;
（4）標識符（ID）：用字母、數字、下劃線的連接用來表示變量名、過程名等的單詞稱爲標識符（不能以數字開頭、不與關鍵字重名、不包含$、#等無法識別的字符）
（5）常量（NUM）：整數、小數、浮點數。
（6）詞法分析階段通常忽略空格、製表符和換行符等。
根據以上的分類與分析，設置該語言中的單詞符號及種別編碼如下表：

六、狀態圖分析

1.關鍵字

本次設計中關鍵字有17個，分別包括：
“char”,“short”,“int”,“long”,“float”,“double”,“final”,“static”,“if”,“else”,“while”,
“do”,“for”,“break”,“continue”,“void”, “return”。
關鍵字都是由小寫字母組成，在程序中，將17個關鍵字保存在一個String類型的數組中，然後做一次循環，將字符串逐個與17個關鍵字對比，相同則取出對應的種別碼，存入ArrayList中。
關鍵字的正規表達式爲：
key->letter( letter )*
letter->[a-z]
按照RE—>NFA—>DFA—>最小化DFA的順序，過程如圖所示：

2.運算符

本實驗設計了十個運算符，分別爲：=、==、>、<、>=、<=、+、-、、/
運算符表達式爲：
operation -> ( = | == | > | < | >= | <= | + | - | * | / )
按照RE—>NFA—>DFA—>最小化DFA的順序，過程如圖所示：

3.界限符

本實驗設計的界限符有：( ) [ ]{ } , : ;
界限符表達式爲：
delimiter -> ( ( | ) | [ | ] | { | } | , | : | ; )*
從RE—>NFA—>DFA—>最小化DFA的順序，過程如圖所示：

4.標識符

用字母、數字、下劃線的連接用來表示變量名、過程名等的單詞稱爲標識符（不能以數字開頭、不與關鍵字重名、不包含$、#等無法識別的字符）。
標識符的正規表達式爲：
ID ->( letter | _ ) ( letter | digit | _ )*
letter->[a-zA-Z]
digit->[0-9]
按照RE—>NFA—>DFA—>最小化DFA的順序，過程如圖所示：

5.常量

將整數，浮點數歸併後分爲無符號數和有符號數。
因爲有符號數，除了開始有符號外，之後的判斷與無符號數是一致的。所以在這裏只針對無符號數的正規表達式構造NFA和DFA。
無符號數正規表達式：
NUM -> digits op_fra op_exp
其中：
digits -> d(d)*
d -> [0-9]
op_fra-> .digits | ε
op_exp -> ( E (+|-| ε) digits) | ε
將上述表達式整合得無符號數NUM的正規表達式爲：
NUM->d(d)|d(d).d(d)|d(d)(E(+|-|ε)d(d))|d(d).d(d)E(+|-|ε)d(d)
d -> [0-9]
按照RE—>NFA—>DFA—>最小化DFA的順序，過程如圖所示：

七、數據結構

（1）使用ArrayList<String>來存放從文件內讀取進來的每個單詞（符號）。
（2）自定義類Pair，相當於鍵值對，用於存放單詞種別碼和對應的單詞，包含一個Integer類型的key和一個String類型的value。
（3）使用ArrayList<Pair>來存放詞法分析的結果。

八、實現代碼

1.定義關鍵字

首先定義了一系列關鍵字及其對應的種別碼，用於表示詞法分析的結果。

// 單詞種別碼, 1-17爲關鍵字種別碼
public static final int CHAR = 1;
public static final int SHORT = 2;
public static final int INT = 3;
public static final int LONG = 4;
public static final int FLOAT = 5;
public static final int DOUBLE = 6;
public static final int FINAL = 7;
public static final int STATIC = 8;
public static final int IF = 9;
public static final int ELSE = 10;
public static final int WHILE = 11;
public static final int DO = 12;
public static final int FOR = 13;
public static final int BREAK = 14;
public static final int CONTINUE = 15;
public static final int VOID = 16;
public static final int RETURN = 17;

// 20爲標識符種別碼
public static final int ID = 20;

// 30爲常量種別碼
public static final int NUM = 30;

// 31-40爲運算符種別碼
public static final int AS = 31; // =
public static final int EQ = 32; // ==
public static final int GT = 33; // >
public static final int LT = 34; // <
public static final int GE = 35; // >=
public static final int LE = 36; // <=
public static final int ADD = 37; // +
public static final int SUB = 38; // -
public static final int MUL = 39; // *
public static final int DIV = 40; // /

// 41-49爲界限符種別碼
public static final int LP = 41; // (
public static final int RP = 42; // )
public static final int LBT = 43; // [
public static final int RBT = 44; // ]
public static final int LBS = 45; // {
public static final int RBS = 46; // }
public static final int COM = 47; // ,
public static final int COL = 48; // :
public static final int SEM = 49; // ;

// -1爲無法識別的字符標誌碼
public static final int ERROR = -1;
public static int errorNum = 0; // 記錄詞法分析錯誤的個數

2.單詞分割

對於輸入的代碼段，進行單詞的分割，需要識別判斷出分割的位置，最後將拆分好的單詞序列放入ArrayList<String>中保存，用於後續的詞法分析。

測試文件讀入內存

final String blank = " ";
final String newLine = "\n";
Scanner scanner = new Scanner(System.in);
System.out.print("請輸入測試文件的路徑：");
String filepath = scanner.next();
StringBuffer sb = new StringBuffer();
String fileStr = "";
try {
    InputStream is = new FileInputStream(filepath);
    String line; // 用來保存每行讀取的內容
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    line = reader.readLine(); // 讀取第一行
    while (line != null) { // 如果 line 爲空說明讀完了
        sb.append(line); // 將讀到的內容添加到 buffer 中
        sb.append("\n"); // 添加換行符
        line = reader.readLine(); // 讀取下一行
    }
    reader.close();
    is.close();
    fileStr = sb.toString();
} catch (IOException e) {
    e.printStackTrace();
}

System.out.println("文件內容如下：\n" + fileStr);

單詞分割

採用循環讀取的方式將輸入的內容進行讀取分析
getFirstChar()：去除字符串開頭的空格和換行符，找到第一個有效字符的位置
getWord()：使用正則表達式進行匹配，獲取一個單詞
當文件讀取結束時將拋出IndexOutOfBoundsException異常，捕獲這個異常，在catch塊中進行後續的詞法分析。

int begin = 0, end = 0;   //分別表示單詞的第一個和最後一個字符位置
ArrayList<String> arrayList = new ArrayList<>();
try {
    do {
        begin = getFirstChar(fileStr, begin);
        String word = getWord(fileStr, begin);
        end = begin + word.length() - 1; // 單詞最後一個字符在原字符串的位置
        if (word.equals("")) {
            break;
        }
        if (!(word.equals(blank) || word.equals(newLine))) {
            arrayList.add(word);
        }
        begin = end + 1;
    } while (true);
} catch (IndexOutOfBoundsException e) { // 文件讀取結束
    // 詞法分析
}

/**
 * 去除字符串開頭的空格和換行符，找到第一個有效字符的位置
 * @param str 目標字符串
 * @param begin 開始位置
 * @return 第一個有效字符在字符串中的位置
 */
public static int getFirstChar(String str, int begin) {
    while (true) {
        if (str.charAt(begin) != ' ' && str.charAt(begin) != '\n' ) {
            return begin;
        }
        begin++;
    }
}

/**
 * 獲取一個單詞
 * @param str 目標字符串
 * @param begin 查找的開始位置
 * @return 單詞
 */
public static String getWord(String str, int begin) {
    String regEx = "\\s+|\\n|\\+|-|\\*|/|=|\\(|\\)|\\[|]|\\{|}|,|:|;"; // 正則
    Pattern p = Pattern.compile(regEx);
    Matcher m = p.matcher(str);
    int end;
    if (m.find(begin)) {
        end =  m.start();
    } else {
        return "";
    }
    if (begin == end) {
        return String.valueOf(str.charAt(begin));
    }
    return str.substring(begin, end);
}

3.詞法分析

將分割好的單詞和最開始定義的一系列關鍵字進行比對匹配，每個單詞的分析結果由Pair類進行保存，Pair類僅代表鍵值對的形式，鍵爲單詞種別碼，值爲該單詞。
Pair類定義如下：

public static class Pair {
    Integer key;
    String value;
    Pair(Integer key, String value) {
        this.key = key;
        this.value = value;
    }
}

詞法分析函數：根據單詞的長度進行分析比對

// 詞法分析函數
public static ArrayList<Pair> analyse(ArrayList<String> arrayList) {
    ArrayList<Pair> result = new ArrayList<>();
    for (int i = 0; i < arrayList.size(); i++) {
        if (arrayList.get(i).length() == 1) {
            if (arrayList.get(i).equals("=")) { // 運算符"="
                if (arrayList.get(i+1).equals("=")) { // 若後面跟的是"="，則是運算符"=="
                    result.add(new Pair(EQ, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否則是運算符"="
                    result.add(new Pair(AS, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals(">")) { // 運算符">"
                if (arrayList.get(i+1).equals("=")) { // 若後面跟的是"="，則是運算符">="
                    result.add(new Pair(GE, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否則是運算符">"
                    result.add(new Pair(GT, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("<")) { // 運算符"<"
                if (arrayList.get(i+1).equals("=")) { // 若後面跟的是"="，則是運算符"<="
                    result.add(new Pair(LE, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否則是運算符"<"
                    result.add(new Pair(LT, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("+")) { // 運算符"+"
                if ((arrayList.get(i-1).equals("=") || arrayList.get(i-1).equals("("))
                    && isNum(arrayList.get(i+1))) { // 判斷是否是有符號常量（正數）
                    result.add(new Pair(NUM, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否則是運算符"+"
                    result.add(new Pair(ADD, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("-")) { // 運算符"-"
                if ((arrayList.get(i-1).equals("=") || arrayList.get(i-1).equals("("))
                        && isNum(arrayList.get(i+1))) { // 判斷是否是有符號常量（負數）
                    result.add(new Pair(NUM, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否則是運算符"-"
                    result.add(new Pair(SUB, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("*")) { // 運算符"*"
                result.add(new Pair(MUL, arrayList.get(i)));
            } else if (arrayList.get(i).equals("/")) { // 運算符"/"
                result.add(new Pair(DIV, arrayList.get(i)));
            } else if (arrayList.get(i).equals("(")) { // 界限符"("
                result.add(new Pair(LP, arrayList.get(i)));
            } else if (arrayList.get(i).equals(")")) { // 界限符")"
                result.add(new Pair(RP, arrayList.get(i)));
            } else if (arrayList.get(i).equals("[")) { // 界限符"["
                result.add(new Pair(LBT, arrayList.get(i)));
            } else if (arrayList.get(i).equals("]")) { // 界限符"]"
                result.add(new Pair(RBT, arrayList.get(i)));
            } else if (arrayList.get(i).equals("{")) { // 界限符"{"
                result.add(new Pair(LBS, arrayList.get(i)));
            } else if (arrayList.get(i).equals("}")) { // 界限符"}"
                result.add(new Pair(RBS, arrayList.get(i)));
            } else if (arrayList.get(i).equals(",")) { // 界限符","
                result.add(new Pair(COM, arrayList.get(i)));
            } else if (arrayList.get(i).equals(":")) { // 界限符":"
                result.add(new Pair(COL, arrayList.get(i)));
            } else if (arrayList.get(i).equals(";")) { // 界限符";"
                result.add(new Pair(SEM, arrayList.get(i)));
            } else if (arrayList.get(i).charAt(0) >= '0' && arrayList.get(i).charAt(0) <= '9') { // 判斷是否是一位數字常量
                result.add(new Pair(NUM, arrayList.get(i)));
            } else if (isLetter(arrayList.get(i).charAt(0))) { // 判斷是否是一位字母標識符
                result.add(new Pair(ID, arrayList.get(i)));
            } else { // 否則是無法識別的字符
                result.add(new Pair(ERROR, arrayList.get(i)));
                errorNum++;
            }
        } else if ((arrayList.get(i).charAt(0) >= '0' && arrayList.get(i).charAt(0) <= '9')
                    || arrayList.get(i).charAt(0) == '.') { // 判斷是否是正確的常量
            if (!isNum(arrayList.get(i))) { // 不是常量，則是無法識別的字符
                result.add(new Pair(ERROR, arrayList.get(i)));
                errorNum++;
            } else if ((arrayList.get(i+1).charAt(0) == '+' || arrayList.get(i+1).charAt(0) == '-')
                    && isNum(arrayList.get(i+2))) { // 判斷是否是有符號的常量
                result.add(new Pair(NUM, arrayList.get(i) + arrayList.get(++i) + arrayList.get(++i)));
            } else { // 否則是無符號的常量
                result.add(new Pair(NUM, arrayList.get(i)));
            }
        } else if (isKeyID(arrayList.get(i)) != 0) { // 判斷是否爲關鍵字
            result.add(new Pair(isKeyID(arrayList.get(i)), arrayList.get(i)));
        } else if (isLetter(arrayList.get(i).charAt(0)) || arrayList.get(i).charAt(0) == '_') { // 判斷是否爲標識符（以字母或者下劃線開頭）
            result.add(new Pair(ID, arrayList.get(i)));
        } else { // 否則是無法識別的單詞
            result.add(new Pair(ERROR, arrayList.get(i)));
            errorNum++;
        }
    }
    return result;
}

catch塊內的處理部分：調用詞法分析函數，輸出結果到文件

ArrayList<Pair> result = analyse(arrayList);
StringBuffer outputBuffer = new StringBuffer("詞法分析結果如下：\n<單詞種別碼，單詞>          //所屬類別\n");
System.out.println("詞法分析結果如下：\n<單詞種別碼，單詞>          //所屬類別");
for (Pair pair : result) {
    outputBuffer.append("<   " + pair.key + "  ,  " + pair.value + "  >          ");
    System.out.print("<   " + pair.key + "  ,  " + pair.value + "  >          ");
    if (pair.key > 0 && pair.key < 20) {
        outputBuffer.append("//關鍵字\n");
        System.out.println("//關鍵字");
    } else if (pair.key == 20) {
        outputBuffer.append("//標識符\n");
        System.out.println("//標識符");
    } else if (pair.key == 30) {
        outputBuffer.append("//常量\n");
        System.out.println("//常量");
    } else if (pair.key > 30 && pair.key <= 40) {
        outputBuffer.append("//運算符\n");
        System.out.println("//運算符");
    } else if (pair.key > 40 && pair.key < 50) {
        outputBuffer.append("//界限符\n");
        System.out.println("//界限符");
    } else if (pair.key == -1) {
        outputBuffer.append("//無法識別的符號\n");
        System.out.println("//無法識別的符號");
    }
}
outputBuffer.append("詞法分析結束！共" + errorNum + "個無法識別的符號\n");
System.out.println("詞法分析結束！共" + errorNum + "個無法識別的符號");

System.out.print("請輸入輸出文件的路徑：");
String outputPath = scanner.next();
// 將StringBuffer的內容輸出到文件
File outputFile = new File(outputPath);
if (outputFile.exists()) {
    outputFile.delete();
}
PrintWriter writer = null;
try {
    outputFile.createNewFile();
    writer = new PrintWriter(new FileOutputStream(outputFile));
    writer.write(outputBuffer.toString());
} catch (IOException e1) {
    e1.printStackTrace();
} finally {
    if (writer != null) {
        writer.close();
    }
}

九、執行結果

至此，詞法分析器編寫完成。如有問題，歡迎交流指正。

詞法分析器實現過程（java和c++實現）

一、實驗目的

二、實驗原理

三、實驗內容

四、實驗方法

五、實驗設計

六、狀態圖分析

1.關鍵字

2.運算符

3.界限符

4.標識符

5.常量

七、數據結構

八、實現代碼

1.定義關鍵字

2.單詞分割

測試文件讀入內存

單詞分割

3.詞法分析

九、執行結果

sm4加密工具類

Python裝飾器概念及實現

從maven私服(nexus)拉取jar文件，解析項目pom依賴信息

模擬devtools工具實現類的熱部署

【信息檢索】Java簡易搜索引擎原理及實現（一）建立倒排索引

Navicat遠程連接docker容器中的mysql 報錯1251 - Client does not support authentication protocol 解決辦法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

詞法分析器實現過程（java和c++實現）

一、 實驗目的

二、 實驗原理

三、 實驗內容

四、 實驗方法

五、 實驗設計

六、狀態圖分析

1.關鍵字

2.運算符

3.界限符

4.標識符

5.常量

七、數據結構

八、實現代碼

1.定義關鍵字

2.單詞分割

測試文件讀入內存

單詞分割

3.詞法分析

九、執行結果

一、實驗目的

二、實驗原理

三、實驗內容

四、實驗方法

五、實驗設計