lex与yacc快速入门

lex与yacc快速入门【原创】

声明：原创文章，转载注明出处http://www.cnblogs.com/lucasysfeng/

第一节、lex和yacc是什么？

　　lex 代表 lexical analyzar（词法分析器），yacc 代表 yet another compiler compiler（编译器代码生成器）。lex和yacc在UNIX下分别叫flex和bison. 可以搜索到很多介绍flex&bison的文章，但这类文章对初学者来说不太容易看懂。

　　我们举个简单的例子来理解lex和yacc：在linux下，有很多系统配置文件，一些linux下的软件也有配置文件，那么程序是如何读取配置文件中的信息的呢？先用到lex词法分析器，读取配置文件中的关键词（后面说到的token标记其实可看做关键词）；然后把关键词递交给yacc，yacc对一些关键词进行匹配，看是否符合一定的语法逻辑，如果符合就进行相应动作。

　　上面举的例子是分析配置文件内容的，当然可分析其他文件内容，或者制作编译器等。

第二节、一个简单的lex程序。

1、程序代码。

来看一个简单的lex程序，代码见下面，这段lex程序的目的是：输入几行字符串，输出行数，单词数和字符的个数。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 /******************************************* * Name : test.l * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : 一个简单的lex例子,输入几行字符串， * 输出行数，单词数和字符的个数。 *******************************************/ /* 第一段 */ %{ int chars = 0; int words = 0; int lines = 0; %} /* 第二段 */ %% [a-zA-Z]+ { words++; chars += strlen(yytext); } { chars++; lines++; } . { chars++; } %% /* 第三段 */ main(int argc, char **argv) { yylex(); printf("%8d%8d%8d ", lines, words, chars); }

程序中yytext是lex变量，匹配模式的文本存储在这一变量中。yylex()这一函数开始分析，它由lex自动生成。关于lex变量和函数后续再介绍，这里只是通过简单的lex程序来认识lex.

2、按照下面过程编译运行。

#flex test.l

#gcc lex.yy.c –lfl

#./a.out

然后输入一段文字，按ctrl+d结束输入，则会输出行数，单词数和字符的个数。

见下图：

3、分析上面的lex程序。

　　（1）%%把文件分为3段，第一段是c和lex的全局声明，第二段是规则段，第三段是c代码。

　　（2）第一段的c代码要用%{和%}括起来，第三段的c代码不用。

　　（3）第二段规则段，[a-zA-Z]+ . 是正则表达式，{}内的是c编写的动作。

4、编译时不加-lfl选项。

　　上面编译时用gcc lex.yy.c –lfl，那么如何直接用gcc lex.yy.c进行编译呢？答案是加上yywrap函数（具体原因见lex的库和函数分析），代码如下。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 /******************************************* * Name : test.l * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : 一个简单的lex例子,输入几行字符串， * 输出行数，单词数和字符的个数。 * 加yywrap函数。 *******************************************/ %{ int chars = 0; int words = 0; int lines = 0; %} %% [a-zA-Z]+ { words++; chars += strlen(yytext); } { chars++; lines++; } . { chars++; } %% main(int argc, char **argv) { yylex(); printf("%8d%8d%8d ", lines, words, chars); } int yywrap() { return 1; }

第三节、lex进阶。

　　修改第二节程序，将正则表达式放在全局声明中，使逻辑更清晰。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 /******************************************* * Name : test.l * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : 一个简单的lex例子,输入几行字符串， * 输出行数，单词数和字符的个数。 * 正则表达式放在全局声明中。 *******************************************/ int chars = 0; int words = 0; int lines = 0; %} mywords [a-zA-Z]+ mylines mychars . %% {mywords} { words++; chars += strlen(yytext); } {mylines} { chars++; lines++; } {mychars} { chars++; } %% main(int argc, char **argv) { yylex(); printf("%8d%8d%8d ", lines, words, chars); }

　　编译运行同第二节。

第四节、lex再进阶—循环扫描。

　　下面给出一个lex程序，这个程序在扫描到 + 或 - 时做一个特殊输出。当调用yylex()函数时，若扫描到return对应的标记时，yylex返回，且值就为return后的值；若没扫描到return对应的标记，yylex继续执行，不返回。下次调用自动从前一次的扫描位置处开始。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 /******************************************* * Name : test.l * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : lex进阶，循环扫描。 *******************************************/ %{ enum yytokentype { ADD = 259, SUB = 260, }; %} myadd "+" mysub "-" myother . %% {myadd} { return ADD; } {mysub} { return SUB; } {myother} { printf("Mystery character "); } %% main(int argc, char **argv) { int tok; while (tok = yylex()) { if (tok == ADD || tok == SUB) { printf("meet + or - "); } else { printf("this else statement will not be printed, because if yylex return,the retrun value must be ADD or SUB."); } } }

编译和运行见下图：

第五节、yacc语法。

1、yacc语法规则部分和BNF类同，先来看BNF巴克斯范式。

（1）<> 内包含的内容为必选项；

（2）[] 内的包含的内容为可选项；

（3）{ } 内包含的为可重复0至无数次的项；

（4） | 表示在其左右两边任选一项，相当于"OR"的意思；

（5）::= 是“被定义为”的意思；

（6）双引号“”内的内容代表这些字符本身；而double _quote用来表示双引号。

（7）BNF范式举例，下面的例子用来定义java中的for语句：

FOR_STATEMENT ::=

　　"for" "(" ( variable_declaration |

　　( expression ";" ) | ";" )

　　[ expression ] ";"

　　[ expression ]

　　")" statement

2、yacc语法。

注：components是根据规则放在一起的终端和非终端符号，后面是{}括起来的执行的动作。

3、语法例子。

1 2 3 4 5 param : NAME EQ NAME { printf(" Name:%s Value(name):%s ", $1,$3); } | NAME EQ VALUE { printf(" Name:%s Value(value):%s ",$1,$3);} ;

yacc文件第一段中定义的token，lex文件对目标进行扫描并返回这些token。yacc文件对规则冒号右边componets进行匹配，如果符合一定语法规则就执行相应动作。

分析：|表示左右两边任选一项，如| subject verb object prep_phrase ;中|的左边为空，所以该句表示匹配空或者subject verb object prep_phrase ;而上面还有一句subject verb object ，所以

simple_sentence: subject verb object

| subject verb object prep_phrase ;

的意思是匹配subject verb object 或 subject verb object prep_phrase ;

第六节、lex和yacc结合使用。

1、lex程序。

当匹配a b c not时分别返回相应的token.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 /******************************************* * Name : test.l * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : lex和yacc结合使用。 *******************************************/ %{ #include "test.tab.h" #include <stdio.h> #include <stdlib.h> %} %% a { return A_STATE; } b { return B_STATE; } c { return C_STATE; } not { return NOT; } %%

2、yacc程序。

当扫描到A_STATE B_STATE时打印1，当扫描到A_STATE B_STATE c_state_not_token时打印2，当扫描到NOT时打印3.

其中，A_STATE B_STATE NOT是token，c_state_not_token 是非终端符号，此处定义为

c_state_not_token : C_STATE {}.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 /******************************************* * Name : test.y * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : lex和yacc结合使用。 *******************************************/ %{ #include <stdio.h> #include <stdlib.h> %} %token A_STATE B_STATE C_STATE NOT %% program : A_STATE B_STATE { printf("1"); } c_state_not_token { printf("2"); } | NOT { printf("3"); } c_state_not_token : C_STATE {} %% yyerror(const char *s) { fprintf(stderr, "error: %s ", s); } int main() { yyparse(); return 0; }

3、编译和运行。

lex和yacc在UNIX下分别叫flex和bison.

第七节、lex和yacc结合使用进阶。

1、我们希望用lex和yacc结合完成下面文件解析。

我们对文本test.txt进行分析，test.txt中的内容如下：

ZhangSan=23
LiSi=34
WangWu=43

扫描test.txt文本后，我们希望输出：

ZhangSan is 23 years old!!!

LiSi is 34 years old!!!

WangWu is 43 years old!!!

2、利用lex扫描test.txt文本，返回token.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 /******************************************* * Name : test.l * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : lex和yacc结合使用进阶。 *******************************************/ %{ #include "test.tab.h" #include <stdio.h> #include <string.h> %} char [A-Za-z] num [0-9] eq [=] name {char}+ age {num}+ %% {name} { yylval = strdup(yytext); return NAME; } {eq} { return EQ; } {age} { yylval = strdup(yytext); return AGE; } %% int yywrap() { return 1; }

3、yacc根据lex返回的token，判断这些token是否符合一定的语法，符合则进行相应动作。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 /******************************************* * Name : test.y * Date : Mar. 11, 2014 * Blog : http://www.cnblogs.com/lucasysfeng/ * Description : lex和yacc结合使用进阶。 *******************************************/ %{ #include <stdio.h> #include <stdlib.h> typedef char* string; #define YYSTYPE string %} %token NAME EQ AGE %% file : record file | record ; record : NAME EQ AGE { printf("%s is %s years old!!! ", $1, $3); } ; %% int main() { extern FILE* yyin; if (!(yyin = fopen("test.txt", "r"))) { perror("cannot open parsefile:"); return -1; } yyparse(); fclose(yyin); return 0; } int yyerror(char *msg) { printf("Error encountered: %s ", msg); }

4、编译运行。

补充：lex变量和和函数。

分类: 编译原理