第12章正则表达式与文件格式化处理

基础正则表达式

语系对正则表达式的影响

不同语系下，字符的编码数据可能不同。

LANG=C：012……ABC……abc……

LANG=zh_CN：012……aAbB……

因此，使用[A-Z]时，搜索到的字符也不一样。

特殊符号	代表意义
[:alnum:]	大小写字符及数字，0-9，A-Z，a-z
[:alpha:]	英文大小写字符
[:blank:]	空格键与tab键
[:cntrl:]	控制按键，CR，LF，TAB，DEL等
[:digit:]	代表数字
[:graph:]	除空格符（空格和Tab）外其他按键
[:lower:]	小写字符
[:print:]	可以被打印出来的字符
[:punct:]	标点字符，" ' ? ; : # $
[:upper:]	大写字符
[:space:]	任何会产生空白的字符
[:xdigit:]	十六进制数字

grep的一些高级参数

除了上一章介绍的基本用法，grep还有一些高级用法。

grep [-A] [-B] [--color=auto} '搜寻字符串‘ filename

参数：

-A：后面可加数字n，为after的意思，除了列出该列，后面的n列也列出来

-B：后面可加数字n，为after的意思，除了列出该列，前面的n列也列出来

--color=auto：对正确选取的数据着色

//-n用于显示行号
[root@localhost 桌面]# dmesg | grep -n --color=auto 'eth'
1730:[   10.210383] e1000 0000:02:01.0 eth0: (PCI:66MHz:32-bit) 00:0c:29:7f:dd:91
1731:[   10.210404] e1000 0000:02:01.0 eth0: Intel(R) PRO/1000 Network Connection

注：grep搜索到字符串后都是以整行为单位显示。

基础正则表达式练习

以下是练习文本

[root@localhost 桌面]# cat regular_express.txt
"Open Source" is a good mechanism to develop programs.
apple is my favorite food.
Football game is not use feet only.
this dress doesn't fit me.
However, this dress is about $ 3183 dollars.
GNU is free air not free beer.
Her hair is very beauty.
I can't finish the test.
Oh! The soup taste good.
motorcycle is cheap than car.
This window is clear.
the symbol '*' is represented as start.
Oh!    My god!
The gd software is a library for drafting programs.
You are the best is mean you are the no. 1.
The world <Happy> is the same with "glad".
I like dog.
google is the best tools for search keyword.
goooooogle yes!
go! go! Let's go.
# I am VBird

[root@localhost 桌面]#

例题一：查找特定字符串

//查找含有the的行
[root@localhost 桌面]# grep -n 'the' regular_express.txt
8:I can't finish the test.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

//查找不含有the的行
[root@localhost 桌面]# grep -vn 'the' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
5:However, this dress is about $ 3183 dollars.
6:GNU is free air not free beer.
7:Her hair is very beauty.
9:Oh! The soup taste good.
10:motorcycle is cheap than car.
11:This window is clear.
13:Oh!    My god!
14:The gd software is a library for drafting programs.
17:I like dog.
19:goooooogle yes!
20:go! go! Let's go.
21:# I am VBird
22:
[root@localhost 桌面]#

例题二：利用中括号[]来查找集合字符

//查找tast或test字符串
[root@localhost 桌面]# grep -n 't[ae]st' regular_express.txt
8:I can't finish the test.
9:Oh! The soup taste good.

//查找不是以g开头的oo字符串
[root@localhost 桌面]# grep -n '[^g]oo' regular_express.txt
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!

//查找数字
[root@localhost 桌面]# grep -n '[0-9]' regular_express.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

查找不是以小写字母开头的oo字符串
[root@localhost 桌面]# grep -n '[^[:lower:]]oo' regular_express.txt
3:Football game is not use feet only.
[root@localhost 桌面]#

例题三：行首与行尾字符^$

//以the开头的行
[root@localhost 桌面]# grep -n '^the' regular_express.txt
12:the symbol '*' is represented as start.

//以小写字母开头的行
[root@localhost 桌面]# grep -n '^[a-z]' regular_express.txt
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

//以小数点结尾的（需要转义）
[root@localhost 桌面]# grep -n '.$' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
11:This window is clear.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
17:I like dog.
18:google is the best tools for search keyword.
20:go! go! Let's go.

//查找空白行
[root@localhost 桌面]# grep -n '^$' regular_express.txt
22:
[root@localhost 桌面]#

例题四：任意字符.和重复字符*

.（小数点）：代表一定有一个任意字符的意思

*：代表重复前一个0到无穷的意思

//查找以g开头，d结尾，中间两个字符的字符
[root@localhost 桌面]# grep -n 'g..d' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
9:Oh! The soup taste good.
16:The world <Happy> is the same with "glad".

//查找至少含有两个o，后面跟0到无穷个o的字符
[root@localhost 桌面]# grep -n 'ooo*' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!
[root@localhost 桌面]#

例题五：限定连续RE字符范围{}

{}必须转义

//查找o重复两次的字符
[root@localhost 桌面]# grep -n 'o{2}' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

//查找o重复2到5次的字符
[root@localhost 桌面]# grep -n 'o{2,5}' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

//查找o重复两次以上的
[root@localhost 桌面]# grep -n 'go{2,}g' regular_express.txt
18:google is the best tools for search keyword.
19:goooooogle yes!
[root@localhost 桌面]#

基础正则表达式字符

经过上节的五个例题，可将基础的正则表达式总结如下：

RE字符

意义

^word

带查找的字符串在行首

word$

待查找的字符串在行尾

代表一定有一个任意字符的字符

转义字符

重复零到无穷多个前一个字符

[list]

从字符集合的RE字符里找到想要选取的字符

[n1-n2]

从字符集合的RE字符里找到想要选取的字符范围

[^list]

从字符集合的RE字符里找到不想要选取的字符范围

{n,m}

前一个字符重复n到m次

sed工具

sed本身也是管道命令，不仅可以分析标准输出数据，还可以将数据进行替换、删除、新增和选取特定行等功能。

sed [-nefr] [动作]

参数：

-n：安静模式。默认情况下，所有来自STDIN的数据都会列在屏幕上，加上-n后，只有经过sed指令特殊处理的那一行才会显示出来

-e：直接在命令行模式上进行sed动作编辑

-f：直接将sed的动作写在一个文件内，-f filename则可以执行filename内的sed动作

-r：sed动作支持扩展性正则表达式（默认是基础型正则表达式）

-i：直接修改读取的文件内容，而不是屏幕输出

动作说明：[n1[,n2]] function

n1,n2不一定存在，一般代表选择动作的行数。

function有以下参数：

a：新增，a后面可以接字符串，而这些字符串会在新的一行出现（目前的下一行）

c：替换，c的后面可以接字符串，可以替换n1-n2行之间的行

d：删除

i：插入，后面可以接字符串，而这些字符串会在新的一行出现（目前的上一行）

p：打印

s：替换，通常搭配正则表达式

//原始文本
[root@localhost 桌面]# cat -n test.txt
     1    this a test text!
     2    i like linux !
     3    today is monday!
     4    my name is fw.
     5    

//删除2-3行
[root@localhost 桌面]# cat -n test.txt | sed '2,3d'
     1    this a test text!
     4    my name is fw.
     5    

//删除第3行及后面的
[root@localhost 桌面]# cat -n test.txt | sed '3,$d'
     1    this a test text!
     2    i like linux !

//新增（在后面）
[root@localhost 桌面]# cat -n test.txt | sed '2a this line is new'
     1    this a test text!
     2    i like linux !
this line is new
     3    today is monday!
     4    my name is fw.
     5    

////插入（在前面）
[root@localhost 桌面]# cat -n test.txt | sed '2i this line is new'
     1    this a test text!
this line is new
     2    i like linux !
     3    today is monday!
     4    my name is fw.
     5    

//替换
[root@localhost 桌面]# cat -n test.txt | sed '2c this line is new'
     1    this a test text!
this line is new
     3    today is monday!
     4    my name is fw.
     5    

//显示2-4行
[root@localhost 桌面]# cat -n test.txt | sed -n '2,4p'
     2    i like linux !
     3    today is monday!
     4    my name is fw.

查找并替换：sed ‘s/要替换的字符串/新的字符串/g’

查找字符串可以使用正则表达式

//查看原文本
[root@localhost 桌面]# cat -n test.txt
     1    this a test text!
     2    i like linux !
     3    today is monday!
     4    my name is fw.
     5    

//将this替换成that
[root@localhost 桌面]# cat -n test.txt | sed 's/this/that/g'
     1    that a test text!
     2    i like linux !
     3    today is monday!
     4    my name is fw.
     5    

//将结尾的！替换成小数点.
[root@localhost 桌面]# cat -n test.txt | sed 's/!$/./g'
     1    this a test text.
     2    i like linux .
     3    today is monday.
     4    my name is fw.
     5    

//将开头的this删除
[root@localhost 桌面]# cat -n test.txt | sed 's/^.*this//g'
 a test text!
     2    i like linux !
     3    today is monday!
     4    my name is fw.
     5    
[root@localhost 桌面]#

直接修改文件内容：

-i参数

//查看原文件
[root@localhost 桌面]# cat test.txt
this a test text!
i like linux !
today is monday!
my name is fw.

//将this替换成that，写入原文件
[root@localhost 桌面]# sed -i 's/this/that/g' test.txt

//查看原文件
[root@localhost 桌面]# cat test.txt
that a test text!
i like linux !
today is monday!
my name is fw.

扩展正则表达式

该部分暂时略过。

文件的格式化与相关处理

格式化打印:printf

　　printf '打印格式' 实际内容

参数：

几个格式方面的特殊样式：

a：警告声音输出

：退格键

f：清除屏幕

：输出新的一行

：Enter按键

：水平Tab按键

v：垂直Tab按键

xNN：NN为两位数的数字，可以转换数字为字符

c程序语言内常见变量格式：

%ns：n是数字，s代表string，即多少个字符

%ni：n是数字，i代表integer，即多少个整数字数

%N.nf：n和N都是数字，f代表float

//查看原文本
[root@localhost 桌面]# cat test.txt
Name    Chinese    English    Math    Average
Tom    80    60    92    77.33
Sherry    75    55    80    70.00
John    60    90    70    73.33


[root@localhost 桌面]# printf '%s	 %s	 %s	 %s	 %s	 
' $(cat test.txt)
Name     Chinese     English     Math     Average     
Tom     80     60     92     77.33     
Sherry     75     55     80     70.00     
John     60     90     70     73.33     

[root@localhost 桌面]# printf '%10s %5i %5i %5i %8.3f 
' $(cat test.txt)
bash: printf: Chinese: 无效数字
bash: printf: English: 无效数字
bash: printf: Math: 无效数字
bash: printf: Average: 无效数字
      Name     0     0     0    0.000 
       Tom    80    60    92   77.330 
    Sherry    75    55    80   70.000 
      John    60    90    70   73.330 

//输出编码值为45的字符
[root@localhost 桌面]# printf 'x45
'
E
[root@localhost 桌面]#

awk：好用的数据处理工具

awk ‘条件类型1{动作1} 条件类型2{动作2}……’ filename

[root@localhost 桌面]# last -n 5
root     pts/0        :0               Mon Jul 18 14:19   still logged in   
root     :0           :0               Mon Jul 18 14:10   still logged in   
(unknown :0           :0               Mon Jul 18 14:08 - 14:10  (00:01)    
reboot   system boot  3.10.0-327.el7.x Mon Jul 18 14:08 - 16:00  (01:52)    
root     pts/0        :0               Sun Jul 17 15:44 - crash  (22:23)    

wtmp begins Mon Apr 25 13:36:45 2016

[root@localhost 桌面]# last -n 5 | awk '{print $1 "	" $4}'
root    Mon
root    Mon
(unknown    Mon
reboot    3.10.0-327.el7.x
root    Sun
    
wtmp    Apr
[root@localhost 桌面]#

awk指令会把每一行根据空格或者tab分割，然后将所有片段依次赋值给$1,$2,……变量。

awk内置变量

NF：每行字段总数

NR：目前awk所处理的是第几行数据

FS：目前的分割字符，默认是空格

[root@localhost 桌面]# last -n 5 | awk '{print $1 "	 lines:" NR "	 cplumes:" NF}'
root     lines:1     cplumes:10
root     lines:2     cplumes:10
(unknown     lines:3     cplumes:10
reboot     lines:4     cplumes:11
root     lines:5     cplumes:10
     lines:6     cplumes:0
wtmp     lines:7     cplumes:7

awk的逻辑运算符

>：大于

<：小于

>=：大于等于

<=：小于等于

==：等于

!=：不等于

[root@localhost 桌面]# cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin

//以下一“：”作为分隔符，但第一行会失效
[root@localhost 桌面]# cat /etc/passwd | 
> awk '{FS=":"} $3<10 {print $1 "	" $3}'
root:x:0:0:root:/root:/bin/bash    
bin    1
daemon    2
adm    3
lp    4
sync    5
shutdown    6
halt    7
mail    8

//以下利用BEGIN预先设置变量，第一行便不会失效
[root@localhost 桌面]# cat /etc/passwd | 
> awk 'BEGIN {FS=":"} $3<10 {print $1 "	" $3}'
root    0
bin    1
daemon    2
adm    3
lp    4
sync    5
shutdown    6
halt    7
mail    8
[root@localhost 桌面]#

awk的计算功能

//查看原文本
[root@localhost 桌面]# cat pay.txt
Name    1st    2nd    3th
Tom    2300    3200    1200
Sherry    3400    1200    7400
 
//在awk中变量可以直接使用，不需要$,awk的{}动作内若有多个命令辅助时，使用“；”分隔
[root@localhost 桌面]# cat pay.txt | 
> awk 'NR==1{printf "%10s %10s %10s %10s %10s 
",$1,$2,$3,$4,"Total"}
> NR>=2{total=$2+$3+$4;printf "%10s %10d %10d %10d %10.2f 
",$1,$2,$3,$4,total}'
      Name        1st        2nd        3th      Total 
       Tom       2300       3200       1200    6700.00 
    Sherry       3400       1200       7400   12000.00

文件比较工具

diff

用于相似文件的比较。

diff [-bBi] fileA fileB

参数：

-b：忽略一行中多个空格的区别

-B：忽略空白行的区别

-i：忽略大小写区别

[root@localhost 桌面]# vim fileA
[root@localhost 桌面]# cp fileA fileB
[root@localhost 桌面]# vim fileB
[root@localhost 桌面]# cat fileA
this is fileA


[root@localhost 桌面]# cat fileB
this is fileB

ok
[root@localhost 桌面]# diff fileA fileB
1,2c1
< this is fileA
< 
---
> this is fileB
3a3
> ok
[root@localhost 桌面]#

patch

该命令与diff密不可分，加入fileA和fileB是两个不同版本的文件，想用fileB来更新fileA，则先通过diff比较两个文件的区别，并将区别文件制作成补丁文件，再由补丁文件更新旧文件。

patch -pN < patchFile　　《==更新

patch -R -pN < patchFile 《==还原

参数：

-p：后面N表示取消几层目录

-R：代表还原

[root@localhost 桌面]# cat fileA
this is fileA


[root@localhost 桌面]# cat fileB
this is fileB

ok

//制作补丁文件
[root@localhost 桌面]# diff -Naur fileA fileB > file.patch
[root@localhost 桌面]# cat file.patch
--- fileA    2016-07-18 16:36:24.371373349 +0800
+++ fileB    2016-07-18 16:37:31.523401652 +0800
@@ -1,3 +1,3 @@
-this is fileA
-
+this is fileB
 
+ok

//使用补丁文件更新旧文件，因为在当前目录，因此N为0
[root@localhost 桌面]# patch -p0 < file.patch
patching file fileA
[root@localhost 桌面]# cat fileA
this is fileB

ok
[root@localhost 桌面]#

第12章 正则表达式与文件格式化处理