awk

AWK：Aho,Weinberger,Kernighan

GNU awk => gawk

# ll `which awk`
/usr/bin/awk -> gawk

# man awk
pattern scanning and processing language.
模式扫描和处理语言;
报表生成器,格式化文本输出;

基本用法

awk [option] 'program' file

program： PATTERN{ACTION STATEMENTS}
  语句之间用分号分隔;
    
  print, printf
                  
选项：
-F,--field-separator：指定输入分隔符;the value of the FS predefined variable.可以不指定,则默认以空白为分隔符;
-v var=val：定义变量; 
      
awk后续的所有动作都是以单引号括住的,可以说是awk的固定用法;单引号里面不能再用单引号;
  
文件中的行称为记录(Records);
文件中的字段称为(Fields);

awk工作流程

第一步：执行BEGIN{}中的语句；
第二步：逐行扫描和处理文件;
第三步：扫描处理完文件之后,执行END{}中的语句;

print命令

print item1,item2,...

(1)逗号分隔符;输出格式以空白分隔;
(2)输出的item可以是字符串,数值,字段,变量,或awk的表达式;
(3)如果省略item,相当于打印整行print $0;

# tail -5 /etc/fstab | awk '{print $2,$4}'
# tail -5 /etc/fstab | awk '{print $2$4}'        #不加逗号时,输出字段会连在一起;
# tail -5 /etc/fstab | awk '{print $2 $4}'
    swapdefaults
    /media/cdromdefaults
    /homedefaults,usrquota,grpquota
    /mnt/lvmdefaults
    /mnt/btreedefaults
# tail -5 /etc/fstab | awk '{print "hello",$2,$4,6}'        #数字被当作字符输出;运算时依然是数值;
    hello swap defaults 6
    hello /media/cdrom defaults 6
    hello /home defaults,usrquota,grpquota 6
    hello /mnt/lvm defaults 6
    hello /mnt/btree defaults 6
# tail -5 /etc/fstab | awk '{print "hello：$1"}'        #$1放在引号里面不会被解析,被当作字符串输出;
    hello：$1
    hello：$1
    hello：$1
    hello：$1
    hello：$1
# tail -5 /etc/fstab | awk '{print "hello："$1}'
    hello：UUID=4b27a61a-4111-4d30-96ac-93cff82b227e
    hello：/dev/sr0
    hello：/dev/sda6
    hello：UUID="6b77b0f3-5a0e-4b28-924c-139f6334da5b"
    hello：UUID=3a3edcd8-a24f-414a-ace6-9952a3ca4891
# tail -5 /etc/fstab | awk '{print}'                    #打印整行;
    UUID=4b27a61a-4111-4d30-96ac-93cff82b227e swap swap defaults  0 0
    /dev/sr0    /media/cdrom    iso9660        defaults    0 0
    /dev/sda6    /home        ext4        defaults,usrquota,grpquota    0 0
    UUID="6b77b0f3-5a0e-4b28-924c-139f6334da5b" /mnt/lvm ext4 defaults 0 0
    UUID=3a3edcd8-a24f-414a-ace6-9952a3ca4891 /mnt/btree btrfs defaults 0 0
# tail -5 /etc/fstab | awk '{print ""}'            #显示空白;这里没有指定输出文件的字段;

变量

内建变量,Built-in Variables

FS：The input field separator, a space by default. 输入时的字段分隔符;与"-F"作用相同;
  # awk -v FS='：' '{print $1}' /etc/passwd |head -3 
    root
    bin
    daemon
  # awk -v FS=： '{print $1}' /etc/passwd        #FS后面的引号可省略;
  # awk -F： '{print $1}' /etc/passwd
    
OFS：The output field separator, a space by default. 输出时的字段分隔符;
  # awk -v FS=： -v OFS=： '{print $1,$3,$7}' /etc/passwd | head -5 
    root：0：/bin/bash
    bin：1：/sbin/nologin
    daemon：2：/sbin/nologin
    adm：3：/sbin/nologin
    lp：4：/sbin/nologin
    
RS：The input record separator, by default a newline. 输入时的行分隔符,换行符;
  # awk -v RS=' ' '{print}' /etc/passwd        #指定以space为换行符,即有空白的地方会换行;打印时原有的换行符依然会换行;
  
  # vim file2
    a：b c：d
    x：y：z
  # awk -v RS='：' '{print $1}' file2
    a
    b
    d
    y
    z
  # awk -v RS='：' '{print $2}' file2        #这里处理d x时,把原有的换行符当作空白处理了;                
    c
    x

ORS：The output record separator, by default a newline. 输出时的行分隔符,换行符;
  # awk -v RS=' ' -v ORS='#' '{print}' /etc/passwd    | tail -5    #指定以#为输出换行符,实际结果是空白换行符被替换为#输出,原有的换行符依然会换行;
    radvd：x：75：75：radvd#user：/：/sbin/nologin
    sssd：x：983：978：User#for#sssd：/：/sbin/nologin
    gdm：x：42：42：：/var/lib/gdm：/sbin/nologin
    gnome-initial-setup：x：982：977：：/run/gnome-initial-setup/：/sbin/nologin
    named：x：25：25：Named：/var/named：/sbin/nologin

NF：The number of fields in the current input record. 每一行的字段数量;
  # awk '{print NF}' /etc/fstab        #显示字段数量;
  # awk '{print $NF}' /etc/fstab        #$NF显示最后一个字段;
  
NR：The total number of input records seen so far. 读取的总行数;
  # awk '{print NR}' /etc/fstab        #显示每一行的行号;
  # awk '{print NR}' /etc/fstab /etc/issue        #跟多个文件时,会连在一起连续编号;
  
FNR：The input record number in the current input file. 当前数据文件中的数据行数;对每个文件单独显示行号;
  # awk '{print FNR}' /etc/fstab /etc/issue    #两个文件单独编号;
    
FILENAME：The name of the current input file. 当前读取的文件的文件名;
  # awk '{print FILENAME}' /etc/fstab /etc/issue    #每读取一行,打印一次当前读取的文件的文件名;
    
ARGC：The number of command line arguments.does not include options to gawk, or the program source. 命令行参数的数量;不包括awk的选项和program;
  # awk '{print ARGC}' /etc/fstab /etc/issue
  # awk 'BEGIN{print ARGC}' /etc/fstab /etc/issue
    
ARGV：Array of command line arguments. The array is indexed from 0 to ARGC-1. 数组,命令行参数的数组;
  # awk 'BEGIN{print ARGV[0]}' /etc/fstab /etc/issue
  # awk 'BEGIN{print ARGV[1]}' /etc/fstab /etc/issue
  # awk 'BEGIN{print ARGV[2]}' /etc/fstab /etc/issue

自定义变量

变量名区分字符大小写;

(1)-v var=value;
  # awk -v test='hello awk' '{print test}' /etc/fstab
  # awk -v test='hello awk' 'BEGIN{print test}' /etc/fstab
    
(2)在program中定义;
  # awk 'BEGIN{test="hello gawk"; print test}'            #BEGIN模式,不对文件进行处理;

printf命令

格式化输出：
  format and print data.
  
  # yum provides printf
  # rpm -ql coreutils
  
  printf FORMAT(格式符),item1,item2,...
  
  (1)FORMAT必须给出;
  (2)printf不会自动换行,需要手动指定换行符,
;
  (3)FORMAT中需要分别为后面的每个item指定一个格式化符号;
  (4)printf不是管道命令;
    
格式符：
  %c：显示字符的ASCII码;
  %d：,%i：显示十进制整数;
  %e,%E：科学计数法显示数值;
  %f：浮点数;
  %g,%G：以科学计数法或浮点形式显示数值;
  %s：显示字符串;
  %u：无符号整数;
  %%：显示%自身;
    
  # awk -F： '{printf "%s
",$1}' /etc/passwd        #格式符需要用引号引起来;
  # awk -F： '{printf "username： %s
",$1}' /etc/passwd
  # awk -F： '{printf "username： %s
, uid： %d
",$1,$3}' /etc/passwd        #这里打印多个字段时,$1对应第一串格式,$3对应第二串格式;
      
修饰符：
  #[.#]：第一个数字控制显示的宽度;第二个数字表示小数点后的精度;
      %3.1f

  -：表示左对齐;默认是右对齐;
    # awk -F： '{printf "username： %-20s  uid： %d
",$1,$3}' /etc/passwd    #指定15个字符的宽度显示$1,并左对齐;
        username： root                  uid： 0
        username： bin                   uid： 1
        username： daemon                uid： 2                        
    # awk -F： '{printf "username： %20s  uid： %d
",$1,$3}' /etc/passwd    
        username：                 root  uid： 0
        username：                  bin  uid： 1
        username：               daemon  uid： 2

  +：显示数值的正负符号;
    %+d
    #awk -F： '{printf "%-20s | %+10d
",$1,$3}' /etc/passwd

操作符

算术运算操作符：
  x+y, x-y, x*y, x/y, x^y, x%y

赋值操作符：
  =, +=, -=, *=, /=, %=, ^=
  ++, --
    
比较操作符：
  >, <, >=, <=, ==, !=

模式匹配符：
  ~：匹配;
  !~：不匹配;
  Regular expression match, negated match.
  
  # awk '$0 ~ /root/' /etc/passwd | wc -l 
  # awk '$0 !~ /root/' /etc/passwd | wc -l
    
逻辑操作符：
  &&
  ||
  !

  pattern && pattern
  pattern || pattern
  ! pattern
  
  # awk -F： '{if($3>=0 && $3<=1000); {print $1,$3}}' /etc/passwd
  # awk -F： '{if($3==0 || $3<=1000); {print $1,$3}}' /etc/passwd
  # awk -F： '{if(!($3>=500)) {print $1,$3}}' /etc/passwd
      
函数调用：
  function_name(argu1,argu2,...)
    
条件表达式：
  ?：
  The C conditional expression.  This has the form expr1 ? expr2 ： expr3.  
  If expr1 is true, the value of the expression is expr2, otherwise it is expr3. 
  pattern ? pattern ： pattern
  selector?if-true-expression：if-false-expression        #如果条件表达式为真,执行true语句,为假则执行false语句;
  
  # awk -F： '{$3>=1000 ? usertype="common user" ： usertype="sysadmin or sysuser"; printf "%15s： %-s
",$1,usertype}' /etc/passwd

PATTERN

类似于sed中的地址定界;

(1)empty：空模式,处理每一行;

(2)/pattern/：仅处理模式匹配到的行;注意模式要写在两条斜线中间/regular expression/,模式支持正则表达式;
    
    # awk '/^UUID/{print $1}' /etc/fstab        #打印以UUID开头的行;
    # awk '!/^UUID/{print $1}' /etc/fstab       #取反,打印不以UUID开头的行;
    
(3)relational expression：关系表达式;结果有"真""假",为真才被处理,为假则过滤掉不处理;
    真：结果为非0值,非空字符串;
    
    # awk -F： '$3>=1000 {print $1,$3}' /etc/passwd              #处理uid大于等于1000的行;
    # awk -F： '$NF="/bin/bash" {print $1,$NF}' /etc/passwd      #处理最后一个字段为/bin/bash的行;
    # awk -F： '$NF~/bash$/ {print $1,$NF}' /etc/passwd          #处理最后一个字段以bash结尾的行;
        
(4)line ranges：行范围,即地址定界;
    /PAT1/,/PAT2/：第一次匹配到PAT1的行到第一次匹配到PAT2的行;

    # awk -F： '/^root/,/^mysql/ {print $1}' /etc/passwd
    # awk -F： '(NR>=2 && NR<=10) {print $1}' /etc/passwd        #不支持直接给定数字界行范围;可以用NR变量指定行数范围;
        
(5)BEGIN/END模式;
    BEGIN{}：仅在开始处理文本之前执行一次;
    END{}：仅在文本处理完成之后执行一次;
    
    # awk -F： 'BEGIN{sum=0} {sum+=$3} END{print sum}' /etc/passwd         #求当前系统上所有用户uid之和;
    # awk -F： 'BEGIN{sum=0} {sum+=$3} END{print sum/NR}' /etc/passwd      #求平均数;

    # awk -F： 'BEGIN{print "    username        uid    
---------------------------"}'        #打印表头;
            username        uid    
        ---------------------------
        
    # awk -F： 'BEGIN{print "    username        uid    
---------------------------"}; END{print "===========================
          END"}' /etc/passwd        #打印表头和表尾;
            username        uid    
        ---------------------------
        ===========================
                      END

常用的Action

1、Expressions;
2、Control Statements：控制语句;
3、Compound Statements：组合语句;
    {statements}：多个语句组合使用时,需要用大括号括起来; 
      Action statements are enclosed in braces,{ and }.
4、Input Statements：输入语句;
5、Output Statements：输出语句;
    print,printf,next,system(cmd-line),fflush([file])...
    
    fflush([file])：用于清空缓冲流,虽然一般感觉不到,但是默认printf是缓冲输出的;
      Flush any buffers associated with the open output file or pipe file.
      If file is the null string, then flush all open output files and pipes.
    
   system(cmd-line)：调用系统命令; Execute the command cmd-line, and return the exit status.
      
    # awk BEGIN'{system("hostname")}'        #命令需要用双引号引起来;
    # echo $?

控制语句(Control Statements)

if (condition) statement [ else statement ]
while (condition) statement
do statement while (condition) #先执行一次循环体,再判断条件;
for (expr1; expr2; expr3) statement
break
continue
delete array[index] #删除数组中的某个元素;
delete array #删除整个数组;
exit
{ statements }
switch (expression) {
case value|regex ： statement
...
[ default： statement ]
}

if-else

if (condition) {statement} [ else {statement} ]

# awk -F： '{if ($3>=1000) print $1,$3}' /etc/passwd
# awk -F： '{if ($3>=1000) {printf "common user： %s
",$1} else {printf "root or sysuser： %s
",$1}}' /etc/passwd
# awk -F： '{if ($NF=="/bin/bash") print $1}' /etc/passwd        #处理最后一个字段为/bin/bash的行;
# awk '{if (NF>5) print $0}' /etc/passwd                        #显示字段数大于5的行;
# df -h | awk -F% '/^/dev/ {print $1}' | awk '{if ($NF>=10) print $1,$NF}'

while

while (condition) statement
对一行内的多个字段进行相同或类似处理时使用;
对数组中的各元素逐一进行处理时; 

# awk '/^[[：space：]]*linux16/ {i=1; while (i<=NF) {print $i,length($i); i++}}' /etc/grub2.cfg    #过滤出以linux16开头的行,并统计每一个字段的长度;
# awk '/^[[：space：]]*linux16/ {i=1; while (i<=NF) {if (length($i)>=7) {print $i,length($i)}; i++}}' /etc/grub2.cfg    #进一步过滤出长度大于等于7的字段;

do-while

do statement while (condition)
至少执行一次循环体;

# awk 'BEGIN{sum=0;i=0; do {sum+=i;i++;} while (i<=100) print sum}'        #求1-100之和;

for

for (expr1; expr2; expr3) statement
for (variable assignment; condition; iteration process) statement

# awk '/^[[：space：]]*linux16/ {for (i=1;i<=NF;i++) print $i,length($i)}' /etc/grub2.cfg        #打印以linux16的行,并统计其长度;

特殊用法：
  可以遍历数组中的元素;
  for (var in array) statement

awk和shell的性能比较：

# time (awk 'BEGIN{sum=0;for(i=0;i<=1000000;i++){sum+=i;};print sum}')
    real    0m0.134s
    user    0m0.129s
    sys        0m0.003s
# time (sum=0;for i in $(seq 1000000);do let sum+=$i;done;echo $sum)
    real    0m10.438s
    user    0m10.021s
    sys        0m0.263s
    
awk比shell快得多;

switch

switch (expression) {
  case value|regex ： statement
  ...
  [ default： statement ]
  }

switch (expression) {case value|regex ： statement; case value|regex ： statement; ...[ default： statement ]}

next

提前结束对本行的处理而直接进入下一行;类似continue;

# awk -F： '{if ($3%2!=0) next; print $1,$3}' /etc/passwd        #处理uid为偶数的行;
# awk -F： '{if ($3%2==0) print $1,$3}' /etc/passwd

array

关联数组：array[index-expression]

index-expression：
(1)可使用任意字符串;字符串要使用双引号引起来;
(2)如果某数组元素事先不存在,在引用时,awk会自动创建此元素,并将其值初始化为"空串",作为数字运算时其值为0;

# awk 'BEGIN{weekdays["mon"]="Monday"; weekdays["tue"]="Tuesday"; print weekdays["tue"]}'

若要判断数组中是否存在某元素,要使用"index in array"格式进行遍历;

遍历数组中的每个元素的index,要使用for语句：
for (var in array) statement

# awk 'BEGIN{weekdays["mon"]="Monday"; weekdays["tue"]="Tuesday"; for (i in weekdays) {print i,weekdays[i]}}'

注意：var会遍历array的每个索引;代表的是index,而不是元素的值;

# netstat -tan 
# netstat -tan | awk '/^tcp>/{print}'
    tcp        0      0 0.0.0.0：3306            0.0.0.0：*               LISTEN     
    tcp        0      0 192.168.122.1：53        0.0.0.0：*               LISTEN     
    tcp        0      0 192.168.135.129：53      0.0.0.0：*               LISTEN     
    tcp        0      0 127.0.0.1：53            0.0.0.0：*               LISTEN     
    tcp        0      0 0.0.0.0：22              0.0.0.0：*               LISTEN     
    tcp        0      0 127.0.0.1：631           0.0.0.0：*               LISTEN     
    tcp        0      0 127.0.0.1：25            0.0.0.0：*               LISTEN     
    tcp        0      0 127.0.0.1：953           0.0.0.0：*               LISTEN     
    tcp        0     52 192.168.135.129：22      192.168.135.1：59129     ESTABLISHED
# netstat -tan | awk '/^tcp>/{state[$NF]++} END{for (i in state) {print i,state[i]}}'
    LISTEN 8
    ESTABLISHED 1
# ab -c 100 -n 1000 http：//172.16.100.9/index.html
    ab - Apache HTTP server benchmarking tool.
      -c concurrency：并发请求次数; Number of multiple requests to perform at a time. Default is one request at a time.
      -n requests： Number of requests to perform for the benchmarking session. 
                   The default is to just perform a single request which usually leads to non-representative benchmarking results.
        
    # state[$NF]事先不存在,引用时自动创建该元素,其值为空,数值大小为0;
        
# awk '{ip[$1]++}; END{for (i in ip) {print i,ip[i]}}' /var/log/httpd/access_log

1、统计/etc/fstab文件中每个文件系统类型出现的次数;

# awk '/^UUID/{fs[$3]++}; END{for (i in fs) {print i,fs[i]}}'

2、统计指定文件中每个单词出现的次数;

# awk '{for (i=1;i<=NF;i++) {count[$i]++}}; END{for (i in count) {print i,count[i]}}' /etc/fstab

函数

内置函数(Built-in Functions)

数值函数(Numeric Functions)

rand()：返回0-1之间的一个随机数; Return a random number N, between 0 and 1, such that 0 ≤ N < 1.
  # awk 'BEGIN{print rand()}'            #第一次取值随机,之后固定不变;
srand([expr])：If no expr is provided, use the time of day.    
  # awk 'BEGIN{srand(); for (i=1;i<=10;i++) print int(rand()*100)}'
    
int(expr)：截取整数; Truncate to integer.

字符函数(String Functions)

length([s])：返回指定字符串的长度; Return the length of the string [s], or the length of $0 if [s] is not supplied.
# awk '{print length()}' /etc/passwd        #如果未指定字符串,则默认返回$0即整行的长度;
    
sub(r,s[,t])：在t中搜索以r表示的模式,并将其第一次匹配到的字符替换为s所表示的内容; Just like gsub(),but replace only the first matching substring.
gsub(r,s[,t])：全局替换;

split(s,a[,r])：以r为分隔符切割字符串s,并将切割后的结果保存至a所表示的数组中; If r is omitted, FS is used instead.
  # netstat -tan | awk '/^tcp>/{split($5,ip,"："); print ip[1]}'
  # netstat -tan | awk '/^tcp>/{split($5,ip,"："); count[ip[1]]++}; END{for (i in count) {print i,count[i]}}'
      
    注意：这里的索引数组的index是从1开始编号的;        
      
  # awk '{split($0,t);for(i=0;++i<=asort(t);)$i=t[i]; print $i}' /etc/passwd    

  substr(s,i[,n])：在字符串s中,从第i个字符开始(包括第i个),截取n个字符;
    #awk '{print substr($1,3)}' /etc/passwd        #截取第一个字段,从第三个字符开始到最后;
    #awk -F： '{print substr($1,3)}' /etc/passwd
    
    要截取的内容(file1.txt)：
      F115!16201!1174113017250745 10.86.96.41 211.140.16.1 200703180718
      F125!16202!1174113327151715 10.86.96.42 211.140.16.2 200703180728
      F235!16203!1174113737250745 10.86.96.43 211.140.16.3 200703180738
      F245!16204!1174113847250745 10.86.96.44 211.140.16.4 200703180748
      F355!16205!1174115827252725 10.86.96.45 211.140.16.5 200703180758

    截取文件中的手机号：
      # awk -F'[ !]' '{print substr($3,6)}' file1.txt        #可以用中括号同时指定两个分隔符;    
      13017250745
      13327151715
      13737250745
      13847250745
      15827252725

时间函数(Time Functions)

strftime([format [,timestamp]])：按照指定的格式(format)格式化时间戳(timestamp); Format timestamp according to the specification in format.
    
systime()：按秒返回当前时间; Return the current time of day as the number of seconds since the Epoch(1970-01-01 00：00：00);    

# awk 'BEGIN{print systime()}'        #等同于`date +%s`
1468038105

# ping 116.113.108.196 | awk '{now=strftime("%Y-%m-%d %H：%M：%S",systime()); printf "%s ： %s
",now,$0; fflush();}' 

++i是先i自加1,然后再调用i的值;
i++是先调用i的值,在i自加1;

awk