高级Bash脚本编程指南(27)：文本处理命令（三）

成于坚持，败于止步

处理文本和文本文件的命令

字符转换过滤器.

必须使用引用或中括号, 这样做才是合理的. 引用可以阻止shell重新解释出现在tr命令序列中的特殊字符. 中括号应该被引用起来防止被shell扩展.

无论tr "A-Z" "*" <filename还是tr A-Z * <filename都可以将filename中的大写字符修改为星号(写到stdout). 但是在某些系统上可能就不能正常工作了, 而tr A-Z '[**]'在任何系统上都可以正常工作.

-d选项删除指定范围的字符.

root@ubuntu:~/resource/shell-study/0621-2013# echo "abcdef"
abcdef
root@ubuntu:~/resource/shell-study/0621-2013# echo "abcdef" | tr -d b-d
aef
root@ubuntu:~/resource/shell-study/0621-2013# gedit file
root@ubuntu:~/resource/shell-study/0621-2013# cat file 
hello123
456 do you like 789
no 0 is my love
root@ubuntu:~/resource/shell-study/0621-2013# tr -d 0-9 < file 
hello
 do you like 
no  is my love
root@ubuntu:~/resource/shell-study/0621-2013#

--squeeze-repeats (或-s)选项用来在重复字符序列中除去除第一个字符以外的所有字符. 这个选项在删除多余空白的时候非常有用.

root@ubuntu:~/resource/shell-study/0621-2013# echo "xxxxxx" | tr -s 'x'
x
root@ubuntu:~/resource/shell-study/0621-2013#

-c"complement"选项将会反转匹配的字符集. 通过这个选项, tr将只会对那些不匹配的字符起作用.

root@ubuntu:~/resource/shell-study/0621-2013# echo "abcd2ef1" | tr -c b-d +
+bcd+++++root@ubuntu:~/resource/shell-study/0621-2013#

不在搜索范围之内的其他内容将会由“+”填充

同时tr命令支持POSIX字符类

root@ubuntu:~/resource/shell-study/0621-2013# echo "abcd2ef1" | tr '[:alpha:]' -
----2--1
root@ubuntu:~/resource/shell-study/0621-2013#

一个实例：把一个文件的内容全部转换为大写

#!/bin/bash
# 把一个文件的内容全部转换为大写.

E_BADARGS=65

if [ -z "$1" ]  # 检查命令行参数.
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi

tr '[:lower:]' '[:upper:]' <"$1"
# tr a-z A-Z <"$1"
# easy use it

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# chmod +x 
file      test1.sh  
root@ubuntu:~/resource/shell-study/0621-2013# chmod +x test1.sh 
root@ubuntu:~/resource/shell-study/0621-2013# cat file 
hello123
456 do you like 789
no 0 is my love
root@ubuntu:~/resource/shell-study/0621-2013# ./test1.sh file 
HELLO123
456 DO YOU LIKE 789
NO 0 IS MY LOVE
root@ubuntu:~/resource/shell-study/0621-2013#

相反做法的实例：将当前目录下的所有文全部转换为小写

#!/bin/bash
#
#  将当前目录下的所有文全部转换为大写.
#
#  灵感来自于John Dubois的脚本,
#+ Chet Ramey将其转换为Bash脚本,
#+ 然后被本书作者精简了一下.


for filename in *                # 遍历当前目录下的所有文件.
do
   fname=`basename $filename`
   n=`echo $fname | tr a-z A-Z`  # 将名字修改为小写.
   if [ "$fname" != "$n" ]       # 只对那些文件名不是小写的文件进行重命名.
   then
     mv $fname $n
   fi
done

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# chmod +x test2.sh 
root@ubuntu:~/resource/shell-study/0621-2013# ./test2.sh 
root@ubuntu:~/resource/shell-study/0621-2013# ls
file  test1.sh  test2.sh
root@ubuntu:~/resource/shell-study/0621-2013# ./test2.sh 
root@ubuntu:~/resource/shell-study/0621-2013# ls
FILE  TEST1.SH  TEST2.SH
root@ubuntu:~/resource/shell-study/0621-2013#

再把结果改回来呗O(∩_∩)O~

#!/bin/bash
#
#  将当前目录下的所有文全部转换为小写.
#
#  灵感来自于John Dubois的脚本,
#+ Chet Ramey将其转换为Bash脚本,
#+ 然后被本书作者精简了一下.


for filename in *                # 遍历当前目录下的所有文件.
do
   fname=`basename $filename`
   n=`echo $fname | tr A-Z a-z`  # 将名字修改为小写.
   if [ "$fname" != "$n" ]       # 只对那些文件名不是小写的文件进行重命名.
   then
     mv $fname $n
   fi
done

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# ls
FILE  TEST1.SH  TEST2.SH
root@ubuntu:~/resource/shell-study/0621-2013# ./TEST2.SH 
root@ubuntu:~/resource/shell-study/0621-2013# ls
file  test1.sh  test2.sh
root@ubuntu:~/resource/shell-study/0621-2013#

一个改进后的方法：

#!/bin/bash

# 对于那些文件名中包含空白和新行的文件, 上边的脚本就不能工作了.
# Stephane Chazelas因此建议使用下边的方法:


for filename in *    # 不必非得使用basename命令,
                     # 因为"*"不会返回任何包含"/"的文件.
do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]'`
#                             POSIX 字符集标记法.
#                    添加的斜线是为了在文件名结尾换行不会被
#                    命令替换删掉.
   # 变量替换:
   n=${n%/}          # 从文件名中将上边添加在结尾的斜线删除掉.
   [[ $filename == $n ]] || mv "$filename" "$n"
                     # 检查文件名是否已经是小写.
done

exit 0

接着看实例：DOS到UNIX文本文件的转换

#!/bin/bash
# DOS到UNIX文本文件的转换.

E_WRONGARGS=65

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename-to-convert"
  exit $E_WRONGARGS
fi

NEWFILENAME=$1.unx

CR='15'  # 回车.
           # 015是8进制的ASCII码的回车.
           # DOS中文本文件的行结束符是CR-LF.
           # UNIX中文本文件的行结束符只是LF.

tr -d $CR < $1 > $NEWFILENAME
# 删除回车并且写到新文件中.

echo "Original DOS text file is "$1"."
echo "Converted UNIX text file is "$NEWFILENAME"."

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# ls
file  test1.sh  test2.sh  test3.sh  test4.sh
root@ubuntu:~/resource/shell-study/0621-2013# ./test4.sh file 
Original DOS text file is "file".
Converted UNIX text file is "file.unx".
root@ubuntu:~/resource/shell-study/0621-2013# ls
file  file.unx  test1.sh  test2.sh  test3.sh  test4.sh
root@ubuntu:~/resource/shell-study/0621-2013# cat file.unx 
hello123
456 do you like 789
no 0 is my love
root@ubuntu:~/resource/shell-study/0621-2013#

接着看一个实例：弱智加密

#!/bin/bash
# 典型的rot13算法,
# 使用这种方法加密至少可以愚弄一下3岁小孩.

# 用法: ./sh filename
# 或    ./sh <filename
# 或    ./sh and supply keyboard input (stdin)

cat "$@" | tr 'a-zA-Z' 'n-za-mN-ZA-M'   # "a"变为"n", "b"变为"o", 等等.
#  'cat "$@"'结构
#+ 允许从stdin或者从文件中获得输入.

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# cat file
hello123
456 do you like 789
no 0 is my love
root@ubuntu:~/resource/shell-study/0621-2013# ./test5.sh file > file-u
root@ubuntu:~/resource/shell-study/0621-2013# cat file-u 
uryyb123
456 qb lbh yvxr 789
ab 0 vf zl ybir
root@ubuntu:~/resource/shell-study/0621-2

会加密当然还得会解密啊：

root@ubuntu:~/resource/shell-study/0621-2013# cat file-u
uryyb123
456 qb lbh yvxr 789
ab 0 vf zl ybir
root@ubuntu:~/resource/shell-study/0621-2013# ./test5.sh file-u > file-d
root@ubuntu:~/resource/shell-study/0621-2013# cat file-d
hello123
456 do you like 789
no 0 is my love
root@ubuntu:~/resource/shell-study/0621-2013#

接着看一个有趣的实例：产生"Crypto-Quote"游戏(译者: 一种文字游戏)

#!/bin/bash
# crypto-quote.sh: 加密

#  使用单码替换(单一字母替换法)来进行加密.
#  这个脚本的结果与"Crypto Quote"游戏
#+ 的行为很相似.


key=ETAOINSHRDLUBCFGJMQPVWZYXK
# "key"不过是一个乱序的字母表.
# 修改"key"就会修改加密的结果.

# 'cat "$@"' 结构既可以从stdin获得输入, 也可以从文件中获得输入.
# 如果使用stdin, 那么要想结束输入就使用 Control-D.
# 否则就要在命令行上指定文件名.

cat "$@" | tr "a-z" "A-Z" | tr "A-Z" "$key"
#        |   转化为大写   |     加密
# 小写, 大写, 或混合大小写, 都可以正常工作.
# 但是传递进来的非字母字符将不会起任何变化.


# 用下边的语句试试这个脚本:
# "Nothing so needs reforming as other people's habits."
# --Mark Twain
#
# 输出为:
# "CFPHRCS QF CIIOQ MINFMBRCS EQ FPHIM GIFGUI'Q HETRPQ."
# --BEML PZERC

# 解密:
# cat "$@" | tr "$key" "A-Z"


#  这个简单的密码可以轻易的被一个12岁的小孩
#+ 用铅笔和纸破解.

exit 0

这个实例其实很简单，不做过多说明

tr工具在历史上有2个重要版本. BSD版本不需要使用中括号(tr a-z A-Z), 但是SysV版本则需要中括号(tr '[a-z]' '[A-Z]'). GNU版本的tr命令与BSD版本比较象, 所以最好使用中括号来引用字符范围

fold

将输入按照指定宽度进行折行. 这里有一个非常有用的选项-s, 这个选项可以使用空格进行断行(译者: 事实上只有外文才需要使用空格断行, 中文是不需要的)

root@ubuntu:~/resource/shell-study/0621-2013# cat file
hello123
456 do you like 789
no 0 is my love
root@ubuntu:~/resource/shell-study/0621-2013# fold -w 5 file
hello
123
456 d
o you
 like
 789
no 0 
is my
 love
root@ubuntu:~/resource/shell-study/0621-2013#

fmt

一个简单的文件格式器, 通常用在管道中, 将一个比较长的文本行输出进行"折行".

一个实例：格式化文件列表

#!/bin/bash

WIDTH=40                    # 设为40列宽.

b=`ls ./`       # 取得文件列表...

echo $b | fmt -w $WIDTH

# 也可以使用如下方法, 作用是相同的.
#    echo $b | fold - -s -w $WIDTH

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# ls ./
file    file-u    test1.sh  test3.sh  test5.sh  test7.sh
file-d  file.unx  test2.sh  test4.sh  test6.sh
root@ubuntu:~/resource/shell-study/0621-2013# ./test7.sh 
file file-d file-u file.unx test1.sh
test2.sh test3.sh test4.sh test5.sh
test6.sh test7.sh
root@ubuntu:~/resource/shell-study/0621-2013#

col

这个命令用来滤除标准输入的反向换行符号. 这个工具还可以将空白用等价的tab来替换. col工具最主要的应用还是从特定的文本处理工具中过滤输出, 比如groff和tbl. (译者: 主要用来将man页转化为文本.)

column

列格式化工具. 通过在合适的位置插入tab, 这个过滤工具会将列类型的文本转化为"易于打印"的表格式进行输出

#!/bin/bash
# 这是"column" man页中的一个例子, 作者对这个例子做了很小的修改.


(printf "PERMISSIONS LINKS OWNER GROUP SIZE MONTH DAY HH:MM PROG-NAME
" 
; ls -l | sed 1d) | column -t

#  管道中的"sed 1d"删除输出的第一行,
#+ 第一行将是"total        N",
#+ 其中"N"是"ls -l"找到的文件总数.

# "column"中的-t选项用来转化为易于打印的表形式.

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# ls -l
total 48
-rw-r--r-- 1 root root   45 2013-06-21 04:04 file
-rw-r--r-- 1 root root   45 2013-06-21 03:51 file-d
-rw-r--r-- 1 root root   45 2013-06-21 03:48 file-u
-rw-r--r-- 1 root root   45 2013-06-21 03:40 file.unx
-rwxr-xr-x 1 root root  257 2013-06-21 03:13 test1.sh
-rwxr-xr-x 1 root root  517 2013-06-21 03:34 test2.sh
-rwxr-xr-x 1 root root  742 2013-06-21 03:34 test3.sh
-rwxr-xr-x 1 root root  534 2013-06-21 03:40 test4.sh
-rwxr-xr-x 1 root root  312 2013-06-21 03:50 test5.sh
-rw-r--r-- 1 root root 1040 2013-06-21 03:54 test6.sh
-rwxr-xr-x 1 root root  220 2013-06-21 04:11 test7.sh
-rwxr-xr-x 1 root root  411 2013-06-21 04:16 test8.sh
root@ubuntu:~/resource/shell-study/0621-2013# ls -l | sed 1d
-rw-r--r-- 1 root root   45 2013-06-21 04:04 file
-rw-r--r-- 1 root root   45 2013-06-21 03:51 file-d
-rw-r--r-- 1 root root   45 2013-06-21 03:48 file-u
-rw-r--r-- 1 root root   45 2013-06-21 03:40 file.unx
-rwxr-xr-x 1 root root  257 2013-06-21 03:13 test1.sh
-rwxr-xr-x 1 root root  517 2013-06-21 03:34 test2.sh
-rwxr-xr-x 1 root root  742 2013-06-21 03:34 test3.sh
-rwxr-xr-x 1 root root  534 2013-06-21 03:40 test4.sh
-rwxr-xr-x 1 root root  312 2013-06-21 03:50 test5.sh
-rw-r--r-- 1 root root 1040 2013-06-21 03:54 test6.sh
-rwxr-xr-x 1 root root  220 2013-06-21 04:11 test7.sh
-rwxr-xr-x 1 root root  411 2013-06-21 04:16 test8.sh
root@ubuntu:~/resource/shell-study/0621-2013# printf "PERMISSIONS LINKS OWNER GROUP SIZE MONTH DAY HH:MM PROG-NAME
";ls -l | sed 1d 
PERMISSIONS LINKS OWNER GROUP SIZE MONTH DAY HH:MM PROG-NAME
-rw-r--r-- 1 root root   45 2013-06-21 04:04 file
-rw-r--r-- 1 root root   45 2013-06-21 03:51 file-d
-rw-r--r-- 1 root root   45 2013-06-21 03:48 file-u
-rw-r--r-- 1 root root   45 2013-06-21 03:40 file.unx
-rwxr-xr-x 1 root root  257 2013-06-21 03:13 test1.sh
-rwxr-xr-x 1 root root  517 2013-06-21 03:34 test2.sh
-rwxr-xr-x 1 root root  742 2013-06-21 03:34 test3.sh
-rwxr-xr-x 1 root root  534 2013-06-21 03:40 test4.sh
-rwxr-xr-x 1 root root  312 2013-06-21 03:50 test5.sh
-rw-r--r-- 1 root root 1040 2013-06-21 03:54 test6.sh
-rwxr-xr-x 1 root root  220 2013-06-21 04:11 test7.sh
-rwxr-xr-x 1 root root  411 2013-06-21 04:16 test8.sh
root@ubuntu:~/resource/shell-study/0621-2013# ./test
test1.sh  test2.sh  test3.sh  test4.sh  test5.sh  test7.sh  test8.sh  
root@ubuntu:~/resource/shell-study/0621-2013# ./test8.sh 
PERMISSIONS  LINKS  OWNER  GROUP  SIZE  MONTH       DAY    HH:MM     PROG-NAME
-rw-r--r--   1      root   root   45    2013-06-21  04:04  file
-rw-r--r--   1      root   root   45    2013-06-21  03:51  file-d
-rw-r--r--   1      root   root   45    2013-06-21  03:48  file-u
-rw-r--r--   1      root   root   45    2013-06-21  03:40  file.unx
-rwxr-xr-x   1      root   root   257   2013-06-21  03:13  test1.sh
-rwxr-xr-x   1      root   root   517   2013-06-21  03:34  test2.sh
-rwxr-xr-x   1      root   root   742   2013-06-21  03:34  test3.sh
-rwxr-xr-x   1      root   root   534   2013-06-21  03:40  test4.sh
-rwxr-xr-x   1      root   root   312   2013-06-21  03:50  test5.sh
-rw-r--r--   1      root   root   1040  2013-06-21  03:54  test6.sh
-rwxr-xr-x   1      root   root   220   2013-06-21  04:11  test7.sh
-rwxr-xr-x   1      root   root   411   2013-06-21  04:16  test8.sh
root@ubuntu:~/resource/shell-study/0621-2013#

colrm

列删除过滤器. 这个工具将会从文件中删除指定的列(列中的字符串)并且写到文件中, 如果指定的列不存在, 那么就回到stdout.

colrm 2 4 <filename将会删除filename文件中每行的第2到第4列之间的所有字符.

如果这个文件包含tab和不可打印字符, 那将会引起不可预期的行为. 在这种情况下, 应该通过管道的手段使用expand和unexpand来预处理colrm.

计算行号过滤器.

nl filename将会把filename文件的所有内容都输出到stdout上, 但是会在每个非空行的前面加上连续的行号. 如果没有filename参数, 那么就操作stdin.

nl命令的输出与cat -n非常相似, 然而, 默认情况下nl不会列出空行

#!/bin/bash
# 这个脚本将会echo自身两次, 并显示行号.

# 'nl'命令显示的时候你将会看到, 本行是第3行, 因为它不计空行.
# 'cat -n'命令显示的时候你将会看到, 本行是第5行.

nl `basename $0`

echo; echo  # 下边, 让我们试试 'cat -n'

cat -n `basename $0`
# 区别就是'cat -n'对空行也进行计数.
# 注意'nl -ba'也会这么做.
echo; echo
nl -ba `basename $0`

exit 0

结果：

root@ubuntu:~/resource/shell-study/0621-2013# ./test9.sh 
     1	#!/bin/bash
     2	# 这个脚本将会echo自身两次, 并显示行号.
       
     3	# 'nl'命令显示的时候你将会看到, 本行是第3行, 因为它不计空行.
     4	# 'cat -n'命令显示的时候你将会看到, 本行是第5行.
       
     5	nl `basename $0`
       
     6	echo; echo  # 下边, 让我们试试 'cat -n'
       
     7	cat -n `basename $0`
     8	# 区别就是'cat -n'对空行也进行计数.
     9	# 注意'nl -ba'也会这么做.
    10	echo; echo
    11	nl -ba `basename $0`
       
    12	exit 0


     1	#!/bin/bash
     2	# 这个脚本将会echo自身两次, 并显示行号.
     3	
     4	# 'nl'命令显示的时候你将会看到, 本行是第3行, 因为它不计空行.
     5	# 'cat -n'命令显示的时候你将会看到, 本行是第5行.
     6	
     7	nl `basename $0`
     8	
     9	echo; echo  # 下边, 让我们试试 'cat -n'
    10	
    11	cat -n `basename $0`
    12	# 区别就是'cat -n'对空行也进行计数.
    13	# 注意'nl -ba'也会这么做.
    14	echo; echo
    15	nl -ba `basename $0`
    16	
    17	exit 0


     1	#!/bin/bash
     2	# 这个脚本将会echo自身两次, 并显示行号.
     3	
     4	# 'nl'命令显示的时候你将会看到, 本行是第3行, 因为它不计空行.
     5	# 'cat -n'命令显示的时候你将会看到, 本行是第5行.
     6	
     7	nl `basename $0`
     8	
     9	echo; echo  # 下边, 让我们试试 'cat -n'
    10	
    11	cat -n `basename $0`
    12	# 区别就是'cat -n'对空行也进行计数.
    13	# 注意'nl -ba'也会这么做.
    14	echo; echo
    15	nl -ba `basename $0`
    16	
    17	exit 0
root@ubuntu:~/resource/shell-study/0621-2013#

格式化打印过滤器. 这个命令会将文件(或stdout)分页, 将它们分成合适的小块以便于硬拷贝打印或者在屏幕上浏览. 使用这个命令的不同的参数可以完成好多任务, 比如对行和列的操作, 加入行, 设置页边, 计算行号, 添加页眉, 合并文件等等.

pr命令集合了许多命令的功能, 比如nl, paste, fold, column, 和expand.

pr -o 5 --width=65 fileZZZ | more 这个命令对fileZZZ进行了比较好的分页, 并且打印到屏幕上. 文件的缩进被设置为5, 总宽度设置为65.

一个非常有用的选项-d, 强制隔行打印(与sed -G效果相同).

gettext

GNU gettext包是专门用来将程序的输出翻译或者本地化为不同国家语言的工具集. 在最开始的时候仅仅支持C语言, 现在已经支持了相当数量的其它程序语言和脚本语言.

想要查看gettext程序如何在shell脚本中使用. 请参考info页.

msgfmt

一个产生二进制消息目录的程序. 这个命令主要用来本地化.

iconv

一个可以将文件转化为不同编码格式(字符集)的工具. 这个命令主要用来本地化.

# 将字符符串由UTF-8格式转换为UTF-16并且打印到BookList中
function write_utf8_string {
    STRING=$1
    BOOKLIST=$2
    echo -n "$STRING" | iconv -f UTF8 -t UTF16 | 
     cut -b 3- | tr -d \n >> "$BOOKLIST"
}

recode

可以认为这个命令是上边iconv命令的专业版本. 这个非常灵活的并可以把整个文件都转换为不同编码格式的工具并不是Linux标准安装的一部分.

TeX, gs

TeX和Postscript都是文本标记语言, 用来对打印和格式化的视频显示进行预拷贝.

TeX是Donald Knuth精心制作的排版系统. 通常情况下, 通过编写脚本的手段来把所有的选项和参数封装起来一起传到标记语言中是一件很方便的事情.

Ghostscript (gs) 是一个遵循GPL的Postscript解释器.

enscript

将纯文本文件转换为PostScript的工具

比如, enscript filename.txt -p filename.ps 产生一个 PostScript 输出文件filename.ps.

groff, tbl, eqn

另一种文本标记和显示格式化语言是groff. 这是一个对传统UNIX roff/troff显示和排版包的GNU增强版本. Man页使用的就是groff.

tbl表处理工具可以认为是groff的一部分, 它的功能就是将表标记转化到groff命令中.

eqn等式处理工具也是groff的一部分, 它的功能是将等式标记转化到groff命令中.

一个实例：查看格式化的man页

#!/bin/bash
# 将man页源文件格式化以方便查看.

#  当你想阅读man页的时候, 这个脚本就有用了.
#  它允许你在运行的时候查看
#+ 中间结果.

E_WRONGARGS=65

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename"
  exit $E_WRONGARGS
fi

# ---------------------------
groff -Tascii -man $1 | less
# 来自于groff的man页.
# ---------------------------

#  如果man页中包括表或者等式,
#+ 那么上边的代码就够呛了.
#  下边的这行代码可以解决上边的这个问题.
#
#   gtbl < "$1" | geqn -Tlatin1 | groff -Tlatin1 -mtty-char -man
#
#   感谢, S.C.

exit 0

抽象了吧，其实我也没有完全理解，O(∩_∩)O~，丢在这吧

lex, yacc

lex是用于模式匹配的词汇分析产生程序. 在Linux系统上这个命令已经被flex取代了.

yacc工具基于一系列的语法规范, 产生一个语法分析器. 在Linux系统上这个命令已经被bison取代了.

待续。。。。