分析文本命令之wc,sort,uniq

wc

用来统计文件的相关信息

[06:49:57 root@C8-3-55 ~]#wc --help
用法：wc [选项]... [文件]...
　或：wc [选项]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

如果没有指定文件，或者文件为"-"，则从标准输入读取。

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help            显示此帮助信息并退出
      --version         显示版本信息并退出

用wc统计passwd文件相关信息

[06:50:33 root@C8-3-55 ~]#wc /etc/passwd
 139  180 6296 /etc/passwd

[06:58:49 root@C8-3-55 ~]#cat -n /etc/passwd | tail -n 1 ## passwd有139行
   139  user100:x:8990:8990::/home/user100:/bin/bash
[06:53:25 root@C8-3-55 ~]#ll /etc/passwd ## passwd有6296字节
-rw-r--r--. 1 root root 6296 3月   6 08:09 /etc/passwd

[06:51:19 root@C8-3-55 ~]#wc -l /etc/passwd
139 /etc/passwd
[06:51:25 root@C8-3-55 ~]#cat /etc/passwd | wc -l
139
[06:52:36 root@C8-3-55 ~]#wc -L /etc/passwd ## 最长的行
99 /etc/passwd

139 180 6296 分别是行数139行，字数180字，字节数6296字节

sort

对文件内容以某种方式对列进行排序

[06:59:16 root@C8-3-55 ~]#sort --help
用法：sort [选项]... [文件]...
　或：sort [选项]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

如果没有指定文件，或者文件为"-"，则从标准输入读取。

必选参数对长短选项同时适用。
排序选项：

  -b, --ignore-leading-blanks   忽略前导的空白区域
  -d, --dictionary-order        只考虑空白区域和字母字符
  -f, --ignore-case             忽略字母大小写
  -g, --general-numeric-sort  compare according to general numerical value
  -i, --ignore-nonprinting    consider only printable characters
  -M, --month-sort            compare (unknown) < 'JAN' < ... < 'DEC'
  -h, --human-numeric-sort    使用易读性数字(例如： 2K 1G)
  -n, --numeric-sort          compare according to string numerical value
  -R, --random-sort           shuffle, but group identical keys.  See shuf(1)
      --random-source=FILE    get random bytes from FILE
  -r, --reverse               reverse the result of comparisons
      --sort=WORD               按照WORD 指定的格式排序：
                                        一般数字-g，高可读性-h，月份-M，数字-n，
                                        随机-R，版本-V
  -V, --version-sort            在文本内进行自然版本排序

其他选项：

      --batch-size=NMERGE       一次最多合并NMERGE 个输入；如果输入更多
                                        则使用临时文件
  -c, --check, --check=diagnose-first   检查输入是否已排序，若已有序则不进行操作
  -C, --check=quiet, --check=silent     类似-c，但不报告第一个无序行
      --compress-program=程序   使用指定程序压缩临时文件；使用该程序
                                        的-d 参数解压缩文件
      --debug                   为用于排序的行添加注释，并将有可能有问题的
                                        用法输出到标准错误输出
      --files0-from=文件        从指定文件读取以NUL 终止的名称，如果该文件被
                                        指定为"-"则从标准输入读文件名
  -k, --key=KEYDEF          sort via a key; KEYDEF gives location and type
  -m, --merge               merge already sorted files; do not sort
  -o, --output=文件             将结果写入到文件而非标准输出
  -s, --stable                  禁用last-resort 比较以稳定比较算法
  -S, --buffer-size=大小        指定主内存缓存大小
  -t, --field-separator=分隔符  使用指定的分隔符代替非空格到空格的转换
  -T, --temporary-directory=目录        使用指定目录而非$TMPDIR 或/tmp 作为
                                        临时目录，可用多个选项指定多个目录
      --parallel=N              将同时运行的排序数改变为N
  -u, --unique          配合-c，严格校验排序；不配合-c，则只输出一次排序结果
  -z, --zero-terminated     line delimiter is NUL, not newline
      --help            显示此帮助信息并退出
      --version         显示版本信息并退出

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end.  If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace.  OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key.  If no key is given, use
the entire line as the key.  Use --debug to diagnose incorrect key usage.

SIZE may be followed by the following multiplicative suffixes:
内存使用率% 1%，b 1、K 1024（默认），M、G、T、P、E、Z、Y 等依此类推。

*** 警告 ***
地区与语言环境变量（locale）会影响排序结果。
如果希望以字节的自然值获得最传统的排序结果，
请设置环境变量 LC_ALL=C。

默认按字母表的顺序顺序排序

[07:06:18 root@C8-3-55 ~]#sort /etc/passwd | head -n 10
adm:x:3:4:adm:/var/adm:/sbin/nologin
apache:x:48:48:Apache:/usr/share/httpd:/sbin/nologin
bin:x:1:1:bin:/bin:/sbin/nologin
chrony:x:991:987::/var/lib/chrony:/sbin/nologin
clevis:x:994:990:Clevis Decryption Framework unprivileged user:/var/cache/clevis:/sbin/nologin
cockpit-ws:x:993:989:User for cockpit-ws:/nonexisting:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin

按照指定字段进行排序

sort可以直接取出指定某个列，对这个列进行扩展排序

-t: 以：为分隔符进行分割
-k3 取分割以后的第三列进行排序

[07:06:26 root@C8-3-55 ~]#sort -t: -k3 /etc/passwd | head -n 10
root:x:0:0:root:/root:/bin/bash
python:x:1000:1000::/home/python:/bin/bash
sun3:x:1002:1002::/home/sun3:/bin/bash
sun4:x:1003:1003::/home/sun4:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
bin:x:1:1:bin:/bin:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
rtkit:x:172:172:RealtimeKit:/proc:/sbin/nologin
systemd-resolve:x:193:193:systemd Resolver:/:/sbin/nologin

排完以后发现，第三列并没有按照数字大小进行排序，而是按照字符进行排序

第一位是1，第二位是0的1000排在了第一位是1第二位是1的11前面去了。

如果需要按照数字排列，需要加选项-n

[07:09:03 root@C8-3-55 ~]#sort -t: -k3 -n /etc/passwd | head -n 10
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin

如果需要倒序排，再加参数-r


[07:13:03 root@C8-3-55 ~]#sort -t: -k3 -n -r /etc/passwd | head -n 10
nobody:x:65534:65534:Kernel Overflow User:/:/sbin/nologin
user100:x:8990:8990::/home/user100:/bin/bash
user99:x:8989:8989::/home/user99:/bin/bash
user98:x:8988:8988::/home/user98:/bin/bash
user97:x:8987:8987::/home/user97:/bin/bash
user96:x:8986:8986::/home/user96:/bin/bash
user95:x:8985:8985::/home/user95:/bin/bash
user94:x:8984:8984::/home/user94:/bin/bash
user93:x:8983:8983::/home/user93:/bin/bash
user92:x:8982:8982::/home/user92:/bin/bash

通常将特定列先取出再排序

先用cut取出/etc/passwd中以：为分隔符的（-d:）第1和第3行(-f 1,3)
再用sort取以：为分隔符(-t:)的第二列(-k 2)并按数字大小顺序倒序排列（-nr）

[07:16:50 root@C8-3-55 ~]#cut -d: -f 1,3 /etc/passwd | sort -t: -k 2 -nr | head -n 10
nobody:65534
user100:8990
user99:8989
user98:8988
user97:8987
user96:8986
user95:8985
user94:8984
user93:8983
user92:8982

根据磁盘空间使用率排序

[07:29:07 root@C8-3-55 ~]#df | tail -n +2 | tr -s ' ' '%'  | cut -d % -f 1,5 |sort -t % -k 2 -nr
/dev/mapper/cl-root%20
/dev/sda1%16
tmpfs%2
tmpfs%0
tmpfs%0
tmpfs%0
devtmpfs%0

利用-R参数进行随机排序

将人员名单输入文本，并用sort -R随机排序，产生出发顺序或者抽奖名单

对人员进行出发顺序随机排序

[07:42:19 root@C8-3-55 ~]#echo name{A..F} | tr -s ' ' '
' | sort -R | cat -n
     1  nameB
     2  nameD
     3  nameA
     4  nameE
     5  nameC
     6  nameF
[07:43:04 root@C8-3-55 ~]#echo name{A..F} | tr -s ' ' '
' | sort -R | cat -n
     1  nameB
     2  nameF
     3  nameE
     4  nameC
     5  nameD
     6  nameA
[07:43:06 root@C8-3-55 ~]#echo name{A..F} | tr -s ' ' '
' | sort -R | cat -n
     1  nameC
     2  nameB
     3  nameA
     4  nameF
     5  nameD
     6  nameE

100个抽奖者随机产生中奖名单

[07:43:08 root@C8-3-55 ~]#seq 100 | sort -R | tail -n 1
51
[07:45:39 root@C8-3-55 ~]#seq 100 | sort -R | tail -n 1
23
[07:45:41 root@C8-3-55 ~]#seq 100 | sort -R | tail -n 1
76
[07:45:42 root@C8-3-55 ~]#seq 100 | sort -R | tail -n 1
96
[07:45:43 root@C8-3-55 ~]#seq 100 | sort -R | tail -n 1
73

uniq

不排序，只是将挨着的重复的行去掉，不挨着不合并

统计同样的信息出现了多少次

[09:06:22 root@C8-3-55 ~]#uniq --help
用法：uniq [选项]... [文件]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

必选参数对长短选项同时适用。
  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines, one for each group
  -D                    print all duplicate lines
      --all-repeated[=METHOD]  like -D, but allow separating groups
                                 with an empty line;
                                 METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields
      --group[=METHOD]  show all items, separating groups with an empty line;
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated     line delimiter is NUL, not newline
  -w, --check-chars=N   对每行第N 个字符以后的内容不作对照
      --help            显示此帮助信息并退出
      --version         显示版本信息并退出

若域中为先空字符(通常包括空格以及制表符)，然后非空字符，域中字符前的空字符将被跳过。

提示："uniq" 不会检查重复的行，除非它们是相邻的行。
您也许需要事先对输入排序，或使用 "sort -u" 而非 "uniq"。
另外，比较操作将服从 "LC_COLLATE" 环境变量所指定的规则。

例：从httpd的access_log中取出访问次数前3最多的ip地址

[09:53:25 root@C8-3-55 ~]#cut -d " " -f 1 access_log | sort | uniq -c | sort -nr | head -3

例：从系统访问日志ss.log中取出前3个连接最多的ip

tr -s ' ' : < ss.log | cut -d : -f 6 | tail -n +2 | sort | uniq -c | sort -nr | head -3

-u显示连续的相同行，-d显示连续的不同行

因为只显示连续的，所以之前要用sort进行排序

例：显示两个文件a.txt,b.txt中相同的行和不同的行

[10:01:31 root@C8-3-55 ~]#cat a.txt b.txt | sort | uniq -u

[10:05:29 root@C8-3-55 ~]#cat a.txt b.txt | sort | uniq -d

* * * 胖并快乐着的死肥宅 * * *