Word Tokenization

Well, after listen the class, it's necessary to make notes.

In this class, Pro just tell us there are a lot of words in various corpus using Linux program, and introduce that in different language, there are different language. For example, word segmentation is the crucial step for Chinese, and introduce maximun matching is the relative good algoritm for Chinese but not for English.

Some Linux program

less a.txt | less : means display a.txt

tr -sc 'A-Za-z' '\n' < a.txt | less : means replace all the periods and commas with new lines and display.

if u want to sort, just add " | sort"

if u want to get the unique, u can add "uniq -c"

if u want to sort, u can add "sort -n - r", -n means display the num, -r means sort by decreasing.