自然语言14.1_python实现PorterStemmer算法

 python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

机器学习,统计项目合作QQ:231469242

#https://tartarus.org/martin/PorterStemmer/python.txt

#!/usr/bin/env python

"""Porter Stemming Algorithm
This is the Porter stemming algorithm, ported to Python from the
version coded up in ANSI C by the author. It may be be regarded
as canonical, in that it follows the algorithm presented in

Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,

only differing from it at the points maked --DEPARTURE-- below.

See also http://www.tartarus.org/~martin/PorterStemmer

The algorithm as described in the paper could be exactly replicated
by adjusting the points of DEPARTURE, but this is barely necessary,
because (a) the points of DEPARTURE are definitely improvements, and
(b) no encoding of the Porter stemmer I have seen is anything like
as exact as this version, even with the points of DEPARTURE!

Vivake Gupta (v@nano.com)

Release 1: January 2001

Further adjustments by Santiago Bruno (bananabruno@gmail.com)
to allow word input not restricted to one word per line, leading
to:

release 2: July 2008
"""

import sys

class PorterStemmer:

    def __init__(self):
        """The main part of the stemming algorithm starts here.
        b is a buffer holding a word to be stemmed. The letters are in b[k0],
        b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k is
        readjusted downwards as the stemming progresses. Zero termination is
        not in fact used in the algorithm.

        Note that only lower case sequences are stemmed. Forcing to lower case
        should be done before stem(...) is called.
        """

        self.b = ""  # buffer for word to be stemmed
        self.k = 0
        self.k0 = 0
        self.j = 0   # j is a general offset into the string

    def cons(self, i):
        """cons(i) is TRUE <=> b[i] is a consonant."""
        if self.b[i] == 'a' or self.b[i] == 'e' or self.b[i] == 'i' or self.b[i] == 'o' or self.b[i] == 'u':
            return 0
        if self.b[i] == 'y':
            if i == self.k0:
                return 1
            else:
                return (not self.cons(i - 1))
        return 1

    def m(self):
        """m() measures the number of consonant sequences between k0 and j.
        if c is a consonant sequence and v a vowel sequence, and <..>
        indicates arbitrary presence,

           <c><v>       gives 0
           <c>vc<v>     gives 1
           <c>vcvc<v>   gives 2
           <c>vcvcvc<v> gives 3
           ....
        """
        n = 0
        i = self.k0
        while 1:
            if i > self.j:
                return n
            if not self.cons(i):
                break
            i = i + 1
        i = i + 1
        while 1:
            while 1:
                if i > self.j:
                    return n
                if self.cons(i):
                    break
                i = i + 1
            i = i + 1
            n = n + 1
            while 1:
                if i > self.j:
                    return n
                if not self.cons(i):
                    break
                i = i + 1
            i = i + 1

    def vowelinstem(self):
        """vowelinstem() is TRUE <=> k0,...j contains a vowel"""
        for i in range(self.k0, self.j + 1):
            if not self.cons(i):
                return 1
        return 0

    def doublec(self, j):
        """doublec(j) is TRUE <=> j,(j-1) contain a double consonant."""
        if j < (self.k0 + 1):
            return 0
        if (self.b[j] != self.b[j-1]):
            return 0
        return self.cons(j)

    def cvc(self, i):
        """cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant
        and also if the second c is not w,x or y. this is used when trying to
        restore an e at the end of a short  e.g.

           cav(e), lov(e), hop(e), crim(e), but
           snow, box, tray.
        """
        if i < (self.k0 + 2) or not self.cons(i) or self.cons(i-1) or not self.cons(i-2):
            return 0
        ch = self.b[i]
        if ch == 'w' or ch == 'x' or ch == 'y':
            return 0
        return 1

    def ends(self, s):
        """ends(s) is TRUE <=> k0,...k ends with the string s."""
        length = len(s)
        if s[length - 1] != self.b[self.k]: # tiny speed-up
            return 0
        if length > (self.k - self.k0 + 1):
            return 0
        if self.b[self.k-length+1:self.k+1] != s:
            return 0
        self.j = self.k - length
        return 1

    def setto(self, s):
        """setto(s) sets (j+1),...k to the characters in the string s, readjusting k."""
        length = len(s)
        self.b = self.b[:self.j+1] + s + self.b[self.j+length+1:]
        self.k = self.j + length

    def r(self, s):
        """r(s) is used further down."""
        if self.m() > 0:
            self.setto(s)

    def step1ab(self):
        """step1ab() gets rid of plurals and -ed or -ing. e.g.

           caresses  ->  caress
           ponies    ->  poni
           ties      ->  ti
           caress    ->  caress
           cats      ->  cat

           feed      ->  feed
           agreed    ->  agree
           disabled  ->  disable

           matting   ->  mat
           mating    ->  mate
           meeting   ->  meet
           milling   ->  mill
           messing   ->  mess

           meetings  ->  meet
        """
        if self.b[self.k] == 's':
            if self.ends("sses"):
                self.k = self.k - 2
            elif self.ends("ies"):
                self.setto("i")
            elif self.b[self.k - 1] != 's':
                self.k = self.k - 1
        if self.ends("eed"):
            if self.m() > 0:
                self.k = self.k - 1
        elif (self.ends("ed") or self.ends("ing")) and self.vowelinstem():
            self.k = self.j
            if self.ends("at"):   self.setto("ate")
            elif self.ends("bl"): self.setto("ble")
            elif self.ends("iz"): self.setto("ize")
            elif self.doublec(self.k):
                self.k = self.k - 1
                ch = self.b[self.k]
                if ch == 'l' or ch == 's' or ch == 'z':
                    self.k = self.k + 1
            elif (self.m() == 1 and self.cvc(self.k)):
                self.setto("e")

    def step1c(self):
        """step1c() turns terminal y to i when there is another vowel in the stem."""
        if (self.ends("y") and self.vowelinstem()):
            self.b = self.b[:self.k] + 'i' + self.b[self.k+1:]

    def step2(self):
        """step2() maps double suffices to single ones.
        so -ization ( = -ize plus -ation) maps to -ize etc. note that the
        string before the suffix must give m() > 0.
        """
        if self.b[self.k - 1] == 'a':
            if self.ends("ational"):   self.r("ate")
            elif self.ends("tional"):  self.r("tion")
        elif self.b[self.k - 1] == 'c':
            if self.ends("enci"):      self.r("ence")
            elif self.ends("anci"):    self.r("ance")
        elif self.b[self.k - 1] == 'e':
            if self.ends("izer"):      self.r("ize")
        elif self.b[self.k - 1] == 'l':
            if self.ends("bli"):       self.r("ble") # --DEPARTURE--
            # To match the published algorithm, replace this phrase with
            #   if self.ends("abli"):      self.r("able")
            elif self.ends("alli"):    self.r("al")
            elif self.ends("entli"):   self.r("ent")
            elif self.ends("eli"):     self.r("e")
            elif self.ends("ousli"):   self.r("ous")
        elif self.b[self.k - 1] == 'o':
            if self.ends("ization"):   self.r("ize")
            elif self.ends("ation"):   self.r("ate")
            elif self.ends("ator"):    self.r("ate")
        elif self.b[self.k - 1] == 's':
            if self.ends("alism"):     self.r("al")
            elif self.ends("iveness"): self.r("ive")
            elif self.ends("fulness"): self.r("ful")
            elif self.ends("ousness"): self.r("ous")
        elif self.b[self.k - 1] == 't':
            if self.ends("aliti"):     self.r("al")
            elif self.ends("iviti"):   self.r("ive")
            elif self.ends("biliti"):  self.r("ble")
        elif self.b[self.k - 1] == 'g': # --DEPARTURE--
            if self.ends("logi"):      self.r("log")
        # To match the published algorithm, delete this phrase

    def step3(self):
        """step3() dels with -ic-, -full, -ness etc. similar strategy to step2."""
        if self.b[self.k] == 'e':
            if self.ends("icate"):     self.r("ic")
            elif self.ends("ative"):   self.r("")
            elif self.ends("alize"):   self.r("al")
        elif self.b[self.k] == 'i':
            if self.ends("iciti"):     self.r("ic")
        elif self.b[self.k] == 'l':
            if self.ends("ical"):      self.r("ic")
            elif self.ends("ful"):     self.r("")
        elif self.b[self.k] == 's':
            if self.ends("ness"):      self.r("")

    def step4(self):
        """step4() takes off -ant, -ence etc., in context <c>vcvc<v>."""
        if self.b[self.k - 1] == 'a':
            if self.ends("al"): pass
            else: return
        elif self.b[self.k - 1] == 'c':
            if self.ends("ance"): pass
            elif self.ends("ence"): pass
            else: return
        elif self.b[self.k - 1] == 'e':
            if self.ends("er"): pass
            else: return
        elif self.b[self.k - 1] == 'i':
            if self.ends("ic"): pass
            else: return
        elif self.b[self.k - 1] == 'l':
            if self.ends("able"): pass
            elif self.ends("ible"): pass
            else: return
        elif self.b[self.k - 1] == 'n':
            if self.ends("ant"): pass
            elif self.ends("ement"): pass
            elif self.ends("ment"): pass
            elif self.ends("ent"): pass
            else: return
        elif self.b[self.k - 1] == 'o':
            if self.ends("ion") and (self.b[self.j] == 's' or self.b[self.j] == 't'): pass
            elif self.ends("ou"): pass
            # takes care of -ous
            else: return
        elif self.b[self.k - 1] == 's':
            if self.ends("ism"): pass
            else: return
        elif self.b[self.k - 1] == 't':
            if self.ends("ate"): pass
            elif self.ends("iti"): pass
            else: return
        elif self.b[self.k - 1] == 'u':
            if self.ends("ous"): pass
            else: return
        elif self.b[self.k - 1] == 'v':
            if self.ends("ive"): pass
            else: return
        elif self.b[self.k - 1] == 'z':
            if self.ends("ize"): pass
            else: return
        else:
            return
        if self.m() > 1:
            self.k = self.j

    def step5(self):
        """step5() removes a final -e if m() > 1, and changes -ll to -l if
        m() > 1.
        """
        self.j = self.k
        if self.b[self.k] == 'e':
            a = self.m()
            if a > 1 or (a == 1 and not self.cvc(self.k-1)):
                self.k = self.k - 1
        if self.b[self.k] == 'l' and self.doublec(self.k) and self.m() > 1:
            self.k = self.k -1

    def stem(self, p, i, j):
        """In stem(p,i,j), p is a char pointer, and the string to be stemmed
        is from p[i] to p[j] inclusive. Typically i is zero and j is the
        offset to the last character of a string, (p[j+1] == ''). The
        stemmer adjusts the characters p[i] ... p[j] and returns the new
        end-point of the string, k. Stemming never increases word length, so
        i <= k <= j. To turn the stemmer into a module, declare 'stem' as
        extern, and delete the remainder of this file.
        """
        # copy the parameters into statics
        self.b = p
        self.k = j
        self.k0 = i
        if self.k <= self.k0 + 1:
            return self.b # --DEPARTURE--

        # With this line, strings of length 1 or 2 don't go through the
        # stemming process, although no mention is made of this in the
        # published algorithm. Remove the line to match the published
        # algorithm.

        self.step1ab()
        self.step1c()
        self.step2()
        self.step3()
        self.step4()
        self.step5()
        return self.b[self.k0:self.k+1]


if __name__ == '__main__':
    p = PorterStemmer()
    if len(sys.argv) > 1:
        for f in sys.argv[1:]:
            infile = open(f, 'r')
            while 1:
                output = ''
                word = ''
                line = infile.readline()
                if line == '':
                    break
                for c in line:
                    if c.isalpha():
                        word += c.lower()
                    else:
                        if word:
                            output += p.stem(word, 0,len(word)-1)
                            word = ''
                        output += c.lower()
                print output,
            infile.close()

http://www.cnblogs.com/xiaoxiangfeizi/archive/2011/12/30/2307810.html

 http://qinxuye.me/article/porter-stemmer/

在英语中,一个单词常常是另一个单词的“变种”,如:happy=>happiness,这里happy叫做happiness的词干 (stem)。在信息检索系统中,我们常常做的一件事,就是在Term规范化过程中,提取词干(stemming),即除去英文单词分词变换形式的结尾。

应用最为广泛的、中等复杂程度的、基于后缀剥离的词干提取算法是波特词干算法,也叫波特词干器(Porter Stemmer)。详见官方网站。比较热门的检索系统包括Lucene、Whoosh等中的词干过滤器就是采用的波特词干算法。

简单说一下历史:

马丁.波特博士(Dr. Martin Porter)于1979年,在英国剑桥大学,计算机实验室,发明了波特词干算法。
波特词干算法当时是作为一个大型IR项目的一部分被提出的。它的原始论文为:
C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587).
最初的波特词干提取算法是使用BCPL语言编写的。作者在其网站上公布了各种语言的实现版本,其中C语言的版本是作者编写的最权威的版本。
波特词干器适用于涉及到提取词干的IR研究工作,其实验结果是可重复的,言外之意是说,波特词干器的输出结果是确定性的,不是随机的。(还有基于随机的高级词干提取算法,虽然会更准确,但同时也更加复杂)。

词干提取算法无法达到100%的准确程度,因为语言单词本身的变化存在着许多例外的情况,无法概括到一般的规则中。使用词干提取算法能够帮助提高IR的性能。

波特词干算法的官方网站上,有各个语言的实现版本(其实都是C标准的各个翻译形式)。各位要应用到实际生产中可以直接下载对应的版本。本文将会分析Java语言的源码。在今后的文章中,再介绍使用Python特性优化过的版本。(Python原版几乎就是C语言版本的翻译,这也就意味着不能充分利用Python的语言特性。)

在实际处理中,需要分六步走。首先,我们先定义一个Stemmer类。

1
2
3
4
5
6
7
8
9
10
11
12
13
class Stemmer
private char[] b;
   private int i,     /* b中的元素位置(偏移量) */
               i_end, /* 要抽取词干单词的结束位置 */
               j, k;
   private static final int INC = 50;
                     /* 随着b的大小增加数组要增长的长度(防止溢出) */
   public Stemmer()
   {  b = new char[INC];
      i = 0;
      i_end = 0;
   }
}

这里,b是一个数组,用来存待词干提取的单词(以char的形式)。这里的变量k会随着词干抽取而变化。

接着,我们要添加单词来进行处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/**
* 增加一个字符到要存放待处理的单词的数组。添加完字符时,
* 可以调用stem(void)方法来进行抽取词干的工作。
*/
public void add(char ch)
if (i == b.length)
   char[] new_b = new char[i+INC];
      for (int c = 0; c < i; c++) new_b[c] = b[c];
      b = new_b;
   }
   b[i++] = ch;
}
 
/** 增加wLen长度的字符数组到存放待处理的单词的数组b。
*/
public void add(char[] w, int wLen)
if (i+wLen >= b.length)
   char[] new_b = new char[i+wLen+INC];
      for (int c = 0; c < i; c++) new_b[c] = b[c];
      b = new_b;
   }
   for (int c = 0; c < wLen; c++) b[i++] = w[c];
}

大家可能会觉得这么处理字符串太麻烦了吧,要明白,整个代码是从C移植过来的。

接下来,是一系列工具函数。首先先介绍一下它们:

  • cons(i):参数i:int型;返回值bool型。当i为辅音时,返回真;否则为假。
  • m():返回值:int型。表示单词b介于0和j之间辅音序列的个度。现假设c代表辅音序列,而v代表元音序列。<..>表示任意存在。于是有如下定义;
    • <c><v>          结果为 0
    • <c>vc<v>       结果为 1
    • <c>vcvc<v>    结果为 2
    • <c>vcvcvc<v> 结果为 3
    • ....
  • vowelinstem():返回值:bool型。从名字就可以看得出来,表示单词b介于0到i之间是否存在元音。
  • doublec(j):参数j:int型;返回值bool型。这个函数用来表示在j和j-1位置上的两个字符是否是相同的辅音。
  • cvc(i):参数i:int型;返回值bool型。对于i,i-1,i-2位置上的字符,它们是“辅音-元 音-辅音”的形式,并且对于第二个辅音,它不能为w、x、y中的一个。这个函数用来处理以e结尾的短单词。比如说 cav(e),lov(e),hop(e),crim(e)。但是像snow,box,tray就辅符合条件。
  • ends(s):参数:String;返回值:bool型。顾名思义,判断b是否以s结尾。
  • setto(s):参数:String;void类型。把b在(j+1)...k位置上的字符设为s,同时,调整k的大小。
  • r(s):参数:String;void类型。在m()>0的情况下,调用setto(s)。

简单贴出来这些工具函数的代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// cons(i) 为真 <=> b[i] 是一个辅音
private final boolean cons(int i)
switch (b[i])
   case 'a': case 'e': case 'i': case 'o': case 'u': return false; //aeiou
      case 'y': return (i==0) ? true : !cons(i-1);
                //y开头,为辅;否则看i-1位,如果i-1位为辅,y为元,反之亦然。
      default: return true;
   }
}
 
// m() 用来计算在0和j之间辅音序列的个数。 见上面的说明。 */
private final int m()
int n = 0; //辅音序列的个数,初始化
   int i = 0; //偏移量
   while(true)
   if (i > j) return n; //如果超出最大偏移量,直接返回n
      if (! cons(i)) break; //如果是元音,中断
      i++; //辅音移一位,直到元音的位置
   }
   i++; //移完辅音,从元音的第一个字符开始
   while(true)//循环计算vc的个数
   while(true) //循环判断v
      if (i > j) return n;
         if (cons(i)) break; //出现辅音则终止循环
            i++;
      }
      i++;
      n++;
      while(true) //循环判断c
      if (i > j) return n;
         if (! cons(i)) break;
         i++;
      }
      i++;
    }
}
 
// vowelinstem() 为真 <=> 0,...j 包含一个元音
private final boolean vowelinstem()
int i; for (i = 0; i <= j; i++) if (! cons(i)) return true;
   return false;
}
 
// doublec(j) 为真 <=> j,(j-1) 包含两个一样的辅音
private final boolean doublec(int j)
if (j < 1) return false;
   if (b[j] != b[j-1]) return false;
   return cons(j);
}
 
/* cvc(i) is 为真 <=> i-2,i-1,i 有形式: 辅音 - 元音 - 辅音
   并且第二个c不是 w,x 或者 y. 这个用来处理以e结尾的短单词。 e.g.
 
   cav(e), lov(e), hop(e), crim(e), 但不是
   snow, box, tray.
 
*/
private final boolean cvc(int i)
if (i < 2 || !cons(i) || cons(i-1) || !cons(i-2)) return false;
   int ch = b[i];
         if (ch == 'w' || ch == 'x' || ch == 'y') return false;
   }
      return true;
}
 
private final boolean ends(String s)
int l = s.length();
   int o = k-l+1;
   if (o < 0) return false;
   for (int i = 0; i < l; i++) if (b[o+i] != s.charAt(i)) return false;
   j = k-l;
   return true;
}
 
// setto(s) 设置 (j+1),...k 到s字符串上的字符, 并且调整k值
private final void setto(String s)
int l = s.length();
   int o = j+1;
   for (int i = 0; i < l; i++) b[o+i] = s.charAt(i);
   k = j+l;
}
 
private final void r(String s) { if (m() > 0) setto(s); }

接下来,就是分六步来进行处理的过程。

第一步,处理复数,以及ed和ing结束的单词。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* step1() 处理复数,ed或者ing结束的单词。比如:
 
      caresses  ->  caress
      ponies    ->  poni
      ties      ->  ti
      caress    ->  caress
      cats      ->  cat
 
      feed      ->  feed
      agreed    ->  agree
      disabled  ->  disable
 
      matting   ->  mat
      mating    ->  mate
      meeting   ->  meet
      milling   ->  mill
      messing   ->  mess
 
      meetings  ->  meet
*/
 
private final void step1()
if (b[k] == 's')
   if (ends("sses")) k -= 2; //以“sses结尾”
      else if (ends("ies")) setto("i"); //以ies结尾,置为i
      else if (b[k-1] != 's') k--; //两个s结尾不处理
   }
   if (ends("eed")) { if (m() > 0) k--; } //以“eed”结尾,当m>0时,左移一位
   else if ((ends("ed") || ends("ing")) && vowelinstem())
   {  k = j;
      if (ends("at")) setto("ate"); else
      if (ends("bl")) setto("ble"); else
      if (ends("iz")) setto("ize"); else
      if (doublec(k))//如果有两个相同辅音
      {  k--;
         int ch = b[k];
            if (ch == 'l' || ch == 's' || ch == 'z') k++;
         }
      }
      else if (m() == 1 && cvc(k)) setto("e");
  }
}

第二步,如果单词中包含元音,并且以y结尾,将y改为i。代码很简单:

1
private final void step2() { if (ends("y") && vowelinstem()) b[k] = 'i'; }

第三步,将双后缀的单词映射为单后缀。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/* step3() 将双后缀的单词映射为单后缀。 所以 -ization ( = -ize 加上
   -ation) 被映射到 -ize 等等。 注意在去除后缀之前必须确保
   m() > 0. */
private final void step3() { if (k == 0) returnswitch (b[k-1])
{
    case 'a': if (ends("ational")) { r("ate"); break; }
              if (ends("tional")) { r("tion"); break; }
              break;
    case 'c': if (ends("enci")) { r("ence"); break; }
              if (ends("anci")) { r("ance"); break; }
              break;
    case 'e': if (ends("izer")) { r("ize"); break; }
              break;
    case 'l': if (ends("bli")) { r("ble"); break; }
              if (ends("alli")) { r("al"); break; }
              if (ends("entli")) { r("ent"); break; }
              if (ends("eli")) { r("e"); break; }
              if (ends("ousli")) { r("ous"); break; }
              break;
    case 'o': if (ends("ization")) { r("ize"); break; }
              if (ends("ation")) { r("ate"); break; }
              if (ends("ator")) { r("ate"); break; }
              break;
    case 's': if (ends("alism")) { r("al"); break; }
              if (ends("iveness")) { r("ive"); break; }
              if (ends("fulness")) { r("ful"); break; }
              if (ends("ousness")) { r("ous"); break; }
              break;
    case 't': if (ends("aliti")) { r("al"); break; }
              if (ends("iviti")) { r("ive"); break; }
              if (ends("biliti")) { r("ble"); break; }
              break;
    case 'g': if (ends("logi")) { r("log"); break; }
} }

第四步,处理-ic-,-full,-ness等等后缀。和步骤3有着类似的处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private final void step4() { switch (b[k])
{
    case 'e': if (ends("icate")) { r("ic"); break; }
              if (ends("ative")) { r(""); break; }
              if (ends("alize")) { r("al"); break; }
              break;
    case 'i': if (ends("iciti")) { r("ic"); break; }
              break;
    case 'l': if (ends("ical")) { r("ic"); break; }
              if (ends("ful")) { r(""); break; }
              break;
    case 's': if (ends("ness")) { r(""); break; }
              break;
} }

第五步,在<c>vcvc<v>情形下,去除-ant,-ence等后缀。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
private final void step5()
{   if (k == 0) returnswitch (b[k-1])
    case 'a': if (ends("al")) break; return;
       case 'c': if (ends("ance")) break;
                 if (ends("ence")) break; return;
       case 'e': if (ends("er")) break; return;
       case 'i': if (ends("ic")) break; return;
       case 'l': if (ends("able")) break;
                 if (ends("ible")) break; return;
       case 'n': if (ends("ant")) break;
                 if (ends("ement")) break;
                 if (ends("ment")) break;
                 /* element etc. not stripped before the m */
                 if (ends("ent")) break; return;
       case 'o': if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break;
                                 /* j >= 0 fixes Bug 2 */
                 if (ends("ou")) break; return;
                 /* takes care of -ous */
       case 's': if (ends("ism")) break; return;
       case 't': if (ends("ate")) break;
                 if (ends("iti")) break; return;
       case 'u': if (ends("ous")) break; return;
       case 'v': if (ends("ive")) break; return;
       case 'z': if (ends("ize")) break; return;
       default: return;
    }
    if (m() > 1) k = j;
}

第六步,也就是最后一步,在m()>1的情况下,移除末尾的“e”。

1
2
3
4
5
6
7
8
private final void step6()
{  j = k;
   if (b[k] == 'e')
   int a = m();
      if (a > 1 || a == 1 && !cvc(k-1)) k--;
   }
   if (b[k] == 'l' && doublec(k) && m() > 1) k--;
}

在了解了步骤之后,我们写一个stem()方法,来完成得到词干的工作。

1
2
3
4
5
6
7
8
9
/** 通过调用add()方法来讲单词放入词干器数组b中
  * 可以通过下面的方法得到结果:
  * getResultLength()/getResultBuffer() or toString().
  */
public void stem()
{  k = i - 1;
   if (k > 1) { step1(); step2(); step3(); step4(); step5(); step6(); }
   i_end = k+1; i = 0;
}

最后要提醒的就是,传入的单词必须是小写。关于Porter Stemmer的实现,就看到这里。如果是Java代码这么写,无可厚非(实际上也不是很美观)。对于Python来说,如果写成这样,实在是让人难以接受。以后的文章,将会实现符合Python习惯的写法。

需要测试数据这里是样本文件。而相应的输出文件在这里。更多内容请参考官方网站。

另外,波特词干算法有第二个版本,它的处理结果要比文中所介绍的算法准确度高,但是,相应地也就更复杂,消耗的时间也就更多。本文就不作解释,详细参考官方网站The Porter2 stemming algorithm

 

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149(博主视频教学主页)

 

 

 
 
原文地址:https://www.cnblogs.com/webRobot/p/6082282.html