Python自然语言处理学习笔记(33)：4.5 关于函数的更多使用

4.5 Doing More with Functions 关于函数的更多使用

This section discusses more advanced features, which you may prefer to skip on the first time through this chapter.

Functions as Arguments 函数作为参数

So far the arguments we have passed into functions have been simple objects like strings, or structured objects like lists. Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a different operation on the same data. As the following examples show, we can pass the built-in function len() or a user-defined function last_letter() as arguments to another function:

>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',

... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']

>>> def extract_property(prop):

... return [prop(word) for word in sent]

...

>>> extract_property(len)

[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]

>>> def last_letter(word):

... return word[-1]

>>> extract_property(last_letter)

['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

The objects len and last_letter can be passed around like lists and dictionaries. Notice that parentheses are only used after a function name if we are invoking the function; when we are simply treating the function as an object these are omitted(注意如果我们调用函数，参数仅用在函数名之后；当我们简单地把函数作为对象时，参数忽略).

Python provides us with one more way to define functions as arguments to other functions, so-called lambda expressions. Supposing there was no need to use the above last_letter() function in multiple places, and thus no need to give it a name. We can equivalently write the following:

>>> extract_property(lambda w: w[-1])

['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

Our next example illustrates passing a function to the sorted() function. When we call the latter with a single argument (the list to be sorted), it uses the built-in comparison function cmp(). However, we can supply our own sort function, e.g. to sort by decreasing length.

>>> sorted(sent)

[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds',

'take', 'the', 'the', 'themselves', 'will']

>>> sorted(sent, cmp)

[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds',

'take', 'the', 'the', 'themselves', 'will']

>>> sorted(sent, lambda x, y: cmp(len(y), len(x))) #根据返回是否1判断？

['themselves', 'sounds', 'sense', 'Take', 'care', 'will', 'take', 'care',

'the', 'and', 'the', 'of', 'of', ',', '.']

Accumulative Functions 累加函数

These functions start by initializing some storage, and iterate over input to build it up, before returning some final object (a large structure or aggregated result). A standard way to do this is to initialize an empty list, accumulate the material, then return the list, as shown in function search1() in Example 4.6.

def search1(substring, words):

result = []

for word in words:

if substring in word:

result.append(word)

return result

def search2(substring, words):

for word in words:

if substring in word:

yield word

print "search1:"

for item in search1('zz', nltk.corpus.brown.words()):

print item

print "search2:"

for item in search2('zz', nltk.corpus.brown.words()):

print item

Example 4.6 (code_search_examples.py): Accumulating Output into a List

The function search2() is a generator. The first time this function is called, it gets as far as the yield statement and pauses. The calling program gets the first word and does any necessary processing. Once the calling program is ready for another word, execution of the function is continued from where it stopped, until the next time it encounters a yield statement. This approach is typically more efficient, as the function only generates the data as it is required by the calling program, and does not need to allocate additional memory to store the output (cf. our discussion of generator expressions above).

Here's a more sophisticated example of a generator which produces all permutations（排列） of a list of words. In order to force the permutations() function to generate all its output, we wrap it with a call to list() .

>>> def permutations(seq):

... if len(seq) <= 1:

... yield seq

... else:

... for perm in permutations(seq[1:]):

... for i in range(len(perm)+1):

... yield perm[:i] + seq[0:1] + perm[i:]

...

>>> list(permutations(['police', 'fish', 'buffalo']))

[['police', 'fish', 'buffalo'], ['fish', 'police', 'buffalo'],

['fish', 'buffalo', 'police'], ['police', 'buffalo', 'fish'],

['buffalo', 'police', 'fish'], ['buffalo', 'fish', 'police']]

Note

The permutations function uses a technique called recursion, discussed below in Section 4.7. The ability to generate permutations of a set of words is useful for creating data to test a grammar (Chapter 8).

Higher-Order Functions 高级函数

Python provides some higher-order functions that are standard features of functional programming languages such as Haskell. We illustrate them here, alongside the equivalent expression using list comprehensions.

Let's start by defining a function is_content_word() which checks whether a word is from the open class of content words（实词的开放类）. We use this function as the first parameter offilter(), which applies the function to each item in the sequence contained in its second parameter, and only retains the items for which the function returns True.

>>> def is_content_word(word):

... return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']

>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',

... 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']

>>> filter(is_content_word, sent)

['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']

>>> [w for w in sent if is_content_word(w)]

['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']

Another higher-order function is map(), which applies a function to every item in a sequence（对序列中的每一项使用函数）. It is a general version of the extract_property() function we saw inSection 4.5. Here is a simple way to find the average length of a sentence in the news section of the Brown Corpus, followed by an equivalent version with list comprehension: calculation:

>>> lengths = map(len, nltk.corpus.brown.sents(categories='news'))

>>> sum(lengths) / len(lengths)

21.7508111616

>>> lengths = [len(w) for w in nltk.corpus.brown.sents(categories='news'))]

>>> sum(lengths) / len(lengths)

21.7508111616

In the above examples we specified a user-defined function is_content_word() and a built-in function len(). We can also provide a lambda expression. Here's a pair of equivalent examples which count the number of vowels in each word.

>>> map(lambda w: len(filter(lambda c: c.lower() in "aeiou", w)), sent)

[2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]

>>> [len([c for c in w if c.lower() in "aeiou"]) for w in sent]

[2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]

The solutions based on list comprehensions are usually more readable than the solutions based on higher-order functions, and we have favored the former approach throughout this book（使用列表解析的方法可读性更好）.

Named Arguments 参数命名

When there are a lot of parameters it is easy to get confused about the correct order. Instead we can refer to parameters by name, and even assign them a default value just in case one was not provided by the calling program. Now the parameters can be specified in any order, and can be omitted.

>>> def repeat(msg='<empty>', num=1):

... return msg * num

>>> repeat(num=3)

'<empty><empty><empty>'

>>> repeat(msg='Alice')

'Alice'

>>> repeat(num=5, msg='Alice')

'AliceAliceAliceAliceAlice'

These are called keyword arguments（关键字参数）. If we mix these two kinds of parameters, then we must ensure that the unnamed parameters precede the named ones. It has to be this way, since unnamed parameters are defined by position. We can define a function that takes an arbitrary number of unnamed and named parameters, and access them via an in-place list of arguments *args and an "in-place dictionary" of keyword arguments **kwargs. (Dictionaries will be presented in Section 5.3.)

>>> def generic(*args, **kwargs):

... print args

... print kwargs

...

>>> generic(1, "African swallow", monty="python")

(1, 'African swallow')

{'monty': 'python'}

When *args appears as a function parameter, it actually corresponds to all the unnamed parameters of the function. Here's another illustration of this aspect of Python syntax, for the zip() function which operates on a variable number of arguments. We'll use the variable name *song to demonstrate that there's nothing special about the name *args.

>>> song = [['four', 'calling', 'birds'],

... ['three', 'French', 'hens'],

... ['two', 'turtle', 'doves']]

>>> zip(song[0], song[1], song[2])

[('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]

>>> zip(*song)

[('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]

It should be clear from the above example that typing *song is just a convenient shorthand, and equivalent to typing out song[0], song[1], song[2].

Here's another example of the use of keyword arguments in a function definition, along with three equivalent ways to call the function:

>>> def freq_words(file, min=1, num=10):

... text = open(file).read()

... tokens = nltk.word_tokenize(text)

... freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)

... return freqdist.keys()[:num]

>>> fw = freq_words('ch01.rst', 4, 10)

>>> fw = freq_words('ch01.rst', min=4, num=10)

>>> fw = freq_words('ch01.rst', num=10, min=4)

A side-effect of having named arguments is that they permit optionality. Thus we can leave out any arguments where we are happy with the default value:freq_words('ch01.rst', min=4), freq_words('ch01.rst', 4). Another common use of optional arguments is to permit a flag（标记）. Here's a revised version of the same function that reports its progress if a verbose flag is set:

>>> def freq_words(file, min=1, num=10, verbose=False):

... freqdist = FreqDist()

... if verbose: print "Opening", file

... text = open(file).read()

... if verbose: print "Read in %d characters" % len(file)

... for word in nltk.word_tokenize(text):

... if len(word) >= min:

... freqdist.inc(word)

... if verbose and freqdist.N() % 100 == 0: print "."

... if verbose: print

... return freqdist.keys()[:num]

Caution!

Take care not to use a mutable object as the default value of a parameter.（注意不要使用可变对象作为参数的缺省值） A series of calls to the function will use the same object, sometimes with bizarre results as we will see in the discussion of debugging below.