Natural Language Processing with Python

Steven Bird, Ewan Klein, and Edward Loper

Table of Contents

Preface ..................................................................... ix

Language Processing and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1. 1.1 Computing with Language: Texts and Words
2. 1.2 A Closer Look at Python: Texts as Lists of Words 10
3. 1.3 Computing with Language: Simple Statistics 16
4. 1.4 Back to Python: Making Decisions and Taking Control 22
5. 1.5 Automatic Natural Language Understanding 27
6. 1.6 Summary 33
7. 1.7 Further Reading 34
8. 1.8 Exercises 35
Accessing Text Corpora and Lexical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1. 2.1 Accessing Text Corpora
2. 2.2 Conditional Frequency Distributions 52
3. 2.3 More Python: Reusing Code 56
4. 2.4 Lexical Resources 59
5. 2.5 WordNet 67
6. 2.6 Summary 73
7. 2.7 Further Reading 73
8. 2.8 Exercises 74
Processing Raw Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1. 3.1 Accessing Text from the Web and from Disk
2. 3.2 Strings: Text Processing at the Lowest Level
3. 3.3 Text Processing with Unicode
4. 3.4 Regular Expressions for Detecting Word Patterns
5. 3.5 Useful Applications of Regular Expressions 102
6. 3.6 Normalizing Text 107
7. 3.7 Regular Expressions for Tokenizing Text 109
8. 3.8 Segmentation 112
9. 3.9 Formatting: From Lists to Strings 116

80 87 93 97

7. vi |

7.1 Information Extraction

Table of Contents

261

3.10 Summary
3.11 Further Reading
3.12 Exercises

Writing Structured Programs

121 122 123

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.9 Summary
4.10 Further Reading
4.11 Exercises

Categorizing and Tagging Words

130 133 138

172 173 173

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

4.1 Back to the Basics
4.2 Sequences
4.3 Questions of Style
4.4 Functions: The Foundation of Structured Programming 142
4.5 Doing More with Functions 149
4.6 Program Development 154
4.7 Algorithm Design 160
4.8 A Sample of Python Libraries 167

5.1 Using a Tagger
5.2 Tagged Corpora
5.3 Mapping Words to Properties Using Python Dictionaries 189
5.4 Automatic Tagging 198
5.5 N-Gram Tagging 202
5.6 Transformation-Based Tagging 208
5.7 How to Determine the Category of a Word 210
5.8 Summary 213
5.9 Further Reading 214
5.10 Exercises 215

Learning to Classify Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

6.1 Supervised Classification 221
6.2 Further Examples of Supervised Classification 233
6.3 Evaluation 237
6.4 Decision Trees 242
6.5 Naive Bayes Classifiers 245
6.6 Maximum Entropy Classifiers 250
6.7 Modeling Linguistic Patterns 254
6.8 Summary 256
6.9 Further Reading 256
6.10 Exercises 257

Extracting Information from Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

179 181

7.2 Chunking 264
7.3 Developing and Evaluating Chunkers 270
7.4 Recursion in Linguistic Structure 277
7.5 Named Entity Recognition 281
7.6 Relation Extraction 284
7.7 Summary 285
7.8 Further Reading 286
7.9 Exercises 286

8. Analyzing Sentence Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

8.1 Some Grammatical Dilemmas 292
8.2 What’s the Use of Syntax? 295
8.3 Context-Free Grammar 298
8.4 Parsing with Context-Free Grammar 302
8.5 Dependencies and Dependency Grammar 310

8.6 Grammar Development
8.7 Summary
8.8 Further Reading
8.9 Exercises

9. Building Feature-Based Grammars

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

9.1 Grammatical Features
9.2 Processing Feature Structures 337
9.3 Extending a Feature-Based Grammar 344
9.4 Summary 356
9.5 Further Reading 357
9.6 Exercises 358

Analyzing the Meaning of Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
1. 10.1 Natural Language Understanding 361
2. 10.2 Propositional Logic 368
3. 10.3 First-Order Logic 372
4. 10.4 The Semantics of English Sentences 385
5. 10.5 Discourse Semantics 397
6. 10.6 Summary 402
7. 10.7 Further Reading 403
8. 10.8 Exercises 404
Managing Linguistic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
1. 11.1 Corpus Structure: A Case Study 407
2. 11.2 The Life Cycle of a Corpus 412
3. 11.3 Acquiring Data 416
4. 11.4 Working with XML 425

327

Table of Contents

| vii

11.5 Working with Toolbox Data 431
11.6 Describing Language Resources Using OLAC Metadata 435
11.7 Summary 437
11.8 Further Reading 437
11.9 Exercises 438

Afterword: The Language Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Bibliography ............................................................... 449 NLTK Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 General Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

自然语言处理爱好者，欢迎交流。QQ: 7214218