文本处理1

problem

实验六字符串的应用

【实验目的】

熟悉字符串常用方法的使用；掌握字符串的索引与分片；熟练使用正则表达式

【实验要求】

实验前，应事先熟悉相关知识点，拟出相应的实验操作步骤，明确实验目的和要求；实验过程中，服从实验指导教师安排，遵守实验室的各项规章制度，爱护实验仪器设备；实验操作完成后，认真书写实验报告（实验报告给出所有源代码），总结实验经验，分析实验过程中出现的问题。

【实验学时、性质】

2学时；设计型

【实验内容】

假设有一段英文，其中有单词中间的字母i误写为I，请编写程序进行纠正。
翻译密码。为了保密，常不采用明码电文，而用密码电文，按事先约定的规律将一个字符转换为另一个字符，收报人则按相反地规律转换得到原来的字符。例如，将字母“A”->“F”，“B”->“G”，“C”->“H”，即将一个字母变成其后第5个字母。例如，“He is Beijing.”应转换为“Mj nx Gjnonsl.”。
查找字符串中每个字符的首次出现。给定一个任意字符串，要求得到一个新字符串，重复字符只保留一个，并且新字符串中的字符保持在原字符串中首次出现的先后顺序。例如，abcdaaabe处理后应得到abcde。
有一段英文文本，其中有单词连续重复了2次，编写程序检查重复的单词并只保留一个。例如，文本内容为“This is is a desk.”，程序输出为“This is a desk.”。
编写程序，用户输入一段英文，然后输出这段英文中所有长度为3字母的单词。

文本处理

字符串

在Python中，字符串属于不可变有序序列¹，使用单引号（这是最常用的，或许是因为敲键盘方便）、双引号、三单引号或三双引号作为定界符，并且不同定界符之间可以相互嵌套。下面几种都是合法的Python字符串： 'abc'、'123'、'中国'、"Python"、'''Tom said,"Let's go"'''。

代码实现

问题1

# 假设有一段英文，其中有单词中间的字母i误写为I，请编写程序进行纠正。
text='LIli is my good freand!'
result=''
for index,ch in enumerate(text):
    if ch=='I' and text[index-1].isalpha() and text[index+1].isalpha():
        ch='i'
    result+=ch
print(result)

运行结果：
Lili is my good freand!

代码分析
- 字符串是不可变有序序列，不能在原字符串上进行替换，因此需要声明一个新的字符串变量result；
- enumerate(iterable, start=0) ²
  - Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over iterable.
  - example
```
>>> seasons = ['Spring', 'Summer', 'Fall', 'Winter']
>>> list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
>>> list(enumerate(seasons, start=1))
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]
```
- bytearray.isalpha()
  - Return true if all bytes in the sequence are alphabetic ASCII characters and the sequence is not empty, false otherwise. Alphabetic ASCII characters are those byte values in the sequence b'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'.
  - example
```
>>> b'ABCabc'.isalpha()
True
>>> b'ABCabc1'.isalpha()
False
```
- Python支持+连接字符串，但该运算符涉及大量数据的复制，效率非常低，不适合大量字符串的连接。
  - 下面的代码演示了运算符"+"和字符串对象join()方法之间的速度差异，该代码分别用join()函数和"+"对1000个字符串进行连接，并重复运行1000次，然后输出每种方法所用的时间。
```
import timeit
strlist=['This is a long string that will not keep in memory.' for n in range(10000)]
def use_join():
    return ''.join(strlist)
def use_plus():
    result=''
    for strtemp in strlist:
        result=result+strtemp
    return result
if __name__== '__main__':
    times=1000
    jointimer=timeit.Timer('use_join()','from __main__ import use_join')
    print('time for join:',jointimer.timeit(number=times))
    plustimer=timeit.Timer('use_plus()','from __main__ import use_plus')
    print('time for plus:',plustimer.timeit(number=times))
```
- 运行结果
  - time for join: 0.4300719999999999
  - time for plus: 2.4646118

问题2

# 翻译密码。为了保密，常不采用明码电文，而用密码电文，
# 按事先约定的规律将一个字符转换为另一个字符，
# 收报人则按相反地规律转换得到原来的字符。
# 例如，将字母“A”->“F”，“B”->“G”，“C”->“H”，
# 即将一个字母变成其后第5个字母。
# 例如，“He is Beijing.”应转换为“Mj nx Gjnonsl.”
text='He is Beijing.'
result=''
for ch in text:
    if ch.isalpha():
        if ch>='a' and ch<='z':
            if ch>'u':
                ch='a'+ch-'v'
            else:
                ch=chr(ord(ch)+5)
        else:
            if ch>'U':
                ch='A'+ch-'V'
            else:
                ch=chr(ord(ch)+5)
    result+=ch
print(result)

运行结果：
Mj nx Gjnonsl.

代码分析：
- 字符的加减运算就是其对应的ASCII码的加减运算。在Python中，字符不能直接与数字加减运算，需要先转换为ASCII码的数值，再与整数相加减，然后再变回字符。
- ord(c)
  - Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
- chr(i)
  - Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().
  - The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16). ValueError will be raised if i is outside that range.
- 注意：字符的连接使用+运算符，而不应使用join()。

问题3

# 查找字符串中每个字符的首次出现。给定一个任意字符串，
# 要求得到一个新字符串，重复字符只保留一个，并且新字
# 符串中的字符保持在原字符串中首次出现的先后顺序。
# 例如，abcdaaabe处理后应得到abcde。
text='abcdaaabe'
result=''
for ch in text:
    if ch not in result:
        result+=ch
print(result)

运行结果：
abcde

代码分析：
- not in：in关键词的使用

问题4

# 有一段英文文本，其中有单词连续重复了2次，编写程序检查重复的单词并只保留一个。
# 例如，文本内容为“This is is a desk.”，程序输出为“This is a desk.”。
text='This is is a desk.'
result=''
text=text.split(' ')
for i in range(0,len(text)-1):
    if text[i]!=text[i+1]:
        result+=text[i]+' '
        i=i+1
result+=text[len(text)-1]
print(result)

运行结果：
This is a desk.

代码分析：
- bytearray.split(sep=None, maxsplit=-1)
  - Split the binary sequence into subsequences of the same type, using sep as the delimiter string. If maxsplit is given and non-negative, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or is -1, then there is no limit on the number of splits (all possible splits are made).
  - If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty subsequences (for example, b'1,,2'.split(b',') returns [b'1', b'', b'2']). The sep argument may consist of a multibyte sequence (for example, b'1<>2<>3'.split(b'<>') returns [b'1', b'2', b'3']). Splitting an empty sequence with a specified separator returns [b''] or [bytearray(b'')] depending on the type of object being split. The sep argument may be any bytes-like object.
  - example
```
    >>> b'1,2,3'.split(b',')
   [b'1', b'2', b'3']
   >>> b'1,2,3'.split(b',', maxsplit=1)
   [b'1', b'2,3']
   >>> b'1,2,,3,'.split(b',')
   [b'1', b'2', b'', b'3', b'']
```
- bytes.strip([chars])
  - Return a copy of the sequence with specified leading and trailing bytes removed. The chars argument is a binary sequence specifying the set of byte values to be removed - the name refers to the fact this method is usually used with ASCII characters. If omitted or None, the chars argument defaults to removing ASCII whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.The binary sequence of byte values to remove may be any bytes-like object.
  - example
```
>>> b'   spacious   '.strip()
b'spacious'
>>> b'www.example.com'.strip(b'cmowz.')
b'example'
```
- len(s)
  - Return the length (the number of items) of an object. The argument may be a sequence (such as a string, bytes, tuple, list, or range) or a collection (such as a dictionary, set, or frozen set).

问题5

# 编写程序，用户输入一段英文，然后输出这段英文中所有长度为3字母的单词。
text=input('用户输入一段英文：')
result=[]
ss=''
for ch in text:
    if not ch.isalpha():
        ch=' '
    ss+=ch
text=ss.split()
for s in text:
    if len(s)==3:
        result.append(s)
print(result)

运行结果：
用户输入一段英文：Tao tao is my wife.
['Tao', 'tao']

代码分析：
- array.append(x)
  - Append a new item with value x to the end of the array.

参考

[1] 董付国.Python程序设计基础--2版[M].北京:清华大学出版社,2018(2020.6重印).

[2] [Python 3.7.3 documentation](Python3.7.3IDLE自带)

文本处理1

problem

实验六 字符串的应用

文本处理

字符串

代码实现

问题1

问题2

问题3

问题4

问题5

参考

实验六字符串的应用