【12月DW打卡】joyful-pandas

Task08 : 文本数据

原文链接

joyful-pandas 第八章文本数据 https://datawhalechina.github.io/joyful-pandas/build/html/目录/ch8.html

脑图

import pandas as pd
import numpy as np
import datetime as datetime

file_prefix = 'E:\PycharmProjects\DatawhaleChina\joyful-pandas\data\'
print(file_prefix)

E:PycharmProjectsDatawhaleChinajoyful-pandasdata

测试部分代码示例:

3.4. 替换

str.replace 和 replace 并不是一个函数，在使用字符串替换时应当使用前者。

s = pd.Series(['a_1_b','c_?'])
# 替换数字或者?为new
s.str.replace('d|?', 'new', regex=True)

0    a_new_b
1      c_new
dtype: object

当需要对不同部分进行有差别的替换时，可以利用子组的方法，并且此时可以通过传入自定义的替换函数来分别进行处理，注意 group(k) 代表匹配到的第 k 个子组（圆括号之间的内容）：

s = pd.Series(['上海市黄浦区方浜中路249号',
              '上海市宝山区密山路5号',
               '北京市昌平区北农路2号'])

pat = '(w+市)(w+区)(w+路)(d+号)'

city = {'上海市': 'Shanghai', '北京市': 'Beijing'}
district = {'昌平区': 'CP District',
             '黄浦区': 'HP District',
             '宝山区': 'BS District'}
road = {'方浜中路': 'Mid Fangbin Road',
          '密山路': 'Mishan Road',
         '北农路': 'Beinong Road'}

# 转为英文,并使用圆括号中的内容; 返回组装好的数据
def my_func(m):
    print(m.group())
    str_city = city[m.group(1)]
    str_distinct = district[m.group(2)]
    str_road = road[m.group(3)]
    str_no = 'No. ' + m.group(4)[:-1]
    return ''.join([str_city, str_distinct, str_road, str_no])
s.str.replace(pat, my_func, regex=True)

上海市黄浦区方浜中路249号
上海市宝山区密山路5号
北京市昌平区北农路2号





0    ShanghaiHP DistrictMid Fangbin RoadNo. 249
1           ShanghaiBS DistrictMishan RoadNo. 5
2           BeijingCP DistrictBeinong RoadNo. 2
dtype: object

3.5. 提取

提取既可以认为是一种返回具体元素（而不是布尔值或元素对应的索引位置）的匹配操作，也可以认为是一种特殊的拆分操作。前面提到的 str.split 例子中会把分隔符去除，这并不是用户想要的效果，这时候就可以用 str.extract 进行提取：

# 拆分不生成列名
s = pd.Series(['上海市黄浦区方浜中路249号',
              '上海市宝山区密山路5号',
               '北京市昌平区北农路2号'])
pattern = '(w+市)(w+区)(w+路)(w+号)'
s.str.extract(pattern)

	0	1	2	3
0	上海市	黄浦区	方浜中路	249号
1	上海市	宝山区	密山路	5号
2	北京市	昌平区	北农路	2号

通过子组的命名?P<>，可以直接对新生成 DataFrame 的列命名：

# 拆分生成列名
pattern = '(?P<市名>w+市)(?P<区名>w+区)(?P<路名>w+路)(?P<号名>w+号)'
s.str.extract(pattern)

	市名	区名	路名	号名
0	上海市	黄浦区	方浜中路	249号
1	上海市	宝山区	密山路	5号
2	北京市	昌平区	北农路	2号

str.extractall 不同于 str.extract 只匹配一次，它会把所有符合条件的模式全部匹配出来，如果存在多个结果，则以多级索引的方式存储：

s = pd.Series(['A135T15,A26S','B674S2,B25T6'], index = ['my_A','my_B'])

pattern = '[A|B](d*)[T|S](d*)'
s.str.extractall(pattern)

		0	1
	match
my_A	0	135	15
my_A	1	26	NaN
my_B	0	674	2
my_B	1	25	6

五、练习

Ex1：房屋信息数据集

现有一份房屋信息数据集如下：

df = pd.read_excel(file_prefix + 'house_info.xls', usecols=['floor','year','area','price'])
df.head(3)

	floor	year	area	price
0	高层（共6层）	1986年建	58.23㎡	155万
1	中层（共20层）	2020年建	88㎡	155万
2	低层（共28层）	2010年建	89.33㎡	365万

将year列改为整数年份存储。

df['year'] = df['year'].str[:-2]

df['year'].head()

0    1986
1    2020
2    2010
3    2014
4    2015
Name: year, dtype: object

将floor列替换为Level, Highest两列，其中的元素分别为string类型的层类别（高层、中层、低层）与整数类型的最高层数。

#           高层（共6层）
pattern = '(?P<Level>w?层)（共(?P<Highest>w+层)）'
df['floor'].str.extract(pattern)

	Level	Highest
0	高层	6层
1	中层	20层
2	低层	28层
3	低层	20层
4	高层	1层
...	...	...
31563	中层	39层
31564	高层	54层
31565	高层	16层
31566	高层	62层
31567	低层	22层

31568 rows × 2 columns

计算房屋每平米的均价avg_price，以***元/平米的格式存储到表中，其中***为整数。

df_price = pd.to_numeric(df['price'].str[:-1])
df_price

0          155.0
1          155.0
2          365.0
3          308.0
4          117.0
          ...   
31563    10000.0
31564     2600.0
31565     2500.0
31566     3500.0
31567     2300.0
Name: price, Length: 31568, dtype: float64

df_area = pd.to_numeric(df['area'].str[:-1])
df_area

0         58.23
1         88.00
2         89.33
3         82.00
4         98.00
          ...  
31563    391.13
31564    283.00
31565    245.00
31566    284.00
31567    224.00
Name: area, Length: 31568, dtype: float64

avg_price = df_price*10000 / df_area
# 先转为`***`为整数, 再转为str
avg_price = avg_price.astype('string').str.split(r'.', 1).apply(lambda x:int(x[0]))
# avg_price = avg_price.str.zfill(3)
avg_price = avg_price.astype('string') + '元/平米'
avg_price

0         26618元/平米
1         17613元/平米
2         40859元/平米
3         37560元/平米
4         11938元/平米
            ...    
31563    255669元/平米
31564     91872元/平米
31565    102040元/平米
31566    123239元/平米
31567    102678元/平米
Length: 31568, dtype: string

Ex2：《权力的游戏》剧本数据集

现有一份权力的游戏剧本数据集如下：

df = pd.read_csv(file_prefix + 'script.csv')
df.head(3)

	Release Date	Season	Episode	Episode Title	Name	Sentence
0	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	What do you expect? They're savages. One lot s...
1	2011-04-17	Season 1	Episode 1	Winter is Coming	will	I've never seen wildlings do a thing like this...
2	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	How close did you get?

计算每一个Episode的台词条数。

df.rename(columns={'Episode ': 'Episode', ' Season': 'Season'}, inplace=True)
df.columns

Index(['Release Date', 'Season', 'Episode', 'Episode Title', 'Name',
       'Sentence', 'word_count'],
      dtype='object')

df.groupby(['Season', 'Episode'])[['Sentence']].count()

		Sentence
Season	Episode
Season 1	Episode 1	327
	Episode 10	266
	Episode 2	283
	Episode 3	353
	Episode 4	404
...	...	...
Season 8	Episode 2	405
	Episode 3	155
	Episode 4	51
	Episode 5	308
	Episode 6	240

73 rows × 1 columns

以空格为单词的分割符号，请求出单句台词平均单词量最多的前五个人。

df['word_count'] = df['Sentence'].str.split(' ').apply(lambda x: len(x))
df['word_count']

0        25
1        21
2         5
3         5
4         7
         ..
23906    12
23907     7
23908    11
23909     5
23910    25
Name: word_count, Length: 23911, dtype: int64

df.groupby('Name')[['word_count']].mean().sort_values(by='word_count', ascending=False)[:5]

	word_count
Name
male singer	109.000000
slave owner	77.000000
manderly	62.000000
lollys stokeworth	62.000000
dothraki matron	56.666667

若某人的台词中含有问号，那么下一个说台词的人即为回答者。若上一人台词中含有(n)个问号，则认为回答者回答了(n)个问题，请求出回答最多问题的前五个人。

single_ans_num = pd.Series(df.Sentence.values, index=df.Name.shift(-1))
single_ans_num

Name
will                What do you expect? They're savages. One lot s...
waymar royce        I've never seen wildlings do a thing like this...
will                                           How close did you get?
gared                                         Close as any man would.
royce                                We should head back to the wall.
                                          ...                        
bronn               I think we can all agree that ships take prece...
tyrion lannister        I think that's a very presumptuous statement.
man                 I once brought a jackass and a honeycomb into ...
all                                           The Queen in the North!
NaN                 The Queen in the North! The Queen in the North...
Length: 23911, dtype: object

single_ans_num.str.count('?').groupby('Name').sum().sort_values(ascending=False).head()

Name
tyrion lannister    527
jon snow            374
jaime lannister     283
arya stark          265
cersei lannister    246
dtype: int64