Pandas CookBook -- 05布尔索引

简书大神SeanCheney的译作，我作了些格式调整和文章目录结构的变化，更适合自己阅读，以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option('max_columns',5 , 'max_rows', 5)

1 布尔值统计信息

movie = pd.read_csv('data/movie.csv', index_col='movie_title')

movie.head()

	color	director_name	...	aspect_ratio	movie_facebook_likes
movie_title
Avatar	Color	James Cameron	...	1.78	33000
Pirates of the Caribbean: At World's End	Color	Gore Verbinski	...	2.35	0
Spectre	Color	Sam Mendes	...	2.35	85000
The Dark Knight Rises	Color	Christopher Nolan	...	2.35	164000
Star Wars: Episode VII - The Force Awakens	NaN	Doug Walker	...	NaN	0

5 rows × 27 columns

1.1 基础方法

判断电影时长是否超过两小时

movie_2_hours = movie['duration'] > 120
movie_2_hours.head(10)

movie_title
Avatar                                      True
Pirates of the Caribbean: At World's End    True
                                            ... 
Avengers: Age of Ultron                     True
Harry Potter and the Half-Blood Prince      True
Name: duration, Length: 10, dtype: bool

有多少时长超过两小时的电影

movie_2_hours.sum()

超过两小时的电影的比例

movie_2_hours.mean()

0.2113506916192026

实际上，dureation这列是有缺失值的，要想获得真正的超过两小时的电影的比例，需要先删掉缺失值

movie['duration'].dropna().gt(120).mean()

0.21199755152009794

1.2 统计信息

用describe()输出一些该布尔Series信息

movie_2_hours.describe()

count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

统计False和True值的比例

 movie_2_hours.value_counts(normalize=True)

False    0.788649
True     0.211351
Name: duration, dtype: float64

2 布尔索引

2.1 布尔条件

在Pandas中，位运算符（&, |, ~）的优先级高于比较运算符

2.1.1 创建多个布尔条件

criteria1 = movie.imdb_score > 8
criteria2 = movie.content_rating == 'PG-13'
criteria3 = (movie.title_year < 2000) | (movie.title_year >= 2010)

criteria3.head()

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: title_year, dtype: bool

2.1.2 将这些布尔条件合并成一个

criteria_final = criteria1 & criteria2 & criteria3

criteria_final.head()

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

2.2 布尔过滤

创建第一个布尔条件

 crit_a1 = movie.imdb_score > 8
 crit_a2 = movie.content_rating == 'PG-13'
 crit_a3 = (movie.title_year < 2000) | (movie.title_year > 2009)
 final_crit_a = crit_a1 & crit_a2 & crit_a3

创建第二个布尔条件

crit_b1 = movie.imdb_score < 5
crit_b2 = movie.content_rating == 'R'
crit_b3 = (movie.title_year >= 2000) & (movie.title_year <= 2010)
final_crit_b = crit_b1 & crit_b2 & crit_b3

合并布尔条件

final_crit_all = final_crit_a | final_crit_b

final_crit_all.head()

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

过滤数据

movie[final_crit_all].head()

	color	director_name	...	aspect_ratio	movie_facebook_likes
movie_title
The Dark Knight Rises	Color	Christopher Nolan	...	2.35	164000
The Avengers	Color	Joss Whedon	...	1.85	123000
Captain America: Civil War	Color	Anthony Russo	...	2.35	72000
Guardians of the Galaxy	Color	James Gunn	...	2.35	96000
Interstellar	Color	Christopher Nolan	...	2.35	349000

5 rows × 27 columns

验证过滤

cols = ['imdb_score', 'content_rating', 'title_year']
movie_filtered = movie.loc[final_crit_all, cols]
movie_filtered.head(10)

	imdb_score	content_rating	title_year
movie_title
The Dark Knight Rises	8.5	PG-13	2012.0
The Avengers	8.1	PG-13	2012.0
...	...	...	...
Sex and the City 2	4.3	R	2010.0
Rollerball	3.0	R	2002.0

10 rows × 3 columns

2.3 与标签索引对比

college = pd.read_csv('data/college.csv')
college2 = college.set_index('STABBR')

2.3.1 单个标签

college2中STABBR作为行索引，用loc选取

college2.loc['TX'].head()

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
STABBR
TX	Abilene Christian University	Abilene	...	40200	25985
TX	Alvin Community College	Alvin	...	34500	6750
TX	Amarillo College	Amarillo	...	31700	10950
TX	Angelina College	Lufkin	...	26900	PrivacySuppressed
TX	Angelo State University	San Angelo	...	37700	21319.5

5 rows × 26 columns

college中，用布尔索引选取所有得克萨斯州的学校

college[college['STABBR'] == 'TX'].head()

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
3610	Abilene Christian University	Abilene	...	40200	25985
3611	Alvin Community College	Alvin	...	34500	6750
3612	Amarillo College	Amarillo	...	31700	10950
3613	Angelina College	Lufkin	...	26900	PrivacySuppressed
3614	Angelo State University	San Angelo	...	37700	21319.5

5 rows × 27 columns

比较二者的速度

法一

%timeit college[college['STABBR'] == 'TX']

937 µs ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

法二

%timeit college2.loc['TX']

520 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit college2 = college.set_index('STABBR')

2.11 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2.3.2 多个标签

布尔索引和标签选取多列

states =['TX', 'CA', 'NY']

college[college['STABBR'].isin(states)]

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
192	Academy of Art University	San Francisco	...	36000	35093
193	ITT Technical Institute-Rancho Cordova	Rancho Cordova	...	38800	25827.5
...	...	...	...	...	...
7533	Bay Area Medical Academy - San Jose Satellite ...	San Jose	...	NaN	PrivacySuppressed
7534	Excel Learning Center-San Antonio South	San Antonio	...	NaN	12125

1704 rows × 27 columns

college2.loc[states].head()

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
STABBR
TX	Abilene Christian University	Abilene	...	40200	25985
TX	Alvin Community College	Alvin	...	34500	6750
TX	Amarillo College	Amarillo	...	31700	10950
TX	Angelina College	Lufkin	...	26900	PrivacySuppressed
TX	Angelo State University	San Angelo	...	37700	21319.5

5 rows × 26 columns

3 查询方法

使用查询方法提高布尔索引的可读性

# 读取employee数据，确定选取的部门和列

employee = pd.read_csv('data/employee.csv')
depts = ['Houston Police Department-HPD', 'Houston Fire Department (HFD)']
select_columns = ['UNIQUE_ID', 'DEPARTMENT', 'GENDER', 'BASE_SALARY']

# 创建查询字符串，并执行query方法

qs = "DEPARTMENT in @depts and GENDER == 'Female' and 80000 <= BASE_SALARY <= 120000"

emp_filtered = employee.query(qs)
emp_filtered[select_columns].head()

	UNIQUE_ID	DEPARTMENT	GENDER	BASE_SALARY
61	61	Houston Fire Department (HFD)	Female	96668.0
136	136	Houston Police Department-HPD	Female	81239.0
367	367	Houston Police Department-HPD	Female	86534.0
474	474	Houston Police Department-HPD	Female	91181.0
513	513	Houston Police Department-HPD	Female	81239.0

4 唯一和有序索引

4.1 单列索引

college = pd.read_csv('data/college.csv')
college2 = college.set_index('STABBR')

college2.index.is_monotonic

False

将college2排序，存储成另一个对象，查看其是否有序

college3 = college2.sort_index()
college3.index.is_monotonic

True

使用INSTNM作为行索引，检测行索引是否唯一

college_unique = college.set_index('INSTNM')

college_unique.index.is_unique

True

4.2 拼装索引

使用CITY和STABBR两列作为行索引，并进行排序

college.index = college['CITY'] + ', ' + college['STABBR']

college = college.sort_index()

college.head()

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
ARTESIA, CA	Angeles Institute	ARTESIA	...	NaN	16850
Aberdeen, SD	Presentation College	Aberdeen	...	35900	25000
Aberdeen, SD	Northern State University	Aberdeen	...	33600	24847
Aberdeen, WA	Grays Harbor College	Aberdeen	...	27000	11490
Abilene, TX	Hardin-Simmons University	Abilene	...	38700	25864

5 rows × 27 columns

college.index.is_unique

False

选取所有Miami, FL的大学

法一

college.loc['Miami, FL'].head()

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
Miami, FL	New Professions Technical Institute	Miami	...	18700	8682
Miami, FL	Management Resources College	Miami	...	PrivacySuppressed	12182
Miami, FL	Strayer University-Doral	Miami	...	49200	36173.5
Miami, FL	Keiser University- Miami	Miami	...	29700	26063
Miami, FL	George T Baker Aviation Technical College	Miami	...	38600	PrivacySuppressed

5 rows × 27 columns

法二

crit1 = college['CITY'] == 'Miami' 
crit2 = college['STABBR'] == 'FL'
college[crit1 & crit2]

	INSTNM	CITY	...	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
Miami, FL	New Professions Technical Institute	Miami	...	18700	8682
Miami, FL	Management Resources College	Miami	...	PrivacySuppressed	12182
...	...	...	...	...	...
Miami, FL	Advanced Technical Centers	Miami	...	PrivacySuppressed	PrivacySuppressed
Miami, FL	Lindsey Hopkins Technical College	Miami	...	29800	PrivacySuppressed

50 rows × 27 columns

5 loc/iloc中使用布尔

movie = pd.read_csv('data/movie.csv', index_col='movie_title')

5.1 行

c1 = movie['content_rating'] == 'G'
c2 = movie['imdb_score'] < 4
criteria = c1 & c2

bool_movie = movie[criteria]
bool_movie

	color	director_name	...	aspect_ratio	movie_facebook_likes
movie_title
The True Story of Puss'N Boots	Color	Jérôme Deschamps	...	NaN	90
Doogal	Color	Dave Borthwick	...	1.85	346
...	...	...	...	...	...
Justin Bieber: Never Say Never	Color	Jon M. Chu	...	1.85	62000
Sunday School Musical	Color	Rachel Goldenberg	...	1.85	777

6 rows × 27 columns

loc使用bool

法一

movie_loc = movie.loc[criteria]

检查loc条件和布尔条件创建出来的两个DataFrame是否一样

movie_loc.equals(movie[criteria])

True

法二

movie_loc2 = movie.loc[criteria.values]

movie_loc2.equals(movie[criteria])

True

iloc使用bool

因为criteria是包含行索引的一个Series，必须要使用底层的ndarray，才能使用，iloc

movie_iloc = movie.iloc[criteria.values]

movie_iloc.equals(movie_loc)

True

5.2 列

布尔索引也可以用来选取列

criteria_col = movie.dtypes == np.int64
criteria_col.head()

color                      False
director_name              False
num_critic_for_reviews     False
duration                   False
director_facebook_likes    False
dtype: bool

movie.loc[:, criteria_col].head()

	num_voted_users	cast_total_facebook_likes	movie_facebook_likes
movie_title
Avatar	886204	4834	33000
Pirates of the Caribbean: At World's End	471220	48350	0
Spectre	275868	11700	85000
The Dark Knight Rises	1144337	106759	164000
Star Wars: Episode VII - The Force Awakens	8	143	0

movie.iloc[:, criteria_col.values].head()

	num_voted_users	cast_total_facebook_likes	movie_facebook_likes
movie_title
Avatar	886204	4834	33000
Pirates of the Caribbean: At World's End	471220	48350	0
Spectre	275868	11700	85000
The Dark Knight Rises	1144337	106759	164000
Star Wars: Episode VII - The Force Awakens	8	143	0

6 使用布尔值 - where/mask

mask() is the inverse boolean operation of where.

DataFrame.where(cond, other=nan, inplace=False **kwgs)
Parameters:

cond : boolean NDFrame, array-like, or callable
- Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
- cond是一个与df通型的dataframe,当dataframe与cond对应的位置是true是，保留原值。否则便为other对应的值
other : scalar, NDFrame, or callable
inplace : boolean, default False
- Whether to perform the operation in place on the data

6.1 Series使用where

movie = pd.read_csv('data/movie.csv', index_col='movie_title')
fb_likes = movie['actor_1_facebook_likes'].dropna()
fb_likes.head()

movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

使用describe获得对数据的认知

fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]).astype(int)

count      4909
mean       6494
          ...  
90%       18000
max      640000
Name: actor_1_facebook_likes, Length: 10, dtype: int64

检测小于20000个喜欢的的比例

criteria_high = fb_likes < 20000

criteria_high.mean().round(2)

0.91

where条件可以返回一个同样大小的Series，但是所有False会被替换成缺失值

fb_likes.where(criteria_high).head()

movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

第二个参数other，可以让你控制替换值

fb_likes.where(criteria_high, other=20000).head()

movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

通过where条件，设定上下限的值

criteria_low = fb_likes > 300
fb_likes_cap = fb_likes.where(criteria_high, other=20000).where(criteria_low, 300)
fb_likes_cap.head()

movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64

原始Series和修改过的Series的长度是一样的

len(fb_likes), len(fb_likes_cap)

(4909, 4909)

6.2 dataframe使用where

df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],'ids2': ['a', 'n', 'c', 'n']})

print(df)
print(df < 2)
df.where(df<2,1000)

   vals ids ids2
0     1   a    a
1     2   b    n
2     3   f    c
3     4   n    n
    vals   ids  ids2
0   True  True  True
1  False  True  True
2  False  True  True
3  False  True  True

	vals	ids	ids2
0	1	a	a
1	1000	b	n
2	1000	f	c
3	1000	n	n

下面的代码等价于 df.where(df < 0,1000).

print(df[df < 2])
df[df < 2].fillna(1000)

   vals ids ids2
0   1.0   a    a
1   NaN   b    n
2   NaN   f    c
3   NaN   n    n

	vals	ids	ids2
0	1.0	a	a
1	1000.0	b	n
2	1000.0	f	c
3	1000.0	n	n

天下风云出我辈，一入江湖岁月催