Pandas CookBook -- 04选取数据子集

选取数据子集

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option('max_columns',5 , 'max_rows', 5)

1 选取Series数据

college = pd.read_csv('data/college.csv', index_col='INSTNM')
city = college['CITY']

1.1 iloc整数选取

标量

city.iloc[3]
'Huntsville'

多行 - 整数列表

city.iloc[[10,20,30]]
INSTNM
Birmingham Southern College                            Birmingham
George C Wallace State Community College-Hanceville    Hanceville
Judson College                                             Marion
Name: CITY, dtype: object

多行 - 整数切片

city.iloc[4:50:10]
INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

1.2 loc标签选取

标量

city.loc['Heritage Christian University']
'Florence'

多行 - 标签列表

随机选择4个标签

labels = list(np.random.choice(city.index, 4))
labels
['University of St Thomas',
 'Paul Mitchell the School-Woodbridge',
 'San Francisco Conservatory of Music',
 'Trinity Bible College']
city.loc[labels]
INSTNM
University of St Thomas                   Saint Paul
Paul Mitchell the School-Woodbridge       Woodbridge
San Francisco Conservatory of Music    San Francisco
Trinity Bible College                      Ellendale
Name: CITY, dtype: object

多行 - 标签切片

 city.loc['Alabama State University':'Reid State Technical College':10]
INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

也可以切片逆序选取

city.loc['Reid State Technical College':'Alabama State University':-10]
INSTNM
Reid State Technical College           Evergreen
Marion Military Institute                 Marion
Heritage Christian University           Florence
Enterprise State Community College    Enterprise
Alabama State University              Montgomery
Name: CITY, dtype: object

2 选取DataFrame数据

2.1 选取DataFrame的行

注意iloc是整数索引,loc是标签索引即可

2.2 同时选取DataFrame的行和列

2.2.1 选取前3行和前4列

college.iloc[:3, :4]
CITY STABBR HBCU MENONLY
INSTNM
Alabama A & M University Normal AL 1.0 0.0
University of Alabama at Birmingham Birmingham AL 0.0 0.0
Amridge University Montgomery AL 0.0 0.0
college.loc[:'Amridge University', :'MENONLY']
CITY STABBR HBCU MENONLY
INSTNM
Alabama A & M University Normal AL 1.0 0.0
University of Alabama at Birmingham Birmingham AL 0.0 0.0
Amridge University Montgomery AL 0.0 0.0

2.2.2 选取两列的所有的行

college.iloc[:, [4,6]].head()
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
Amridge University 0.0 NaN
University of Alabama in Huntsville 0.0 595.0
Alabama State University 0.0 425.0
college.loc[:, ['WOMENONLY', 'SATVRMID']]
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
... ... ...
Bay Area Medical Academy - San Jose Satellite Location NaN NaN
Excel Learning Center-San Antonio South NaN NaN

7535 rows × 2 columns

2.2.3 选取不连续的行和列

college.iloc[[100, 200], [7, 15]]
SATMTMID UGDS_NHPI
INSTNM
GateWay Community College NaN 0.0029
American Baptist Seminary of the West NaN NaN
rows = ['GateWay Community College', 'American Baptist Seminary of the West']
columns = ['SATMTMID', 'UGDS_NHPI']
college.loc[rows, columns]
SATMTMID UGDS_NHPI
INSTNM
GateWay Community College NaN 0.0029
American Baptist Seminary of the West NaN NaN

2.2.4 选取一个标量值

college.iloc[5, -4]
0.401
college.loc['The University of Alabama', 'PCTFLOAN']
0.401

2.2.5 行切片选取一列

college.iloc[90:80:-2, 5]
INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64
start = 'Empire Beauty School-Flagstaff'
stop = 'Arizona State University-Tempe'
college.loc[start:stop:-2, 'RELAFFIL']
INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64

3 用整数和标签的互换

用索引方法get_loc,找到指定列的整数位置

col_start = college.columns.get_loc('UGDS_WHITE')
col_end = college.columns.get_loc('UGDS_UNKN') + 1
col_start, col_end
(10, 19)
college.iloc[:5, col_start:col_end]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.0333 0.9353 ... 0.0059 0.0138
University of Alabama at Birmingham 0.5922 0.2600 ... 0.0179 0.0100
Amridge University 0.2990 0.4192 ... 0.0000 0.2715
University of Alabama in Huntsville 0.6988 0.1255 ... 0.0332 0.0350
Alabama State University 0.0158 0.9208 ... 0.0243 0.0137

5 rows × 9 columns

获得整数行对应的标签名

row_end = college.index[4]
col_start,col_end = college.columns[10],college.columns[19]
col_start,col_end
('UGDS_WHITE', 'PPTUG_EF')
college.loc[:row_end, col_start:col_end]
UGDS_WHITE UGDS_BLACK ... UGDS_UNKN PPTUG_EF
INSTNM
Alabama A & M University 0.0333 0.9353 ... 0.0138 0.0656
University of Alabama at Birmingham 0.5922 0.2600 ... 0.0100 0.2607
Amridge University 0.2990 0.4192 ... 0.2715 0.4536
University of Alabama in Huntsville 0.6988 0.1255 ... 0.0350 0.2146
Alabama State University 0.0158 0.9208 ... 0.0137 0.0892

5 rows × 10 columns

4 切片操作

4.1 惰性行切片

惰性,我的理解就就是省去loc和iloc吧,但是惰性切片不能用于列,只能用于DataFrame的行和Series,也不能同时选取行和列。

Series选取数据

city = college['CITY']
city[10:20:2]
INSTNM
Birmingham Southern College              Birmingham
Concordia College Alabama                     Selma
Enterprise State Community College       Enterprise
Faulkner University                      Montgomery
New Beginning College of Cosmetology    Albertville
Name: CITY, dtype: object

dataframe行选取

college[10:20:2]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Birmingham Southern College Birmingham AL ... 44200 27000
Concordia College Alabama Selma AL ... 19900 PrivacySuppressed
Enterprise State Community College Enterprise AL ... 24600 8273
Faulkner University Montgomery AL ... 37200 22000
New Beginning College of Cosmetology Albertville AL ... NaN 5500

5 rows × 26 columns

Series和DataFrame都可以用标签进行切片。

start = 'Mesa Community College'
stop = 'Spokane Community College'
college[start:stop:1500]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Mesa Community College Mesa AZ ... 35200 8000
Hair Academy Inc-New Carrollton New Carrollton MD ... 15200 9666
National College of Natural Medicine Portland OR ... NaN PrivacySuppressed

3 rows × 26 columns

4.2 按照字母切片

需先对索引进行排序

college = college.sort_index()
college.head()
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
A & W Healthcare Educators New Orleans LA ... NaN 19022.5
A T Still University of Health Sciences Kirksville MO ... 219800 PrivacySuppressed
ABC Beauty Academy Garland TX ... NaN PrivacySuppressed
ABC Beauty College Inc Arkadelphia AR ... PrivacySuppressed 16500
AI Miami International University of Art and Design Miami FL ... 29900 31000

5 rows × 26 columns

选取字母顺序在‘Sp’和‘Su’之间的学校

college.loc['Sp':'Su']
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Spa Tech Institute-Ipswich Ipswich MA ... 21500 6333
Spa Tech Institute-Plymouth Plymouth MA ... 21500 6333
... ... ... ... ... ...
Styles and Profiles Beauty College Selmer TN ... PrivacySuppressed PrivacySuppressed
Styletrends Barber and Hairstyling Academy Rock Hill SC ... PrivacySuppressed 9495.5

201 rows × 26 columns

天下风云出我辈,一入江湖岁月催
原文地址:https://www.cnblogs.com/shiyushiyu/p/9742232.html