Pandas CookBook -- 02DataFrame基础操作

Pandas基础操作

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option('max_columns',5 , 'max_rows', 5)

1 选取多个DataFrame列

1.1 用列表选取多个列

movie = pd.read_csv('data/movie.csv')
cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
movie_actor_director = movie[cols]
movie_actor_director
actor_1_name actor_2_name actor_3_name director_name
0 CCH Pounder Joel David Moore Wes Studi James Cameron
1 Johnny Depp Orlando Bloom Jack Davenport Gore Verbinski
... ... ... ... ...
4914 Alan Ruck Daniel Henney Eliza Coupe Daniel Hsia
4915 John August Brian Herzlinger Jon Gunn Jon Gunn

4916 rows × 4 columns

1.2 使用select_dtypes选取类型

select_dtypes(include=None, exclude=None)

  • To select all numeric types, use np.number or 'number'
  • To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'
movie.shape
(4916, 28)

1.2.1 选取整数列

movie.select_dtypes(include=['int']).head()
num_voted_users cast_total_facebook_likes movie_facebook_likes
0 886204 4834 33000
1 471220 48350 0
2 275868 11700 85000
3 1144337 106759 164000
4 8 143 0

1.2.2 选取非整数列

movie.select_dtypes(exclude=['int']).head()
color director_name ... imdb_score aspect_ratio
0 Color James Cameron ... 7.9 1.78
1 Color Gore Verbinski ... 7.1 2.35
2 Color Sam Mendes ... 6.8 2.35
3 Color Christopher Nolan ... 8.5 2.35
4 NaN Doug Walker ... 7.1 NaN

5 rows × 25 columns

1.2.3 通过filter函数过滤选取多列

filter(items=None, like=None, regex=None, axis=None)

  • items : list-like
    • List of info axis to restrict to (must not all be present)
    • 传递个列名或行名列表
  • like : string
    • Keep info axis where “arg in col == True”
    • 类似Python里面字符串的find()函数,col.find(arg)
  • regex : string (regular expression)
    • Keep info axis with re.search(regex, col) == True

通过filter()函数过滤选取多列

movie.filter(like='facebook').head()
director_facebook_likes actor_3_facebook_likes ... actor_2_facebook_likes movie_facebook_likes
0 0.0 855.0 ... 936.0 33000
1 563.0 1000.0 ... 5000.0 0
2 0.0 161.0 ... 393.0 85000
3 22000.0 23000.0 ... 23000.0 164000
4 131.0 NaN ... 12.0 0

5 rows × 6 columns

通过正则表达式选取多列

movie.filter(regex='d').head()
actor_3_facebook_likes actor_2_name ... actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore ... Wes Studi 936.0
1 1000.0 Orlando Bloom ... Jack Davenport 5000.0
2 161.0 Rory Kinnear ... Stephanie Sigman 393.0
3 23000.0 Christian Bale ... Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker ... NaN 12.0

5 rows × 6 columns

filter()函数,传递列表到参数items,选取多列

movie.filter(items=['actor_1_name', 'actor_3_name']).head()
actor_1_name actor_3_name
0 CCH Pounder Wes Studi
1 Johnny Depp Jack Davenport
2 Christoph Waltz Stephanie Sigman
3 Tom Hardy Joseph Gordon-Levitt
4 Doug Walker NaN

2 DataFrame上操作

2.1 基本方法

数据的个数 数据集的维度 数据集的长度

 movie.shape,movie.size,movie.ndim
((4916, 28), 137648, 2)

各个列的非空值的个数

movie.count()
color                   4897
director_name           4814
                        ... 
aspect_ratio            4590
movie_facebook_likes    4916
Length: 28, dtype: int64

2.2 统计信息

movie.shape
(4916, 28)

2.2.1 最大 最小值

2.2.1.1 数值类型

# min max quantile
movie_min = movie.min()
movie_min.name = '最小值'
movie_min
num_critic_for_reviews    1.00
duration                  7.00
                          ... 
aspect_ratio              1.18
movie_facebook_likes      0.00
Name: 最小值, Length: 16, dtype: float64

计算是默认会跳过缺失值的,可设置skipna=False使其包含缺失,但这样不具有意义

movie.min(skipna=False)
num_critic_for_reviews    NaN
duration                  NaN
                         ... 
aspect_ratio              NaN
movie_facebook_likes      0.0
Length: 16, dtype: float64

2.2.1.2 字符串类型

当字符串类型的列包含缺失值时,聚合方法min、max、sum,不会返回任何值。

movie[['color', 'movie_title', 'color']].max()
Series([], dtype: float64)

要让pandas强行返回每列的值,必须填入缺失值。下面填入的是空字符串

movie[['color', 'movie_title', 'color']].fillna('').max()
color             Color
movie_title    Æon Flux
color             Color
dtype: object

2.2.2 统计信息

2.2.2.1 数值型

使用percentiles参数指定分位数

movie.describe(percentiles=[.01, .3, .99])
num_critic_for_reviews duration ... aspect_ratio movie_facebook_likes
count 4867.000000 4901.000000 ... 4590.000000 4916.000000
mean 137.988905 107.090798 ... 2.222349 7348.294142
... ... ... ... ... ...
99% 546.680000 189.000000 ... 4.000000 93850.000000
max 813.000000 511.000000 ... 16.000000 349000.000000

9 rows × 16 columns

2.2.2.2 字符串型

movie.select_dtypes(include='object').describe()
color director_name ... country content_rating
count 4897 4814 ... 4911 4616
unique 2 2397 ... 65 18
top Color Steven Spielberg ... USA R
freq 4693 26 ... 3710 2067

4 rows × 12 columns

2.3 方法的组合

使用isnull方法将每个值转变为布尔值

movie.isnull().head()
color director_name ... aspect_ratio movie_facebook_likes
0 False False ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 True False ... True False

5 rows × 28 columns

sum统计布尔值,返回的是Series

movie.isnull().sum().head()
color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
dtype: int64

对这个Series再使用sum,返回整个DataFrame的缺失值的个数,返回值是个标量

movie.isnull().sum().sum()
2654

判断整个DataFrame有没有缺失值,方法是连着使用两个any

movie.isnull().any().any()
True

2.4 运算符

行索引名设为INSTNM,用UGDS_过滤出本科生的种族比例

college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')
college_ugds_
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.0333 0.9353 ... 0.0059 0.0138
University of Alabama at Birmingham 0.5922 0.2600 ... 0.0179 0.0100
... ... ... ... ... ...
Bay Area Medical Academy - San Jose Satellite Location NaN NaN ... NaN NaN
Excel Learning Center-San Antonio South NaN NaN ... NaN NaN

7535 rows × 9 columns

college_ugds_的数值类型都是float,可以进行整数运算

college_ugds_.dtypes
UGDS_WHITE    float64
UGDS_BLACK    float64
               ...   
UGDS_NRA      float64
UGDS_UNKN     float64
Length: 9, dtype: object

2.4.1 加减乘除

college_ugds_.head() + .00501
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03831 0.94031 ... 0.01091 0.01881
University of Alabama at Birmingham 0.59721 0.26501 ... 0.02291 0.01501
Amridge University 0.30401 0.42421 ... 0.00501 0.27651
University of Alabama in Huntsville 0.70381 0.13051 ... 0.03821 0.04001
Alabama State University 0.02081 0.92581 ... 0.02931 0.01871

5 rows × 9 columns

2.4.2 计算样例数据的百分比

2.4.2.1 方式一

college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
college_ugds_op_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03 0.94 ... 0.01 0.01
University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
Amridge University 0.30 0.42 ... 0.00 0.27
University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
Alabama State University 0.02 0.92 ... 0.02 0.01

5 rows × 9 columns

2.4.2.2 方式二

college_ugds_round = (college_ugds_ + .00001).round(2)
college_ugds_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03 0.94 ... 0.01 0.01
University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
Amridge University 0.30 0.42 ... 0.00 0.27
University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
Alabama State University 0.02 0.92 ... 0.02 0.01

5 rows × 9 columns

2.4.2.3 方式三

college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
college_ugds_op_round_methods.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03 0.94 ... 0.01 0.01
University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
Amridge University 0.30 0.42 ... 0.00 0.27
University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
Alabama State University 0.02 0.92 ... 0.02 0.01

5 rows × 9 columns

3 比较缺失值

Pandas使用NumPy NaN(np.nan)对象表示缺失值。这是一个不等于自身的特殊对象:

np.nan == np.nan
False

所有和np.nan的比较都返回False,除了不等于:

5 > np.nan
False
5 != np.nan
True

无法通过直接比较比较,含有缺失值的df是否一致

movie_equal = movie == movie
movie_equal.all().all()
False
movie_equal.size - movie_equal.sum().sum()
2654
movie.isnull().sum().sum()
2654

比较两个DataFrame最直接的方法是使用equals()方法

from pandas.testing import assert_frame_equal
assert_frame_equal(movie, movie)
天下风云出我辈,一入江湖岁月催
原文地址:https://www.cnblogs.com/shiyushiyu/p/9734621.html