Pandas CookBook -- 01Pandas基础

Pandas基础

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option('max_columns', 5, 'max_rows', 10)

1 DataFrame的结构

movie = pd.read_csv('data/movie.csv')
movie.shape
(4916, 28)

2 访问DataFrame的组件

2.1 组件获取及其类型

columns = movie.columns
type(columns)
pandas.core.indexes.base.Index
columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')
index = movie.index
type(index)
pandas.core.indexes.range.RangeIndex
index
RangeIndex(start=0, stop=4916, step=1)
data = movie.values
type(data)
numpy.ndarray
data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

2.2 索引类型

判断是不是子类型

issubclass(pd.core.indexes.range.RangeIndex,pd.Index)
True

访问index的值,index的值是个列表,所以可以索引或切片

index.values
array([   0,    1,    2, ..., 4913, 4914, 4915])

3 理解数据类型

movie.dtypes
color                       object
director_name               object
num_critic_for_reviews     float64
duration                   float64
director_facebook_likes    float64
                            ...   
title_year                 float64
actor_2_facebook_likes     float64
imdb_score                 float64
aspect_ratio               float64
movie_facebook_likes         int64
Length: 28, dtype: object

显示各类型的数量

movie.get_dtype_counts()
float64    13
int64       3
object     12
dtype: int64

4 Series 结构

选择一列数据,作为Series

movie['director_name']
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

也可以通过属性的方式选取

movie.director_name
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object
type(movie['director_name'])
pandas.core.series.Series

4.1 调用Series方法

查看Series所有不重复的指令

s_attr_methods = set(dir(pd.Series))
len(s_attr_methods)
464

查看DataFrame所有不重复的指令

df_attr_methods = set(dir(pd.DataFrame))
len(df_attr_methods)
460

这两个集合中有多少共有的指令

len(s_attr_methods & df_attr_methods)
399

4.2 Series基础方法

选取director和actor_1_fb_likes两列

director = movie['director_name']
actor_1_fb_likes  = movie['actor_1_facebook_likes']

查看series头部信息

director.head()
0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

统计series值出现的频数

director.value_counts()
Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Martin Scorsese     20
Spike Lee           16
                    ..
John Duigan          1
Ray Griggs           1
Lena Dunham          1
Dario Argento        1
Eric Mendelsohn      1
Name: director_name, Length: 2397, dtype: int64

统计series值出现的频率

director.value_counts(normalize=True)
Steven Spielberg    0.005401
Woody Allen         0.004570
Clint Eastwood      0.004155
Martin Scorsese     0.004155
Spike Lee           0.003324
                      ...   
John Duigan         0.000208
Ray Griggs          0.000208
Lena Dunham         0.000208
Dario Argento       0.000208
Eric Mendelsohn     0.000208
Name: director_name, Length: 2397, dtype: float64

长度相关

len(director) 
4916
director.size 
4916
director.shape 
(4916,)

director有多少非空值

director.count() 
4814

空值个数(会有更加直接的方法)

director.size - director.count()
102

4.3 Series统计信息

最小值、最大值、平均值、中位数、标准差、总和

actor_1_fb_likes.min(), actor_1_fb_likes.max()
(0.0, 640000.0)
actor_1_fb_likes.mean(), actor_1_fb_likes.median()
(6494.488490527602, 982.0)
actor_1_fb_likes.std(), actor_1_fb_likes.sum()
(15106.986883848309, 31881444.0)

数值描述信息

actor_1_fb_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

字符描述信息

director.describe()
count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

任意分为点

actor_1_fb_likes.quantile(.2)
510.0
actor_1_fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
0.1      240.0
0.2      510.0
0.3      694.0
0.4      854.0
0.5      982.0
0.6     1000.0
0.7     8000.0
0.8    13000.0
0.9    18000.0
Name: actor_1_facebook_likes, dtype: float64

4.4 空值处理

判断是否有缺失值

actor_1_fb_likes.hasnans
True

缺失值的个数

actor_1_fb_likes.isnull().sum()
7

选取缺失值

actor_1_fb_likes[actor_1_fb_likes.isnull()]
4401   NaN
4418   NaN
4608   NaN
4721   NaN
4822   NaN
4823   NaN
4864   NaN
Name: actor_1_facebook_likes, dtype: float64

非空值

actor_1_fb_likes.isnull()
0       False
1       False
2       False
3       False
4       False
        ...  
4911    False
4912    False
4913    False
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
bool_sig = actor_1_fb_likes.notnull()

判断所有的bool是否都为true

bool_sig.all()
False

填充缺失值

actor_1_fb_likes.count()
4909
actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
actor_1_fb_likes_filled.count()
4916

删除缺失值

actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
actor_1_fb_likes_dropped.size
4909

4.5 在Series上使用运算符

imdb_score = movie['imdb_score']

加减乘除

imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

函数实现

imdb_score.add(1)        
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

4.6 类型转化

imdb_score.dtype
dtype('float64')
imdb_score = imdb_score.astype(int)
imdb_score.dtype
dtype('int64')

5 使dataframe索引有意义

movie.shape
(4916, 28)
movie.tail()
color director_name ... aspect_ratio movie_facebook_likes
4911 Color Scott Smith ... NaN 84
4912 Color NaN ... 16.00 32000
4913 Color Benjamin Roberds ... NaN 16
4914 Color Daniel Hsia ... 2.35 660
4915 Color Jon Gunn ... 1.85 456

5 rows × 28 columns

5.1 给索引重命名

movie.index.name = 'row_index'
movie.columns.name = 'col_index'
movie.tail()
col_index color director_name ... aspect_ratio movie_facebook_likes
row_index
4911 Color Scott Smith ... NaN 84
4912 Color NaN ... 16.00 32000
4913 Color Benjamin Roberds ... NaN 16
4914 Color Daniel Hsia ... 2.35 660
4915 Color Jon Gunn ... 1.85 456

5 rows × 28 columns

5.2 重设索引

将dataframe中存在某列或多列作为索引

movie2 = movie.set_index('movie_title')
movie2.tail()
col_index color director_name ... aspect_ratio movie_facebook_likes
movie_title
Signed Sealed Delivered Color Scott Smith ... NaN 84
The Following Color NaN ... 16.00 32000
A Plague So Pleasant Color Benjamin Roberds ... NaN 16
Shanghai Calling Color Daniel Hsia ... 2.35 660
My Date with Drew Color Jon Gunn ... 1.85 456

5 rows × 27 columns

另一种方式

movie = pd.read_csv('data/movie.csv',index_col = 'movie_title')

还原为默认整数索引

movie2.reset_index().tail()
col_index movie_title color ... aspect_ratio movie_facebook_likes
4911 Signed Sealed Delivered Color ... NaN 84
4912 The Following Color ... 16.00 32000
4913 A Plague So Pleasant Color ... NaN 16
4914 Shanghai Calling Color ... 2.35 660
4915 My Date with Drew Color ... 1.85 456

5 rows × 28 columns

6 重命名行名和列名

通过rename()重命名

idx_rename = {'Avatar':'Ratava', 'Spectre': 'Ertceps'} 
col_rename = {'director_name':'Director Name','num_critic_for_reviews': 'Critical Reviews'} 
movie.rename(index=idx_rename, columns=col_rename).head()
color Director Name ... aspect_ratio movie_facebook_likes
movie_title
Ratava Color James Cameron ... 1.78 33000
Pirates of the Caribbean: At World's End Color Gore Verbinski ... 2.35 0
Ertceps Color Sam Mendes ... 2.35 85000
The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

5 rows × 27 columns

列表的方式

index = movie.index
columns = movie.columns
index_list = index.tolist()
column_list = columns.tolist()
index_list[0] = 'Ratava'
index_list[2] = 'Ertceps'
column_list[1] = 'Director Name'
column_list[2] = 'Critical Reviews'
movie.index = index_list
movie.columns = column_list
movie.head()
color Director Name ... aspect_ratio movie_facebook_likes
Ratava Color James Cameron ... 1.78 33000
Pirates of the Caribbean: At World's End Color Gore Verbinski ... 2.35 0
Ertceps Color Sam Mendes ... 2.35 85000
The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

5 rows × 27 columns

7 创建、删除列

通过[列名]添加新列

movie = pd.read_csv('data/movie.csv')
movie['has_seen'] = 0
movie['actor_director_facebook_likes'] = (movie['actor_1_facebook_likes'] + movie['actor_2_facebook_likes'])
movie.shape,movie['actor_director_facebook_likes'].shape
((4916, 30), (4916,))

删除行/列

movie.drop(['actor_director_facebook_likes','actor_1_facebook_likes'],axis=1)
color director_name ... movie_facebook_likes has_seen
0 Color James Cameron ... 33000 0
1 Color Gore Verbinski ... 0 0
2 Color Sam Mendes ... 85000 0
3 Color Christopher Nolan ... 164000 0
4 NaN Doug Walker ... 0 0
... ... ... ... ... ...
4911 Color Scott Smith ... 84 0
4912 Color NaN ... 32000 0
4913 Color Benjamin Roberds ... 16 0
4914 Color Daniel Hsia ... 660 0
4915 Color Jon Gunn ... 456 0

4916 rows × 28 columns

movie.drop([0,2])
color director_name ... has_seen actor_director_facebook_likes
1 Color Gore Verbinski ... 0 45000.0
3 Color Christopher Nolan ... 0 50000.0
4 NaN Doug Walker ... 0 143.0
5 Color Andrew Stanton ... 0 1272.0
6 Color Sam Raimi ... 0 35000.0
... ... ... ... ... ...
4911 Color Scott Smith ... 0 1107.0
4912 Color NaN ... 0 1434.0
4913 Color Benjamin Roberds ... 0 0.0
4914 Color Daniel Hsia ... 0 1665.0
4915 Color Jon Gunn ... 0 109.0

4914 rows × 30 columns

天下风云出我辈,一入江湖岁月催
原文地址:https://www.cnblogs.com/shiyushiyu/p/9712998.html