快速入门Pandas

教你十分钟学会使用pandas。
pandas是python数据分析的一个最重要的工具。

基本使用

# 一般以pd作为pandas的缩写
import pandas as pd

# 读取文件
df = pd.read_csv('file.csv')

# 返回数据的大小
df.shape

# 显示数据的一些对象信息和内存使用
df.info()

# 显示数据的统计量信息
df.describe()

花式索引

我们的主要数据结构就是DataFrame了，DataFrame有两部分构成，一个是列(columns)。列是有名称的或者说有标签的。另一个是索引(index)，这里我们为了避孕歧义称之为行(rows)，行一般没有名称，但是也可以有名称。
如图所示：

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, index=labels)

>>> df
   age animal priority  visits
a  2.5    cat      yes       1
b  3.0    cat      yes       3
c  0.5  snake       no       2
d  NaN    dog      yes       3
e  5.0    dog       no       2
f  2.0    cat       no       3
g  4.5  snake       no       1
h  NaN    cat      yes       1
i  7.0    dog       no       2
j  3.0    dog       no       1

原始索引

原始索引就是类list的索引方式。
当索引对象是切片时就是行索引。

>>> df[1:3]
   age animal priority  visits
b  3.0    cat      yes       3
c  0.5  snake       no       2

当索引对象是list时就是列索引。

>>> df[['age', 'animal']]
   age animal
a  2.5    cat
b  3.0    cat
c  0.5  snake
d  NaN    dog
e  5.0    dog
f  2.0    cat
g  4.5  snake
h  NaN    cat
i  7.0    dog
j  3.0    dog

跟上面等效，上面是用了列名称，这里用了列序号。

>>> df[[0,1]]
   age animal
a  2.5    cat
b  3.0    cat
c  0.5  snake
d  NaN    dog
e  5.0    dog
f  2.0    cat
g  4.5  snake
h  NaN    cat
i  7.0    dog
j  3.0    dog

位置索引

>>> df.iloc[0:2, 0:2]
   age animal
a  2.5    cat
b  3.0    cat

标签索引

loc与iloc的主要区别就是索引要用标签不能用序号。

>>> df.loc[['a', 'b'], ['animal', 'age']]
  animal  age
a    cat  2.5
b    cat  3.0

混合索引

其实就是位置索引和标签索引的混合使用方式。

>>> df.ix[0:2, ['animal', 'age']]
  animal  age
a    cat  2.5
b    cat  3.0

条件索引

>>> df[(df['animal'] == 'cat') & (df['age'] < 3)]
   age animal priority  visits
a  2.5    cat      yes       1
f  2.0    cat       no       3

数据清洗

找到缺失值。

>>> df[df['age'].isnull()]
   age animal priority  visits
d  NaN    dog      yes       3
h  NaN    cat      yes       1

填充缺失值。

>>> df['age'].fillna(0, inplace=True)
>>> df
   age animal priority  visits
a  2.5    cat      yes       1
b  3.0    cat      yes       3
c  0.5  snake       no       2
d  0.0    dog      yes       3
e  5.0    dog       no       2
f  2.0    cat       no       3
g  4.5  snake       no       1
h  0.0    cat      yes       1
i  7.0    dog       no       2
j  3.0    dog       no       1

将字符值替换成布尔值

>>> df['priority'] = df['priority'].map({'yes': True, 'no': False})
>>> df
   age animal priority  visits
a  2.5    cat     True       1
b  3.0    cat     True       3
c  0.5  snake    False       2
d  0.0    dog     True       3
e  5.0    dog    False       2
f  2.0    cat    False       3
g  4.5  snake    False       1
h  0.0    cat     True       1
i  7.0    dog    False       2
j  3.0    dog    False       1