pandas模块

pandas简介：

pandas是一个强大的Python数据分析的工具包，是基于NumPy构建的。

pandas的主要功能：

1. 具备对其功能的数据结构DataFrame、Series
2. 集成时间序列功能
3. 提供丰富的数学运算和操作
4. 灵活处理缺失数据

安装方法：

pip install pandas

引用方法：

import pandas as pd

Series --- 一维数据对象

Series是一种类似于一维数据的对象，由一组数据和一组与之相关的数据标签（索引）组成。

创建方式：

import pandas as pd
pd.Series([4,7,-5,3])
pd.Series([4,7,-5,3],index=['a','b','c','d'])
pd.Series({'a':1,'b':2})
pd.Series(0,index=['a','b','c','d'])

获取值数组和索引数组： values属性和index属性
Series比较像列表（数组）和字典的结合体

示例代码：

# Series创建方式
import pandas as pd
import numpy as np

pd.Series([2,3,4,5])  # 列表创建Series
"""
输出结果：
0    2
1    3
2    4
3    5
dtype: int64

# 左边一列是 索引，右边一列是值
"""

pd.Series([2,3,4,5],index=["a","b","c","d"])  # 指定索引
"""
输出结果：
a    2
b    3
c    4
d    5
dtype: int64
"""

# Series支持array 的特性（下标）
pd.Series(np.arange(5))  # 数组创建 Series
"""
输出结果：
0    0
1    1
2    2
3    3
4    4
dtype: int32
"""

sr = pd.Series([2,3,4,5],index=["a","b","c","d"])
sr
"""
a    2
b    3
c    4
d    5
dtype: int64
"""

# 索引：
sr[0]
#  输出结果： 2  # sr虽然指定了 标签索引，但仍可以利用 下标索引 的方式获取值

sr[[1,2,0]]  # sr[[索引1,索引2,...]]
"""
b    3
c    4
a    2
dtype: int64
"""

sr['d']
# 输出结果： 5

# Series可以和标量进行运算
sr+2
"""
a    4
b    5
c    6
d    7
dtype: int64
"""

# 两个相同大小（长度相同）的 Series 也可以进行运算
sr + sr
"""
a     4
b     6
c     8
d    10
dtype: int64
"""

# 切片
sr[0:2]  # 也是 顾首不顾尾
"""
a    2
b    3
dtype: int64
"""

# Series也支持 numpy 的通用函数
np.abs(sr)
"""
a    2
b    3
c    4
d    5
dtype: int64
"""

# 支持布尔型索引过滤
sr[sr>3]
"""
c    4
d    5
dtype: int64
"""

sr>3
"""
a    False
b    False
c     True
d     True
dtype: bool
"""

# Series支持字典的特性（标签）
# 通过字典创建 Series
sr = pd.Series({"a":1,"b":2})
sr 
"""
a    1
b    2
dtype: int64
# 字典的 key 会当作 标签
"""


sr["a"]
# 输出结果： 1
sr[0]
# 输出结果： 1

# 判断 一个字符串 是不是一个Series 中的标签
"a" in sr
# 输出结果： True

for i in sr:
    print(i)
"""
打印结果：
1
2

# for 循环中，遍历的是 Seires 中的 值（value），而不是它的标签；这是和字典不同的地方
"""

# 分别获取 Series 的值和索引
sr.index  # 获取索引
# 输出结果： Index(['a', 'b'], dtype='object')  # 是一个 Index 类的对象，其和数组对象（Array）完全一样
sr.index[0]
# 输出结果： 'a'

sr.values  # 获取 Series 的值
# 输出结果： array([1, 2], dtype=int64)

# 键索引
sr['a']
# 输出结果： 1
sr[['a','b']] # 也是 花式索引
"""
a    1
b    2
dtype: int64
"""

sr = pd.Series([1,2,3,4,5,6],index=['a','b','c','d','e','f'])
sr
"""
a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64
"""

sr[['a','c']]
"""
a    1
c    3
dtype: int64
"""

sr['a':'c']  # 通过标签进行切片； 首尾相顾，前包后也包
"""
a    1
b    2
c    3
dtype: int64
"""

series 整数索引问题：

整数索引的pandas对象很容易出错，如：

import pandas as pd
import numpy as np

sr = pd.Series(np.arange(10))
sr
"""
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32
# 上述的整数索引是自动生成的索引
"""

sr2 = sr[5:].copy()
sr2
"""
5    5
6    6
7    7
8    8
9    9
dtype: int32
# 上述的索引仍然是整数索引，但是不是从0开始的
"""
sr2[5]  # 此时的 5 解释为 标签，而不是下标（索引）
# 输出结果： 5

# sr2[-1]  # 会报错；因为当索引是整数的时候，[] 中的内容一定会被解释为 标签

# 解决方法： loc 和 iloc
sr2.loc[5]  # loc 表示 [] 中的内容解释为 标签
# 输出结果： 5
sr2.iloc[4] # iloc 表示 [] 的内容解释为 下标（索引）
# 输出结果： 9
sr2.iloc[0:3]
"""
5    5
6    6
7    7
"""
# 所以 用整数索引的时候 一定要 loc 和 iloc 进行区分

如果索引是整数类型，则根据整数进行下标获取值时总是面向标签的
解决方法：loc 属性（将索引解释为标签）和 iloc 属性（将索引解释为下标）

Series --- 数据对齐

pandas 在进行两个Series对象的运算时，会按照索引进行对齐然后计算

示例代码：

# Series -- 数据对齐
import pandas as pd

sr1 = pd.Series([12,23,34],index=["c","a","d"])
sr2 = pd.Series([11,20,10],index=["d","c","a"])
sr1 + sr2
"""
a    33    # 23+10
c    32    # 12+20
d    45    # 34+11
dtype: int64
# 数据会按照标签对齐
"""
# pandas 在进行两个Series对象的运算时，会按照索引进行对齐然后计算

# 注： pandas 的索引支持重复，但我们不要让索引重复 
pd.Series([1,2],index=["a","a"])  
"""
a    1
a    2
dtype: int64
"""

# 两个 pandas对象的长度不一样时
sr3 = pd.Series([12,23,34],index=["c","a","d"])
sr4 = pd.Series([11,20,10,21],index=["d","c","a","b"])
sr3+sr4
"""
a    33.0
b     NaN
c    32.0
d    45.0
dtype: float64
# 在 pandas 中 NaN 会被当作数据缺失值
"""

sr5 = pd.Series([12,23,34],index=["c","a","d"])
sr6 = pd.Series([11,20,10],index=["b","c","a"])
sr5+sr6
"""
a    33.0
b     NaN
c    32.0
d     NaN
dtype: float64
"""
#使上述结果中索引"b"处的值为 21、在索引"d"处的值为34 的方法： add sub mul div  （分别是 加减乘除）；如：sr5.add(sr2,fill_value=0) 
sr5.add(sr6)
"""
a    33.0
b     NaN
c    32.0
d     NaN
dtype: float64
# 不加 fill_value 时， sr5.add(sr6) 和 sr5+sr6 一样的效果
"""

sr5.add(sr6,fill_value=0)  # fill_value 的作用：如果一个Series对象中有某个标签，但另外一个Series对象中没有该标签，那么没有该标签的那个值就被赋值为 fill_value 的值
"""
a    33.0
b    11.0
c    32.0
d    34.0
dtype: float64
"""