Pandas-数据整理

Pandas包对数据的常用整理功能，相当于数据预处理（不包括特征工程）

丢弃值
1. drop()
缺失值处理
1. isnull() & notnull()
2. dropna()
3. fillna()　　
值替换
1. replace()
2. get_dummies()　
重复值处理
1. duplicated()
2. is_unique()
3. unique()
4. drop_duplicated()
排序&排名
1. sort_index()
2. rank()　　
索引设置
1. reindex()
2. set_index()
3. reset_index()
4. stack() & unstack()
修改列名

丢弃值

drop()
- 一般用于丢弃DataFrame里的列columns，但切片功能效果相同，根据实际使用，好处是节约内存

# 按列（axis=1），丢弃指定label的列
df.drop(labels, axis=1)

缺失值处理

isnull() & notnull() 判断空值

df.isnull()
s.isnull()
s.isnull().value_counts()

dropna() 丢弃缺失值

# 默认axi=0，how=‘any’，按行，任意一行有NaN就整列丢弃
df.dropna()
df.dropna(axis=1)

# 一行中全部为NaN的，才丢弃
df.driopna(how='all')

# 保留至少3个非空值的行：一行中有3个值是非空的就保留
df.dropna(thresh=3)

fillna() 缺失值填充
```
df.fillna(0)
```

值替换

replace()

# 将df的A列中 -999 全部替换成空值
df["A"].replace(-999, np.nan)

#-999和1000 均替换成空值
obj.replace([-999,1000],  np.nan)

# -999替换成空值，1000替换成0
obj.replace([-999,1000],  [np.nan, 0])

# 同上，写法不同，更清晰
obj.replace({-999:np.nan, 1000:0})

重复值处理

duplicated()

# 返回布尔向量、矩阵
s.duplicated()
df.duplicated()

unique()

# 返回唯一值的数组
df["A"].unique()

drop_duplictad()

# 保留k1列中的唯一值的行，默认保留第一行
df.drop_duplicated(["k1"])

# 保留 k1和k2 组合的唯一值的行，take_last=True 保留最后一行
df.drop_duplicated(["k1","k2"], take_last=True)

排序&排名

sort_index()

索引排序

# 默认axis=0，对行进行排序；ascending=True，升序排序
s.sort_index()
df.sort_index()

# 对列进行排序，ascending=False 降序
df.sort_index(axis=1, ascending=False)

值排序

# 按值对Series进行排序，使用order()，默认空值会置于尾部
s = pd.Series([4, 6, np.nan, 2, np.nan])
s.order()

# DataFrame可根据一个或多个值进行排序
df.sort_index(by="A")
df.sort_index(by=["A","B"])

rank()

索引设置

reindex()
- 更新index或者columns，
- 默认：更新index，返回一个新的DataFrame

# 返回一个新的DataFrame，更新index，原来的index会被替代消失
# 如果某个索引值不存在，会自动补上NaN
df2 = df1.reindex(['a','b','c','d','e'])

# fill_valuse为原先不存在的索引补上默认值，不在是NaN
df2 = df1.reindex(['a','b','c','d','e'],  fill_value=0)

# inplace=Ture，在DataFrame上修改数据，而不是返回一个新的DataFrame
df1.reindex(['a','b','c','d','e'],  inplace=Ture)

# reindex不止可以修改 索引(行)，也可以修改列
states = ["Texas","Utah","California"]
df2 = df1.reindex( columns=states )

set_index()
- 将DataFrame中的列columns设置成索引index、
- 打造层次化索引的方法

# 将columns中的其中两列：race和sex设置索引，race为一级，sex为二级
# inplace=True 在原数据集上修改的
adult.set_index(['race','sex'], inplace = True) 

# 默认情况下，设置成索引的列会从DataFrame中移除
# drop=False将其保留下来
adult.set_index(['race','sex'], inplace = True)

reset_index()
- 将使用set_index()打造的层次化逆向操作
- 既是取消层次化索引，将索引变回列，并补上最常规的数字索引
```
adult.reset_index()
```

修改列名

df.rename(columns = {'库存数量':'12月20日库存数量'},inplace=True)