简介

pandas是一个强大的Python数据分析的工具包，它是基于Numpy构建的，正因pandas的出现，让Python语言也成为使用最广泛而且强大的数据分析环境之一

Pandas 的主要功能

具备对其功能的数据结构DataFrame，Series
集成时间序列功能
提供丰富的数学运算和操作
灵活处理缺失数据

安装

pip install pandas

引用

import pandas as pd

Series

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签(索引)组成

series 的创建

第一种方式
s1 = pd.Series([1,2,3,4])
s1

执行结果
0    1
1    2
2    3
3    4
dtype: int64
将数组索引以及数组的值打印出来，索引在左，值在右，由于没有为数据指定索引，于是会自动创建一个 0 到 N-1（N为数据的长度）的整数型索引，取值的时候可以通过索引取值，跟之前学过的数组和列表一样

第二种方式
s2 = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
s2

执行结果
a    1
b    2
c    3
d    4
dtype: int64
自定义索引，index 是一个索引列表，里面包含的是字符串，依然可以通过默认索引取值。

第三种方式
pd.Series({"a":1,"b":2})

执行结果：
a    1
b    2
dtype: int64
指定索引

第四重种方式
pd.Series(0,index=['a','b','c'])

执行结果：
a    0
b    0
c    0
dtype: int64
# 创建一个值都是0的数组

缺失值的处理

dropna() # 过滤掉值为NaN的行
fill() # 填充缺失数据
isnull() # 返回布尔数组，缺失值对应为True
notnull() # 返回布尔数组，缺失值对应为False

s1 = pd.Series({'sean':18, 'yang':20, 'bella':22, 'cloud':34})
s1

执行结果
sean     18
yang     20
bella    22
cloud    34
dtype: int64

s2 = pd.Series(s1, index=['sean', 'yang', 'rocky', 'cloud'])
s2

执行结果
sean     18.0
yang     20.0
rocky     NaN
cloud    34.0
dtype: float64

type(np.nan)
# float

缺失数据的行

直接舍弃掉改行

s2.dropna(inplace=True)  ### dropnan删除所在nan那一行

直接填充

s2.fillna(0, inplace=True) #### fillnan 使用某一个值进行填充

执行结果
sean     18.0
yang     20.0
rocky     0.0
cloud    34.0
dtype: float64

s2 = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
s2

执行结果
a    1
b    2
c    3
d    4
dtype: int64

s2.loc['a'] ### location =loc
# 1

s2.iloc[1]  ### index + location = iloc
# 2

Dataframe

df1 = pd.DataFrame({'one':[1,2,3,4], 'two':[5,6,7,8]})
df1

执行结果
	one	two
0	1	5
1	2	6
2	3	7
3	4	8

df1['one'][0]  #### 先取列  再取行
# 1

excel 表格，csv 表格
pd.read_csv('./douban_movie.csv')

import pandas as pd

res = pd.read_html('https://baike.baidu.com/item/NBA%E6%80%BB%E5%86%A0%E5%86%9B/2173192?fr=aladdin')   #### 相当于向某一个url发起请求，会将此页面下面所有的表格数据全部爬下来

df = res[0]  # 获取第一个表格
df

df.columns = df.iloc[0]  # 把列名设为第一行的数据
df

df.drop([0], inplace=True)  # 删除第一行多余的数据
df

# 获得各个球队获得冠军的次数
df.groupby('冠军').groups  #### select sum(dsasda) from xxx groub by xxx

执行结果
{'休斯顿火箭队': Int64Index([48, 49], dtype='int64'),
 '克里夫兰骑士队': Int64Index([70], dtype='int64'),
 '华盛顿子弹队': Int64Index([32], dtype='int64'),
 '圣安东尼奥马刺队': Int64Index([53, 57, 59, 61, 68], dtype='int64'),
		...
  '迈阿密热火队': Int64Index([60, 66, 67], dtype='int64'),
 '金州勇士队': Int64Index([29, 69, 71, 72], dtype='int64')}

df.groupby('冠军').size().sort_values(ascending=False)  # 聚合函数

执行结果
冠军
波士顿凯尔特人队     17
洛杉矶湖人队       11
芝加哥公牛队        6
圣安东尼奥马刺队      5
明尼阿波利斯湖人队     5