案例分析

这里记录一下通过这个案例掌握的之前不会的api(pandas)

1.数据的统计描述

往往都df.decsribe()

但是可以分数值型和对象型变量

数值型

# describe函数查看部分变量的分布
# 因为Survived是0-1变量，所以均值就是幸存人数的百分比，这个用法非常有用
titanic_df[["Survived","Age", "SibSp", "Parch"]].describe()

分类变量

# 使用include=[np.object]来查看分类变量
# count: 非缺失值的个数
# unique: 非重复值得个数
# top: 最高频值
# freq: 最高频值出现次数
titanic_df.describe(include=[np.object])

2.填补（年龄数据中）的缺失值

# 计算所有人年龄的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值，inplace=True表示在原数据titanic_df上直接进行修改
titanic_df.Age.fillna(age_median1, inplace=True)

3.处理多个维度的特征办法（舱位与生还概率）分组或透视表

两个维度：

计算每个舱位的生还概率

# 方法1：使用经典的分组-聚合-计算（第六节课核心内容）
# 注意：因为Survived是0-1函数，所以均值即表示生还百分比
titanic_df[['Pclass', 'Survived']].groupby('Pclass').mean() 
    .sort_values(by='Survived', ascending=False)

# 方法2：我们还可以使用pivot_table函数来实现同样的功能（本次课新内容）
# pivot table中文为数据透视表
# values: 聚合后被施加计算的值，这里我们施加mean函数
# index: 分组用的变量
# aggfunc: 定义施加的函数
titanic_df.pivot_table(values='Survived', index='Pclass', aggfunc=np.mean)

性别与生还概率

# 方法1：groupby
titanic_df[["Sex", "Survived"]].groupby('Sex').mean() 
    .sort_values(by='Survived', ascending=False)

# 方法2：pivot_table
titanic_df.pivot_table(values='Survived', index='Sex', aggfunc=np.mean)

三个维度：

综合考虑舱位和性别的因素，与生还概率的关系

# 方法1：groupby
titanic_df[['Pclass','Sex', 'Survived']].groupby(['Pclass', 'Sex']).mean()

# 方法2：pivot_table
titanic_df.pivot_table(values='Survived', index=['Pclass', 'Sex'], aggfunc=np.mean)

# 方法3：pivot_talbe
# columns指定另一个分类变量，只不过我们将它列在列里而不是行里，这也是为什么这个变量称为columns
titanic_df.pivot_table(values='Survived', index='Pclass', columns="Sex", aggfunc=np.mean)

练习：

分别使用groupby和pivot_table, 计算在不同舱位中男女乘客的人数。

# 1.groupby
titanic_df.groupby(['Pclass', 'Sex']).agg({"Sex": "size"})
titanic_df.groupby(['Pclass', 'Sex']).agg({"Sex": "count"})
titanic_df.groupby(['Pclass', 'Sex']).Sex.count()

# 2.透视表pivot_table
titanic_df.columns
# titanic_df.pivot_table(values='Survived', index=['Pclass', 'Sex'], aggfunc=np.mean)
titanic_df.pivot_table(values='Name', index=['Pclass', 'Sex'], aggfunc="count")  # 这里的aggfunc作用于values，values可以取除了index中的其余值

4.将连续型变量离散化

连续型变量离散化是建模中一种常用的方法
离散化指的是将某个变量的所在区间分割为几个小区间，落在同一个区间的观测值用同一个符号表示
以年龄为例，最小值是0.42（婴儿），最大值是80，如果我们想产生一个五个级（levels），我们可使用cut或者qcut函数
cut函数将年龄的区间均匀分割为5分，而qcut则选取区间以至于每个区间里的观察值个数都是一样的（五等分），这里演示中使用cut函数。

# 我们使用cut函数
# 我们可以看到每个区间的大小是固定的，大约是16岁
titanic_df['AgeBand'] = pd.cut(titanic_df['Age'], 5)
titanic_df.head()

查看落在不同年龄区间里的人数

# 方法1：value_counts(), sort=False表示不需要将结果排序
titanic_df.AgeBand.value_counts(sort=False)

# 方法2：pivot_table
titanic_df.pivot_table(values='Survived',index='AgeBand', aggfunc='count')
titanic_df.pivot_table(values='Name',index='AgeBand', aggfunc='count')

练习：综合考虑性别，舱位和登船码头三个因素，计算其生还概率，并在一副图中探索它们和生还概率的关系。

# 方法1.这个方法这里是最好的，这样可以一下讨论4个维度（变量）的关系
# 默认点图
sns.factorplot(x="Pclass", y="Survived", hue="Sex", col="Embarked", data=titanic_df)
# 柱状图
sns.factorplot(x="Pclass", y="Survived", hue="Sex", col="Embarked", data=titanic_df, kind="bar")

# 方法2. 'Embarked', 'Pclass', 'Sex', 'Survived'
# 这种方式最大可以分析三个变量之间的关系，
# 1、这里讨论 关系，舱位 和 性别三者的关系
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=titanic_df, ci=None)


# 2、使用FacetGrid函数 进行分类讨论
sns.FacetGrid(data = titanic_df, row='Embarked', aspect=1.5) 
   .map(sns.pointplot, 'Sex', 'Survived', 'Pclass',hueorder=['male','female'], palette='deep', ci=None)

# 方法3.
sns.pairplot(titanic_df.loc[:, ['Embarked', 'Pclass', 'Sex', 'Survived']], hue="Embarked")
# sns.pairplot(titanic_df.loc[:, ['Embarked', 'Pclass', 'Sex', 'Survived']], hue_order=["Embarked", "Sex"])

sns.pairplot(titanic_df[['Pclass', 'Sex', 'PassengerId', 'Survived', 'Embarked', 'AgeBand']], hue='AgeBand')
sns.pairplot(titanic_df, hue='AgeBand')

案例分析

2.填补（年龄数据中）的缺失值

3.处理多个维度的特征办法（舱位与生还概率） 分组 或 透视表

4.将连续型变量离散化

3.处理多个维度的特征办法（舱位与生还概率）分组或透视表