Kaggle 房价预测问题参考资料

作者的 Kaggle 主页:https://www.kaggle.com/pavansanagapati

Tutorial - Housing Prices Model Prediction

https://www.kaggle.com/pavansanagapati/tutorial-housing-prices-model-prediction
https://www.kaggle.com/pavansanagapati/tutorial-housing-prices-model-prediction
https://www.kaggle.com/pavansanagapati/tutorial-housing-prices-model-prediction

一份探索性数据分析的简单教程

https://www.kaggle.com/pavansanagapati/a-simple-tutorial-on-exploratory-data-analysis/notebook
https://www.kaggle.com/pavansanagapati/a-simple-tutorial-on-exploratory-data-analysis/notebook
https://www.kaggle.com/pavansanagapati/a-simple-tutorial-on-exploratory-data-analysis/notebook

如何处理缺失值数据的简单教程

https://www.kaggle.com/pavansanagapati/simple-tutorial-on-how-to-handle-missing-data
https://www.kaggle.com/pavansanagapati/simple-tutorial-on-how-to-handle-missing-data
https://www.kaggle.com/pavansanagapati/simple-tutorial-on-how-to-handle-missing-data

中文版:
https://www.kaggle.com/marsggbo/kaggle

英文版:
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

  • 筛选出数值型变量
numeric_features = train.select_dtypes(include=[np.number])
numeric_features.columns
  • 筛选出类别型变量
categorical_features = train.select_dtypes(include=[np.object])
categorical_features.columns
  • 介绍了 msno 的使用,这是一个观察缺失值分布的工具。

  • 计算每个特征的偏度和峰度:train.skew()train.kurt()
    关于偏度和峰度的知识补充:
    1、偏度指的是尾部偏向哪里;

  • 把目标变量经过变换,处理成符合正态分布的样子,这个变换是对数变换(保证了单调性)

target = np.log(train['SalePrice'])
target.skew()
plt.hist(target,color='blue')
  • 通过线性相关系数矩阵筛选重要的特征:
correlation = numeric_features.corr()
print(correlation['SalePrice'].sort_values(ascending = False),'
')

说明:找出与目标变量 SalePrice 线性相关的特征。

从相关系数矩阵中,找到与关系的变量最“正”线性相关的 10 个变量:cols = correlation.nlargest(k, 'SalePrice')['SalePrice'].index

k= 11
cols = correlation.nlargest(k, 'SalePrice')['SalePrice'].index
print(cols)
cm = np.corrcoef(train[cols].values.T)
f , ax = plt.subplots(figsize = (14, 12))
sns.heatmap(cm, vmax=.8, linewidths=0.01, square=True,annot=True,cmap='viridis',
            linecolor="white",xticklabels = cols.values ,annot_kws = {'size':12}, yticklabels = cols.values)

说明:为了避免多重共线性,自变量之间如果高度线性相关,我们取与目标变量线性相关最高的一个,另一个舍弃。

绘制 pairplot 图

sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
原文地址:https://www.cnblogs.com/liweiwei1419/p/9203222.html