Coggle 30 Days of ML：结构化赛题：天池二手车交易价格预测（二）

任务4：使用特征工程对比赛字段进行编码

对数据集中类别字段(取值空间大于2)的进行one-hot操作

对类别特征进行OneEncoder编码

Train_data_onehot = pd.get_dummies(Train_data,columns = ['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage'])

Train_data_onehot

	SaleID	name	regDate	power	kilometer	regionCode	seller	offerType	creatDate	price	...	fuelType_2.0	fuelType_3.0	fuelType_4.0	fuelType_5.0	fuelType_6.0	gearbox_0.0	gearbox_1.0	notRepairedDamage_-	notRepairedDamage_0.0	notRepairedDamage_1.0
0	0	736	20040402	60	12.5	1046	0	0	20160404	1850	...	0	0	0	0	0	1	0	0	1	0
1	1	2262	20030301	0	15.0	4366	0	0	20160309	3600	...	0	0	0	0	0	1	0	1	0	0
2	2	14874	20040403	163	12.5	2806	0	0	20160402	6222	...	0	0	0	0	0	1	0	0	1	0
3	3	71865	19960908	193	15.0	434	0	0	20160312	2400	...	0	0	0	0	0	0	1	0	1	0
4	4	111080	20120103	68	5.0	6977	0	0	20160313	5200	...	0	0	0	0	0	1	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
149995	149995	163978	20000607	163	15.0	4576	0	0	20160327	5900	...	0	0	0	0	0	0	1	0	1	0
149996	149996	184535	20091102	125	10.0	2826	0	0	20160312	9500	...	0	0	0	0	0	1	0	0	1	0
149997	149997	147587	20101003	90	6.0	3302	0	0	20160328	7500	...	0	0	0	0	0	1	0	0	1	0
149998	149998	45907	20060312	156	15.0	1877	0	0	20160401	4999	...	0	0	0	0	0	1	0	0	1	0
149999	149999	177672	19990204	193	12.5	235	0	0	20160305	4700	...	0	0	0	0	0	0	1	0	1	0

150000 rows × 334 columns

Test_data_onehot = pd.get_dummies(Test_data,columns = ['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage'])

Test_data_onehot

	SaleID	name	regDate	power	kilometer	regionCode	seller	offerType	creatDate	v_0	...	fuelType_2.0	fuelType_3.0	fuelType_4.0	fuelType_5.0	fuelType_6.0	gearbox_0.0	gearbox_1.0	notRepairedDamage_-	notRepairedDamage_0.0	notRepairedDamage_1.0
0	200000	133777	20000501	101	15.0	5019	0	0	20160308	42.142061	...	0	0	0	0	0	1	0	0	1	0
1	200001	61206	19950211	73	6.0	1505	0	0	20160310	43.907034	...	0	0	0	0	0	1	0	0	1	0
2	200002	67829	20090606	120	5.0	1776	0	0	20160309	45.389665	...	0	0	0	0	0	1	0	1	0	0
3	200003	8892	20020601	58	15.0	26	0	0	20160314	42.788775	...	0	0	0	0	0	1	0	0	1	0
4	200004	76998	20030301	116	15.0	738	0	0	20160306	43.670763	...	0	0	0	0	0	1	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	249995	111443	20041005	150	15.0	5564	0	0	20160309	46.321013	...	0	0	0	0	0	0	1	1	0	0
49996	249996	152834	20130409	179	4.0	5220	0	0	20160323	48.086547	...	0	0	0	0	0	1	0	0	1	0
49997	249997	132531	20041211	147	12.5	3795	0	0	20160316	46.145279	...	0	0	0	0	0	0	1	0	1	0
49998	249998	143405	20020702	176	15.0	61	0	0	20160327	45.507088	...	0	0	0	0	0	0	1	0	1	0
49999	249999	78202	20090708	0	3.0	4158	0	0	20160401	44.289471	...	0	0	0	0	0	1	0	0	1	0

50000 rows × 329 columns

对日期特征提取年月日等信息

Train_data_create = pd.to_datetime(Train_data['creatDate'],format='%Y%m%d', errors='coerce')
Test_data_create = pd.to_datetime(Test_data['creatDate'],format='%Y%m%d', errors='coerce')
Train_data_reg = pd.to_datetime(Train_data['regDate'],format='%Y%m%d', errors='coerce')
Test_data_reg = pd.to_datetime(Test_data['regDate'],format='%Y%m%d', errors='coerce')

Train_data_create

0        2016-04-04
1        2016-03-09
2        2016-04-02
3        2016-03-12
4        2016-03-13
            ...    
149995   2016-03-27
149996   2016-03-12
149997   2016-03-28
149998   2016-04-01
149999   2016-03-05
Name: creatDate, Length: 150000, dtype: datetime64[ns]

Train_data_reg

0        2004-04-02
1        2003-03-01
2        2004-04-03
3        1996-09-08
4        2012-01-03
            ...    
149995   2000-06-07
149996   2009-11-02
149997   2010-10-03
149998   2006-03-12
149999   1999-02-04
Name: regDate, Length: 150000, dtype: datetime64[ns]

任务5：使用Sklearn中基础树模型完成训练和预测

学会五折交叉验证的数据划分方法(KFold)

import numpy as np
from sklearn.model_selection import KFold

X = np.array([[1,2],[3,4],[1,2],[3,4],[3,4]])
y = np.array([1,2,3,4,5])
kf = KFold(n_splits = 5)

for train_index,test_index in kf.split(X):
    print("TRAIN:",train_index,"TEST:",test_index)
    X_train,X_test = X[train_index],X[test_index]
    y_train,y_test = y[train_index],y[test_index]
    print(X_train,X_test)
    print(y_train,y_test)

TRAIN: [1 2 3 4] TEST: [0]
[[3 4]
 [1 2]
 [3 4]
 [3 4]] [[1 2]]
[2 3 4 5] [1]
TRAIN: [0 2 3 4] TEST: [1]
[[1 2]
 [1 2]
 [3 4]
 [3 4]] [[3 4]]
[1 3 4 5] [2]
TRAIN: [0 1 3 4] TEST: [2]
[[1 2]
 [3 4]
 [3 4]
 [3 4]] [[1 2]]
[1 2 4 5] [3]
TRAIN: [0 1 2 4] TEST: [3]
[[1 2]
 [3 4]
 [1 2]
 [3 4]] [[3 4]]
[1 2 3 5] [4]
TRAIN: [0 1 2 3] TEST: [4]
[[1 2]
 [3 4]
 [1 2]
 [3 4]] [[3 4]]
[1 2 3 4] [5]

对标签price按照大小划分成10等分，然后使用StratifiedKFold进行划分

#按照大小划分成10等分
Y_data = Train_data['price']
Y_data = Y_data.sort_values()
Y_data_dict = {}
for i in range(10):
    Y_data_dict[i]=Y_data[i*15000:(i+1)*15000]

Y_data_list = []
Y_data_iloc = list(Y_data_dict.keys())
for i in Y_data_iloc:
    Y_data_list.append(list(Y_data_dict[i]))

Y_data_iloc

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

len(Y_data_list[0])

Y_data_iloc = [0,1,0,1,0,1,0,1,0,1]

import numpy as np 
from sklearn.model_selection import StratifiedKFold

skf=StratifiedKFold(n_splits = 5,shuffle=True,random_state=0)

Y_data_list = np.array(Y_data_list)
Y_data_iloc = np.array(Y_data_iloc)

for train_index,test_index in skf.split(Y_data_list,Y_data_iloc):
    print("TRAIN:",train_index,"TEST:",test_index) 
    X_train, X_test = Y_data_list[train_index], Y_data_list[test_index]
    y_train,y_test = Y_data_iloc[train_index],Y_data_iloc[test_index]

TRAIN: [0 3 4 5 6 7 8 9] TEST: [1 2]
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
TRAIN: [1 2 4 5 6 7 8 9] TEST: [0 3]
TRAIN: [0 1 2 3 4 5 7 8] TEST: [6 9]
TRAIN: [0 1 2 3 4 5 6 9] TEST: [7 8]

学会使用sklearn中的随机森林模型

学习博客链接：https://www.cnblogs.com/banshaohuan/p/13308680.html

任务6：成功将树模型的预测结果文件提交到天池

使用StratifiedKFold配合随机森林完成模型的训练和预测

在每折记录下模型对验证集和测试集的预测结果

X_data = X_data.fillna(-1)

X_data

	gearbox	power	kilometer	v_0	v_1	v_2	v_3	v_4	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0.0	60	12.5	43.357796	3.966344	0.050257	2.159744	1.143786	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	0.0	0	15.0	45.305273	5.236112	0.137925	1.380657	-1.422165	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	0.0	163	12.5	45.978359	4.823792	1.319524	-0.998467	-0.996911	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	1.0	193	15.0	45.687478	4.492574	-0.050616	0.883600	-2.228079	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	0.0	68	5.0	44.383511	2.031433	0.572169	-1.571239	2.246088	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
149995	1.0	163	15.0	45.316543	-3.139095	-1.269707	-0.736609	-1.505820	0.280264	0.000310	0.048441	0.071158	0.019174	1.988114	-2.983973	0.589167	-1.304370	-0.302592
149996	0.0	125	10.0	45.972058	-3.143764	-0.023523	-2.366699	0.698012	0.253217	0.000777	0.084079	0.099681	0.079371	1.839166	-2.774615	2.553994	0.924196	-0.272160
149997	0.0	90	6.0	44.733481	-3.105721	0.595454	-2.279091	1.423661	0.233353	0.000705	0.118872	0.100118	0.097914	2.439812	-1.630677	2.290197	1.891922	0.414931
149998	0.0	156	15.0	45.658634	-3.204785	-0.441680	-1.179812	0.620680	0.256369	0.000252	0.081479	0.083558	0.081498	2.075380	-2.633719	1.414937	0.431981	-1.659014
149999	1.0	193	12.5	45.536383	-3.200326	-1.612893	-0.067144	-1.396166	0.284475	0.000000	0.040072	0.062543	0.025819	1.978453	-3.179913	0.031724	-1.483350	-0.342674

150000 rows × 18 columns

Y_data = Train_data['price']
Y_data

0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64

#对每折记录模型对验证集和测试集的预测结果并求平均值
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
#定义随机森林模型
randomforest = RandomForestRegressor(n_estimators=50,random_state=0)
    
scores_train = []
scores_val = []
    
for train_ind,val_ind in skf.split(X_data,Y_data):
        
    train_x = X_data.iloc[train_ind].values
    train_y = Y_data.iloc[train_ind]
    val_x = X_data.iloc[val_ind].values
    val_y = Y_data.iloc[val_ind]
        
    randomforest.fit(train_x,train_y)
    pred_train_random = randomforest.predict(train_x)
    pred_val_random = randomforest.predict(val_x)
        
    score_train = mean_absolute_error(train_y,pred_train_random)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_val_random)
    scores_val.append(score)

print('Train mae:',np.mean(scores_train))
print('Val mae',np.mean(scores_val))

Train mae: 253.97298353737415
Val mae 667.4268940009031

X_Test_data = Test_data[feature_cols]
X_Test_data

	gearbox	power	kilometer	v_0	v_1	v_2	v_3	v_4	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0.0	101	15.0	42.142061	-3.094739	-0.721300	1.466344	1.009846	0.236520	0.000241	0.105319	0.046233	0.094522	3.619512	-0.280607	-2.019761	0.978828	0.803322
1	0.0	73	6.0	43.907034	-3.244605	-0.766430	1.276718	-1.065338	0.261518	0.000000	0.120323	0.046784	0.035385	2.997376	-1.406705	-1.020884	-1.349990	-0.200542
2	0.0	120	5.0	45.389665	3.372384	-0.965565	-2.447316	0.624268	0.261691	0.090836	0.000000	0.079655	0.073586	-3.951084	-0.433467	0.918964	1.634604	1.027173
3	0.0	58	15.0	42.788775	4.035052	-0.217403	1.708806	1.119165	0.236050	0.101777	0.098950	0.026830	0.096614	-2.846788	2.800267	-2.524610	1.076819	0.461610
4	0.0	116	15.0	43.670763	-3.135382	-1.134107	0.470315	0.134032	0.257000	0.000000	0.066732	0.057771	0.068852	2.839010	-1.659801	-0.924142	0.199423	0.451014
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	1.0	150	15.0	46.321013	-3.304401	0.073363	-0.622359	-0.778349	0.263668	0.000292	0.141804	0.076393	0.039272	2.072901	-2.531869	1.716978	-1.063437	0.326587
49996	0.0	179	4.0	48.086547	-3.318641	0.965881	-2.672160	0.357440	0.255310	0.000991	0.155868	0.108425	0.067841	1.358504	-3.290295	4.269809	0.140524	0.556221
49997	1.0	147	12.5	46.145279	-3.305263	-0.015283	-0.288329	-0.687112	0.262933	0.000318	0.141872	0.071968	0.042966	2.165658	-2.417885	1.370612	-1.073133	0.270602
49998	1.0	176	15.0	45.507088	-3.197006	-1.141252	-0.434930	-1.845040	0.282106	0.000023	0.067483	0.067526	0.009006	2.030114	-2.939244	0.569078	-1.718245	0.316379
49999	0.0	0	3.0	44.289471	4.181452	0.547068	-0.775841	1.789601	0.231449	0.103947	0.096027	0.062328	0.110180	-3.689090	2.032376	0.109157	2.202828	0.847469

50000 rows × 18 columns

X_Test_data = X_Test_data.fillna(0)

X_Test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   gearbox    50000 non-null  float64
 1   power      50000 non-null  int64  
 2   kilometer  50000 non-null  float64
 3   v_0        50000 non-null  float64
 4   v_1        50000 non-null  float64
 5   v_2        50000 non-null  float64
 6   v_3        50000 non-null  float64
 7   v_4        50000 non-null  float64
 8   v_5        50000 non-null  float64
 9   v_6        50000 non-null  float64
 10  v_7        50000 non-null  float64
 11  v_8        50000 non-null  float64
 12  v_9        50000 non-null  float64
 13  v_10       50000 non-null  float64
 14  v_11       50000 non-null  float64
 15  v_12       50000 non-null  float64
 16  v_13       50000 non-null  float64
 17  v_14       50000 non-null  float64
dtypes: float64(17), int64(1)
memory usage: 6.9 MB

from sklearn.model_selection import train_test_split
#定义模型函数
def build_model_randomforest(x_train,y_train):
    model = RandomForestRegressor(n_estimators=50,random_state=0)
    model.fit(x_train, y_train)
    return model

model_random_pre = build_model_randomforest(X_data,Y_data)
subpre = model_random_pre.predict(X_Test_data)
subpre

array([1227.94      , 1832.4       , 8610.005     , ..., 5474.99      ,
       5055.48      , 5637.44666667])

将多折测试集结果进行求均值，并写入csv提交到天池

sub = pd.DataFrame()
sub['SaleID'] = Test_data.SaleID
sub['price'] = subpre
sub.to_csv('submit.csv',index = False)

sub.head()

	SaleID	price
0	200000	1227.940
1	200001	1832.400
2	200002	8610.005
3	200003	929.880
4	200004	2075.360

提交天池结果