教师编制考试数据分析

.背景：因为女朋友最近考上了教师编，所以拿到了教师编制笔试面试的数据，进行笔试面试上岸数据分析。

数据源：xx省xx市教师编制考试成绩数据

1.准备数据:

# 导入相关包
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import svm
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import plotting
sns.set_style("whitegrid")
plt.style.use('seaborn')

# 导入数据集
io = r'G:PythonLearnirisdataDataCalculate.xlsx'
data = pd.read_excel(io, sheet_name='Sheet1')

查看数据:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ranking_written   39 non-null     int64  
 1   written           39 non-null     float64
 2   ranking_audition  39 non-null     int64  
 3   audition          39 non-null     float64
 4   total             39 non-null     float64
 5   ranking_total     39 non-null     int64  
 6   complete          39 non-null     object

查看数据:

print(data)

ranking_written  written  ranking_audition  ...  total  ranking_total  complete
0                 1    84.75                 2  ...  87.30              1        ON
1                 2    78.70                 3  ...  84.40              2        ON
2                 7    75.15                 1  ...  83.58              3        ON
3                12    72.70                 4  ...  81.88              4        ON
4                 8    74.70                 8  ...  81.72              5        ON
5                 4    75.70                15  ...  81.52              6        ON
6                 3    76.15                21  ...  81.34              7        ON
7                13    72.05                 6  ...  81.26              8        ON
8                 6    75.20                19  ...  81.08              9        ON
9                11    73.95                16  ...  80.82             10        ON
10               15    70.70                 7  ...  80.60             11        ON
11               10    73.95                22  ...  80.46             12        ON
12               14    71.65                10  ...  80.38             13        ON
13                9    74.15                29  ...  79.82             14       OFF
14                5    75.55                33  ...  79.78             15       OFF
15               29    65.10                 5  ...  78.72             16       OFF
16               19    68.80                18  ...  78.64             17       OFF
17               21    67.05                11  ...  78.30             18       OFF
18               17    69.60                31  ...  77.76             19       OFF
19               25    65.70                13  ...  77.64             20       OFF
20               20    68.35                26  ...  77.62             21       OFF
21               22    66.50                20  ...  77.60             22       OFF
22               26    65.60                14  ...  77.60             23       OFF
23               30    65.10                12  ...  77.52             24       OFF
24               32    63.85                 9  ...  77.38             25       OFF
25               16    70.20                35  ...  77.16             26       OFF
26               24    65.75                23  ...  76.82             27       OFF
27               27    65.55                25  ...  76.62             28       OFF
28               31    64.95                24  ...  76.50             29       OFF
29               18    69.10                38  ...  76.12             30       OFF
30               28    65.45                32  ...  76.10             31       OFF
31               23    65.85                34  ...  75.78             32       OFF
32               38    59.30                17  ...  74.96             33       OFF
33               34    60.65                27  ...  74.54             34       OFF
34               36    60.00                28  ...  74.28             35       OFF
35               33    62.35                37  ...  73.78             36       OFF
36               39    59.25                30  ...  73.74             37       OFF
37               35    60.20                36  ...  73.16             38       OFF
38               37    59.90                39  ...  23.96             39       OFF

1.探索数据之间的关系:

通过 violinplot 与 pointplot 通过斜率与分布，探索笔试和面试以及上岸的关系

# 设置颜色主题
antV = ['#1890FF', '#2FC25B', '#FACC14', '#223273', '#8543E0', '#13C2C2', '#3436c7', '#F04864']

# 绘制  pointplot
# 各特征与上岸之间的关系
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.violinplot(x='complete', y='ranking_written', data=data, palette=antV, ax=axes[0, 0])
sns.violinplot(x='complete', y='written', data=data, palette=antV, ax=axes[0, 1])
sns.violinplot(x='complete', y='ranking_audition', data=data, palette=antV, ax=axes[1, 0])
sns.violinplot(x='complete', y='audition', data=data, palette=antV, ax=axes[1, 1])

# 绘制  pointplot
# 各特征与上岸之间的关系
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.pointplot(x='complete', y='ranking_written', data=data, color=antV[0], ax=axes[0, 0])
sns.pointplot(x='complete', y='written', data=data, color=antV[0], ax=axes[0, 1])
sns.pointplot(x='complete', y='ranking_audition', data=data, color=antV[0], ax=axes[1, 0])
sns.pointplot(x='complete', y='audition', data=data, color=antV[0], ax=axes[1, 1])

各特征值之间矩阵图关系

sns.pairplot(data=data, palette=antV, hue='complete')

Andrews Curves 适合进行数据校验，对数据中异常的数据进行数据校验。

plt.subplots(figsize=(10, 8))
plotting.andrews_curves(data, 'complete', colormap='cool')

分别基于笔试和面试笔试排名和面试排名进行线性回归分析:

sns.lmplot(data=data, x='written', y='audition', palette=antV, hue='complete')

sns.lmplot(data=data, x='ranking_written', y='ranking_audition', palette=antV, hue='complete')

最后通过热力图找出不同属性之间的相关性相关性体现在热力图的正负值：

2.机器学习

通过机器学习以笔试成绩面试成绩预测其是否上岸，其他辅助数据笔试排名面试排名

进行机器学习之前将数据集进行拆分为训练集和测试集将是否上岸转换为 0 1

# 载入特征和标签集
X = data[['ranking_written', 'written', 'ranking_audition', 'audition', 'total', 'ranking_total']]
Y = data['complete']

# 对标签集进行编码
encoder = LabelEncoder()
y = encoder.fit_transform(Y)

print(y)

[1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]

将数据集进行 7:3 的拆分拆分为训练数据和测试数据

# 对各阶段排名 以及成绩 最终是否进入进行机器学习
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=101)
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

(27, 6) (27,) (12, 6) (12,)

检查不同模型的准确性分析

# 通用模型的机器学习测试方式
model = svm.SVC()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the SVM is: {0}'.format(metrics.accuracy_score(prediction, test_y)))

The accuracy of the SVM is: 1.0

# 笔试属性 与最终结果之间的关系
written = data[['ranking_written', 'written', 'complete']]
train_w, test_w = train_test_split(written, test_size=0.3, random_state=0)
train_x_w = train_w[['ranking_written', 'written']]
train_y_w = train_w.complete
test_x_w = test_w[['ranking_written', 'written']]
test_y_w = test_w.complete

model = svm.SVC()
model.fit(train_x_w, train_y_w)
prediction = model.predict(test_x_w)
print('The accuracy of the SVM using Written is: {0}'.format(metrics.accuracy_score(prediction, test_y_w)))

# 面试属性 与最终结果之间的关系
audition = data[['ranking_audition', 'audition', 'complete']]
train_a, test_a = train_test_split(audition, test_size=0.3, random_state=0)
train_x_a = train_a[['ranking_audition', 'audition']]
train_y_a = train_a.complete
test_x_a = test_a[['ranking_audition', 'audition']]
test_y_a = test_a.complete

model = svm.SVC()
model.fit(train_x_a, train_y_a)
prediction = model.predict(test_x_a)
print('The accuracy of the SVM using audition is: {0}'.format(metrics.accuracy_score(prediction, test_y_a)))

# 总成绩属性 与最终结果之间的关系
audition = data[['ranking_total', 'total', 'complete']]
train_a, test_a = train_test_split(audition, test_size=0.3, random_state=0)
train_x_a = train_a[['ranking_total', 'total']]
train_y_a = train_a.complete
test_x_a = test_a[['ranking_total', 'total']]
test_y_a = test_a.complete
model = svm.SVC()
model.fit(train_x_a, train_y_a)
prediction = model.predict(test_x_a)
print('The accuracy of the SVM using total is: {0}'.format(metrics.accuracy_score(prediction, test_y_a)))

The accuracy of the SVM is: 1.0
The accuracy of the SVM using Written is: 0.9166666666666666
The accuracy of the SVM using audition is: 0.8333333333333334
The accuracy of the SVM using total is: 1.0