异常检测-基于孤立森林算法Isolation-based Anomaly Detection-3-例子

参考：https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py

代码：

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)

# 构建训练数据，即100个属性值为2的样本，属性的值为 随机[0,1]数*0.3
X = 0.3 * rng.randn(100, 2)
# 将上面得到的值+2和-2各生成100个值在2和-2附近的样本
#拼接后训练数据大小为(200, 2)
X_train = np.r_[X + 2, X - 2] #按列连接矩阵，要求列相等，行拼接

# 产生一些有规律的新观察值
X = 0.3 * rng.randn(20, 2)
#拼接后训练数据大小为(40, 2)
X_test = np.r_[X + 2, X - 2]

# 均匀分布生成异常数据集，大小为(20, 2)，值的范围为[-4,4]
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

# 构建森林，进行采样，子采样大小为100
# 默认参数max_features=1，则每棵树都仅使用一个属性来进行切割
# 如果你想要选择多个属性（当你的数据是多维，即有多个属性时）则记得设置该参数
clf = IsolationForest(behaviour='new', max_samples=100,
                      random_state=rng, contamination='auto')

# 训练森林，选择属性和分割值等
clf.fit(X_train)

#然后使用该构建好的森林进行预测
y_pred_train = clf.predict(X_train)
print(y_pred_train)
y_pred_test = clf.predict(X_test)
print(y_pred_test)
y_pred_outliers = clf.predict(X_outliers)
print(y_pred_outliers)

# 画图, the samples, and the nearest vectors to the plane
# xx和yy大小分别为(50,50)
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
# 先拉直xx和yy为大小为(2500,)的一维向量
# 然后按行拼接xx,yy，即行数相等，列数增加；即两者拼成(2500,2)的坐标点
# 然后得到这几个点的异常分数
# 正常点的异常分数为整数，异常的为负数
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
# 绘制网格点的异常分数的等高线图，看图可知，颜色越浅越可能为正常点，越深越为异常点
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

# 在等高线中标出训练点、测试点、异常点的位置，看它们是不是在对应的颜色位置
# 可见训练点和测试点都在颜色前的区域，异常点都在颜色深的区域
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white',
                 s=20, edgecolor='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green',
                 s=20, edgecolor='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red',
                s=20, edgecolor='k')

plt.axis('tight')
# 绘制图的坐标和图例信息
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([b1, b2, c],
           ["training observations",
            "new regular observations", "new abnormal observations"],
           loc="upper left")
plt.show()

Automatically created module for IPython interactive environment
[ 1 -1  1 -1  1  1 -1 -1  1 -1 -1  1  1  1  1 -1  1 -1 -1  1  1  1 -1  1
 -1  1  1 -1  1  1  1 -1 -1  1  1 -1  1 -1  1 -1  1 -1  1  1  1  1  1 -1
  1  1  1  1  1 -1  1 -1 -1  1  1 -1  1 -1 -1  1  1 -1  1 -1  1 -1  1 -1
  1 -1  1  1  1  1 -1  1  1 -1 -1 -1  1  1  1  1  1 -1  1  1  1  1 -1  1
  1  1  1  1  1 -1  1 -1  1  1 -1 -1  1 -1 -1 -1  1  1  1 -1  1 -1 -1 -1
  1  1 -1  1 -1  1  1 -1  1  1  1 -1 -1  1  1 -1 -1 -1  1 -1  1 -1  1  1
  1  1  1 -1  1  1 -1  1  1 -1  1 -1 -1  1  1 -1  1 -1 -1 -1  1 -1  1 -1
  1 -1  1 -1  1 -1  1  1  1  1 -1 -1  1 -1 -1 -1  1 -1  1  1 -1 -1  1  1
  1  1 -1  1  1  1  1  1]
[ 1 -1 -1  1 -1 -1 -1  1  1  1 -1 -1  1  1  1  1  1 -1 -1  1  1 -1 -1  1
 -1  1 -1  1  1  1 -1 -1  1  1  1  1  1 -1 -1  1]
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

图为：

如果将contamination设置为0.，表示训练数据中没有异常数据，返回为：

Automatically created module for IPython interactive environment
[[-0.48822863 -3.37234895]
 [-3.79719405  3.70118732]
 [ 2.68784096  1.56779365]
 [-0.72837644 -2.61364544]
 [-2.74850366 -1.99805681]
 [ 0.39381332  1.71676738]
 [ 1.28157901 -1.76052882]
 [ 3.63892225  1.90317533]
 [ 0.43483242  0.89376597]
 [-0.6431995  -2.01815208]
 [-1.15221857  2.06276888]
 [-3.88485209 -3.07141888]
 [-3.63197886 -3.67416958]
 [ 2.84368467  1.62926288]
 [-0.20660937 -3.21732671]
 [-0.067073   -0.21222583]
 [-2.61438504 -0.52918681]
 [-0.81196212  0.92680078]
 [ 1.08074921 -3.63756792]
 [-1.00309908  1.00687933]]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[ 1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1]
[-1 -1 -1 -1  1 -1  1  1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1]

可见其会将一些接近训练点的数据也预测为正常数据

如果同时设置构建树时使用的属性为2，即max_features=2，而不是默认的1，结果为：

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[ 1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1]
[-1 -1 -1 -1  1 -1  1  1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1]

感觉训练效果更好了，测试数据基本上都能验证为正常点

所以根据你自己的需要来配置参数吧