Tree-based Model 如何处理categorical variable

categorical variable 分为 order variale 和 non-order variable,其中order variable直接使用sklearn.preprocess.LabelEncoder是最好的处理方法。对于order variable的处理方法主要在于是否使用one-hot encoding。在这篇quora answer (author: Clem Wang) 中给出了其它的处理方法:

One can try a few other approaches:

  • look at how the response variable responds to the categorical values and try to group them.
  • Find another ML algorithm that works better with categorical features or with one-hot encoding and use that to train a submodel that just uses the categorical features. Then replace the categorical feature with a probability score. For instance, use a Logistic Regression on the hot-encoded values.
  • Try to combine the categorical feature with some other features.
  • Build N xgboost classifiers, one for each category.

This may require playing around with the data a bit. Plotting the data may help you see patterns that you didn't know that were there.

这篇博客对于在xgboost中使用one-hot给出了一个总体结论:

总结起来的结论,大至两条:

  • 1.对于类别有序的类别型变量,比如age等,当成数值型变量处理可以的。对于非类别有序的类别型变量,推荐one-hot。但是one-hot会增加内存开销以及训练时间开销。
  • 2.类别型变量在范围较小时(tqchen给出的是[10,100]范围内)推荐使用

其他相关的资料

comment:re sklearn -- integer encoding vs 1-hot

原文地址:https://www.cnblogs.com/ZeroTensor/p/10097069.html