Tree-based Model 如何处理categorical variable

categorical variable 分为 order variale 和 non-order variable，其中order variable直接使用sklearn.preprocess.LabelEncoder是最好的处理方法。对于order variable的处理方法主要在于是否使用one-hot encoding。在这篇quora answer (author: Clem Wang) 中给出了其它的处理方法：

One can try a few other approaches:

look at how the response variable responds to the categorical values and try to group them.

Find another ML algorithm that works better with categorical features or with one-hot encoding and use that to train a submodel that just uses the categorical features. Then replace the categorical feature with a probability score. For instance, use a Logistic Regression on the hot-encoded values.

Try to combine the categorical feature with some other features.

Build N xgboost classifiers, one for each category.

This may require playing around with the data a bit. Plotting the data may help you see patterns that you didn't know that were there.

这篇博客对于在xgboost中使用one-hot给出了一个总体结论：

总结起来的结论，大至两条：

1.对于类别有序的类别型变量，比如age等，当成数值型变量处理可以的。对于非类别有序的类别型变量，推荐one-hot。但是one-hot会增加内存开销以及训练时间开销。

2.类别型变量在范围较小时（tqchen给出的是[10,100]范围内）推荐使用

Tree-based Model 如何处理categorical variable

其他相关的资料