How CatBoost encodes categorical variables?

One of the key ingredients of CatBoost explained from the ground up

Adrien Biarnes
Towards Data Science
17 min readFeb 10, 2021

--

Photo by Chandler Cruttenden on Unsplash

CatBoost is a “relatively” new package developed by Yandex researchers. It is pretty popular right now, especially in Kaggle competitions where it generally outperforms other gradient tree boosting libraries.

Among other ingredients, one of the very cool feature of CatBoost is that it handles categorical variables out of the box (hence the name of the algorithm).

When using an implementation, it is important to really understand how it works under the hood. This is the goal of this article. We are going to take an in-depth look at this technique called Ordered Target Statistics. It is presumed that you have a good understanding of gradient tree boosting. If not, check out my article on the subject.

Introduction

Gradient Tree Boosting is one of the best, of the shelve, family of algorithms to deal with structured data. This family of algorithms originated from the boosting method which is one of the most powerful learning ideas introduced in the last 25 years. AdaBoost introduced by Yoav Freund and Robert Schapire in 1997 was the first instance of this method. Then later that year Leo Breiman realized that boosting could be reformulated as an optimization problem with a proper cost function which gave birth to gradient boosting by Jerome Friedman in 2001. Since then, boosting methods with trees have seen a lot of newcomers, namely and in order of appearance, XGBoost — eXtreme Gradient Boosting (2014), lightGBM (2016), and CatBoost (2017).

Gradient Boosting can also be used in combination with neural networks theses days, either to integrate structured additional knowledge or just to get a boost in performance. In this regard, a technique introduced by the authors of GrowNet where the weak learners are replaced by shallow neural networks seems very promising.

Categorical variable encoding

A word a caution though, I use the terms category, level, or factor interchangeably (in this article they mean the exact same thing)

As you probably already know, machine learning algorithms only handle numerical features. So…

--

--