Handling Categorical Features in Machine Learning

Introduction: Every dataset has two type of variables Continuous(Numerical) and Categorical. Regression based algorithms use continuous and categorical features to build the models. You can’t fit categorical variables into a regression equation in their raw form in most of the ML Libraries. If it is not included in the modeling, then you do not get an accurate model. It’s crucial to learn the methods of dealing with such variables. There are many machine learning libraries that deal with categorical variables in various ways. Approach on how to transform and use those efficiently in model training, varies based on multiple conditions, including the algorithm being used, as well as the relation between the response variable and the categorical variable(s). Here I take the opportunity to demonstrate the various methods prevalent and incorporated in the popular Machine Learning Library in Spark, i.e.Mllib for handling categorical variables. Continue reading