Blog

Are sklearn defaults wrong?

03 Sep, 2019
Xebia Background Header Wave

There was some uprising on Twitter recently about the default behavior of sklearn LogisticRegression:

If you read the post, you can see that the biggest problem with the choice is that, unless your data is regularized, you will train a model that probably under performs: you are unnecessarily penalizing it by making it learn less than what it could from the data.

The second problem with the default behavior of LogisticRegression is about choosing a regularization constant that is — in effect — a magic number (equal to 1.0). This hides the fact that the regularization constant should be tuned by hyperparameter search, and not set in advance without knowing how the data and problem looks like.

You could just normalize the data and do a grid search then, can’t you? We certainly could: the wide spread problem in machine learning is, however, that people often blindly follow tutorials online written without attention to these details as they’re hard(er). Understanding how grid search works is not difficult but not trivial. Understanding why regularization is necessary requires a good mental model of the feature space. Again, these are hardly intricate concepts. The post notes how the first Google hit that you find by searching “logistic regression sklearn example” does not talk about these fundamental details.

As an aside, this makes for a very simple yet powerful question when interviewing data scientists: *why should you normalize the data when using a regularization term*. A trivial answer for any experienced data scientist, a hard one if you are not an experienced practitioner.

This whole discussion is what makes it hard to justify our data science courses when most people think that you can find all the answers online. While this is true, understanding which answers are correct — and which are not — takes often an expert.

Want more controversial opinions every day in your Twitter client? I’m gglanzani there!

Improve your Python skills, learn from the experts!

At GoDataDriven we offer a host of Python courses taught by the very best professionals in the field. Join us and level up your Python game:

Data Science with Python Foundation – Want to make the step up from data analysis and visualization to true data science? This is the right course.
Advanced Data Science with Python – Learn to productionize your models like a pro and use Python for machine learning.

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts