Are sklearn defaults wrong?

Giovanni Lanzani
Published on 03 September 2019 in Build

There was some uprising on Twitter recently about the default behavior of sklearn LogisticRegression:

If you read the post, you can see that the biggest problem with the choice is that, unless your data is regularized, you will train a model that probably under performs: you are unnecessarily penalizing it by making it learn less than what it could from the data.

The second problem with the default behavior of LogisticRegression is about choosing a regularization constant that is — in effect — a magic number (equal to 1.0). This hides the fact that the regularization constant should be tuned by hyperparameter search, and not set in advance without knowing how the data and problem looks like.

You could just normalize the data and do a grid search then, can't you? We certainly could: the wide spread problem in machine learning is, however, that people often blindly follow tutorials online written without attention to these details as they're hard(er). Understanding how grid search works is not difficult but not trivial. Understanding why regularization is necessary requires a good mental model of the feature space. Again, these are hardly intricate concepts. The post notes how the first Google hit that you find by searching "logistic regression sklearn example" does not talk about these fundamental details.

As an aside, this makes for a very simple yet powerful question when interviewing data scientists: *why should you normalize the data when using a regularization term*. A trivial answer for any experienced data scientist, a hard one if you are not an experienced practitioner.

This whole discussion is what makes it hard to justify our data science courses when most people think that you can find all the answers online. While this is true, understanding which answers are correct — and which are not — takes often an expert.

Want more controversial opinions every day in your Twitter client? I'm gglanzani there!

Improve your Python skills, learn from the experts!

At GoDataDriven we offer a host of Python courses taught by the very best professionals in the field. Join us and level up your Python game:

- Data Science with Python Foundation - Want to make the step up from data analysis and visualization to true data science? This is the right course.
- Advanced Data Science with Python - Learn to productionize your models like a pro and use Python for machine learning.

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.