The Secrets of Univariate Feature Selection

Univariate feature selection is a powerful technique to improve the performance of your models and to reduce their computational cost.

This feature selection technique uses statistical tests to assess the relationship between each input feature and the output feature. Input features with a strong statistical relationship with the output feature are kept. The remaining features are excluded.

Dilbert on feature selection
Source: https://www.pinterest.pt/pin/569846159077398163/

 In the following notebook, you’ll learn how to use univariate feature selection. Explore this notebook, do your own variations, and get better at feature selection. You can access all the data and the GitHub version here.

Notebook on Univariate Feature Selection

4 thoughts on “The Secrets of Univariate Feature Selection

  • Manuel

    Olá Pedro,
    Estava aqui a tentar usar o teu exemplo mas o link não tem lá o file

    https://github.com/pmarcelino/blog/data/titanic_modified.csv

    Continua a colocar estes artigos interessantes 🙂
    Posso colocar este código num gitHub que ando a fazer?
    Posso colocar lá que o código é teu, nao quero de maneira nenhum usurpar direitos de autor (manda-me o texto que quiseres… que eu coloco lá).

    Cumprimentos,
    Manuel Monteiro

    • m0rd3p

      Claro que podes usar! Adicionei mesmo agora MIT License para que a questão não se coloque no futuro 🙂

  • Aleksandra

    Pedro, I have never heard about stratified k-fold cross-validation. It seems to be really useful (I always used the standard one), I will try it for sure.

    If we want to keep the same proportion of class values, we could also use a stratify parameter in a train_test_split function. Is there any reason why you used it only during cross validation?

    Thanks!

    • m0rd3p

      Yes, there’s also the stratify parameter in the train_test_split function. It should do the same, although I never tested it.

      If I understood the question, I used it only during cross_validation_scores for no special reason. I usually do the train_test_split always in the same way and, when I need to adjust it for classification purposes, I just apply StratifiedKFold.

      As long as you are aware that in your classification problems you need to do something about imbalanced datasets, and you effectively do something, you should be ok 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *