‘Do not reinvent the wheel’. Every time I hear this adage, it reminds me when I was doing the MSc thesis. Back then I was building a finite element software and it was necessary to implement some general functions for matrix manipulation. Faithful to a ‘do-not-copy’ Bushido code, I was reluctant to reuse some Matlab code already available that would solve my problem. It was a noble sentiment until it started affecting the progress of my work, since I was taking more time then supposed to solve the problem. One day, my supervisor called me and asked me: ‘Don’t you want to create Matlab from scratch as well?’. I said no and asked why the question. To which he replied: ‘Well, if you don’t like to use others’ people code, I thought you also didn’t like to use others’ people software’. I took the message and reused the code.
We feel guilty when we go for the lazy ‘copy-paste’ approach. However, sometimes, it makes sense to not reinvent the wheel. Productivity is about inputs and outputs. If we input less time/work and output more results, we’re being productive. Of course, take what I’m saying with a pinch of salt. I’m not saying that we should be a copy-paste bot. I’m just saying that everything depends on our goals. In the case of my MSc thesis, the value of my work was on how I was assembling the pieces of the jigsaw and not in the implementation of standard matrix manipulation routines. As Newton said, ‘if I have seen further it is by standing on the shoulders of giants‘. Leveraging on existing knowledge isn’t lazy, it’s smart.
Modern wheels are made of open source libraries. These libraries are the result of million hours of collective know-how. They are tested, they are fast, and they are supervised by a community of active users. Accordingly, open source libraries are like a treasure map that guides you to the treasure you’re looking for. I recommend you to collect these treasure maps. You want many and diverse because you never know what you will need.
Here you can find a curated list of Machine Learning libraries in Python that I have collected over the years. Use it and share it, so that fewer people reinvent the wheel.
- Computer Vision
- Natural Language Processing
- Automated Machine Learning
- Machine Learning Frameworks and Algorithms
- Time Series
- Statistical and Scientific Computing
- Data Cleaning and Preprocessing
- Data Visualization
- Neural Networks and Deep Learning
- Reinforcement Learning
- Code Testing
- Detectron – Facebook AI Research’s software system that implements state-of-the-art object detection algorithms, including Mask R-CNN.
- Scikit-Image – A collection of algorithms for image processing.
- SimpleCV – SimpleCV is an open source framework for building computer vision applications.
- OpenCV – A library of Python bindings designed to solve computer vision problems.
Natural Language Processing
- Beautiful Soup – A Python library for pulling data out of HTML and XML files.
- NLTK – A leading platform for building Python programs to work with human language data.
- PyNLPl – Python library for Natural Language Processing that can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model.
- Pattern – Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
- FuzzyWuzzy – Fuzzy String Matching in Python.
Automated Machine Learning
- Auto-Sklearn – An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.
- H2O AutoML – An unified interface to a variety of machine learning algorithms, making it easy for non-experts to experiment with machine learning and produce high-performing models.
- auto_ml – Automated machine learning for production and analytics.
- TPOT – A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
- Featuretools – A Python library for automated feature engineering.
Machine Learning Frameworks and Algorithms
- Scikit-Learn – Toolbox with solid implementations of a bunch of state-of-the-art machine learning algorithms.
- Tensorflow – An open source machine learning framework for everyone.
- XGBoost – A parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
- MLxtend – A library of extension and helper modules for Python’s data analysis and machine learning libraries.
- Vecstack – Python package for stacking.
- LightGBM – fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
- Prophet – Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
- PyAF – An Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.
- tsfresh – A Python package that automatically calculates a large number of time series characteristics, the so called features.
- tslearn – A machine learning toolkit dedicated to time-series data.
- traces – A Python library for unevenly-spaced time series analysis.
- PyFlux – An open source time series library for Python.
Statistical and Scientific Computing
- Theano – Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
- NumPy – Package for scientific computing with Python.
- SciPy – Python-based ecosystem of open-source software for mathematics, science, and engineering.
- Pandas – An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- Statsmodels – Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
- PyMC3 – Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms.
Data Cleaning and Preprocessing
- Dedupe – A library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.
- Fancyimpute – A variety of matrix completion and imputation algorithms implemented in Python.
- Imbalanced-Learn – Python module to perform under sampling and over sampling with various techniques.
- Matplotlib – Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
- Bokeh – An interactive visualization library that targets modern web browsers for presentation.
- Seaborn – A Python visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics.
- Plotly – Open source tools for composing, editing, and sharing interactive data visualization via the Web.
Neural Networks and Deep Learning
- Caffe2 – A deep learning framework that provides an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms.
- Keras – A high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
- Pytorch – A deep learning framework for fast, flexible experimentation.
- Blocks – Framework that helps you build and manage neural network models on using Theano.
- Lasagne – Lightweight library to build and train neural networks in Theano.
- OpenAI Gym – A toolkit for developing and comparing reinforcement learning algorithms.
- Engarde – A library for defensive data analysis.
- Pytest – The pytest framework makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries.
- Hypothesis – a Python library for creating unit tests which are simpler to write and more powerful when run, finding edge cases in your code you wouldn’t have thought to look for.