5 open-source automl tools to kick-start your next machine learning project

In this article we would like to introduce you to AutoML, what it is and when to use it. You'll also find out more about cool, open-soure AutoML tools!

Published:
May 28, 2022

What is automatic machine learning (AutoML)?

Machine learning is about the ability of computers to learn without the need of being explicitly programmed for a task. However, if you’ve ever set up a machine learning algorithm, you’ll know that a lot of setting up as well as a lot of programming is still necessary.  AutoML tries to automate all the tedious tasks needed for machine learning to save time and money. They also allow beginners, experts and non-experts alike to make effective use of machine learning in a short amount of time.

Let’s take a look at an example of how you would use AutoML in a real world data science project. With AutoML, you could quickly get a robust machine learning model that you can then show to members of your team or to clients. That way you can present the first results very fast and then build on top of the initial results afterwards. This is especially critical if it’s important to get Stakeholders involved and get their approval to ensure the success of a project.

How AutoML works and when to use it

There are various approaches to automate machine learning. The most simple one is to simply create a pipeline with different preprocessing steps, models and parameters, try them all out and use the pipeline with the best results, e.g. brute force your way to the best model. However, this method is computationally intensive. Other approaches try to mitigate this problem by taking a more Bayesian route to the solution, training multiple models at once and keeping only the best once after a few iterations, only fully training models that seem to be most promising. Another way to optimize the efficiency of automated machine learning is to train models which achieved good results on previous, similar data, making sure that the right model is applied to the right type of data. 

While AutoML can be very useful, it’s not an encompassing solution for all ML and data science problems. Using automated machine learning allows you to rapidly tackle a problem and enables you to quickly get a first prototype in a short amount of time. Using AutoML on basic problems that have been thoroughly explored before might also be a good idea. If you know the data you’re dealing with and you know that it won’t require any further analysis and feature optimization, automated ML can help you focus on the work you actually need to do, like scaling up and improving your data quality .

However, you shouldn’t blindly rely on AutoML and it is not a replacement for doing data science. It’s a handy and valuable tool to have and when used appropriately it can save both time and money. But it is still important to explore the data that is used and to make sure that the underlying data is of a sufficient quality, because even the best models can’t compensate for seriously bad data. 

5 awesome AutoML tools:

As you can see, AutoML can be a valuable tool for most Data Scientists. However, the variety of tasks in data science can be huge, from processing texts with NLP, crunching numbers in tabular data to working with images. There isn’t a one-fits-all AutoML tool for everything, but we’ve found a couple of open-source tools to cover most data science tasks.

Auto-Sklearn

Sklearn is one of the most widely used machine learning libraries for Python, so building an AutoML solution on top of it makes a lot of sense. Auto-Sklearn combines the easy to use way of Sklearn and enables you to automatically tune both classification and regression models. 

Unfortunately, Auto-Sklearn only works on MacOS and Linux and has no official Windows support. However, you could use Auto-Sklearn using WSL 2 on Windows. Just like Sklearn, Auto-Sklearn doesn’t support GPU usage at the moment.[1] 

Pros

  • Easy to use, especially if you are already familiar with Sklearn.
  • Applicable for many situations.

Cons

  • Not running on Windows.
  • No GPU support.

You can check out Auto-Sklearn here.

TPOT

TPOT stands for Tree-Based Pipeline Optimization Tool. Just like Auto-Sklearn, TPOT is an open-source AutoML library for Python, which uses models and data preprocessing capabilities from Sklearn. However, the way that TPOT is working under the hood is a bit different than Auto-Sklearn.

As the name suggests, TPOT mainly works with tree-based algorithms to do regression and classification, such as Random Forests, Decision Trees or XGBoost. Often, these tree-based models provide very robust solutions. The models used in this library are especially good on noise or unclean data. [3]So you might want to consider using TPOT if you know that your data has a lot of missing values for example. 

Besides providing strong models, TPOT also builds a whole pipeline, including data cleaning, feature selection and parameter optimization.[4] Sadly, TPOT doesn’t support GPUs at the time of writing this. 

Pros

  • Tree-based models deliver robust results on noisy or unclean data.
  • Built on top of Sklearn.

Cons

  • GPU support only when using XGboost.

You can find more about TPOT here.

automl-docker by Kern AI

automl-docker is a free and open source tool that we at Kern developed to easily create natural language classifiers. Most AutoML tools are really good when it comes to tabular data or everyday machine learning tasks. However, there isn’t a huge variety for NLP tasks. That’s where automl-docker comes into play!

The tool allows you to load in a dataset and takes over the whole text preprocessing and model training for you. We use state of the art transformer models to preprocess the text data to get shockingly accurate results.

Pros

  • Easy to use CLI Interface to create NLP-Classifiers.
  • State of the art transformer models.
  • Full GPU support.

Cons

  • Restricted to text data and NLP

Check out our free tool here.

AutoKeras

AutoKeras is an AutoML library for deep learning. While libraries such as TPOT mainly use tree-based models, AutoKeras is building neural networks to achieve their goals. The main benefit from this is that you can do things like image or text classification.  All you need to do is to provide the data and AutoKeras will find the optimal model architecture and hyperparameters for you.

AutoKeras also offers full GPU support. So, if you have a graphics card at your disposal, you might be able to speed up the model building process even more!

AutoKeras is built on-top of Keras, which is an open-source API for deep learning. Keras is part of Tensorflow, one of the major deep learning frameworks developed by Google. Because Keras is so easy to handle, it’s a great resource for beginners and people who want to get more into deep learning. With AutoKeras you’ll now have an even easier time to create solid deep neural networks very quickly. 

Pros

  • Can handle image and text data.
  • Built on top of Keras and Tensorflow.
  • GPU support.

Cons

  • Computationally demanding compared to Auto-Sklearn or TPOT.

AutoKeras is available here.

H2O AutoML

This AutoML tool is a little bit different from all the other tools we previously mentioned. The backend of H2O AutoML is built with Java, but it has a Python API as well. Another difference is that H2O AutoML is made by a company, whereas other AutoML tools are often collaboratively built by researchers and the data science community. But, H20 AutoML is still open-source. It’s also a very popular tool, so we wanted to show it here.

H2O AutoML works by building various machine learning models with the end goal of combining these models in the end to achieve the best result. These models include tree-based algorithms like XGBoost or neural networks.[3]

One thing we really like about H2O AutoML is that it also offers a no-code web interface, which makes it even easier to create machine learning models!

Pros

  • Delivers solid results compared to other AutoML tools.
  • Open-Source.

Cons

  • Pricing of H20 is ambiguous, not all functions might be free to use.

H2O AutoML is available here.

Code to get started with these AutoML tools!

We’ve prepared some code on GitHub to give you a quickstart for Auto-Sklearn, TPOT and AutoKeras. For automl-docker you can find a dedicated tutorial on YouTube.

Sources

1 - Feurer, Matthias; Klein, Aaron; Eggensperger, Katharina; Springenberg, Jost Tobias; Blum, Manuel; Hutter, Frank (2019): Auto-sklearn: Efficient and Robust Automated Machine Learning. In: Frank Hutter, Lars Kotthoff und Joaquin Vanschoren (Hg.): Automated Machine Learning. Cham: Springer International Publishing (The Springer Series on Challenges in Machine Learning), S. 113–134.

2 - Friedrich, Tobias; Neumann, Frank; Sutton, Andrew M. (Hg.) (2016): Proceedings of the Genetic and Evolutionary Computation Conference 2016. GECCO '16: Genetic and Evolutionary Computation Conference. Denver Colorado USA, 20 07 2016 24 07 2016. New York, NY, USA: ACM.

3 -Halvari, Tuomas; Nurminen, Jukka K.; Mikkonen, Tommi (2020): Testing the Robustness of AutoML Systems. In: Electron. Proc. Theor. Comput. Sci. 319, S. 103–116. DOI: 10.4204/EPTCS.319.8.

4 -Olson, Randal S.; Bartley, Nathan; Urbanowicz, Ryan J.; Moore, Jason H. (2016): Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In: Tobias Friedrich, Frank Neumann und Andrew M. Sutton (Hg.): Proceedings of the Genetic and Evolutionary Computation Conference 2016. GECCO '16: Genetic and Evolutionary Computation Conference. Denver Colorado USA, 20 07 2016 24 07 2016. New York, NY, USA: ACM, S. 485–492.

Become a data pioneer now

We are building tools for the age of data-centric AI.
Let's build great use cases together.