Getting started with Auto-ML docker

In this article, we would like to show you a step-by-step guide on using this tool to create an awesome sentiment classifier for stock news

Published:
June 22, 2022
May 28, 2022

The main goal is to build a tool to predict if a stock's news is positive, neutral, or negative! We will use the financial sentiment analysis dataset to implement the model, which you can get here or on our GitHub Repo in the sample_data folder. The Dataset contains about 6.000 labeled news articles about various companies. Before we start building, let's take a quick look at the data: 

As you can see, the data consists of a news headline about a specific company and a label about the sentiment of the headline. In this case, the sentiment is positive. With this data, we will build a classifier that takes unseen news headlines as input and tells us if the headline is positive or negative. That way, it is straightforward to quickly analyze many news headlines to understand how the press feels about a company.

To follow this guideline, we implore you to use the same dataset as we did. However, you are free to use your datasets, data from our amazing refinery, or other cool datasets from Kaggle! As long as they are textual data, they’ll work fine with our tool!

Also, please consider joining our newsletter so you won’t miss upcoming news from Kern AI!

Getting started

The first step is all about getting the tool up and running. Before we start, make sure that you have Python and Docker Desktop installed to be able to follow along! If you have, head over to our GitHub Repo and clone it into an empty folder of your choice. After that, jump into the folder and install all dependencies. You can either use pip or Anaconda for that:

$ pip install -r requirements.txt

$ conda install --file requirements.txt

After all, dependencies are installed, you have everything ready to build fantastic machine learning models quickly!

Loading the data

Now it is time to build a machine learning model! To do so, simply enter the following:

$ python3 ml/create_model.py

After that, the command line interface will start and guide you through the process. First, we need to tell the tool where the data is stored on our pc. You can right-click on your file, copy the path and paste it into the software! The financial sentiment analysis dataset is stored in the sample_data folder provided with the GitHub Repo in a file called stocknews_data.csv. 

Please note that at the moment, our tool only supports data that comes in the .csv format. We are working on allowing more formats, but for now, only the .csv format is working. Alternatively, if you feel comfortable with using the Python library Pandas, you can go into the code and modify it so that it accepts other data formats. 

After putting in the path of the data, we need to tell the tools which columns are used for training the model next. First, the data to train the model is loaded; This is the data that contains the 6.000 news headlines in our case. Afterward, we need to load the labels for the data. The labels show us whether or not the headline will be good or bad. Bad headlines will be labeled “negative”, and good headlines will be labeled “positive”. The machine learning algorithm needs these labels to learn to classify the headlines correctly!

Preprocessing the data

Computers work with numbers. Because our dataset is made of words, we need to process the dataset to be usable by a computer. For that, we use a process called "embedding". If that sounds unfamiliar and complicated to you, don't worry! We've done all the work for you. 

You now must choose the language and how fast you want the data to be processed. We recommend using the distilbert-base-uncased model to preprocess your data for English text data. The all-MiniM-L6v2 works fine as well. It's faster than the distilbert-base-uncased but a little bit less accurate. You can choose these models by simply inputting the numbers in the command line. If your data is not in English, you can find many more models in different languages on Hugging Face. Load them into the tool by choosing the third option and pasting in the name of your model! Please be aware that in a super rare occasion, some models might not work right away.

Suppose you have a correctly set up GPU with CUDA cores available. In that case, it will be automatically detected to speed up the preprocessing of the data dramatically. If you don’t have a GPU on your computer, this process might take a couple of minutes, but it will work fine. So, if you have a GPU with CUDA cores, we implore you to check how to make them usable. You can learn more about this here and here.

Building the model

For now, you’re all done! Now, lean back and watch our tool do all the work for you. It will automatically try out some parameters for the model to find the best configuration for the data. After that, the best configuration will be used to build a model! In the end, you will also get an overview of how well the model performs. 

Our model did quite well on the stock news data and reached an accuracy of 70 %. This is not perfect, but considering that the dataset is relatively small, this result is alright! To improve the model, we can just label more data and add them to the training set. The perfect tool to do this is our open-source labeling application, which we’ll soon release. So make sure to subscribe to our newsletter!

Containerization

We’ve now built a machine learning model that can classify stock news. How cool is that? Now we want to use the model. To do that, we can containerize it. If you are not familiar with containerization, that’s ok! To containerize a program means you put everything the program needs, such as files and dependencies, into a virtual container. This container can run everywhere because it has everything it needs to execute. 

To do this, you need to have Docker Desktop installed. 

If you have everything set up to containerize, just type:

$ bash start_container

Alternatively, if you have Windows, you might want to paste these steps:

$ docker build -t automl-container-backend  .

$ docker run -d –-rm \

$ -–name automl-container \

$ -p 7531:7531 \

$ automl-container-backend

This will start the containerization process for you! It might take a couple of minutes until everything is build-up. 

Using the model

After the container is built, it will automatically start up. We have built a simple user interface for you to use and get new predictions! Simply run the following command: 

$ streamlit run app/ui.py 

This will start up the frontend, which you can access via localhost:8501. You can also easily connect the frontend to a custom URL if you'd like. Let us know if you'd like to learn more about this!

Finally, let's test our model. Yahoo Finance is a great and reliable website to get news about stocks. We found an exciting headline about Bitcoin. Let's find out what our machine learning model has to say about it:

The model doesn’t seem very optimistic about this headline, and we would say that’s right. This headline is about the price of Bitcoin going down, which is bad. Good job, model! Let’s try out another headline about a different topic:

The following headline is positive, and our model correctly identifies that. Yay! If you look closely, you can see that the confidence is not very high. Our model has been correct both times, but to give rock-solid, confident predictions, it would probably need more data, a little bit of clever feature engineering, or both!

But this is not where the fun stops. You could take this container and use it for many more cool things. Why not use a web scraper and build your own news recommendation service? Or a dashboard in which you display the sentiments on different stocks? There are many possibilities. Let us know if you would be interested to see more tutorials of this kind!

We hope you found this small tutorial helpful. Should you have any questions or suggestions for us, don't hesitate to get in touch with us at info@kern.ai or leonard.puettmann@kern.ai. We would also be happy if you would give a star to our GitHub Repo!

Become a data pioneer now

Algorithms aren’t the bottlenecks. It’s data. We shorten AI development from months to days by programmatically scaling your data labeling tasks.