Reimagine AI development

No matter whether you are starting from scratch or are improving on existing use cases
— our next-gen data-centric development environment enhances your AI.
Screenshot of the onetask product dashboard.

Programmatic labeling and data curation

refinery shines where other approaches fail.
We combine best of different worlds to give you a new data-centric experience.
Weak supervision
Write labeling functions, build active learners and many more heuristics - and  integrate them in seconds. For prototyping and continuous improvements.
Data navigation system
As we enrich your records with valuable metadata, this data can be prioritized and sliced. This ensures both saving time and increasing quality.
Data debugging
No manual labeling is 100% correct. We help you to identify potential labeling errors to ensure highest possible data quality for your models.
Multiuser capability
refinery is designed to enable collaborative work. Multiple engineers, multiple subject matter experts.
Seamless integration
Integrate refinery into your development workflow via its API or Python SDK. Works perfect with e.g. DVC for versioning.
Data privacy
We care about your data security.
We offer refinery both on public and private cloud as well as on-prem.

Data programming

With refinery, you can label tons of data within hours or days. Potential heuristics to develop labels are:
  • Transformer-based active learning models, fully available in refinery
  • User-defined labeling functions
  • Zero-shot models
  • 3rd Party applications
  • Legacy systems
  • Crowdlabeled records
Each of them can be integrated within minutes. Through intelligent analysis, this ensures high quality training data.

Data formats

refinery uses JSON as its backbone data model. You can import anything that can be transferred into a JSON-like file, e.g. CSV files or spreadsheets.
When exporting the data you've built, you receive valuable metadata. Also, our system comes with a Python SDK and API, which you can use to programmatically export the data. Just use rsdk pull (short for refinery-sdk pull) in your CLI.

Build with automl-docker

Use your labeled texts to automatically build a containerized AutoML web-service using our open-source CLI tool automl-docker.
  • Perfect as a baseline model
  • Lightweight and highly customizable
  • Low entry-barrier for junior data scientists
Build the data with refinery, create the container with automl-docker, and prototype the UI with streamlit (off-the-shelf available) or other tools.

Frequently asked questions

What is weak supervision?

A technique/methodology to integrate different kinds of noisy and imperfect heuristics like labeling functions. It can be used not only to automate data labeling, but generally as an approach to improve your existing label quality.

What is active learning?

As the name suggests, in Active Learning, models are trained during the labeling process. This way, the learning model can continuously make predictions on the data, helping both in auto-labeling confident data and identifying critical records. The latter is used e.g. for query scheduling, making use of all available information to pick the next records to be labeled.

What is confident learning?

Real-world training data isn’t labeled 100% correct. Even datasets like MNIST, a well-known toy dataset to help new ML engineers enter the field, aren’t without errors. The field of confident learning aims to detect records which are either mislabeled or could be interpreted in multiple ways. With higher data quality, models can learn to make the right decisions in difficult cases.

What exactly is a heuristic?

Heuristics are the ingredients for scaling your data labeling. They don't have to be 100% accurate, heuristics can be e.g. simple Python functions expressing some domain knowledge. When you add and run several of these heuristics, you create what is called a noisy label matrix, that is matched against the reference data that you manually labeled. This allows us to analyze correlations, conflicts, overlaps, the number of hits for a data set, and the accuracy of each heuristic.

How do I know whether my heuristic is good?

A heuristic can be “good” with respect to both coverage and precision. For coverage there basically is no limitation at all, for precision we generally recommend some value above 70%, depending on how many heuristics you have. The more heuristics you have, the more overlaps and conflicts will be given, the better weak supervision can work.

If you already automatically label data, why should I train a model at all?

Technically, you could use our program for inference. However, best results are achieved if a supervised learning model is trained on the generated labels, as these models improve generalization. It’s just a best practice.

Which data formats are supported?

We’ve structured our data formats around JSON, so you can upload most file types natively. This includes spreadsheets, text files, CSV data, generic JSON and many more.

I don’t know whether my data would work - who can I contact?

No worries, we’re always happy to help. Just send a message to the chat in your bottom right corner, and someone from our team will gladly help.

How fast will I get my results?

Heuristics typically run for few seconds to minutes, depending on the payload of your data. As we run functions in containerized environments and enrich text data using spaCy, it might take more time than running them on your local machine. The computation of weak supervision also takes few seconds to minutes.

I have less than 1,000 records - do I need this?

Our system is well designed for scalability, but you can definitely also face the benefits with low amounts of data. We provide an intuitive multi-task labeling interface, extensive data management capabilities, well-written documentation and world class-support.

I don’t want to label my data myself - can I outsource this with your tool?

We’ll gladly help you with the data labeling. Check out our pricing options, and reach out to us.

How can I reach support?

The easiest way is to use the chat in the bottom right corner of your browser. Someone from our team will contact you within minutes. Alternatively, you can just send a message to Henrik, one of our co-founders, will be in contact with you as soon as possible.

Are you offering consulting or workshops?

Yes, we offer consulting and workshops depending on the size of your project. In this, we offer custom labeling solutions and workshops with best practices on labeling, to ensure high data quality right from the beginning of your project.

Kern AI is highly secure, as we follow industry-leading best practices to keep all of your data secure.

How is my data encrypted?

All of your data is encrypted at transfer using HTTPS in order to protect requests from eavesdrop and man-in-the-middle attacks. Additionally, your data is encrypted at rest using AES-256, securing your data from unauthorized access.

How often are backups created?

We use a managed database for production, which automatically creates backups in form of snapshots from the data every day.

Where are the data centers located?

Our application solely runs on three AWS availability zones (data centers) located in Frankfurt, Germany. AWS data centers maintain state-of-the-art physical security, including 24x7x365 surveillance, environmental protection, and extensive secure access policies.

On which OS is the application running?

Kern AI servers run in recent Linux OS releases with Long Term Support policies and are regularly updated. Our engineering team monitors uptime and is able to quickly act if errors occur.

How do you ensure operational security?

Only a small number of authorized employees can access user data. Accessing users’ accounts by kern employees is only allowed in exceptional cases, always with your prior permission and for the purpose of resolving a specific issue only.

We use specialized tools for storing and sharing passwords and other sensitive data and require our employees to use Multi-Factor authentication for all tools where possible.

Can we use Multi-Factor Authorization?

We provide your users to enable MFA for login to reduce friction and increase security. Additionally, we use a security stack that detects whether your password has been leaked in a recent data breach, and validates that used passwords are secure.

Is the application available on private cloud or on-premises?

Our free version is available on public cloud only. For private cloud or an on-premises solution, please contact sales.

I have some further questions about your security - who can I contact?

For all further questions, please contact

Become a data pioneer now

We are building tools for the age of data-centric AI.
Let's build great use cases together.