What is Weak Supervision?
Put simple, Weak Supervision is an automated and intelligent integration of information sources. These sources don’t have to be perfect, i.e. can be rules of thumb. For instance, Python functions to describe textual patterns, Active Learning models or some external information source like a 3rd party application or Crowd Labeling. From these informations, cleansed weakly supervised labels can be derived.
What is Active Learning?
As the name suggests, in Active Learning, models are trained during the labeling process. This way, the learning model can continuously make predictions on the data, helping both in auto-labeling confident data and identifying critical records. The latter is used e.g. for query scheduling, making use of all available information to pick the next records to be labeled.
What is Confident Learning?
Real-world training data isn’t labeled 100% correct. Even datasets like MNIST, a well-known toy dataset to help new ML engineers enter the field, aren’t without errors. The field of Confident Learning aims to detect records which are either mislabeled or could be interpreted in multiple ways. With higher data quality, models can learn to make the right decisions in difficult cases.
What exactly is an information source?
Information sources are the ingredients for scaling your data labeling. You can think of them as heuristics associated with labeling, but they don't have to be 100% accurate, e.g. simple Python functions expressing some domain knowledge. When you add and run several of these sources, you create what is called a noisy label matrix, that is matched against the reference data that you manually labeled. This allows us to analyze correlations, conflicts, overlaps, the number of hits for a data set, and the accuracy of each information source.
How do I know whether my information source is good?
An information source can be “good” with respect to both coverage and precision. For coverage there basically is no limitation at all, for precision we recommend some value above 70%, depending on how many information sources you have. In general, the more information sources you have, the more overlaps and conflicts will be given, the better the information integration can work.
If you already automatically label data, why should I train a model at all?
Technically, you could use our program for inference. However, best results are achieved if a Supervised Learning model is trained on the generated labels, as these models improve generalization. It’s just a best practice.
Is your software limited to classifications?
No, you can do single- and multilabel multiclass-classifications as well as named entity recognition. We’re currently aiming to implement further labeling tasks in the area of NLP, such as entity linkage. If you have any custom labeling task you need, let us know.
Which data formats are supported?
We’ve structured our data formats around JSON, so you can upload most file types natively. This includes spreadsheets, text files, CSV data, generic JSON and many more.
I don’t know whether my data would work - who can I contact?
No worries, we’re always happy to help. Just send a message to the chat in your bottom right corner, and someone from our team will gladly help.
How fast will I get my results?
Information sources typically run for few seconds to minutes, depending on the payload of your data. As we run functions in containerized environments and enrich text data using SpaCy, it might take more time than running them on your local machine. The computation of Weak Supervision also takes few seconds to minutes.
I have less than 1,000 records - do I need this?
Our system is well designed for scalability, but you can definitely also face the benefits with low amounts of data. We provide an intuitive multi-task labeling interface, extensive data management capabilities, well-written documentation and world class-support.
I don’t want to label my data myself - can I outsource this with your tool?
We’ll gladly help you with the data labeling. To do so, please contact our support team using the chat in the bottom right corner of your browser.
How can I reach support?
The easiest way is to use the chat in the bottom right corner of your browser. Someone from our team will contact you within minutes. Alternatively, you can just send a message to email@example.com. Henrik, one of our co-founders, will be in contact with you as soon as possible.
Are you offering consulting or workshops?
Yes, we offer consulting and workshops depending on the size of your project. In this, we offer custom labeling solutions and workshops with best practices on labeling, to ensure high data quality right from the beginning of your project.
kern is highly secure, as we follow industry-leading best practices to keep all of your data secure.
How is my data encrypted?
All of your data is encrypted at transfer using HTTPS in order to protect requests from eavesdrop and man-in-the-middle attacks. Additionally, your data is encrypted at rest using AES-256, securing your data from unauthorized access.
How often are backups created?
We use a managed database for production, which automatically creates backups in form of snapshots from the data every day.
Where are the data centers located?
Our application solely runs on three AWS availability zones (data centers) located in Frankfurt, Germany. AWS data centers maintain state-of-the-art physical security, including 24x7x365 surveillance, environmental protection, and extensive secure access policies.
On which OS is the application running?
kern servers run in recent Linux OS releases with Long Term Support policies and are regularly updated. Our engineering team monitors uptime and is able to quickly act if errors occur.
How do you ensure operational security?
Only a small number of authorized employees can access user data. Accessing users’ accounts by kern employees is only allowed in exceptional cases, always with your prior permission and for the purpose of resolving a specific issue only.
We use specialized tools for storing and sharing passwords and other sensitive data and require our employees to use Multi-Factor authentication for all tools where possible.
Can we use Multi-Factor Authorization?
We provide your users to enable MFA for login to reduce friction and increase security. Additionally, we use a security stack that detects whether your password has been leaked in a recent data breach, and validates that used passwords are secure.
Is the application available on private cloud or on-premises?
Our free version is available on public cloud only. For private cloud or an on-premises solution, please contact sales.
I have some further questions about your security - who can I contact?
For all further questions, please contact firstname.lastname@example.org.