Advantages of Weak Supervision
We’ll get into detail in this blog post how Weak Supervision works. At first, however, we want to tell you the two main advantages of Weak Supervision:
- Correctly applied, it enables you to label 100k+ records a day
- It provides you the ability to develop and debug data
Are you interested? If so, let’s dive into it!
If there’s a :-) in a sentence, its sentiment is positive - right?
How would you implement a sentiment classification for „positive“ and „negative“ comments if there would be no such thing as Supervised Learning? One valid option would be to collect a list of positive as well as negative tokens and to use them as lookup values for classification (e.g. „great“, „:-)“ or „terrible“). Even though this would most likely be imperfect as words are often context-sensitive and sentences can be negated, it would be better than random guessing.
Such methods are called heuristics, and they are at the core of Weak Supervision - a technique to build and manage large-scale programmatic AI training labels. Let’s take a closer look at it:
Weak Supervision is all about incrementally gaining further information by creating so-called „noisy“ (i.e. not always correct) labels through heuristics. In our above scenario, a Python keyword lookup function (= heuristic) could create noisy labels by processing an incoming record:
Typical heuristics are keyword/database lookups, pattern matchers, third-party applications, legacy systems and even Machine Learning algorithms. To be clear, they don’t necessarily have to create a noisy label for each record; it is perfectly fine if they only do so for a subset of the data.
If you collect enough noisy labels for multiple heuristics, you can craft a noisy label matrix - i.e. multiple noisy labels per record.
What would be your initial thought given that such a matrix with information would exist for your records? Chances are high that if all noisy labels of a record are the same (that are not abstained), this noisy label indeed is the correct label.
Now, Weak Supervision is quite like that - but in a much more versatile way. Various synthesis algorithms exist that squeeze each tiny bit of information out of the noisy labels matrix both locally for one record, as well as globally for the whole dataset in order to create large-scale programmatic labels that you could actually use for Supervised Learning. Based on a different approach (i.e. one with weights based on the information quality of each heuristic), the synthesis could also look as follows:
Various synthesis algorithms exist that squeeze each tiny bit of information out of the noisy labels matrix both locally for one record, as well as globally for the whole dataset in order to create large-scale programmatic labels that you could actually use for Supervised Learning
Coverage, Conflict and Overlaps
A first source of information to combine noisy labels is to analyze the coverage, conflict and overlap ratio of the heuristics that created them. Let’s define each of these terms to better understand them:
- Coverage: Simply put the ratio (or counts) on how often a heuristic makes an actual statement for a given record.
- Conflict: How often does this heuristic make a statement that has any other heuristic stating something else? For instance, if a record contains two noisy labels, one „positive“ and one „negative“, there’s a conflict.
- Overlap: The counterpart to conflicts; how often does a heuristic make the same statement as another heuristic for a given record?
Based on this information alone, a synthetic model can be crafted that knows to differentiate between certain heuristic combinations. This is incredibly helpful, as such an analysis can be done in a completely unsupervised manner. But you can even gather further information on your heuristics, making them a superpower to leverage your labeling!
True Positives and False Positives
When building a classification model, you most likely analyze its precision, recall, accuracy and F1-score. For this, you use a confusion matrix stating true/false positives and true/false negatives. You can do the exact same for heuristics given that you have labeled some records manually as reference labels.
With this information, your heuristics can be weighed by their expertise; i.e. a heuristic that is 90% precise has a higher vote than a heuristic that is only 70% precise.
Let them vote!
Isn’t this quite similar to something we already know in classification principles? Think of a random forest or AdaBoost. Those are ensemble classification models, i.e. they combine various weak classifiers into one strong classifier.
With Weak Supervision, at last, it is almost the same. You create multiple heuristics, explicit as code or implicit as ML models, and let them vote. Other than with pure ML embedding algorithms, you can easily take control over what the models learn by specifying each classification model as a heuristic. Combined with extensive monitoring, you can understand how to adapt such heuristics to improve the label quality - and ultimately your models performance.
But why then should I train a ML algorithm anyways?
This is one of the most common questions when it comes to Weak Supervision. Let’s just keep it like this: You could use Weak Supervision to make actual predictions. However, you typically gain generalization when using the programmatic labels as training data for an actual Supervised Learning model. Therefore, you can use Weak Supervision as the bridge between heuristics and Supervised Learning, enabling you to train well-functioning AI models.
Weak Supervision is an emerging technique in Supervised Learning to combine multiple heuristics programmatically to create large-scale AI training data. As a result, you can basically go from unlabeled raw data to massive training data within hours, transforming your AI to the next level.