Building heuristics

If you want to automate parts of your data labeling, heuristics like labeling functions come in handy. To do so, simply head over to the heuristics page and select "Labeling function" from the "New heuristic" button.

Writing your labeling function

You'll jump into a heuristic page with some code editor. Here you can write Python functions that take as input a dictionary (we loop over all records of your project, so imagine this to be one specific record - just as in the record IDE.), and output a label name.

We run this code as containerized functions, such that we need to prepare your execution environment. You can find installed libraries in the requirements.txt of our execution environment repository.

As with any other heuristic, your function will automatically and continuously be evaluated against the data you label manually.

Lookup lists for distant supervision

You'll quickly see that many of the functions you want to write are based on list expressions. But hey, you most certainly don't want to start maintaining a long list in your heuristic, right? That's why we've integrated automated lookup lists into our application.

As you manually label spans for your extraction tasks, we collect and store these values in a lookup list for the given label.

You can access them via the heuristic overview page when you click on "Lookup lists". You'll then find another overview page with the lookup lists.

If you click on "Details", you'll see the respective list and its terms. You can of course also create them fully manually, and add terms as you like. This is also helpful if you have a long list of regular expressions you want to check for your heuristics. You can also see the python variable name of the lookup list, as in this example countries.

In your labeling function, you can then import it from the module knowledge, where we store your lookup lists. In this example, it would look as follows:

Heuristics for extraction tasks

You might already wonder what labeling functions look like for extraction tasks, as labels are on token-level. Essentially, they differ in two characteristics:

  • you use yield instead of return, as there can be multiple instances of a label in one text (e.g. multiple people)
  • you specify not only the label name but also the start index and end index of the span.

An example that incorporates an existing knowledge base to find further examples of this label type looks as follows:

This is also where the tokenization via spaCy comes in handy. You can access attributes such as noun_chunks from your attributes, which show you the very spans you want to label in many cases. Our template functions repository contains some great examples of how to use that.

Template functions

We realize that labeling functions can at first be a bit difficult to write. Because of that, we have a super simple GitHub repository in which we show some exemplary usages. You can copy and paste them, and even use them fully outside of our application.

If you have further ideas for template functions, please feel free to add them as issues.

Active learning for classification

Just as you can write labeling functions for your labeling automation, you can also easily integrate active learners. To do so, head to the heuristics overview page and select "Active learning" from the "New heuristic" button.

Similar to the labeling function editor, a coding interface will appear with pre-entered data. Once you made sure that you have the right labeling task selected, you can pick an embedding from the purple badges right above the editor. If you click on them, their configuration will be copied to your clipboard, such that you can enter the name into the value embedding_name of decorator @params_fit.

You can use Scikit-Learn inside the editor as you like, e.g. to extend your model with grid search. The self.model is any model that fits the Scikit-Learn estimator interface, i.e. you can also write code like this:

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

class ActiveDecisionTree(LearningClassifier):

    def __init__(self):
        params = {
            "criterion": ["gini", "entropy"],
            "max_depth": [5, 10, None]
        }
        self.model = GridSearchCV(DecisionTreeClassifier(), params, cv=3)

# ...

As with any other heuristic, your function will automatically and continuously be evaluated against the data you label manually.

Minimum confidence for finetuning

One way to improve the precision of your heuristics is to label more data (also, there typically is a steep learning curve, in the beginning, so make sure to label at least some records). Another way is to increase the min_confidence threshold of the @params_inference decorator. Generally, precision beats recall in active learners for weak supervision, so it is perfectly fine to choose higher values for the minimum confidence.

Active learning for extraction

We're using our own library sequencelearn to enable a Scikit-Learn-like API for programming span predictors, which you can also use outside of our application.

Other than importing a different library, the logic works analog to active learning classifiers.

Zero-shot

Zero-shot classifiers are amazing. Deriving predictions without labeling any data is great, but they are even better suitable as heuristics:

  • Zero-shot (and few-shot) learning quickly hit plateaus in performance, such that more labeled data doesn't add value.
  • They are highly reliant on the prompt they've been engineered on (for more details on this, take a look at our blog; we explain how zero-shot works there in greater detail).
  • They are rather computationally expensive, such that they often are too slow for inference.

Again, they are amazing heuristics. So let's build a zero-shot classifier! To do so, we head over to the heuristics page and select "Zero-shot" from the "New heuristic" button.

We now have to pick a target task, attribute, and configuration handle. We pull the zero-shot classifiers directly from 🤗 Hugging Face. You can either search for classifiers or pick one from our recommendations.

Once you've selected a zero-shot model, you enter into the details page. Other than labeling functions or active learners, there is no editor to program into. Instead, you can only pick which labels should be predicted.

Also, as already mentioned, zero-shot classifiers are rather slow, so it makes perfect sense to first play a bit with sample records to estimate the performance. You can enter an arbitrary text into the playground, or compute the predictions for 10 randomly selected records from your data.

If you're happy with the model, you can click on the purple "Run" button, which will compute the results on all your records.

As with any other heuristic, your function will automatically and continuously be evaluated against the data you label manually.

🚧 Zero-shot extractors are in active development

Zero-shot classifiers are freshly integrated into our application. But we're already working on extractors and extensive prompt engineering, so stay tuned!

Crowd labeling as a heuristic

When you have some annotation budget available, you can set up a crowd labeled heuristic. Imagine this to be a "heuristic executed by money" ;-)

To execute such a heuristic, you need to have a role set up as an annotator, and you need to specify a static data slice in the databrowser.

Evaluating heuristics

We constantly analyze how well your heuristics are doing, no matter what type they are. Once you execute a heuristic - and there is some manually labeled data we can use for evaluation - you will find a statistic like this at the bottom of your heuristics page:

It shows you per label the relevant data for you to know. The values have the following meaning:

  • est. precision = true positives / (true positives + false positives) for the reference data you labeled.
  • est. recall (only for extraction tasks) = true positives / (true positives + false negatives) for the reference data you labeled.
  • coverage: how many records does this heuristic generally hit?
  • hits (only for extraction tasks): how many spans are hit by this heuristic?
  • conflicts: on how many records (or spans) does this heuristic create conflicting expressions to other heuristics? (E.g. heuristic A says record 1 is "positive", while heuristic B says it is "negative").
  • overlaps: on how many records (or spans) does this heuristic create overlapping expressions to other heuristics? (E.g. heuristic A says record 1 is "positive", and so does also heuristic B).

You can also find the precision and coverage for each heuristic on the heuristics overview page.