Gazetteers / Lookups
If you’ve got a text analytics use case at hand (such as in a sentiment analysis or a named entity recognition), chances are high that you’ll use some kind of gazetteer. A gazetteer looks for occurrences of some word or phrase within a document. For instance, if you try to build a classification that detects support tickets related to some bug, you could collect a list of words such as [‘error’, ‘exception’, ‘bug’, ...] that indicate that a certain text relates to a bug. A python gazetteer then could look as follows:
Of course, you could also pass these values into some kind of database and have your labeling function look up those values, even though this makes no practical difference.
What’s great about gazetteers is that you can easily extend them to work even in cumbersome scenarios. If you’re labeling documents that are parsed by some OCR algorithm, you’ll most likely have some character-wise issues in your texts (such as that sometimes "m" is read as "rn", "h" as "b", and so on). We offer some great augmentations to your gazetteers on-demand, so that your gazetteers become stable to spelling mistakes, OCR issues and many more.
Imagine you’re building a text classification that needs to differentiate between Clickbait headlines and regular headlines. Now, you will most likely find a lot of gazetteers to help you label. Another tool that will make your life much easier is applying regular expressions to find patterns within your data. For instance, you believe that headlines starting with two digits are much more likely to be Clickbait (“15 reasons why …”, “29 ways to ...”). The respective labeling function would search for instances that are matched by "^[1-9][0-9] ", a simple regular expression that describes what you are looking for.
3rd party applications and Legacy Code
Another way to build a great labeling function is to use existing applications that do some logic for you. For instance, if you’re writing an urgency detector, you could use a 3rd party application to detect the sentiment of a text message. If the tone of a message is harsh, chances are higher that this message has some urgency.
Alternatively, you can also always use legacy code of existing applications you’ve built. As long as you can transform the legacy code into labeling functions, you’re good to go.
Machine Learning models
In kern, we make great use of Active Learning-based implicit heuristics. Those are rather simplistic machine learning algorithms that learn on the data you label by hand, in order to make new classifications for you on new, yet unseen data points. Those are especially great whenever you don’t know how to best define a label function for a certain class. Getting back to our example of Clickbait headlines, it might be easy to define labeling functions for Clickbait, but it is way harder to do so for regular headlines. Therefore, you can always integrate Active Learning for this in kern with a click of a button.
For Named Entity Recognition, hidden labeling function champions are Document-level functions. They make use of the idea of label consistency, i.e. that for a given document, the label of an entity does not change, even if the spelling of the entity changes slightly. Imagine you want to build a NER to detect politicians. Now, if you write a gazetteer that detects certain names (e.g. `Angela Merkel`), a document-level function will make use of this gazetteer and detect additional occurrences of the entities (such as `Merkel`). This is extremely helpful, as in many scenarios, entities are only written once in full version (e.g. last name and first name, company name and legal unit, …), and are then only referenced to by their shortened form.
Last but not least, you can design any kind of function that takes as input a dictionary and returns a String to be a labeling function. For instance, you could define a Python function that processes items of "text" and "source" combined in a dictionary. This way, you have no boundaries in coming up with valuable indicators for your labels.
As you can see, defining labeling functions is not hard at all. And you don't even have to worry whether they will be correct in each case - they are just meant to be heuristics, giving at least some indication towards a label. If you want to take a closer look at how this works in practice, feel free to reach out to us for a demo. We are more than happy to show you what we have built!