Data Processing in the ETL Pipeline
Cognition provides an ETL (Extract, Transform, Load) data processing pipeline for working with PDF files. It converts the content into Markdown format and chunks it into smaller parts to make it easier for a large language model (LLM) to analyze and use the information. This process of chunking is an integral part of any Retrieval-Augmented Generation (RAG) pipeline, which helps the system effectively retrieve and generate relevant information.
Extraction is the first step in the ETL pipeline and refers to extraction the textual content out of the PDF. It can be performed by a vision-based large language model (LLM), or through selectable tools like pdf2markdown and Azure Document Intelligence (Azure DI). For shorter PDFs without complex layouts such as tables or multi-column formats, pdf2markdown is a suitable option. Note that it only works with PDFs that contain selectable text and cannot process protected files. For more complex documents or those without selectable text, Azure DI and the vision LLMs are recommended.
The transformation step of the ETL pipeline is handled by a LLM, which is responsible for cleaning, structuring, and refining the extracted data. This LLM can be a model hosted on platforms like Azure, Azure Foundry, or OpenAI. During the transformation process, the LLM performs tasks such as removing unnecessary formatting, standardizing the structure, and converting complex elements like tables into a more usable format.
Creating a Dataset and Uploading Files
To begin processing, first create a dataset where you can upload your files. On the left-hand side of the Cognition home page, there is a sidebar. Click “ETL” and then “Create new dataset”. To find an existing dataset, click “See datasets”.As part of the table, the engineer can see the configuration of the dataset.
You can name and describe the dataset however you want to. Select the language. Currently, the available languages are German and English. The English system is configured to handle other languages as well.
Choose Azure(Azure Foundry available as well but only for transformation) or OpenAI as the LLM provider and fill in the fields API key, engine, etc. with your credentials. You can use the same credentials for both extraction and transformation configuration. Depending on the LLM Providers different fields are required (either for extraction or transformation), such as:
- for OpenAI: api key and model,
- for Azure: api key, engine, Azure URL and API version and
- for Azure Foundry (transformation): api key and Foundry URL.
If you want to use Azure DI as an extraction method, activate it with the corresponding credentials. The Azure DI requires api key and Azure DI Url.
When creating a new dataset, the prompt can be either configured or use the default one. Clicking the info circle button, will open a modal for customizing the vision prompt. The vision prompt applied only to vision capable models. Note: without selecting the language and the extractor, the button will be disabled.
Example for English language:
Another configuration when creating a dataset is the option to activate the o model series. (e.g. o1 & o3).
When you have configured everything, click “Create”. You can now upload files from your device.
When uploading documents, choose the extraction method. The available methods are pdf2markdown, Azure DI and GPT-4 Vision. You can select one of these methods for each upload (if configured for Azure DI). In each upload, you can submit one or multiple files. You can also use different methods within the same dataset.
In case the file fails in the pipeline, try deleting it and uploading it on its own. The problem might be that the file is too long. If the file is longer than 100 pages, it is advisable to split it into shorter files.
Editing
When the data extraction is complete, the computation state will display “finished”. When the file is uploaded it goes through multiple stages. More info about uploading data can be found in the following section: Uploading data Section.
You can click on “Show” to open the processed file. The content is now broken down into chunks. On the left-hand side, you can edit the document. The right hand-side will show a preview of the edit. Editing is done using markdown.
Markdown is a way to write text that allows you to add simple formatting like bold, italics, lists, and headers. It uses easily accessible punctuation marks from the keyboard, so you can create formatted documents without needing complicated software.
Markdown is simple and clear, which makes it easier for LLMs to read and understand the text. Since it’s straightforward, LLMs can focus on the content and meaning of the text without getting confused by complicated formatting.
The chunks are separated by three dashes (---). While other Markdown symbols are optional, the dashes are mandatory as chunk dividers.
To turn a line into a heading, write # followed by a space. There are six sizes of headings in Markdown, and the heading size decreases with each additional # you add, meaning the smallest heading is written with ######.
To create a list, write - followed by a space. To add subitems, use double dashes with a space in between, like this: - -.
To make a word bold, write it between two **. To italicize a word, use only one *. To create an empty line, press Enter twice.
Before getting into details of editing, here is a quick checklist:
- Unwanted special characters
- Lost information
- Tables, columns, or other complex structures
- Chunk length
- No empty chunks or chunks without letters
Unwanted Special Characters
First thing to check in a processed document is whether there are any unwanted special characters. Some images or symbols may be incorrectly processed and cause unwanted characters, such as “⍰”. These should be deleted, because they make it difficult for an LLM to interpret the text and also consume additional tokens, increasing costs. You can use the "Cleanse Text" option at the top left of the page. To use this function, you need to select the target area first.
Lost Information
Another important check is to ensure that no information is lost. To do this quickly, review the start and end of each page. If those sections look correct, the rest of the page is usually fine. Additionally, check any text boxes written in a small font along the sides of the page.
If you suspect there are several pieces of text that got lost in the process, reconsider your extraction method. Use the other method that can work better for the document. It is usually better to switch to Azure DI.
Tables or Other Complex Structures
Tables should also be reviewed. While it is not essential for tables to look exactly like the original, the content should be clear, and rows should be processed correctly. The information will still be useful as plain lines. If you prefer tables, you can convert plain lines into Markdown tables. Select the lines you want to convert and click “Into Markdown-Table” at the top left of the page.
Chunking
Finally, review the chunking. Ideally, chunks should correspond to individual sections of the document (i.e., text under headings or subheadings). Depending on section size, one section might be represented by several chunks. For simplicity, each paragraph can form a single chunk. The first chunk of each section should naturally include the heading and/or subheading. Headings or subheadings should also be included in other chunks when possible.
Additionally, tables, bullet points, and lists should each have their own chunk. If a paragraph, table, list etc. is too long, they can be divided again into multiple chunks, again including the heading or subheading.
How do we decide if a chunk is too long? On the right-hand side of the page, when fully scrolled up, you will see a bar graph comparing “facts”. Facts are numbered chunks. The first chunk is fact #1, and so on. If a fact is below the yellow line, its length is optimal. If a fact is below the red line, the length is still acceptable but should be shortened if possible. Any chunk above the red line must be shortened to fall under, at least, the red line, or ideally, the yellow line.
A chunk cannot be empty. A very short chunk containing, for example, only a sentence is acceptable. Chunks consisting solely of punctuation or special characters are invalid. They have to at least contain a letter.
If the document is long and you don’t have time for manual editing, prioritize checking chunking rules and removing unwanted special characters.
Interface Navigation
To jump to a chunk directly, you can click to its corresponding fact. If you click on “jump to fact” on the right-hand side, the left-hand side directly goes to its respective chunk. To search a section or word, you can use the usual Ctrl+F search function.
To save your changes, click “Save” at the top right of the page. If you finished editing, you can click “Finish Review” next to the save button. You can still edit and make changes after finishing the review. But if you want to continue working on a reviewed document, it is best to remove the review status to keep track of what is complete. You can remove the reviewed status using the same button as “Finish Review”, located in the upper right corner.
To delete one or more files in a dataset, select the files and click “Delete all selected”.
You can filter the files that you want to download. You can download only finished (didn’t fail in the pipeline), only reviewed or both. When using these options, the files will be downloaded as .csv files. If you apply the “Download excel” filter, the files will be downloaded as Excel files.
The Excel file has two columns: name and content. Each chunk is represented as a row under the content column, with the corresponding file name listed in the name column.
You can also remove duplicates while downloading.
Processing Data Locally
Local OCR models are generally not as effective and precise as advanced cloud-based OCR solutions, particularly in handling complex data extraction tasks. Therefore, it is strongly recommended to use the ETL pipeline in most cases. Local OCR models should only be used for highly sensitive data when no other option is suitable.
We recommend starting with Tesseract as your primary local OCR model, and then trying EasyOCR if you want to explore another option.
Tesseract
Tesseract supports multiple languages and is time-efficient. However, as mentioned, local OCR models are not as precise as cloud models like Azure DI. The checklist provided still applies with the exception of chunk length. This will be explained below.
To run Tesseract, use the following code:
Tesseract, and other local OCR models, generally work with JPG files, so this code converts PDF files to JPGs.
For simplicity, chunking is less precise than the ETL pipeline. The code prints three dashes after every page, indicating that each page is treated as one chunk.
If your document is not in English, change the language code in line 12. You can find the language codes here https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html.
EasyOCR
If you want to try another option for your document, you can use EasyOCR. To run it, use the following code:
Again, change the language code in line 11 if your document is in another language. You can find the language codes here https://jaided.ai/easyocr/ (scroll down and find the “Overview”).
In order to provide better extraction, we test different methods on how to do it. In the last test we tried GraphRAG, approach to convert RAG files into knowledge graphs. This approach is still being tested and it is in Beta version. As soon as we perform another approach, a new part of the documentation will be published.