Uploading data
Many GenAI agents use Retrieval Augmented Generation (RAG) to obtain knowledge from documents and unstructured data. Since knowledge is often stored in PDFs, you can use our ETL pipeline to convert the PDF file into a more LLM accessible markdown format. Furthermore, it will be processed into smaller RAG-ready chunks of content and will be ready to be stored in Refinery of the Kern AI platform through an optional export. This section is indented for the upload of data, but more information about the processes are described in the Data Processing Documentation Section.
If you upload your data, it will run through multiple stages:
- Uploading: the PDF file is uploaded to the application and stored
- Queue: the document is waiting to be processed since other documents are in the pipeline at the moment
- Extracting: extracting document/pulling text from the PDF, either via pdf2markdown, a LLM or a service like Azure Document Intelligence.
- Tokenizing: the document is tokenized by spacy for semantic splitting
- Splitting: the raw Markdown text is cut into chunks to best represent a logical block in a RAG ready length
- Transforming: the chunks are cleaned and optimized using a language model, e.g. transform text into tables
- Finished: the process is finished, the data can be further manipulated manually and exported
You can analyze what the Markdown chunks look like and how they are currently being stored. On the rightern side, you can also see how long each chunked fact is, as well as a best practice cutoff length.
If some text needs additional changes, the improvements can be done directly in the text editor or with the use of one of the options ‘Into Markdown-Table’ (to convert it in a markdown table) or ‘Cleanse Text’ (to clean the text from special characters or text that does not correspond to the content). Once the analyzation/review is done, this can be easily marked with the ‘Finish review’ button. All files, only reviewed files, or each file separately can be downloaded in an Excel form. Additionally, the text can be further split into chunks by adding three dashes (- - -) to the text. More about lost information and cleaning data in the Data Processing Documentation Section.
ETL API Endpoint
Another option for running the ETL pipeline is through the API, which allows you to execute the pipeline and collect results without using the application's UI. The provided code snippet requires custom configurations, including the API key/token, file path, and extraction method. Tokens can be generated within the application, with customizable expiration options (1 month, 3 months, or never). It is important to securely store the token value, as it will not be visible in the table once created.
Tokens are assigned to either the ETL level (subject MARKDOWN_DATASET) or the project level (subject PROJECT), with each having a defined scope. ETL tokens are exclusively used within the ETL process, while project tokens are restricted to the project level.
File Caching
To enhance the performance and efficiency of file processing, the application introduces a caching mechanism based on SHA-256 hashing. By caching at every critical step—from the initial file upload to the extraction and the final transformation— it is insured that only new and unique operations of the file processing are performed.
- File Upload and Hashing - When a file is uploaded, the application generates a unique identifier for its content by calculating a SHA-256 hash along with the file size. This combination serves as a digital fingerprint, uniquely representing the file's content. The system then checks whether a file with the same identifier already exists within the organization’s data processing scope. If an identical file has been uploaded before, the system skips the upload process and simply reuses the existing file, avoiding redundant storage and reducing processing time.
- File Extraction - The next layer of caching involves file extractions. A file extraction refers to the specific combination of the uploaded file and the extractor used to process it (e.g pdf2markdown). The system caches the result of each extraction, meaning that if the same file is uploaded again and processed with the same extractor, the system will reuse the previously extracted content. This eliminates the need to reprocess the file, saving both time and computing resources.
- Transformation - Similar to extractions, transformations are also treated as a cacheable process. A transformation is defined as the combination of the extraction and any specific configuration applied for the transformation step. Once a transformation is completed for a particular extraction, the resulting transformed content is cached. This means that if you apply the same transformation to the same extracted content in the future, the system will reuse the previously transformed results. It bypasses the need to perform the transformation again, thus saving resources and time.