Return to search

RAG-based data extraction : Mining information from second-life battery documents

With the constant evolution of Large Language Models (LLMs), methods for minimizing hallucinations are being developed to provide more truthful answers. By using Retrieval-Augmented Generation (RAG), external data can be provided to the model on which its answers should be based. This project aims at using RAG for a data extraction pipeline specified for second-life batteries. By pre-defining the prompts the user may only provide the documents that are wished to be analyzed, this is to ensure that the answers are in the correct format for further data processing. To process different document types, initial labeling takes place before more specific extraction suitable for the document can be applied. Best performance is achieved by grouping questions that allow the model to reason around what the relevant questions are so that no hallucinations occur. Regardless of whether there are two or three document types, the model performs equally well, and it is clear that a pipeline of this type is well suited to today's models. Further improvements can be achieved by utilizing models containing a larger context window and initially using Optical Character Recognition (OCR) to read text from the documents.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-533357
Date January 2024
CreatorsEdström, Jesper
PublisherUppsala universitet, Avdelningen för systemteknik
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess
RelationUPTEC F, 1401-5757 ; 24025

Page generated in 0.0014 seconds