Spelling suggestions: "subject:"extract"" "subject:"textract""
1 |
Automating the extraction of Financial dataRollino, Nicolas, Ali, Rakin January 2022 (has links)
It is hard for retail investors and data providing companies to attain financial data of European companies. The work of extracting financial data of European companies is most likely done manually, which is a time-consuming process. This would explain why European companies’ data is supplied slower than American companies. This thesis attempts to see if it is possible to automatise the process of extracting financial data of European companies by creating two proof of concept systems. One focuses on collecting financial reports of European companies using a web scraper and directly scrapes the reports from the source. The other system extracts financial data from the reports using Amazon Web Services(AWS), specifically the text extraction tool called Textract. The system that collects financial reports from companies could not be automated and did not meet the expectations set by the company that commissioned the thesis. The system that extracts financial data from companies was promising as all data points of interest could be extracted. The second system was deemed promising however since it is reliant on a system that supplies it with reports, it cannot be implemented.The work conducted shows that automating the process of extracting financial data from European companies is not (yet) possible. Extracting the data from reports is possible however collecting the report is the bottleneck which is not possible. It would have been better to manually collect financial reports instead of using a web scraper in this thesis. This was a bottleneck which could be solved in future projects. / Det svårt för privata investerare och företag som tillhandahåller data att få tillgång till finansiella data om europeiska företag. Uppgiften att extrahera finansiella data från europeiska företag sker med största sannolikhet manuellt, vilket är en tidskrävande process. Detta skulle förklara varför europeiska företags finansiella data levereras långsammare än amerikanska företag. Denna rapport försöker testa ifall det är möjligt att automatisera processen att extrahera finansiella data för europeiska företag genom att skapa två proof of concept-system. En fokuserar på att samla in finansiella rapporter från europeiska företag som använder en webbskrapa och skrapar rapporterna direkt från källan. Det andra systemet extraherar finansiella data från rapporterna med hjälp av Amazon Web Services(AWS), specifikt verktyget som extraherar text, även kallad Textract. Systemet som samlar in finansiella rapporter från företag kunde inte automatiseras och motsvarade inte de förväntningar som ställts av företaget som föreslog examensarbetet. Systemet som extraherar finansiella data från företag var lovande eftersom alla eftertraktade datapunkter kunde extraheras. Det andra systemet ansågs lovande men eftersom det är beroende av ett system som förser det med rapporter kan det inte implementeras. Arbetet som utförts visar att det ännu inte är möjligt att automatisera processen att extrahera finansiell data från europeiska företag. Det är möjligt att extrahera data från rapporter men att samla in rapporten är flaskhalsen som inte är möjlig. Det hade varit bättre att manuellt samla in finansiella rapporter istället i denna avhandling. Detta var en flaskhals som skulle kunna lösas i framtida projekt.
|
2 |
Invoice Line Item Extraction using Machine Learning SaaS ModelsKadir, Avin January 2022 (has links)
Manual invoice processing is a time-consuming and error prone task which has proven to be done more efficiently by introducing automation software that minimizes the need for human input. Amazon Textract is a software as a service provided by Amazon Web Services for that purpose. It has been developed to extract document data from both general and financial documents, such as receipts and invoices, by using machine learning models. The service is available in multiple widely spoken languages, but not in Swedish as of the time of writing this thesis. This thesis explores the potential and accuracy of Amazon Textract in extracting data from Swedish invoices by using the English setting. Specifically, the accuracy of extracting line items as well as Swedish letters are examined. In addition, the potential of correcting incorrectly extracted data is explored. This is achieved by testing certain defined categories on each invoice by comparing the Amazon Textract extractions with the correct labeled data. These categories include emptiness, meaning no data was extracted, equality, missing and added line items, as well as missing and added characters that have been added or removed from otherwise correct line item strings. The invoices themselves are divided into two categories, namely structured and semi-structured invoices. The tests are mainly conducted on the service’s dedicated API method for data extraction of financial documents, but a comparison with the table extraction API method is also made to gain more insight in Amazon Textract’s capability. The results suggest that Amazon Textract is quite inaccurate when extracting line item data from Swedish invoices. Therefore, manual post processing of the data is generally needed to ensure its correctness. However, it showed better results in extracting data from structured invoices, where it scored 70% in equality and 100% in 2 out of 6 invoice layouts. The Swedish character accuracy was 66%.
|
Page generated in 0.0351 seconds