Global ETD Search

1	The Extendable Guideline for Analysing Malicious PDF Documents Sjöholm, Peter January 2013 (has links) Today, the average computer user has undoubtedly encountered the PDF format while handling electronic documents. Due to its wide-spread popularity and feature richness, PDF documents are commonly utilized by attackers in order to infect systems with malware. This thesis will present The Extendable Guideline for Analysing Malicious PDF Documents. This work will establish the foundation of the guideline and populate it with a part of the analysis process. The guideline relies on earlier published material in the topic. It is a practical guideline that is followed by the use of a flowchart and can be utilized by an analyst in order to determine if a PDF document is malicious or not. It provides technical background information, suitable analysis techniques, and tools. The guideline structure was developed by using sequential thinking in combination with the divide and conquer paradigm. The thesis will also elucidate commonly applied techniques that are used by malicious PDF authors in order to infect systems, evade detection, and distribute their malicious documents. A commonly utilized function in PDF documents are the JavaScript feature. There are a wide range of other features that are targeted by malicious PDF authors, but they are more rarely encountered. PDF documents are often distributed by attackers by sending them as an attachment in an email, or storing the document on a web server. PDF Documents Portable Document Format Malicious PDF Malicious Guideline Analyse Analysing Analyze Analyzing Extendable Flowchart
2	轉換年報資料以擷取企業評價模型之非財務性資料項 / A Transformation Approach to Extract Annual Report for Non-Financial Category in Business Valuation 吳思宏, Wu, Szu-Hung Unknown Date (has links) 現今由於之前企業併購熱潮，使得企業到底價值多少？企業是否能夠還有前景？這些問題不僅僅是投資者所關心的問題，也同樣是會計師及企業評價者所關心的問題。又現今已邁入知識經濟時代，企業已從過去以土地、廠房、設備等固定資產來產生企業價值，轉而以服務、品牌、專利等無形資產為主要的企業價值時，企業的價值又要如何來估算。而這些問題都一再的顯示出“企業評價”的重要性。在進行企業評價之前，企業評價模型中之資料項的取得更是關係著最後評價結果的好壞。在企業評價資料項中，可分為財務性及非財務性。財務性資料項由於定義清楚，所以在資料的收集上較非財務性資料容易。但我們發現過往之資料收集方式並不足以應用在企業評價非財務性資料項的收集上，且現行大多採用人工處理資料的方式，不僅耗費大量時間及成本，又因人工輸入而有資料輸入錯誤之風險，使得資料的正確性大幅降低。故本研究提出一自動化擷取年報中企業評價非財務性資料項之方法，希望藉此方法達到簡化資料收集過程，提高資料的正確性。 / Because of the trend of the business combination, now, more and more people concern about “how much value does a business have?” And “does the business still have any perspectives?” This not only get investors’’ interest, but also the accountant and business valuator. Now we already get into a new economy, called knowledge-based economy. When the businesses are not just use fixed asset, such as facility, factory and land to earn money, but also earn their money by providing services, making brand, or sell patents for live, how to measure the business’s real value and what the real value for the business is. These problems all shows that the importance of “Business Valuation.” Before calculate the business value, the most important thing is to collect the data or data category for business valuation. There are two kinds of business valuation data item. One is financial data item; the other is non-financial data item. Because of the financial data item’s clear definition, the data collection process of financial data item is easier than non-financial data item. And the data collection in the past is not fit for today, and now most valuators use manual way to process these data. This way not only wastes the time and money, but also lowers the correctness and raises the risk of mistype during the process of data collection. In this thesis, we propose an approach to automatic extract business valuation data category from annual report by using the technology of data extraction. 企業評價資訊擷取 Portable Document Format ( PDF ) 資訊檢索斷詞 Business valuation Data extraction Portable Document Format ( PDF ) Information Retrieval Word Segmentation
3	En undersökning av metoder förautomatiserad text ochparameterextraktion frånPDF-dokument med NaturalLanguage Processing / An investigation of methods forautomated text and parameterextraction from PDF documentsusing Natural LanguageProcessing Värling, Alexander, Hultgren, Emil January 2024 (has links) I dagens affärsmiljö strävar många organisationer efter att automatisera processen för att hämta information från fakturor. Målet är att göra hanteringen av stora mängder fakturor mer effektiv. Trots detta möter man utmaningar på grund av den varierande strukturen hos fakturor. Placeringen och formatet för information kan variera betydligt mellan olika fakturor, vilket skapar komplexitet och hinder vid automatiserad utvinning av fakturainformation. Dessa utmaningar kan påverka noggrannheten och effektiviteten i processen. Förmågan att navigera genom dessa utmaningar blir därmed avgörande för att framgångsrikt implementera automatiserade system för hantering av fakturor. Detta arbete utforskar fyra olika textextraktions metoder som använder optisk teckenigenkänning, bildbehandling, vanlig textextraktion och textbearbetning, följt av en jämförelse mellan de naturliga språkbehandlingsmodellerna GPT- 3.5 (Generative Pre-trained Transformer) och GPT-4 för parameterextraktion av fakturor. Dessa modeller testades på sin förmåga att extrahera åtta specifika fält i PDF-dokument, sedan jämfördes deras resultat. Resultatet presenteras med valideringsmetoden ”Micro F1-poäng” en skala mellan 0 till 1, där 1 är en perfekt extraktion. Metoden som använde GPT-4 visade sig vara mest framgångsrik, som gav ett resultat på 0.98 och felfri extraktion i sex av åtta fält när den testades på 19 PDF-dokument. GPT 3.5 kom på andraplats och visade lovande resultat i fyra av de åtta fält, men presterade inte lika bra i de återstående fält, vilket resulterade i ett Micro F1-poäng på 0.71. På grund av det begränsade datamängden kunde GPT 3.5 inte uppnå sin fulla potential, eftersom finjustering och validering kräver större datamängder. Likaså behöver GPT-4 valideras med ett mer omfattande dataset för att kunna dra slutsatser om modellernas faktiska prestanda. Ytterligare forskning är nödvändig för att fastställa GPT-modellernas kapacitet med dessa förbättringar. / In today’s business environment, many organizations aim to automate the process of extracting information from invoices with the goal of making the management of large volumes of invoices more efficient. However, challenges arise due to the varied structure of invoices. The placement and format of information can significantly differ between different invoices, creating complexity and obstacles in the automated extraction of invoice information. These challenges can impact the accuracy and efficiency of the process, making the ability to navigate through them crucial for the successful implementation of automated systems for invoice management. This work explores four different text extraction methods that use optical character recognition, image processing, plain text extraction, and text processing, followed by a comparison between the natural language processing models GPT-3.5 (Generative Pre-trained Transformer) and GPT-4 for parameter extraction of invoices. These models were tested on their ability to extract eight specific fields in PDF documents, after which their results were compared. The results are presented using the ”Micro F1-Score” validation method, a scale from 0 to 1, where 1 represents perfect extraction. The method that used GPT-4 proved to be the most successful, yielding a result of 0.98 and error-free extraction in six out of eight fields when tested on 19 PDF documents. GPT-3.5 came in second place and showed promising results in four of the eight fields but did not perform as well in the remaining fields, resulting in a Micro F1-Score of 0.71. Due to the limited amount of data, GPT-3.5 could not reach its full potential, as fine-tuning and validation require larger datasets. Similarly, GPT-4 needs validation with a more comprehensive dataset to draw conclusions about the models’ actual performance. Further research is necessary to determine the capacities of GPT models with these improvements. portable document format faktura digitalisering IT-lösningar optisk teckenigenkänning textextraktion naturlig språkbehandling generative pre-trained transformer portable document format faktura digitalisering IT-lösningar optisk teckenigenkänning textextraktion naturlig språkbehandling generative pre-trained transformer Software Engineering Programvaruteknik
4	PDF document search within a very large database Wang, Lizhong January 2017 (has links) Digital search engine, taking a search request from user and then returning a result responded to the request to the user, is indispensable for modern humans who are used to surfing the Internet. On the other hand, the digital document PDF is accepted by more and more people and becomes widely used in this day and age due to the convenience and effectiveness. It follows that, the traditional library has already started to be replaced by the digital one. Combining these two factors, a document based search engine that is able to query a digital document database with an input file is urgently needed. This thesis is a software development that aims to design and implement a prototype of such search engine, and propose latent optimization methods for Loredge. This research can be mainly divided into two categories: Prototype Development and Optimization Analysis. It involves an analytical research on sample documents provided by Loredge and a multi-perspective performance analysis. The prototype contains reading, preprocessing and similarity measurement. The reading part reads in a PDF file by using an imported Java library Apache PDFBox. The preprocessing processes the in-reading document and generates document fingerprint. The similarity measurement is the final stage that measures the similarity between the input fingerprint with all the document fingerprints in the database. The optimization analysis is to balance resource consumptions involving response time, accuracy rate and memory consumption. According to the performance analysis, the shorter the document fingerprint is, the better performance the search program presents. Moreover, a permanent feature database and a similarity based filtration mechanism are proposed to further optimize the program. This project has laid a solid foundation for further study in the document based search engine by providing a feasible prototype and enough relevant experimental data. This study figures out that the following study should mainly focuses on improving the effectiveness of the database access, which involves data entry labeling and search algorithm optimization. / Digital sökmotor, som tar en sökfråga från användaren och sedan returnerar ett resultat som svarar på den begäran tillbaka till användaren, är oumbärligt för moderna människor som brukar surfa på Internet. Å andra sidan, det digitala dokumentets format PDF accepteras av fler och fler människor, och det används i stor utsträckning i denna tidsålder på grund av bekvämlighet och effektivitet. Det följer att det traditionella biblioteket redan har börjat bytas ut av det digitala biblioteket. När dessa två faktorer kombineras, framgår det att det brådskande behövs en dokumentbaserad sökmotor, som har förmåga att fråga en digital databas om en viss fil. Den här uppsatsen är en mjukvaruutveckling som syftar till att designa och implementera en prototyp av en sådan sökmotor, och föreslå relevant optimeringsmetod för Loredge. Den här undersökningen kan huvudsakligen delas in i två kategorier, prototyputveckling och optimeringsanalys. Arbeten involverar en analytisk forskning om exempeldokument som kommer från Loredge och en prestandaanalys utifrån flera perspektiv. Prototypen innehåller läsning, förbehandling och likhetsmätning. Läsningsdelen läser in en PDF-fil med hjälp av en importerad Java bibliotek, Apache PDFBox. Förbehandlingsdelen bearbetar det inlästa dokumentet och genererar ett dokumentfingeravtryck. Likhetsmätningen är det sista steget, som mäter likheten mellan det inlästa fingeravtrycket och fingeravtryck av alla dokument i Loredge databas. Målet med optimeringsanalysen är att balansera resursförbrukningen, som involverar responstid, noggrannhet och minnesförbrukning. Ju kortare ett dokuments fingeravtryck är, desto bättre prestanda visar sökprogram enligt resultat av prestandaanalysen. Dessutom föreslås en permanent databas med fingeravtryck, och en likhetsbaserad filtreringsmekanism för att ytterligare optimera sökprogrammet. Det här projektet har lagt en solid grund för vidare studier om dokumentbaserad sökmotorn, genom att tillhandahålla en genomförbar prototyp och tillräckligt relevanta experimentella data. Den här studie visar att kommande forskning bör huvudsakligen inriktas på att förbättra effektivitet i databasåtkomsten, vilken innefattar data märkning och optimering av sökalgoritm. Portable Document Format Search Document Identification Cosine Similarity Document Preprocessing Document Search Optimization Method Performance Analysis Classification Regression Loredge. Portable Document Format Sökning Dokument Identifiering Cosine Similarity Dokument Förhandling Dokument Sökning Optimering metod Prestandaanalys Klassificering Regression Loredge Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.0578 seconds