Global ETD Search

51	Automatic Extraction of Financial Data in Credit Rating Analysis / Automatisk extraktion av finansiella data inom kreditvärderingsanalys Minasyan, Robert, Erlandsson, Pim January 2023 (has links) With the increasing use of big data and automatization, financial data extraction is of growing importance in the financial industry. The thesis examines how an extraction system can be developed for extracting relevant data for credit rating analysis. The system is designed to collect financial reports, extract relevant information, and identify failed extractions. Prerequisites were identified by conducting a qualitative literature study, as well as holding meetings with employees at a credit rating analysis company to align the system’s functionality with the company’s processes. The results showed that an automatic extraction can be implemented. The system was trained through a manual review process, resulting in an increase in performance. Following the training, the system was able to identify and extract all target data correctly. However, in some reports, certain target data was missing and these were treated as failures by the system. In summary, a system that extracts all existing target data was implemented. / Med den ökande användningen av big data och automatisering blir utvinning av finansiella data allt viktigare inom finansbranschen. Uppsatsen undersöker hur ett extraktionssystem kan utvecklas för extraktion av relevant data för kreditvärderingsanalys. Systemet är utformat för att samla in finansiella rapporter, extrahera relevant information och identifiera misslyckade extraktioner. Förutsättningar identifierades genom att genomföra en kvalitativ litteraturstudie samt att hålla möten med anställda på ett företag som utför kreditvärderingsanalyser för att anpassa systemets funktionalitet efter företagets processer. Resultaten visade att en automatisk extraktion kan implementeras. Systemet tränades genom en manuell granskningsprocess, vilket resulterade i en ökad prestanda. Efter träningen kunde systemet identifiera och extrahera all måldata korrekt. Dock saknades viss måldata i vissa rapporter, och dessa behandlades som misslyckanden av systemet. Sammanfattningsvis implementerades ett system som extraherar all befintlig måldata. automatic data extraction financial reports XBRL Arelle credit rating analysis automatisk dataextraktion finansiella rapporter XBRL Arelle kreditvärderingsanalys Computer and Information Sciences Data- och informationsvetenskap
52	Breaking down a videogame level´s design: Deconstruction of the narrative in The Witcher III: The wild hunt Toumpoulidis, Charalampos January 2023 (has links) This master's thesis aims to investigate the narrative components inherent in open-world games. In this thesis, we will extract the narrative part of a game,analyze it, and compare the narrative part with other game design elements. The focus will be on the game "The Witcher III: The Wild Hunt," with specific portions of the game studied using AutoCAD and in-game playthroughs to extract data relating to narrative components such as items, characters, and locales. Data such as buildings, cities and objects referring to main-quest, sidequest, random interaction, cutscenes-storytelling, object interaction, and recurring characters are generated from the open-world game The Witcher III:The Wild Hunt using AutoCAD. Objects are used to mark narrative components during gameplay, and these elements will be exported into Excel for analysis using Tableau. The results of the research will be coupled with other parts of the level design. The collected data will be evaluated with Tableau, and a comparative study between the narrative part and the game-level design part has been conducted to uncover patterns and trends in the open-world game. This study investigates the narrative components found in the game world, examining their importance on various scales and their relationships with other game mechanics. According to the research, item interactions become more significant on the third floor and in larger cities, whereas cut scenes and narration are more common in big cities and on the first floor of buildings. The study also emphasizes the connection between main quests and side quests, indicating their strong relationship to the game's overarching story. The use of side missions, which frequently entail interacting with city objects, becomes increasingly important for encouraging player exploration. The study emphasizes the need for game designers to tailor their use of narrative components to the scale and context of each gaming setting, ultimately helping them to possibly create more immersive and engaging game worlds. Narrative Open-World Games The Witcher III: The Wild Hunt In-game Playthrough Data extraction Comparative Study Game Design Patterns and Trends. Computer Sciences Datavetenskap (datalogi)
53	The One Spider To Rule Them All : Web Scraping Simplified: Improving Analyst Productivity and Reducing Development Time with A Generalized Spider / Spindeln som härskar över dom alla : Webbskrapning förenklat: förbättra analytikerproduktiviteten och minska utvecklingstiden med generaliserade spindlar Johansson, Rikard January 2023 (has links) This thesis addresses the process of developing a generalized spider for web scraping, which can be applied to multiple sources, thereby reducing the time and cost involved in creating and maintaining individual spiders for each website or URL. The project aims to improve analyst productivity, reduce development time for developers, and ensure high-quality and accurate data extraction. The research involves investigating web scraping techniques and developing a more efficient and scalable approach to report retrieval. The problem statement emphasizes the inefficiency of the current method with one customized spider per source and the need for a more streamlined approach to web scraping. The research question focuses on identifying patterns in the web scraping process and functions required for specific publication websites to create a more generalized web scraper. The objective is to reduce manual effort, improve scalability, and maintain high-quality data extraction. The problem is resolved using a quantitative approach that involves the analysis and implementation of spiders for each data source. This enables a comprehensive understanding of all potential scenarios and provides the necessary knowledge to develop a general spider. These spiders are then grouped based on their similarity, and through the application of simple logic, they are consolidated into a single general spider capable of handling all the sources. To construct the general spider, a utility library is created, equipped with the essential tools for extracting relevant information such as title, description, date, and PDF links. Subsequently, all the individual information is transferred to configuration files, enabling the execution of the general spider. The findings demonstrate the successful integration of multiple sources and spiders into a unified general spider. However, due to the limited time frame of the project, there is potential for further improvement. Enhancements could include better structuring of the configuration files, expansion of the utility library, or even the integration of AI capabilities to enhance the performance of the general spider. Nevertheless, the current solution is deemed suitable for automated article retrieval and ready to be used. / Denna rapport tar upp processen att utveckla en generaliserad spindel för webbskrapning, som kan appliceras på flera källor, och därigenom minska tiden och kostnaderna för att skapa och underhålla individuella spindlar för varje webbplats eller URL. Projektet syftar till att förbättra analytikers produktivitet, minska utvecklingstiden för utvecklare och säkerställa högkvalitativ och korrekt dataextraktion. Forskningen går ut på att undersöka webbskrapningstekniker och utveckla ett mer effektivt och skalbart tillvägagångssätt för att hämta rapporter. Problemformuleringen betonar ineffektiviteten hos den nuvarande metoden med en anpassad spindel per källa och behovet av ett mer effektiviserad tillvägagångssätt för webbskrapning. Forskningsfrågan fokuserar på att identifiera mönster i webbskrapningsprocessen och funktioner som krävs för specifika publikationswebbplatser för att skapa en mer generaliserad webbskrapa. Målet är att minska den manuella ansträngningen, förbättra skalbarheten och upprätthålla datautvinning av hög kvalitet. Problemet löses med hjälp av en kvantitativ metod som involverar analys och implementering av spindlar för varje datakälla. Detta möjliggör en omfattande förståelse av alla potentiella scenarier och ger den nödvändiga kunskapen för att utveckla en allmän spindel. Dessa spindlar grupperas sedan baserat på deras likhet, och genom tillämpning av enkel logik konsolideras de till en enda allmän spindel som kan hantera alla källor. För att konstruera den allmänna spindeln skapas ett verktygsbibliotek, utrustat med de väsentliga verktygen för att extrahera relevant information som titel, beskrivning, datum och PDF-länkar. Därefter överförs all individuell information till konfigurationsfiler, vilket möjliggör exekvering av den allmänna spindeln. Resultaten visar den framgångsrika integrationen av flera källor och spindlar till en enhetlig allmän spindel. Men på grund av projektets begränsade tidsram finns det potential för ytterligare förbättringar. Förbättringar kan inkludera bättre strukturering av konfigurationsfilerna, utökning av verktygsbiblioteket eller till och med integrering av AI-funktioner för att förbättra den allmänna spindelns prestanda. Ändå bedöms den nuvarande lösningen vara lämplig för automatisk artikelhämtning och redo att användas. Web scraping Web crawlers HTML Scrapy Optimization Web data extraction Webbskrapning Webbsökrobotar HTML Scrapy Optimering Webbdataextraktion Computer and Information Sciences Data- och informationsvetenskap
54	Knowledge Base Augmentation from Spreadsheet Data : Combining layout inference with multimodal candidate classification Heyder, Jakob Wendelin January 2020 (has links) Spreadsheets compose a valuable and notably large dataset of documents within many enterprise organizations and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extraction and transformation of the data remain a cumbersome and mostly manual task. The great flexibility they provide to the user results in data that is arbitrarily structured and hard to process for other applications. In this paper, we propose a novel architecture that combines supervised layout inference and multimodal candidate classification to allow knowledge base augmentation from arbitrary spreadsheets. In our design, we consider the need for repairing misclassifications and allow for verification and ranking of ambiguous candidates. We evaluate the performance of our system on two datasets, one with single-table spreadsheets, another with spreadsheets of arbitrary format. The evaluation result shows that the proposed system achieves similar performance on single-table spreadsheets compared to state-of-the-art rule-based solutions. Additionally, the flexibility of the system allows us to process arbitrary spreadsheet formats, including horizontally and vertically aligned tables, multiple worksheets, and contextualizing metadata. This was not possible with existing purely text-based or table-based solutions. The experiments demonstrate that it can achieve high effectiveness with an F1 score of 95.71 on arbitrary spreadsheets that require the interpretation of surrounding metadata. The precision of the system can be further increased by applying candidate schema-matching based on semantic similarity of column headers. / Kalkylblad består av ett värdefullt och särskilt stort datasätt av dokument inom många företagsorganisationer och på webben. Även om kalkylblad är intuitivt att använda och är utrustad med kraftfulla funktioner, utvinning och transformation av data är fortfarande en besvärlig och manuell uppgift. Den stora flexibiliteten som de ger användaren resulterar i data som är godtyckligt strukturerade och svåra att bearbeta för andra applikationer. I det här förslaget föreslår vi en ny arkitektur som kombinerar övervakad layoutinferens och multimodal kandidatklassificering för att tillåta kunskapsbasförstärkning från godtyckliga kalkylblad. I vår design överväger vi behovet av att reparera felklassificeringar och möjliggöra verifiering och rangordning av tvetydiga kandidater. Vi utvärderar systemets utförande på två datasätt, en med singeltabellkalkylblad, en annan med kalkylblad av godtyckligt format. Utvärderingsresultatet visar att det föreslagna systemet uppnår liknande prestanda på singel-tabellkalkylblad jämfört med state-of-the-art regelbaserade lösningar. Dessutom tillåter systemets flexibilitet oss att bearbeta godtyckliga kalkylark format, inklusive horisontella och vertikala inriktade tabeller, flera kalkylblad och sammanhangsförande metadata. Detta var inte möjligt med existerande rent textbaserade eller tabellbaserade lösningar. Experimenten visar att det kan uppnå hög effektivitet med en F1-poäng på 95.71 på godtyckliga kalkylblad som kräver tolkning av omgivande metadata. Systemets precision kan ökas ytterligare genom att applicera schema-matchning av kandidater baserat på semantisk likhet mellan kolumnrubriker. Data extraction Data transformation Knowledge base augmentation Machine learning Table understanding Spreadsheets Datainsamling Datatransformation Kunskapsbasförstärkning Maskininlärning Tabellförståelse Kalkylblad Computer and Information Sciences Data- och informationsvetenskap
55	轉換年報資料以擷取企業評價模型之非財務性資料項 / A Transformation Approach to Extract Annual Report for Non-Financial Category in Business Valuation 吳思宏, Wu, Szu-Hung Unknown Date (has links) 現今由於之前企業併購熱潮，使得企業到底價值多少？企業是否能夠還有前景？這些問題不僅僅是投資者所關心的問題，也同樣是會計師及企業評價者所關心的問題。又現今已邁入知識經濟時代，企業已從過去以土地、廠房、設備等固定資產來產生企業價值，轉而以服務、品牌、專利等無形資產為主要的企業價值時，企業的價值又要如何來估算。而這些問題都一再的顯示出“企業評價”的重要性。在進行企業評價之前，企業評價模型中之資料項的取得更是關係著最後評價結果的好壞。在企業評價資料項中，可分為財務性及非財務性。財務性資料項由於定義清楚，所以在資料的收集上較非財務性資料容易。但我們發現過往之資料收集方式並不足以應用在企業評價非財務性資料項的收集上，且現行大多採用人工處理資料的方式，不僅耗費大量時間及成本，又因人工輸入而有資料輸入錯誤之風險，使得資料的正確性大幅降低。故本研究提出一自動化擷取年報中企業評價非財務性資料項之方法，希望藉此方法達到簡化資料收集過程，提高資料的正確性。 / Because of the trend of the business combination, now, more and more people concern about “how much value does a business have?” And “does the business still have any perspectives?” This not only get investors’’ interest, but also the accountant and business valuator. Now we already get into a new economy, called knowledge-based economy. When the businesses are not just use fixed asset, such as facility, factory and land to earn money, but also earn their money by providing services, making brand, or sell patents for live, how to measure the business’s real value and what the real value for the business is. These problems all shows that the importance of “Business Valuation.” Before calculate the business value, the most important thing is to collect the data or data category for business valuation. There are two kinds of business valuation data item. One is financial data item; the other is non-financial data item. Because of the financial data item’s clear definition, the data collection process of financial data item is easier than non-financial data item. And the data collection in the past is not fit for today, and now most valuators use manual way to process these data. This way not only wastes the time and money, but also lowers the correctness and raises the risk of mistype during the process of data collection. In this thesis, we propose an approach to automatic extract business valuation data category from annual report by using the technology of data extraction. 企業評價資訊擷取 Portable Document Format ( PDF ) 資訊檢索斷詞 Business valuation Data extraction Portable Document Format ( PDF ) Information Retrieval Word Segmentation
56	Extração automática de dados de páginas HTML utilizando alinhamento em dois níveis Pedralho, André de Souza 28 July 2011 (has links) Made available in DSpace on 2015-04-11T14:02:41Z (GMT). No. of bitstreams: 1 andre.pdf: 821975 bytes, checksum: 8b72d2493d068d6a827082e5eb108bf6 (MD5) Previous issue date: 2011-07-28 / There is a huge amount of information in the World Wide Web in pages composed by similar objects. E-commerce Web sites and on-line catalogs, in general, are examples of such data repositories. Although this information usually occurs in semi-structured texts, it is designed to be interpreted and used by humans and not processed by machines. The identification of these objects inWeb pages is performed by external applications called extractors or wrappers. In this work we propose and evaluate an automatic approach to the problem of generating wrappers capable of extracting and structuring data records and the values of their attributes. It uses the Tree Alignment Algorithm to find in the Web page examples of objects of interest. Then, our method generates regular expressions for extracting objects similar to the examples given using the Multiple Sequence Alignment Algorithm. In a final step, the method decomposes the objects in sequences of text using the regular expression and common formats and delimiters, in order to identify the value of the attributes of the data records. Experiments using a collection composed by 128 Web pages from different domains have demonstrated the feasibility of our extraction method. It is evaluated regarding the identification of blocks of HTML source code that contain data records and regarding record extraction and the value of its attributes. It reached a precision of 83% and a recall of 80% when extracting the value of attributes. These values mean a gain in precision of 43.37% and in recall of 68.75% when compared to similar proposals. / Existe uma grande quantidade de informação na World Wide Web em páginas compostas por objetos similares. Web sites de comércio eletrônico e catálogos online, em geral, são exemplos destes repositórios de dados. Apesar destes dados serem apresentados em porções de texto semi-estruturados, são projetados para serem interpretados e utilizados por humanos e não processados por máquinas. A identificação destes objetos em páginas Web é feita por aplicações externas chamadas extratores ou wrappers. Neste trabalho propomos e avaliamos um método automático para o problema de extrair e estruturar registros e valores de seus atributos presentes em páginas Web ricas em dados. O método utiliza um Algoritmo de Alinhamento de Árvores para encontrar nestas páginas exemplos de registros que correspondem a objetos de interesse. Em seguida, o método gera expressões regulares para extrair objetos similares aos exemplos dados usando o Algoritmo de Alinhamento de Múltiplas Sequências. Em um passo final, o método decompõe os registros em sequências de texto aplicando a expressão regular criada e formatações e delimitadores comuns, com o intuito de identificar os valores dos atributos dos registros. Experimentos utilizando uma coleção composta por 128 páginasWeb de diferentes domínios demonstram a viabilidade do nosso método de extração. O método foi avaliado em relação à identificação de blocos de código HTML que contêm os registros e quanto à extração dos registros e dos valores de seus atributos. Obtivemos precisão de 83% e revocação de 80% na extração de valores de atributos. Estes valores significam um ganho na precisão de 43,37% e na revocação de 68,75%, em relação a propostas similares Extração de dados Web Alinhamento em dois níveis Distância de edição de árvores Geração automática de extratores Web Data extraction Two-level alignment Tree edit distance Automatic Wrapper generation
57	Portál pro agregaci dat z webových zdrojů / Portal for Aggregation of Data from Web Sources Mikita, Tibor January 2019 (has links) This thesis deals with data extraction and data aggregation from heterogeneous web sources. The goal is to create a platform and a functional web application using appropriate technologies. The main focus of the thesis is on the application design and implementation. The application domain is accommodation or lease of apartments. For the data extraction, we use the portal API or a wrapper. Obtained data is stored in a document database. In this thesis, we managed to design and implement a system that allows to obtain rental ads from multiple web sources at the same time and to present them in a uniform way.
58	Vyhledávání objektů v obraze na základě předlohy / Image object detection using template Novák, Pavel January 2014 (has links) This Thesis is focused to Image Object Detection using Template. Main Benefit of this Work is a new Method for sympthoms extraction from Histogram of Oriented Gradients using set of Comparators. In this used Work Methods of Image comparing and Sympthoms extraction are described. Main Part is given to Histogram of Oriented Gradients Method. We came out from this Method. In this Work is used small training Data Set (100 pcs.) verified by X-Validation, followed by tests on real Sceneries. Achieved success Rate using X-Validation is 98%. for SVM Algorithm.
59	Metody dolování relevantních dat z prostředí webu s využitím sociálních sítí / Datamining of Relevenat Information from WWW with Using Social Networks Smolík, Jakub January 2013 (has links) This thesis focuses on solving problems related to searching of relevant data on the internet. In text is presented possible solution in form of application capable of automated extraction and aggregation of data from web and their presentation, based on input key words. For this purpouse there were studied and discribed possibilities of automated extraction from three chosen data types, mainly used as data storages on the internet. Furthermore it focuses on ways of data mining from social networks. As a result it pressents planning, implementation, realization and testing of created appliation which can easily find, display and let user easy access searched informations.
60	Evaluation of web scraping methods : Different automation approaches regarding web scraping using desktop tools / Utvärdering av webbskrapningsmetoder : Olika automatiserings metoder kring webbskrapning med hjälp av skrivbordsverktyg Oucif, Kadday January 2016 (has links) A lot of information can be found and extracted from the semantic web in different forms through web scraping, with many techniques emerging throughout time. This thesis is written with the objective to evaluate different web scraping methods in order to develop an automated, performance reliable, easy implemented and solid extraction process. A number of parameters are set to better evaluate and compare consisting techniques. A matrix of desktop tools are examined and two were chosen for evaluation. The evaluation also includes the learning of setting up the scraping process with so called agents. A number of links gets scraped by using the presented techniques with and without executing JavaScript from the web sources. Prototypes with the chosen techniques are presented with Content Grabber as a final solution. The result is a better understanding around the subject along with a cost-effective extraction process consisting of different techniques and methods, where a good understanding around the web sources structure facilitates the data collection. To sum it all up, the result is discussed and presented with regard to chosen parameters. / En hel del information kan bli funnen och extraherad i olika format från den semantiska webben med hjälp av webbskrapning, med många tekniker som uppkommit med tiden. Den här rapporten är skriven med målet att utvärdera olika webbskrapnings metoder för att i sin tur utveckla en automatiserad, prestandasäker, enkelt implementerad och solid extraheringsprocess. Ett antal parametrar är definierade för att utvärdera och jämföra befintliga webbskrapningstekniker. En matris av skrivbords verktyg är utforskade och två är valda för utvärdering. Utvärderingen inkluderar också tillvägagångssättet till att lära sig sätta upp olika webbskrapnings processer med så kallade agenter. Ett nummer av länkar blir skrapade efter data med och utan exekvering av JavaScript från webbsidorna. Prototyper med de utvalda teknikerna testas och presenteras med webbskrapningsverktyget Content Grabber som slutlig lösning. Resultatet utav det hela är en bättre förståelse kring ämnet samt en prisvärd extraheringsprocess bestående utav blandade tekniker och metoder, där en god vetskap kring webbsidornas uppbyggnad underlättar datainsamlingen. Sammanfattningsvis presenteras och diskuteras resultatet med hänsyn till valda parametrar. web scraping data extraction automation semantic web business intelligence DOM parsing HTML parsing XPath webbskrapning datautvinning automatisering semantiska webben business intelligence DOM parsing HTML parsing XPath Engineering and Technology Teknik och teknologier

Search results