• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 20
  • 5
  • 4
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 52
  • 24
  • 11
  • 8
  • 8
  • 8
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop : TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATION

Wara, Ummul January 2018 (has links)
Due to the transformation of different recommendation system from contentbased to hybrid cross-domain-based, there is an urge to prepare a socialnetwork dataset which will provide sufficient data as well as detail-level annotation from a predefined hierarchical clothing category and attribute based vocabulary by considering user interactions. However, existing fashionbased datasets lack either in hierarchical-category based representation or user interactions of social network. The thesis intends to represent two datasets- one from photo-sharing platform Instagram which gathers fashionistas images with all possible user-interactions and another from online-shop Zalando with every cloths detail. We present a design of a customized crawler that enables the user to crawl data based on category or attributes. Moreover, an efficient and collaborative web-solution is designed and implemented to facilitate large-scale hierarchical category-based detaillevel annotation of Instagram data. By considering all user-interactions, the developed solution provides a detail-level annotation facility that reflects the user’s preference. The web-solution is evaluated by the team as well as the Amazon Turk Service. The annotated output from different users proofs the usability of the web-solution in terms of availability and clarity. In addition to data crawling and annotation web-solution development, this project analyzes the Instagram and Zalando data distribution in terms of cloth category, subcategory and pattern to provide meaningful insight over data. Researcher community will benefit by using these datasets if they intend to work on a rich annotated dataset that represents social network and resembles in-detail cloth information. / Med tanke på trenden inom forskning av rekommendationssystem, där allt fler rekommendationssystem blir hybrida och designade för flera domäner, så finns det ett behov att framställa en datamängd från sociala medier som innehåller detaljerad information om klädkategorier, klädattribut, samt användarinteraktioner. Nuvarande datasets med inriktning mot mode saknar antingen en hierarkisk kategoristruktur eller information om användarinteraktion från sociala nätverk. Detta projekt har syftet att ta fram två dataset, ett dataset som insamlats från fotodelningsplattformen Instagram, som innehåller foton, text och användarinteraktioner från fashionistas, samt ett dataset som insamlats från klädutbutdet som ges av onlinebutiken Zalando. Vi presenterar designen av en webbcrawler som är anpassad för att kunna hämta data från de nämnda domänerna och är optimiserad för mode och klädattribut. Vi presenterar även en effektiv webblösning som är designad och implementerad för att möjliggöra annotering av stora mängder data från Instagram med väldigt detaljerad information om kläder. Genom att vi inkluderar användarinteraktioner i applikationen så kan vår webblösning ge användaranpassad annotering av data. Webblösningen har utvärderats av utvecklarna samt genom AmazonTurk tjänsten. Den annoterade datan från olika användare demonstrerar användarvänligheten av webblösningen. Utöver insamling av data och utveckling av ett system för webb-baserad annotering av data så har datadistributionerna i två modedomäner, Instagram och Zalando, analyserats. Datadistributionerna analyserades utifrån klädkategorier och med syftet att ge datainsikter. Forskning inom detta område kan dra nytta av våra resultat och våra datasets. Specifikt så kan våra datasets användas i domäner som kräver information om detaljerad klädinformation och användarinteraktioner.
42

Matching ESCF Prescribed Cyber Security Skills with the Swedish Job Market : Evaluating the Effectiveness of a Language Model

Ahmad, Al Ghaith, Abd ULRAHMAN, Ibrahim January 2023 (has links)
Background: As the demand for cybersecurity professionals continues to rise, it is crucial to identify the key skills necessary to thrive in this field. This research project sheds light on the cybersecurity skills landscape by analyzing the recommendations provided by the European Cybersecurity Skills Framework (ECSF), examining the most required skills in the Swedish job market, and investigating the common skills identified through the findings. The project utilizes the large language model, ChatGPT, to classify common cybersecurity skills and evaluate its accuracy compared to human classification. Objective: The primary objective of this research is to examine the alignment between the European Cybersecurity Skills Framework (ECSF) and the specific skill demands of the Swedish cybersecurity job market. This study aims to identify common skills and evaluate the effectiveness of a Language Model (ChatGPT) in categorizing jobs based on ECSF profiles. Additionally, it seeks to provide valuable insights for educational institutions and policymakers aiming to enhance workforce development in the cybersecurity sector. Methods: The research begins with a review of the European Cybersecurity Skills Framework (ECSF) to understand its recommendations and methodology for defining cybersecurity skills as well as delineating the cybersecurity profiles along with their corresponding key cybersecurity skills as outlined by ECSF. Subsequently, a Python-based web crawler, implemented to gather data on cybersecurity job announcements from the Swedish Employment Agency's website. This data is analyzed to identify the most frequently required cybersecurity skills sought by employers in Sweden. The Language Model (ChatGPT) is utilized to classify these positions according to ECSF profiles. Concurrently, two human agents manually categorize jobs to serve as a benchmark for evaluating the accuracy of the Language Model. This allows for a comprehensive assessment of its performance. Results: The study thoroughly reviews and cites the recommended skills outlined by the ECSF, offering a comprehensive European perspective on key cybersecurity skills (Tables 4 and 5). Additionally, it identifies the most in-demand skills in the Swedish job market, as illustrated in Figure 6. The research reveals the matching between ECSF-prescribed skills in different profiles and those sought after in the Swedish cybersecurity market. The skills of the profiles 'Cybersecurity Implementer' and 'Cybersecurity Architect' emerge as particularly critical, representing over 58% of the market demand. This research further highlights shared skills across various profiles (Table 7). Conclusion: This study highlights the matching between the European Cybersecurity Skills Framework (ECSF) recommendations and the evolving demands of the Swedish cybersecurity job market. Through a review of ECSF-prescribed skills and a thorough examination of the Swedish job landscape, this research identifies crucial areas of alignment. Significantly, the skills associated with 'Cybersecurity Implementer' and 'Cybersecurity Architect' profiles emerge as central, collectively constituting over 58% of market demand. This emphasizes the urgent need for educational programs to adapt and harmonize with industry requisites. Moreover, the study advances our understanding of the Language Model's effectiveness in job categorization. The findings hold significant implications for workforce development strategies and educational policies within the cybersecurity domain, underscoring the pivotal role of informed skills development in meeting the evolving needs of the cybersecurity workforce.
43

Jacking and Equalizing Cylinders for NASA- Crawler Transporter

Rühlicke, Ingo 03 May 2016 (has links) (PDF)
For the transport of their spacecraft from the vehicle assembly building to the launch pads at Kennedy Space Centre, Florida, the National Aeronautics and Space Administration (NASA) is using two special crawler transporters since 1965. First developed for the Saturn V rocket the crawler transporters have been sufficient for all following generations of space ships so far. But for the new generation of Orionspacecraft which is under development now, a load capacity increase for the crawler transporter of plus 50% was necessary. For this task Hunger Hydraulik did develop new jacking, equalizing and levelling (JEL) cylinders with sufficient load capacity but also with some new features to improve the availability, reliability and safety of this system. After design approval and manufacture of the cylinders they have been tested in a special developed one-to-one scale dynamic test rig and after passing this the cylinders had to prove their performance in the crawler transporter itself. This article describes the general application and introduces the technical requirements of this project as well as the realized solution.
44

Modélisation de parcours du Web et calcul de communautés par émergence

Toufik, Bennouas 16 December 2005 (has links) (PDF)
Le graphe du Web, plus précisément le crawl qui permet de l'obtenir et les communautés qu'il contient est le sujet de cette thèse, qui est divisée en deux parties.<br />La première partie fait une analyse des grand réseaux d'interactions et introduit un nouveau modèle de crawls du Web. Elle commence par définir les propriétés communes des réseaux d'interactions, puis donne quelques modèles graphes aléatoires générant des graphes semblables aux réseaux d'interactions. Pour finir, elle propose un nouveau modèle de crawls aléatoires.<br />La second partie propose deux modèles de calcul de communautés par émergence dans le graphe du Web. Après un rappel sur les mesures d'importances, PageRank et HITS est présenté le modèle gravitationnel dans lequel les nœuds d'un réseau sont mobile et interagissent entre eux grâce aux liens entre eux. Les communautés émergent rapidement au bout de quelques itérations. Le second modèle est une amélioration du premier, les nœuds du réseau sont dotés d'un objectif qui consiste à atteindre sa communautés.
45

Uma abordagem para captura automatizada de dados abertos governamentais

Ferreira, Juliana Sabino 07 November 2017 (has links)
Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-06T16:01:21Z No. of bitstreams: 1 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) / Rejected by Milena Rubi ( ri.bso@ufscar.br), reason: Bom dia Juliana! Além da dissertação, você deve submeter também a carta comprovante devidamente preenchida e assinada pelo orientador. O modelo da carta encontra-se na página inicial do site do Repositório Institucional. Att., Milena P. Rubi Bibliotecária CRB8-6635 Biblioteca Campus Sorocaba on 2018-01-08T11:07:30Z (GMT) / Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-09T00:48:08Z No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) / Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:15:53Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) / Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:16:03Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) / Made available in DSpace on 2018-01-09T11:16:12Z (GMT). No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) Previous issue date: 2017-11-07 / Não recebi financiamento / Currently open government data run an important job on regards to public transparency, besides being obligated by law. But most of this data are stored in non-standard ways, isolated and independent, making it very hard for its use by third party systems providers. This work proposes the creation of an approach for capturing this open government data in an automated way, allowing its use in various applications. For that a Web Crawler was built for the capture and storing of this open government data, as well as an API for making this data available in JSON format, that way developers can easily use this data on their application. We also performed an evaluation of the API for developers with different levels of experience. / Atualmente os dados abertos governamentais exercem um papel fundamental na transparência pública na gestão dos governos, além de ser uma obrigação legal. Porém grande parte desses dados são publicados em formatos diversos, isolados e independentes, dificultado seu reaproveitamento por sistemas de terceiros que poderiam reusar informações disponibilizadas em tais portais. Este trabalho propõe a criação de uma abordagem para captura de dados abertos governamentais de forma automatizada, permitindo sua reutilização em outras aplicações. Para isso foi construído um Web Crawler para captura e armazenamento de Dados Abertos Governamentais (DAG) e a API DAG Prefeituras para disponibilizar esses dados no formato JSON para que outros desenvolvedores possam utilizar esses dados em suas aplicações. Também foi realizada uma avaliação do uso da API para desenvolvedores com diferentes níveis de experiência
46

Characterizing the Third-Party Authentication Landscape : A Longitudinal Study of how Identity Providers are Used in Modern Websites / Longitudinella mätningar av användandet av tredjepartsautentisering på moderna hemsidor

Josefsson Ågren, Fredrik, Järpehult, Oscar January 2021 (has links)
Third-party authentication services are becoming more common since it eases the login procedure by not forcing users to create a new login for every website thatuses authentication. Even though it simplifies the login procedure the users still have to be conscious about what data is being shared between the identity provider (IDP) and the relying party (RP). This thesis presents a tool for collecting data about third-party authentication that outperforms previously made tools with regards to accuracy, precision and recall. The developed tool was used to collect information about third-party authentication on a set of websites. The collected data revealed that third-party login services offered by Facebook and Google are most common and that Twitters login service is significantly less common. Twitter's login service shares the most data about the users to the RPs and often gives the RPs permissions to perform write actions on the users Twitter account.  In addition to our large-scale automatic data collection, three manual data collections were performed and compared to previously made manual data collections from a nine-year period. The longitudinal comparison showed that over the nine-year period the login services offered by Facebook and Google have been dominant.It is clear that less information about the users are being shared today compared to earlier years for Apple, Facebook and Google. The Twitter login service is the only IDP that have not changed their permission policies. This could be the reason why the usage of the Twitter login service on websites have decreased.  The results presented in this thesis helps provide a better understanding of what personal information is exchanged by IDPs which can guide users to make well educated decisions on the web.
47

Služba pro ověření spolehlivosti a pečlivosti českých advokátů / A Service for Verification of Czech Attorneys

Jílek, Radim January 2017 (has links)
This thesis deals with the design and implementation of the Internet service, which allows to objectively assess and verify the reliability and diligence of Czech lawyers based on publicly available data of several courts. The aim of the thesis is to create and put into operation this service. The result of the work are the programs that provide partial actions in the realization of this intention.
48

Jacking and Equalizing Cylinders for NASA- Crawler Transporter

Rühlicke, Ingo January 2016 (has links)
For the transport of their spacecraft from the vehicle assembly building to the launch pads at Kennedy Space Centre, Florida, the National Aeronautics and Space Administration (NASA) is using two special crawler transporters since 1965. First developed for the Saturn V rocket the crawler transporters have been sufficient for all following generations of space ships so far. But for the new generation of Orionspacecraft which is under development now, a load capacity increase for the crawler transporter of plus 50% was necessary. For this task Hunger Hydraulik did develop new jacking, equalizing and levelling (JEL) cylinders with sufficient load capacity but also with some new features to improve the availability, reliability and safety of this system. After design approval and manufacture of the cylinders they have been tested in a special developed one-to-one scale dynamic test rig and after passing this the cylinders had to prove their performance in the crawler transporter itself. This article describes the general application and introduces the technical requirements of this project as well as the realized solution.
49

Raupenfahrzeug-Dynamik

Graneß, Henry 27 March 2018 (has links)
Bei Raupenfahrwerken wird das allgemeingültige Prinzip verfolgt, dass durch die scharnierbare Aneinanderreihung von Kettengliedern eine fahrzeugeigene Fahrstrecke entsteht. Dies erlaubt selbst schwere Geräte im unwegsamen, brüchigen Gelände mit großen Vortriebskräften zu mobilisieren. Jedoch wohnt, der Diskretisierung des Raupenbandes in Glieder endlicher Länge geschuldet, dem Fahrwerk eine hohe Fahrunruhe inne. Dadurch entstehen zeitvariante Lasten im Fahrwerk, welche die Lebensdauer der Kette, des Fahrwerkantriebs und der Tragstruktur des Fahrzeugs limitieren und somit regelmäßig kostenintensive Instandsetzungsmaßnahmen erzwingen. Diese Problemstellung aufgreifend beschäftigt sich die Arbeit mit der Analyse und Optimierung des fahrdynamischen Verhaltens von Raupenfahrzeugen. Zugleich werden Methoden vorgestellt, welche eine rechenzeiteffiziente Simulation von Raupenfahrzeugen und Antriebssystemen zulassen.:Inhaltsverzeichnis V Symbolverzeichnis VIII Abkürzungsverzeichnis XII 1 Einleitung 1 1.1 Eigenschaften und Anwendungsbereiche von Raupenfahrwerken 1 1.2 Problemstellung 2 1.3 Gesamtaufbau Bagger 293 4 1.4 Raupenfahrwerk Bagger 293 5 1.5 Raupenfahrwerk – Fahrschiff 6 1.6 Präzisierte Aufgabenstellung 7 2 Grundlagen und Stand der Technik 11 2.1 Grundlagen zur Fahrunruhe von Raupenfahrwerken 11 2.1.1 Allgemeine Einteilung der Fahrunruhe 11 2.1.2 Innere Fahrwiderstände 12 2.1.3 Äußere Fahrwiderstände 18 2.1.4 Kettenvorspannung 19 2.2 Arbeiten zur Beschreibung der Fahrunruhe von Raupenfahrwerken 20 2.3 Ganzheitliche Analyse von Raupenfahrzeugen 22 2.3.1 Ganzheitliche Systembetrachtung 22 2.3.2 Beiträge zur ganzheitlichen Raupenfahrzeuganalyse 22 3 Detaillierte Modellfindung von Raupenfahrzeugkomponenten 26 3.1 Hintergrund 26 3.2 Elektrisch-Regelungstechnisches System 27 3.2.1 Regelungsprinzip für das einzelne Fahrschiff 27 3.2.2 Regelungsprinzip für das gesamte Fahrwerk 27 3.2.3 PI-Drehzahlregelung 29 3.2.4 P-Drehzahldifferenzregelung 30 3.2.5 Lenkwinkelkorrektur 31 3.2.6 Asynchronmaschine 33 3.2.7 Feldorientierte Regelung 37 3.2.8 Frequenzumrichter 40 3.2.9 Simulation und Analyse des Einzelraupenmodells der Regelung 41 3.3 Fahrwerksmodell 43 3.3.1 Modellbildung und Topologie 43 3.3.2 Fahrsimulation ohne Schakentäler 46 3.3.3 Fahrsimulation mit Schakentälern 51 3.3.4 Fahrsimulation Hangfahrt mit Schakentälern 54 3.3.5 Fahrsimulation Kurvenfahrt mit Schakentälern 56 3.3.6 Sensitivität des Fahrverhaltens 59 3.3.7 Fazit zur Fahrdynamik eines Fahrschiffes 63 3.4 Mechanisches System – Getriebe 63 3.4.1 Modellbildung und Topologie 63 3.4.2 Simulation mit synthetischem Lastfall 67 3.5 Mechanisches System – Unterwagen und Oberbau 69 3.5.1 Modellbildung 69 3.5.2 Simulation im Frequenzbereich 71 4 Rechenzeiteffiziente Ersatzmodelle von Raupenfahrzeugkomponenten 72 4.1 Hintergrund 72 4.2 Elektrisch-Regelungstechnisches System 72 4.2.1 Methodik 72 4.2.2 Simulation und Bewertung 73 4.3 Fahrwerksmodell 74 4.3.1 Methodik 74 4.3.2 Simulation und Bewertung ohne Schakentäler 87 4.3.3 Simulation und Bewertung mit Schakentälern 90 4.4 Getriebemodell 92 4.4.1 Methodik 92 4.4.2 Simulation und Bewertung 96 4.5 Unterwagen- und Oberbaumodell 98 4.5.1 Methodik 98 4.5.2 Simulation und Bewertung 99 5 Ganzheitliche Fahrdynamik-Simulation und Messdatenabgleich 101 5.1 Modellstufen 101 5.1.1 Rheonom betriebenes Fahrschiffmodell 101 5.1.2 Ganzheitliches Fahrschiffmodell 101 5.1.3 Ganzheitliches Fahrzeugmodell 102 5.2 Simulation 103 5.2.1 Vergleich des rheonomen mit dem ganzheitlichen Fahrschiffmodell 103 5.2.2 Einfluss der Oberbauelastizität auf das Fahrverhaltens 104 5.2.3 Einfluss der Phasenlage (Parallelfahrt) 105 5.2.4 Vergleich Messung und Simulation 108 6 Ganzheitliche Optimierung am Fahrschiffmodell 115 6.1 Methodik 115 6.2 Kontinuierliche Rollbahn 115 6.2.1 Hintergrund 115 6.2.2 Erprobung am Ersatzmodell des Fahrwerkes 116 6.2.3 Erprobung am MKS-Kontaktmodell des Fahrwerkes 117 6.3 PI-Motordrehzahlregelung 118 6.3.1 Hintergrund 118 6.3.2 Erprobung am Ersatzmodell mit Schakental-Design 119 6.3.3 Erprobung am MKS-Kontanktmodell mit Schakental-Design 122 6.3.4 Erprobung am Ersatzmodell mit kontinuierlicher Rollbahn 124 6.3.5 Erprobung am MKS-Kontaktmodell mit kontinuierlicher Rollbahn 126 6.3.6 Fazit PI-Drehzahlregelung 127 6.4 PI-Zustandsregelung 127 6.4.1 Methodik 127 6.4.2 Erprobung am Ersatzmodell mit Schakental-Design 133 6.4.3 Erprobung am MKS-Kontaktmodell mit Schakental-Design 135 6.4.4 Erprobung am Ersatzmodell mit kontinuierlicher Rollbahn 135 6.4.5 Erprobung am MKS-Kontaktmodell mit kontinuierlicher Rollbahn 137 6.4.6 Fazit PI-Zustandsregelung 138 6.5 Statische und statisch-dynamische Kettenvorspannung 139 6.5.1 Hintergrund 139 6.5.2 Erprobung am Ersatzmodell 140 6.5.3 Erprobung am MKS-Kontaktmodell 142 6.5.4 Kritische Bewertung 143 7 Ganzheitliche Optimierung am Fahrzeugmodell 144 7.1 Methodik 144 7.2 Kontinuierliche Rollbahn 144 7.3 Kontinuierliche Rollbahn und statische Kettenvorspannung 145 8 Zusammenfassung und Ausblick 146 Literatur 149 Abbildungsverzeichnis 154 Tabellenverzeichnis 159 A Auswertungsgrößen 160 A.1. Amplitudensignal 160 A.2. Schwingungseffektivwert 160 A.3. Kreuzkorrelationskoeffizient 161 B Analytische Berechnung der Lasten bei Kurvenfahrt 162 C Korrelationen CB-Set 164
50

Generic Data Harvester

Asp, William, Valck, Johannes January 2022 (has links)
This report goes through the process of developing a generic article scraper which shall extract relevant information from an arbitrary web article. The extraction is implemented by searching and examining the HTML of the article, by using Python and XPath. The data that shall be extracted is the title, summary, publishing date and body text of the article. As there is no standard way that websites, and in particular news articles, is built, the extraction needs to be adapted for every different structure and language of articles. The resulting program should provide a proof of concept method of extracting the data showing that future development is possible. The thesis host company Acuminor is working with financial crime intelligence and are collecting information through articles and reports. To scale up the data collection and minimize the maintenance of the scraping programs, a general article scraper is needed. There exist an open source alternative called Newspaper, but since this is no longer being maintained and it can be argued is not properly designed, an internal implementation for the company could be beneficial. The program consists of a main class that imports extractor classes that have an API for extracting the data. Each extractor are decoupled from the rest in order to keep the program as modular as possible. The extraction for title, summary and date are similar, with the extractors looking for specific HTML tags that contain some common attribute that most websites implement. The text extraction is implemented using a tree that is built up from the existing text on the page and then searching the tree for the most likely node containing only the body text, using attributes such as amount of text, depth and number of text nodes. The resulting program does not match the performance of Newspaper, but shows promising results on every part of the extraction. The text extraction is very slow and often takes too much text of the article but provides a great blueprint for further improvement at the company. Acuminor will be able to have their in-house article extraction that suits their wants and needs. / Den här rapporten går igenom processen av att utveckla en generisk artikelskrapare som ska extrahera reöevamt information från en godtycklig artikelhemsida. Extraheringen kommer bli implementerad genom att söka igenom och undersöka HTML-en i artikeln, genom att använda Python och XPath. Datan som skall extraheras är titeln, summering, publiceringsdatum och brödtexten i artikeln. Eftersom det inte finns något standard sätt som hemsidor, och mer specifikt nyhetsartiklar är uppbyggda, extraheringen måste anpassas för varje olika struktur och språk av artiklar. Det resulterande programmed skall visa på ett bevis för ett koncept sätt att extrahera datan som visar på att framtida utveckling är möjlig. Projektets värdföretag Acuminor jobbar inom finansiell brottsintelligens och samlar ihop information genom artiklar och rapporter. För att skala upp insamlingen av data och minimera underhåll av skrapningsprogrammen, behövs en generell artikelskrapare. Det existerar ett öppen källkodsalternativ kallad Newspaper, men eftersom denna inte länge är underhållen och det kan argumenteras att den inte är så bra designad, är en intern implementation för företaget fördelaktigt. Programmet består av en huvudklass som importerar extraheringsklasser som har ett API för att extrahera datan. Varje extraherare är bortkopplad från resten av programmet för att hålla programmet så moodulärt som möjligt. Extraheringen för titel, summering och datum är liknande, där extragherarna tittar efter specifika HTML taggar som innehåller något gemensamt attribut som de flesta hemsidor implementerar. Textextraheringen är implementerad med ett träd som byggs upp från grunden från den existerande texten på sidan och sen söks igenom för att hitta den mest troliga noden som innehåller brödtexten, där den använder attribut såsom text, djup och antal textnoder. Det resulterande programmet matchar inte prestandan av Newspaper, men visar på lovande resultat vid varje del av extraheringen. Textextraheringen är väldigt långsam och hämtar ofta för mycket text från artikeln men lämnar ett bra underlag för vidare förbättring hos företaget. Allt som allt kommer Acuminor kunna bygga vidare på deras egna artikel extraherare som passar deras behov.

Page generated in 0.0245 seconds