Spelling suggestions: "subject:"unicode"" "subject:"anticode""
11 |
Improving Retrieval Accuracy in Main Content Extraction from HTML Web DocumentsMohammadzadeh, Hadi 27 November 2013 (has links)
The rapid growth of text based information on the World Wide Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the “main content” from the additional content items, such as navigation menus, advertisements, design elements or legal disclaimers.
Firstly, in this thesis, we study, develop, and evaluate R2L, DANA, DANAg, and AdDANAg, a family of novel algorithms for extracting the main content of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other three algorithms, is to use well particularities of Right-to-Left languages for obtaining the main content of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. This enables the R2L approach to recognize areas of the HTML file with a high density of Right-to-Left characters and a low density of characters from the English character set. Having recognized these areas, R2L can successfully separate only the Right-to-Left characters. The first extension of the R2L, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the main content from areas with a high density of Right-to-Left characters. DANAg is the second extension of the R2L and generalizes the idea of R2L to render it language independent. AdDANAg, the third extension of R2L, integrates a new preprocessing step to normalize the hyperlink tags. The presented approaches are analyzed under the aspects of efficiency and effectiveness. We compare them to several established main content extraction algorithms and show that we extend the state-of-the-art in terms of both, efficiency and effectiveness.
Secondly, automatically extracting the headline of web articles has many applications. We develop and evaluate a content-based and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. The proposed method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features. / Das rasante Wachstum von textbasierten Informationen im World Wide Web und die Vielfalt der Anwendungen, die diese Daten nutzen, macht es notwendig, effiziente und effektive Methoden zu entwickeln, die den Hauptinhalt identifizieren und von den zusätzlichen Inhaltsobjekten wie
z.B. Navigations-Menüs, Anzeigen, Design-Elementen oder Haftungsausschlüssen trennen.
Zunächst untersuchen, entwickeln und evaluieren wir in dieser Arbeit R2L, DANA, DANAg und AdDANAg, eine Familie von neuartigen Algorithmen zum Extrahieren des Inhalts von Web-Dokumenten. Das grundlegende Konzept hinter R2L, das auch zur Entwicklung der drei weiteren Algorithmen führte, nutzt die Besonderheiten der Rechts-nach-links-Sprachen aus, um den Hauptinhalt von Webseiten zu extrahieren.
Da der lateinische Zeichensatz und die Rechts-nach-links-Zeichensätze durch verschiedene Abschnitte des Unicode-Zeichensatzes kodiert werden, lassen sich die Rechts-nach-links-Zeichen leicht von den lateinischen Zeichen in einer HTML-Datei unterscheiden. Das erlaubt dem R2L-Ansatz, Bereiche mit einer hohen Dichte von Rechts-nach-links-Zeichen und wenigen lateinischen Zeichen aus einer HTML-Datei zu erkennen. Aus diesen Bereichen kann dann R2L die Rechts-nach-links-Zeichen extrahieren. Die erste Erweiterung, DANA, verbessert die Wirksamkeit des Baseline-Algorithmus durch die Verwendung eines HTML-Parsers in der Nachbearbeitungsphase des R2L-Algorithmus, um den Inhalt aus Bereichen mit einer hohen Dichte von Rechts-nach-links-Zeichen zu extrahieren. DANAg erweitert den Ansatz des R2L-Algorithmus, so dass eine Sprachunabhängigkeit erreicht wird. Die dritte Erweiterung, AdDANAg, integriert eine neue Vorverarbeitungsschritte, um u.a. die Weblinks zu normalisieren. Die vorgestellten Ansätze werden in Bezug auf Effizienz und Effektivität analysiert. Im Vergleich mit mehreren etablierten Hauptinhalt-Extraktions-Algorithmen zeigen wir, dass sie in diesen Punkten überlegen sind.
Darüber hinaus findet die Extraktion der Überschriften aus Web-Artikeln vielfältige Anwendungen. Hierzu entwickeln wir mit TitleFinder einen sich nur auf den Textinhalt beziehenden und sprachabhängigen Ansatz. Das vorgestellte Verfahren ist in Bezug auf Effektivität und Effizienz besser als bekannte Ansätze, die auf strukturellen und visuellen Eigenschaften der HTML-Datei beruhen.
|
12 |
Tvorba písma OpenType volně dostupnými softwarovými prostředky / Making OpenType fonts with free softwareBednár, Peter January 2011 (has links)
In thesis themes of typography and computer font of OpenType format is described in details. At the beginning attention is paid to historical development of typeface, where stress is laid mainly on development of Roman and white letter with their characteristics. Having presented basis of typography work is concentrated on topic of digital font with emphasis on possibilities of OpenType format. Further its characteristics and advantages were listed compared to another formats and it was evaluated as format appropriate also for creating font in education process. Letterspacing and kerning were mentioned between basic graphical modifications in creating fonts. In theoretical part of the thesis they were examined in available programs designed for creating font in OpenType format. Except free available means into summary were included also commercial types due to absence of more advanced instruments and functions with free available applications. In evaluation was found that the most convenient for education is Fontlab Fontographer commercial program, free Type lite and Fontforge indicated for Open-source platform. Practical part of the thesis is focused on two chosen programs for creating main font characteristics. The goal was to detect if it is possible to reach identical results when using both programs. Fontographer program enabled to use wide tool palette dedicated to vector graphic processing by means of Adobe Illustrator similar instrument. In the case of Type lite program there were rather less instruments, what is sufficient for elementary work and familiarization with creating of digital typeface. Freeware shortage is basic absence of kerning, spacing or hinting functions. Comparing program possibilities, it falls that freeware programs based on OS Windows with their functionality are sufficient only for entry level users. The best option within free available programs is Fontforge for OS Linux which supports mentioned typographic functions. Fontographer was recommended for teaching of basic characteristics of OpenType font format. Another goal of the thesis was creating of recommended work procedure for creating basic characteristics of OpenType font for students, that is enclosed at the end of the thesis.
|
Page generated in 0.0396 seconds