Spelling suggestions: "subject:"forminformation retrieval"" "subject:"informationation retrieval""
651 |
Information discovery from semi-structured record sets on the Web.January 2012 (has links)
万维网(World Wide Web ,简称Web) 从上世纪九十年代出现以来在深度和广度上都得到了巨大的发展,大量的Web应用前所未有地改变了人们的生活。Web的发展形成了个庞大而有价值的信息资源,然而由于Web 内容异质性给自动信息抽取所造成的困难,这个信息源并没有被充分地利用。因此, Web信息抽取是Web信息应用过程中非常关键的一环。一般情况下,一个网页用来描述一个单独的对象或者一组相似的对象。例如,关于某款数码相机的网页描述了该相机的各方面特征,而一个院系的教授列表则描述了一组教授的基本信息。相应地, Web信息抽取可以分为两大类,即面向单个对象细节的信息抽取和面向组对象记录的信息抽取。本文集中讨论后者,即从单的网页中抽取组半结构化的数据记录。 / 本文提出了两个框架来解决半结构化数据记录的抽取问题。首先介绍一个基于数据记录切分树的框架RST 。该框架中提出了个新的搜索结构即数据记录切分树。基于所设计的搜索策略,数据记录切分树可以有效地从网页中抽取数据记录。在数据记录切分树中,对应于可能的数据记录的DOM子树组是在搜索过程中动态生成的,这使得RST框架比已有的方法更具灵活性。比如在MDR和DEPTA 中, DOM子树组是根据预定义的方式静态生成的,未能考虑当前数据记录区域的特征。另外, RST框架中提出了一个基于"HTML Token" 单元的相似度计算方法。i衷方法可以综合MDR中基于字符串编辑距离的方法之优点和DEPTA 中基于树结构编辑距离的方法之优点。 / 很多解决数据记录抽取问题的已有方法(包括RST框架)都需要预定义若干硬性的条件,并且他们通过遍历DOM树结构来在一个网页中穷举搜索可能存在的数据记录区域。这些方法不能很好地处理大量的含有复杂数据记录结构的网页。因此,本文提出了第二个解决框架Skoga。 Skoga框架由一个DOM结构知识驱动的模型和一个记录切分树模型组成。Skoga框架可以对DOM结构进行全局的分析,进而实现更加有效的、鲁棒的记录识别。DOM结构知识包含DOM 背景知识和DOM统计知识。前者描述DOM结构中的一些逻辑关系,这些关系对DOM 的逻辑结构进行限制。而后者描述一个DOM节点或者一组DOM节点的特点,由一组经过巧妙设计的特征(Feature) 来表示。特征的权重是由参数估计算法在一个开发数据集上学习得到的。基于面向结构化输出的支持向量机( Structuredoutput Support Vector Machine) 模型,本参数估计算法可以很好地处理DOM节点之间的依赖关系。另外,本文提出了一个基于分治策略的优化方法来搜索一个网页的最优化记录识别。 / 最后,本文提出了一个利用半结构化数据记录来进行维基百科类目(Wikipedia Category) 扩充的框架。該框架首先从某个维基百科类目中获取几个已有的实体(Entity) 作为种子,然后利用这些种子及其信息框(Infobox) 中的属性来从Web上发掘更多的同一类目的实体及其属性信息。该框架的一个特点是它利用半结构化的数据记录来进行新实体和属性的抽取,而这些半结构化的数据记录是通过自动的方法从Web上获取的。该框架提出了一个基于条件随机场(Conditional Random Fields) 的半监督学习模型来利用有限的标注样本进行目标信息抽取。这个半监督学习模型定义了一个记录相似关系图来指导学习过程,从而利用大量非标注样本来获得更好的信息抽取效果。 / The World Wide Web has been extensively developed since its first appearance two decades ago. Various applications on theWeb have unprecedentedly changed humans' life. Although the explosive growth and spread of the Web have resulted in a huge information repository, yet it is still under-utilized due to the difficulty in automated information extraction (IE) caused by the heterogeneity of Web content. Thus, Web IE is an essential task in the utilization of Web information. Typically, a Web page may describe either a single object or a group of similar objects. For example, the description page of a digital camera describes different aspects of the camera. On the contrary, the faculty list page of a department presents the information of a group of professors. Corresponding to the above two types, Web IE methods can be broadly categorized into two classes, namely, description details oriented extraction and object records oriented extraction. In this thesis, we focus on the later task, namely semi-structured data record extraction from a single Web page. / In this thesis, we develop two frameworks to tackle the task of data record extraction. We first present a record segmentation search tree framework in which a new search structure, named Record Segmentation Tree (RST), is designed and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. The subtree groups corresponding to possible data records are dynamically generated in the RST structure during the search process. Therefore, this framework is more exible compared with existing methods such as MDR and DEPTA that have a static manner of generating subtree groups. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. / Many existing methods, including the RST framework, for data record extraction from Web pages contain pre-coded hard criteria and adopt an exhaustive search strategy for traversing the DOM tree. They fail to handle many challenging pages containing complicated data records and record regions. In this thesis, we also present another framework Skoga which can perform robust detection of different kinds of data records and record regions. Skoga, composed of a DOM structure knowledge driven detection model and a record segmentation search tree model, can conduct a global analysis on the DOM structure to achieve effective detection. The DOM structure knowledge consists of background knowledge as well as statistical knowledge capturing different characteristics of data records and record regions as exhibited in the DOM structure. Specifically, the background knowledge encodes some logical relations governing certain structural constraints in the DOM structure. The statistical knowledge is represented by some carefully designed features that capture different characteristics of a single node or a node group in the DOM. The feature weights are determined using a development data set via a parameter estimation algorithm based on structured output Support Vector Machine model which can tackle the inter-dependency among the labels on the nodes of the DOM structure. An optimization method based on divide and conquer principle is developed making use of the DOM structure knowledge to quantitatively infer the best record and region recognition. / Finally, we present a framework that can make use of the detected data records to automatically populate existing Wikipedia categories. This framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of this framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised leaning process. The graph captures alignment similarity among data records. Then the semisupervised learning process can leverage the benefit of the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Bing, Lidong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 114-123). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Web Era and Web IE --- p.1 / Chapter 1.2 --- Semi-structured Record and Region Detection --- p.3 / Chapter 1.2.1 --- Problem Setting --- p.3 / Chapter 1.2.2 --- Observations and Challenges --- p.5 / Chapter 1.2.3 --- Our Proposed First Framework - Record Segmentation Tree --- p.9 / Chapter 1.2.4 --- Our Proposed Second Framework - DOM Structure Knowledge Oriented Global Analysis --- p.10 / Chapter 1.3 --- Entity Expansion and Attribute Acquisition with Semi-structured Data Records --- p.13 / Chapter 1.3.1 --- Problem Setting --- p.13 / Chapter 1.3.2 --- Our Proposed Framework - Semi-supervised CRF Regularized by Proximate Graph --- p.15 / Chapter 1.4 --- Outline of the Thesis --- p.17 / Chapter 2 --- Literature Survey --- p.19 / Chapter 2.1 --- Semi-structured Record Extraction --- p.19 / Chapter 2.2 --- Entity Expansion and Attribute Acquisition --- p.23 / Chapter 3 --- Record Segmentation Tree (RST) Framework --- p.27 / Chapter 3.1 --- Overview --- p.27 / Chapter 3.2 --- Record Segmentation Tree --- p.29 / Chapter 3.2.1 --- Basic Record Segmentation Tree --- p.29 / Chapter 3.2.2 --- Slimmed Segmentation Tree --- p.30 / Chapter 3.2.3 --- Utilize RST in Record Extraction --- p.31 / Chapter 3.3 --- Search Pruning Strategies --- p.33 / Chapter 3.3.1 --- Threshold-Based Top k Search --- p.33 / Chapter 3.3.2 --- Complexity Analysis --- p.35 / Chapter 3.3.3 --- Composite Node Pruning --- p.37 / Chapter 3.3.4 --- More Challenging Record Region Discussion --- p.37 / Chapter 3.4 --- Similarity Measure --- p.41 / Chapter 3.4.1 --- Encoding Subtree with Tokens --- p.42 / Chapter 3.4.2 --- Tandem Repeat Detection and Distance-based Measure --- p.42 / Chapter 4 --- DOM Structure Knowledge Oriented Global Analysis (Skoga) Framework --- p.45 / Chapter 4.1 --- Overview --- p.45 / Chapter 4.2 --- Design of DOM Structure Knowledge --- p.49 / Chapter 4.2.1 --- Background Knowledge --- p.49 / Chapter 4.2.2 --- Statistical Knowledge --- p.51 / Chapter 4.3 --- Finding Optimal Label Assignment --- p.54 / Chapter 4.3.1 --- Inference for Bottom Subtrees --- p.55 / Chapter 4.3.2 --- Recursive Inference for Higher Subtree --- p.57 / Chapter 4.3.3 --- Backtracking for the Optimal Label Assignment --- p.59 / Chapter 4.3.4 --- Second Optimal Label Assignment --- p.60 / Chapter 4.4 --- Statistical Knowledge Acquisition --- p.62 / Chapter 4.4.1 --- Finding Feature Weights via Structured Output SVM Learning --- p.62 / Chapter 4.4.2 --- Region-oriented Loss --- p.63 / Chapter 4.4.3 --- Cost Function Optimization --- p.65 / Chapter 4.5 --- Record Segmentation and Reassembling --- p.66 / Chapter 5 --- Experimental Results of Data Record Extraction --- p.68 / Chapter 5.1 --- Evaluation Data Set --- p.68 / Chapter 5.2 --- Experimental Setup --- p.70 / Chapter 5.3 --- Experimental Results on TBDW --- p.73 / Chapter 5.4 --- Experimental Results on Hybrid Data Set with Nested Region --- p.76 / Chapter 5.5 --- Experimental Results on Hybrid Data Set with Intertwined Region --- p.78 / Chapter 5.6 --- Empirical Case Studies --- p.79 / Chapter 5.6.1 --- Case Study One --- p.80 / Chapter 5.6.2 --- Case Study Two --- p.83 / Chapter 6 --- Semi-supervised CRF Regularized by Proximate Graph --- p.85 / Chapter 6.1 --- Overview --- p.85 / Chapter 6.2 --- Semi-structured Data Record Set Collection --- p.88 / Chapter 6.3 --- Semi-supervised Learning Model for Extraction --- p.89 / Chapter 6.3.1 --- Proximate Record Graph Construction --- p.91 / Chapter 6.3.2 --- Semi-Markov CRF and Features --- p.94 / Chapter 6.3.3 --- Posterior Regularization --- p.95 / Chapter 6.3.4 --- Inference with Regularized Posterior --- p.97 / Chapter 6.3.5 --- Semi-supervised Training --- p.97 / Chapter 6.3.6 --- Result Ranking --- p.98 / Chapter 6.4 --- Derived Training Example Generation --- p.99 / Chapter 6.5 --- Experiments --- p.100 / Chapter 6.5.1 --- Experiment Setting --- p.100 / Chapter 6.5.2 --- Entity Expansion --- p.103 / Chapter 6.5.3 --- Attribute Extraction --- p.107 / Chapter 7 --- Conclusions and Future Work --- p.110 / Chapter 7.1 --- Conclusions --- p.110 / Chapter 7.2 --- Future Work --- p.112 / Bibliography --- p.113
|
652 |
Using web resources for effective English-to-Chinese cross language information retrieval. / CUHK electronic theses & dissertations collectionJanuary 2005 (has links)
A web-aided query translation expansion method in Cross-Language Information Retrieval (CLIR) is presented in this study. The method is applied to English/Chinese language pair, in which queries are expressed in English and the documents returned are in Chinese. Among the three main categories of CLIR methods of machine translation (MT), dictionary translation using a machine-readable dictionary (MRD), and parallel corpus, our method is based on the second one. MRD-based method is easy to implement. However, it faces the resource limitation problem, i.e., the dictionary is often incomplete leading to poor translation and hence undesirable results. By combining MRD and web-aided query translation expansion technique, good retrieval performance can be achieved. The performance gain is largely due to the successful translation extraction of relevant words of a query term from online texts. A new Chinese word discovery algorithm, which extracts words from continuous Chinese characters was designed and used for this purpose. The extracted relevant words do not only include the precise translation of a query term, but also those words that are relevant to that term in the source language. / Jin Honglan. / "October 2005." / Adviser: Kam Fai Wong. / Source: Dissertation Abstracts International, Volume: 67-07, Section: B, page: 3899. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (p. 115-121). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract in English and Chinese. / School code: 1307.
|
653 |
Um estudo sobre agrupamento de documentos textuais em processamento de informações não estruturadas usando técnicas de "clustering" / A study about arrangement of textual documents applied to unstructured information processing using clustering techniquesWives, Leandro Krug January 1999 (has links)
Atualmente, técnicas de recuperação e análise de informações, principalmente textuais, são de extrema importância. Após o grande BOOM da Internet, muitos problemas que já eram conhecidos em contextos fechados passaram a preocupar também toda a comunidade científica. No âmbito deste trabalho os problemas relacionados à sobrecarga de informações, que ocorre devido ao grande volume de dados a disposição de uma pessoa, são os mais importantes. Visando minimizar estes problemas, este trabalho apresenta um estudo sobre métodos de agrupamento de objetos textuais (documentos no formato ASCII), onde os objetos são organizados automaticamente em grupos de objetos similares, facilitando sua localização, manipulação e análise. Decorrente deste estudo, apresenta-se uma metodologia de aplicação do agrupamento descrevendo-se suas diversas etapas. Estas etapas foram desenvolvidas de maneira que após uma ter sido realizada ela não precisa ser refeita, permitindo que a etapa seguinte seja aplicada diversas vezes sobre os mesmos dados (com diferentes parâmetros) de forma independente. Além da metodologia, realiza-se um estudo comparativo entre alguns algoritmos de agrupamento, inclusive apresentando-se um novo algoritmo mais eficiente. Este fato é comprovado em experimentos realizados nos diversos estudos de caso propostos. Outras contribuições deste trabalho incluem a implementação de uma ferramenta de agrupamento de textos que utiliza a metodologia elaborada e os algoritmos estudados; além da utilização de uma fórmula não convencional de cálculo de similaridades entre objetos (de abordagem fuzzy), aplicada a informações textuais, obtendo resultados satisfatórios. / The Internet is the vital media of today and, as being a mass media, problems known before to specific fields of Science arise. One of these problems, capable of annoying many people, is the information overload problem caused by the excessive amount of information returned in response to the user’s query. Due to the information overload problem, advanced techniques for information retrieval and analysis are needed. This study presents some aids in these fields, presenting a methodology to help users to apply the clustering process in textual data. The technique investigated is capable of grouping documents of several subjects in clusters of documents of the same subject. The groups identified can be used to simplify the process of information analysis and retrieval. This study also presents a tool that was created using the methodology and the algorithms analyzed. The tool was implemented to facilitate the process of investigation and demonstration of the study. The results of the application of a fuzzy formula, used to calculate the similarity among documents, are also presented.
|
654 |
Automatic topic detection from news stories.January 2001 (has links)
Hui Kin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2001. / Includes bibliographical references (leaves 115-120). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Topic Detection Problem --- p.2 / Chapter 1.1.1 --- What is a Topic? --- p.2 / Chapter 1.1.2 --- Topic Detection --- p.3 / Chapter 1.2 --- Our Contributions --- p.5 / Chapter 1.2.1 --- Thesis Organization --- p.6 / Chapter 2 --- Literature Review --- p.7 / Chapter 2.1 --- Dragon Systems --- p.7 / Chapter 2.2 --- University of Massachusetts (UMass) --- p.9 / Chapter 2.3 --- Carnegie Mellon University (CMU) --- p.10 / Chapter 2.4 --- BBN Technologies --- p.11 / Chapter 2.5 --- IBM T. J. Watson Research Center --- p.12 / Chapter 2.6 --- National Taiwan University (NTU) --- p.13 / Chapter 2.7 --- Drawbacks of Existing Approaches --- p.14 / Chapter 3 --- System Overview --- p.16 / Chapter 3.1 --- News Sources --- p.17 / Chapter 3.2 --- Story Preprocessing --- p.21 / Chapter 3.3 --- Named Entity Extraction --- p.22 / Chapter 3.4 --- Gross Translation --- p.22 / Chapter 3.5 --- Unsupervised Learning Module --- p.24 / Chapter 4 --- Term Extraction and Story Representation --- p.27 / Chapter 4.1 --- IBM Intelligent Miner For Text --- p.28 / Chapter 4.2 --- Transformation-based Error-driven Learning --- p.31 / Chapter 4.2.1 --- Learning Stage --- p.32 / Chapter 4.2.2 --- Design of New Tags --- p.33 / Chapter 4.2.3 --- Lexical Rules Learning --- p.35 / Chapter 4.2.4 --- Contextual Rules Learning --- p.39 / Chapter 4.3 --- Extracting Named Entities Using Learned Rules --- p.42 / Chapter 4.4 --- Story Representation --- p.46 / Chapter 4.4.1 --- Basic Representation --- p.46 / Chapter 4.4.2 --- Enhanced Representation --- p.47 / Chapter 5 --- Gross Translation --- p.52 / Chapter 5.1 --- Basic Translation --- p.52 / Chapter 5.2 --- Enhanced Translation --- p.60 / Chapter 5.2.1 --- Parallel Corpus Alignment Approach --- p.60 / Chapter 5.2.2 --- Enhanced Translation Approach --- p.62 / Chapter 6 --- Unsupervised Learning Module --- p.68 / Chapter 6.1 --- Overview of the Discovery Algorithm --- p.68 / Chapter 6.2 --- Topic Representation --- p.70 / Chapter 6.3 --- Similarity Calculation --- p.72 / Chapter 6.3.1 --- Similarity Score Calculation --- p.72 / Chapter 6.3.2 --- Time Adjustment Scheme --- p.74 / Chapter 6.3.3 --- Language Normalization Scheme --- p.75 / Chapter 6.4 --- Related Elements Combination --- p.78 / Chapter 7 --- Experimental Results and Analysis --- p.84 / Chapter 7.1 --- TDT corpora --- p.84 / Chapter 7.2 --- Evaluation Methodology --- p.85 / Chapter 7.3 --- Experimental Results on Various Parameter Settings --- p.88 / Chapter 7.4 --- Experiments Results on Various Named Entity Extraction Ap- proaches --- p.89 / Chapter 7.5 --- Experiments Results on Various Story Representation Approaches --- p.100 / Chapter 7.6 --- Experiments Results on Various Translation Approaches --- p.104 / Chapter 7.7 --- Experiments Results on the Effect of Language Normalization Scheme on Detection Approaches --- p.106 / Chapter 7.8 --- TDT2000 Topic Detection Result --- p.110 / Chapter 8 --- Conclusions and Future Works --- p.112 / Chapter 8.1 --- Conclusions --- p.112 / Chapter 8.2 --- Future Work --- p.114 / Bibliography --- p.115 / Chapter A --- List of Topics annotated for TDT2 Corpus --- p.121 / Chapter B --- Significant Test Results --- p.124
|
655 |
Video text detection and extraction using temporal information.January 2003 (has links)
Luo Bo. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 55-60). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgments --- p.vi / Table of Contents --- p.vii / List of Figures --- p.ix / List of Tables --- p.x / List of Abbreviations --- p.xi / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.2 --- Text in Videos --- p.1 / Chapter 1.3 --- Related Work --- p.4 / Chapter 1.3.1 --- Connected Component Based Methods --- p.4 / Chapter 1.3.2 --- Texture Classification Based Methods --- p.5 / Chapter 1.3.3 --- Edge Detection Based Methods --- p.5 / Chapter 1.3.4 --- Multi-frame Enhancement --- p.7 / Chapter 1.4 --- Our Contribution --- p.9 / Chapter Chapter 2 --- Caption Segmentation --- p.10 / Chapter 2.1 --- Temporal Feature Vectors --- p.10 / Chapter 2.2 --- Principal Component Analysis --- p.14 / Chapter 2.3 --- PCA of Temporal Feature Vectors --- p.16 / Chapter Chapter 3 --- Caption (Dis)Appearance Detection --- p.20 / Chapter 3.1 --- Abstract Image Sequence --- p.20 / Chapter 3.2 --- Abstract Image Refinement --- p.23 / Chapter 3.2.1 --- Refinement One --- p.23 / Chapter 3.2.2 --- Refinement Two --- p.24 / Chapter 3.2.3 --- Discussions --- p.24 / Chapter 3.3 --- Detection of Caption (Dis)Appearance --- p.26 / Chapter Chapter 4 --- System Overview --- p.31 / Chapter 4.1 --- System Implementation --- p.31 / Chapter 4.2 --- Computation of the System --- p.35 / Chapter Chapter 5 --- Experiment Results and Performance Analysis --- p.36 / Chapter 5.1 --- The Gaussian Classifier --- p.36 / Chapter 5.2 --- Training Samples --- p.37 / Chapter 5.3 --- Testing Data --- p.38 / Chapter 5.4 --- Caption (Dis)appearance Detection --- p.38 / Chapter 5.5 --- Caption Segmentation --- p.43 / Chapter 5.6 --- Text Line Extraction --- p.45 / Chapter 5.7 --- Caption Recognition --- p.50 / Chapter Chapter 6 --- Summary --- p.53 / Bibliography --- p.55
|
656 |
Generic signboard detection in image and video.January 2003 (has links)
by Shen Hua. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 67-71). / Abstracts in English and Chinese. / Abstract --- p.i / 摘要 --- p.iii / Acknowledgments --- p.v / Table of Contents --- p.vii / List of Figures --- p.ix / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Object Detection --- p.2 / Chapter 1.2 --- Signboard Detection --- p.3 / Chapter Chapter 2 --- System Overview --- p.5 / Chapter 2.1 --- What is the problem? --- p.5 / Chapter 2.2 --- Review of previous work --- p.6 / Chapter 2.3 --- System Outline --- p.8 / Chapter Chapter 3 --- Preprocessing --- p.10 / Chapter 3.1 --- Edge Detection --- p.11 / Chapter 3.1.1 --- Gradient-Based Method --- p.11 / Chapter 3.1.2 --- Laplacian of Gaussian --- p.14 / Chapter 3.1.3 --- Canny edge detection --- p.15 / Chapter 3.2 --- Corner Detection --- p.18 / Chapter Chapter 4 --- Finding Candidate Lines --- p.22 / Chapter 4.1 --- Hough Transform --- p.22 / Chapter 4.1.1 --- What is Hough Transform --- p.22 / Chapter 4.1.2 --- Parameter Space --- p.22 / Chapter 4.1.3 --- Accumulator Array --- p.24 / Chapter 4.2 --- Gradient-based Hough Transform --- p.25 / Chapter 4.2.1 --- Direction of Gradient --- p.26 / Chapter 4.2.2 --- Accumulator Array --- p.28 / Chapter 4.2.3 --- Peaks in the accumulator array --- p.30 / Chapter 4.2.4 --- Performance of Gradient-based Hough Transform --- p.32 / Chapter Chapter 5 --- Signboards Locating --- p.35 / Chapter 5.1 --- Line Verification --- p.35 / Chapter 5.1.1 --- Line Segmentation --- p.35 / Chapter 5.1.2 --- Density Checking --- p.37 / Chapter 5.2 --- Finding Close Circuits --- p.40 / Chapter 5.3 --- Remove Redundant Segments --- p.47 / Chapter Chapter 6 --- Post processing --- p.54 / Chapter Chapter 7 --- Experiments and Conclusion --- p.59 / Chapter 7.1 --- Experimental Results --- p.59 / Chapter 7.2 --- Conclusion --- p.66 / Bibliography --- p.67
|
657 |
Usabilidade na recuperação da informação : um enfoque no catálogo Athena /Banhos, Vângela Tatiana Madalena. January 2008 (has links)
Orientador: Edberto Ferneda / Banca: Silvana Aparecida Borsetti Gregório Vidotti / Banca: Guilherme Ataíde Dias / Resumo: A pesquisa realiza um estudo acerca de um catálogo específico, o Athena, considerado um sistema de recuperação de informação estruturado e organizado. Nesse ambiente se tem como objetivo avaliar um conjunto de diretrizes de usabilidade e aplicá-las em sistemas de recuperação de informação na Web, levando em consideração aspectos relativos não só à usabilidade de sua interface, mas também à sua eficiência no processo de recuperação. O estudo se caracteriza como exploratório e descritivo-analítico. Para tanto, procurou-se inicialmente revisar a literatura nacional e internacional sobre recuperação de informação e usabilidade, em várias fontes informacionais, impressas e eletrônicas, como embasamento teórico do trabalho. Em segunda etapa, foi realizada uma análise heurística, e por último foram realizados testes de usabilidade com usuários em que se aplicaram dois procedimentos: um questionário semi-estruturado e um instrumento de observação. Após as análises quantitativa e qualitativa dos dados, o teste com os usuários possibilitou verificar o modo como eles interagem com a interface do Catálogo Athena e as formas de busca que costumam realizar em outras ferramentas disponíveis na Web. Também foi possível validar alguns apontamentos feitos na análise heurística, pois a maioria dos participantes da pesquisa revelou não ter qualquer experiência na utilização do Catálogo Athena. Verifica-se, nesta pesquisa, a importância de se aplicar os testes com usuários em ambientes de recuperação de informação, considerando-os como parte fundamental no desenvolvimento de qualquer sistema. / Abstract: The research conducts a study about a specific catalog, the Athena, considered a system of retrieval of information structured and organized. In this environment you have to evaluate a set of guidelines for usability and apply them in systems for retrieval of information on the Web, taking into account aspects relating not only to the usability of its interface, but also its efficiency in the recovery process. The study is characterized as exploratory and descriptive and analytical. Thus, it was initially to review the national and international literature on recovery of information and usability in various informational sources, printed and electronic, as the theoretical work. In the second stage, was a heuristic analysis, and finally usability tests were carried out with users that were applied in two procedures: a semi-structured questionnaire and an instrument of observation. After the analysis of quantitative and qualitative data, the test allows users to determine how they interact with the interface of the Athena Catalog and the forms of search that usually take place in other tools available on the Web was also possible to validate some notes made in heuristic analysis, because the majority of the research has shown to have no experience in the use of Athena Catalog. This research shows the importance of applying the tests with users in environments of retrieval of information, considering them a vital part in the development of any system. / Mestre
|
658 |
Nouvelles méthodes pour la recherche sémantique et esthétique d'informations multimédia / Novel methods for semantic and aesthetic multimedia retrievalRedi, Miriam 29 May 2013 (has links)
A l'ère d'Internet, la classification informatisée des images est d'une importance cruciale pour l’utilisation efficace de l'énorme quantité de données visuelles qui sont disponibles. Mais comment les ordinateurs peuvent-ils comprendre la signification d'une image? La Recherche d’Information Multimédia (RIM) est un domaine de recherche qui vise à construire des systèmes capables de reconnaître automatiquement le contenu d’une image. D'abord, des caractéristiques de bas niveau sont extraites et regroupées en signatures visuelles compactes. Ensuite, des techniques d'apprentissage automatique construisent des modèles qui font la distinction entre les différentes catégories d'images à partir de ces signatures. Ces modèles sont finalement utilisés pour reconnaître les propriétés d'une nouvelle image. Malgré les progrès dans le domaine, ces systèmes ont des performances en général limitées. Dans cette thèse, nous concevons un ensemble de contributions originales pour chaque étape de la chaîne RIM, en explorant des techniques provenant d'une variété de domaines qui ne sont pas traditionnellement liés avec le MMIR. Par exemple, nous empruntons la notion de saillance et l'utilisons pour construire des caractéristiques de bas niveau. Nous employons la théorie des Copulae étudiée en statistique économique, pour l'agrégation des caractéristiques. Nous réutilisons la notion de pertinence graduée, populaire dans le classement des pages Web, pour la récupération visuelle. Le manuscrit détaille nos solutions novatrices et montre leur efficacité pour la catégorisation d'image et de vidéo, et l’évaluation de l'esthétique. / In the internet era, computerized classification and discovery of image properties (objects, scene, emotions generated, aesthetic traits) is of crucial importance for the automatic retrieval of the huge amount of visual data surrounding us. But how can computers see the meaning of an image? Multimedia Information Retrieval (MMIR) is a research field that helps building intelligent systems that automatically recognize the image content and its characteristics. In general, this is achieved by following a chain process: first, low-level features are extracted and pooled into compact image signatures. Then, machine learning techniques are used to build models able to distinguish between different image categories based on such signatures. Such model will be finally used to recognize the properties of a new image. Despite the advances in the field, human vision systems still substantially outperform their computer-based counterparts. In this thesis we therefore design a set of novel contributions for each step of the MMIR chain, aiming at improving the global recognition performances. In our work, we explore techniques from a variety of fields that are not traditionally related with Multimedia Retrieval, and embed them into effective MMIR frameworks. For example, we borrow the concept of image saliency from visual perception, and use it to build low-level features. We employ the Copula theory of economic statistics for feature aggregation. We re-use the notion of graded relevance, popular in web page ranking, for visual retrieval frameworks. We explain in detail our novel solutions and prove their effectiveness for image categorization, video retrieval and aesthetics assessment.
|
659 |
Recuperação de documentos e pessoas em ambientes empresariais através de árvores de decisão. / Documents and people retrieval in enterprises using decision tree.Fabrício Jailson Barth 29 May 2009 (has links)
Este trabalho avalia o desempenho do uso de árvores de decisão como função de ordenação para documentos e pessoas em ambientes empresariais. Para tanto, identificouse atributos relevantes das entidades a serem recuperadas a partir da análise de: (i) dinâmica de produção e consumo de informações em um ambiente empresarial; (ii) algoritmos existentes na literatura para a recuperação de documentos e pessoas; e (iii) conceitos utilizados em funções de ordenação para domínios genéricos. Montou-se um ambiente de avaliação, utilizando a coleção de referência CERC, para avaliar a aplicabilidade do algoritmo C4.5 na obtenção de funções de ordenação para o domínio empresarial. O uso do algoritmo C4.5 para a construção de funções de ordenação mostrou-se parcialmente efetivo. Para a tarefa de recuperação de documentos não trouxe resultados bons. Porém, constatou-se que é possível controlar a forma de construção da função de ordenação a fim de otimizar a precisão nas primeiras posições do ranking ou otimizar a média das precisões (MAP). Para a tarefa de recuperação de pessoas o algoritmo C4.5 obteve uma árvore de decisão que consegue resultados melhores que todas as outras funções de ordenação avaliadas. OMAP obtido pela árvore de decisão foi 0, 83, enquanto que a média do MAP das outras funções de ordenação foi de 0, 74. Percebeu-se que a árvore de decisão utilizada para representar a função de ordenação contribui para a compreensão da composição dos diversos atributos utilizados na caracterização dos documentos e pessoas. A partir da análise da árvore de decisão utilizada como função de ordenação para pessoas foi possível entender que uma pessoa é considerada especialista em algum tópico se ela aparecer em muitos documentos, aparecer muitas vezes nos documentos e os documentos onde aparece têm uma relevância alta para a consulta. / This work evaluates the performance of using decision trees as ranking functions for documents and people in enterprises. It was identified relevant attributes of the entities to be retrieved from the analysis of: (i) the production and consumption of information behavior in an enterprise, (ii) algorithms for documents and people retrieval at literature, and (iii) the concepts used in ranking functions for generic domains. It was set up an evaluation environment, using the CERC collection, to evaluate the applicability of the C4.5 algorithm to obtain a ranking function for the enterprise domain. The use of C4.5 algorithm for the construction of ranking function was proved to be partially effective. In the case of documents retrieval the C4.5 has not found good results. However, it was found that is possible to control the way of building the ranking function in order to optimize the precision in the first positions of the ranking or optimize the mean average precision (MAP). For the task of people retrieval the C4.5 algorithm developed a ranking function that obtain better results than all other ranking functions assessed. The value of MAP obtained by decision tree was 0, 83, while the average MAP of other ranking functions was 0, 74. The decision tree used to represent the ranking function contributes to understanding the attributes composition used in the characterization of documents and people. Through the analysis of the decision tree used as ranking function for people, we could realise that a person is considered expert in any topic if he/she appear in many documents, appear many times in same documents and documents where he/she appears have a high relevance to the query.
|
660 |
Modelo social de relevância para opiniões. / S.O.R.M.: Social Opinion Relevance Model.Allan Diego Silva Lima 02 October 2014 (has links)
Esta tese apresenta um modelo de relevância de opinião genérico e independente de domínio para usuários de Redes Sociais. O Social Opinion Relevance Model (SORM) é capaz de estimar a relevância de uma opinião com base em doze parâmetros distintos. Comparado com outros modelos, a principal característica que distingue o SORM é a sua capacidade para fornecer resultados personalizados de relevância de uma opinião, de acordo com o perfil da pessoa para a qual ela está sendo estimada. Devido à falta de corpus de relevância de opiniões capazes de testar corretamente o SORM, fez-se necessária a criação de um novo corpus chamado Social Opinion Relevance Corpus (SORC). Usando o SORC, foram realizados experimentos no domínio de jogos eletrônicos que ilustram a importância da personalização da relevância para alcançar melhores resultados, baseados em métricas típicas de Recuperação de Informação. Também foi realizado um teste de significância estatística que reforça e confirma as vantagens que o SORM oferece. / This thesis presents a generic and domain independent opinion relevance model for Social Network users. The Social Opinion Relevance Model (SORM) is able to estimate an opinions relevance based on twelve different parameters. Compared to other models, SORMs main distinction is its ability to provide customized results, according to whom the opinion relevance is being estimated for. Due to the lack of opinion relevance corpora that are able to properly test our model, we have created a new one called Social Opinion Relevance Corpus (SORC). Using SORC, we carried out some experiments on the Electronic Games domain that illustrate the importance of customizing opinion relevance in order to achieve better results, based on typical Information Retrieval metrics, such as NDCG, QMeasure and MAP. We also performed a statistical significance test that reinforces and corroborates the advantages that SORM offers.
|
Page generated in 0.1052 seconds