Global ETD Search

1	Decentralized Web Search Haque, Md Rakibul 08 June 2012 (has links) Centrally controlled search engines will not be sufficient and reliable for indexing and searching the rapidly growing World Wide Web in near future. A better solution is to enable the Web to index itself in a decentralized manner. Existing distributed approaches for ranking search results do not provide flexible searching, complete results and ranking with high accuracy. This thesis presents a decentralized Web search mechanism, named DEWS, which enables existing webservers to collaborate with each other to form a distributed index of the Web. DEWS can rank the search results based on query keyword relevance and relative importance of websites in a distributed manner preserving a hyperlink overlay on top of a structured P2P overlay. It also supports approximate matching of query keywords using phonetic codes and n-grams along with list decoding of a linear covering code. DEWS supports incremental retrieval of search results in a decentralized manner which reduces network bandwidth required for query resolution. It uses an efficient routing mechanism extending the Plexus routing protocol with a message aggregation technique. DEWS maintains replica of indexes, which reduces routing hops and makes DEWS robust to webservers failure. The standard LETOR 3.0 dataset was used to validate the DEWS protocol. Simulation results show that the ranking accuracy of DEWS is close to the centralized case, while network overhead for collaborative search and indexing is logarithmic on network size. The results also show that DEWS is resilient to changes in the available pool of indexing webservers and works efficiently even in the presence of heavy query load. Decentralized search engine P2P webserver ranking pagerank bm25 Computer Science
2	Decentralized Web Search Haque, Md Rakibul 08 June 2012 (has links) Centrally controlled search engines will not be sufficient and reliable for indexing and searching the rapidly growing World Wide Web in near future. A better solution is to enable the Web to index itself in a decentralized manner. Existing distributed approaches for ranking search results do not provide flexible searching, complete results and ranking with high accuracy. This thesis presents a decentralized Web search mechanism, named DEWS, which enables existing webservers to collaborate with each other to form a distributed index of the Web. DEWS can rank the search results based on query keyword relevance and relative importance of websites in a distributed manner preserving a hyperlink overlay on top of a structured P2P overlay. It also supports approximate matching of query keywords using phonetic codes and n-grams along with list decoding of a linear covering code. DEWS supports incremental retrieval of search results in a decentralized manner which reduces network bandwidth required for query resolution. It uses an efficient routing mechanism extending the Plexus routing protocol with a message aggregation technique. DEWS maintains replica of indexes, which reduces routing hops and makes DEWS robust to webservers failure. The standard LETOR 3.0 dataset was used to validate the DEWS protocol. Simulation results show that the ranking accuracy of DEWS is close to the centralized case, while network overhead for collaborative search and indexing is logarithmic on network size. The results also show that DEWS is resilient to changes in the available pool of indexing webservers and works efficiently even in the presence of heavy query load. Decentralized search engine P2P webserver ranking pagerank bm25 Computer Science
3	Investigating Search Algorithms for Shorter Documents : A study on how to search for titles / Undersökning av sökalgoritmer för kortare dokument : En studie i hur man söker på titlar Rostami, Lara January 2022 (has links) The objective of this thesis was to explore whether there are alternatives to the established search ranking algorithm Best Matching 25 (BM25) when searching for shorter documents, in particular for the search of titles. Five search engines were compared to BM25, three of them being variants of the BM25 algorithm and the other two being based on a binary independence model that does not take term frequency or length normalisation into account. The evaluation data consisted of titles of Wikipedia articles from the fair ranking track retrieved from the main conference in the field, Text REtrieval Conference (TREC), and user logs collected from user search queries from Spotify. It was found that none of the alternative models consistently outperformed the standard BM25 for a query q where the number of words in q ranges between 1 ≤ \|q\| ≤ 8. Yet, for shorter queries \|q\| ≤ 3, the binary independence model and BM25 adaptive term (BM25adpt) outperformed the standard BM25. Furthermore, a 1% increase in Mean Average Precision (MAP) score was acquired with a binary independence model and BM25adpt compared to BM25 when sampling queries from the user log data. However, because of the bias in the evaluation data together with the small percentage increase in MAP score, it was concluded that the potential benefit of using the methods explored in this thesis is not enough to justify switching from the BM25 algorithm when searching for titles. / Målet med avhandlingen var att undersöka om det finns alternativ till den vedertagna sökalgoritmen Best matching 25 (BM25) vid sökning bland kortare document, närmare bestämt vid titelsökning. Fem sökmotorer jämfördes med BM25, tre av dem var varianter av BM25 och de andra två varianter av en binär oberoende modell. Den senare modellen använder sig inte av ordfrekvens eller längdnormalisering i sin beräkning, till skillnad från de tidigare modellerna. Evalueringsdatan bestod av titlar från Wikipedia som hämtats från den främsta konferensen inom informationssökning, Text retrieval conference (TREC), och även användarloggar hämtade från användarsökningar från Spotifys datasamling. Ingen av de alternativa modellerna presterade konsekvent bättre än BM25 när antalet ord i söktexten q varierade mellan 1 ≤ \|q\| ≤ 8. För kortare söktexter \|q\| ≤ 3 kunde både en binär oberoende modell och en BM25 adaptive term-modell (BM25adpt) prestera bättre än BM25. Vidare så kunde man se en ökning på den genomsnittliga precisionen (MAP) på 1% både hos den binära oberoende modellen och BM25adpt-modellen jämfört med BM25 när flera stickprov från användarloggdatan gjordes. På grund av att evalueringsdatan har en bias tillsammans med att den potentiella ökningen av MAP endast når upp till 1% drogs slutsatsen att fördelen med att använda en annan modell inte rättfärdigar bytet från BM25 vid titelsökning. information retrieval search engine search algorithm BM25 short documents titles title search informationssökning sökmotor sökalgoritm BM25 korta dokument titlar titelsökning Computer Sciences Datavetenskap (datalogi)
4	Smart Search Engine : A Design and Test of Intelligent Search of News with Classification Li, Chaoyang, Liu, Ke January 2021 (has links) Background Google, Bing, and Baidu are the most commonly used search engines in the world. They also have some problems. For example, when searching for Jaguar, most of the search results are cars, not animals. This is the problem of polysemy. Search engines always provide the most popular but not the most correct results. Aim We want to design and implement a search function and explore whether the method of classified news can improve the precision of users searching for news. Method In this research, we collect data by using a web crawler. We use a web crawler to crawl the data of news in BBC news. Then we use NLTK, inverted index to do data pre-processing, and use BM25 to do data processing. Results Compare to the normal search function, our function has a lower recall rate and a higher precision. Conclusions This search function can improve the precision when people search for news. Implications This search function can be used not only to search news but to search everything. It has a great future in search engines. It can be combined with machine learning to analyze users' search habits to search and classify more accurately. Smart search precision recall rate NLTK inverted index BM25 Information Systems
5	Optimizing Search Engine Field Weights with Limited Data : Offline exploration of optimal field weight combinations through regression analysis / Optimering av sökmotorers fältvikter med begränsad data : Offline-utforskning av optimala fältviktskombinationer genom regressionsanalys Kader, Zino January 2023 (has links) Modern search engines, particularly those utilizing the BM25 ranking algorithm, offer a multitude of tunable parameters designed to refine search results. Among these parameters, the weight of each searchable field plays a crucial role in enhancing search outcomes. Traditional methods of discovering optimal weight combinations, however, are often exploratory, demanding substantial time and risking the delivery of substandard results during testing. This thesis proposes a streamlined solution: an ordinal-regression-based model specifically engineered to identify optimal weight combinations with minimal data input, within an offline testing environment. The evaluation corpus comprises a comprehensive snapshot of a product search database from Tradera. The top $100$ search queries and corresponding search results pages on the Tradera platform were divided into a training set and an evaluation set. The model underwent iterative training on the training set, and subsequent testing on the evaluation set, with progressively increasing amounts of labeled data. This methodological approach allowed examining the model's proficiency in deriving high-performance weight combinations from limited data. The empirical experiments conducted confirmed that the proposed model successfully generated promising weight combinations, even with restricted data, and exhibited robust generalization to the evaluation dataset. In conclusion, this research substantiates the significant potential for enhancing search results by tuning searchable field weights using a regression-based model, even in data-scarce scenarios. / Moderna sökmotorer, i synnerhet sådana som använder rankningsalgoritmen BM25, erbjuder en mängd justerbara parametrar utformade för att förbättra sökresultat. Bland dessa parametrar spelar vikten av varje sökbart fält en avgörande roll för att förbättra sökresultaten. Traditionella metoder för att hitta optimala viktkombinationer är dock ofta utforskande, kräver mycket tid och riskerar att ge undermåliga sökresultat under testningsperioden. Denna avhandling föreslår en strömlinjeformad lösning: en ordinal-regressionsbaserad modell specifikt utvecklad för att identifiera optimala viktkombinationer med minimal träningsdata, inom en offline testmiljö. Utvärderingskorpus består av en omfattande ögonblicksbild av en produktsökdatabas från Tradera. De $100$ vanligaste sökfrågorna och motsvarande sökresultatssidor på Traderas plattform delades in i en träningsuppsättning och en utvärderingsuppsättning. Modellen genomgick iterativ träning på träningsuppsättningen, och därefter testning på utvärderingsuppsättningen, med successivt ökande mängder av kategoriserad data. Denna metodologiska strategi möjliggjorde undersökning av modellens förmåga att härleda högpresterande viktkombinationer från begränsad data. De empiriska experimenten som genomfördes bekräftade att den föreslagna modellen framgångsrikt genererade lovande viktkombinationer, även med begränsad data, och uppvisade robust generalisering till utvärderingsdatamängden. Sammanfattningsvis bekräftar denna forskning den betydande potentialen för förbättring av sökresultat genom att justera sökbara fältvikter med hjälp av en regressionsbaserad modell, även i datasnåla scenarion. Information retrieval Search engines BM25 (Best Match 25) Regression analysis Parameter estimation Learning to rank Informationsinhämtning Sökmotorer BM25 (Best Match 25) Regressionsanalys Parameterskattning Maskininlärning för rangordning Computer Sciences Datavetenskap (datalogi) Software Engineering Programvaruteknik Computer Engineering Datorteknik
6	Rocchio, Ide, Okapi och BIM : En komparativ studie av fyra metoder för relevance feedback / Rocchio, Ide, Okapi and BIM : A comparative study of four methods for relevance feedback Eriksen, Martin January 2008 (has links) This thesis compares four relevance feedback methods. The Rocchio and Ide dec-hi algorithms for the vector space model and the binary independence model and Okapi BM25 within the probabilistic framework. This is done in a custom-made Information Retrieval system utilizing a collection containing 131 896 LA-Times articles which is part of the TREC ad-hoc collection. The methods are compared on two grounds, using only the relevance information from the 20 highest ranked documents from an initial search and also by using all available relevance information. Although a significant effect of choice of method could be found on the first ground, post-hoc analysis could not determine any statistically significant differences between the methods where Rocchio, Ide dec-hi and Okapi BM25 performed equivalent. All methods except the binary independence model performed significantly better than using no relevance feedback. It was also revealed that although the binary independence model performed far worse on average than the other methods it did outperform them on nearly 20 % of the topics. Further analysis argued that this depends on the lack of query expansion in the binary independence model which is advantageous for some topics although has a negative effect on retrieval efficiency in general. On the second ground Okapi BM25 performed significantly better than the other methods with the binary independence model once again being the worst performer. It was argued that the other methods have problems scaling to large amounts of relevance information where Okapi BM25 has no such issues. / Uppsatsnivå: D relevance feedback information retrieval rocchio ide dec-hi okapi bm25 vektormodellen sökfrågeexpansion klassiska probabilistiska modellen Social Sciences Samhällsvetenskap
7	Relevance Analysis for Document Retrieval Labouve, Eric 01 March 2019 (has links) (PDF) Document retrieval systems recover documents from a dataset and order them according to their perceived relevance to a user’s search query. This is a diﬃcult task for machines to accomplish because there exists a semantic gap between the meaning of the terms in a user’s literal query and a user’s true intentions. Even with this ambiguity that arises with a lack of context, users still expect that the set of documents returned by a search engine is both highly relevant to their query and properly ordered. The focus of this thesis is on document retrieval systems that explore methods of ordering documents from unstructured, textual corpora using text queries. The main goal of this study is to enhance the Okapi BM25 document retrieval model. In doing so, this research hypothesizes that the structure of text inside documents and queries hold valuable semantic information that can be incorporated into the Okapi BM25 model to increase its performance. Modiﬁcations that account for a term’s part of speech, the proximity between a pair of related terms, the proximity of a term with respect to its location in a document, and query expansion are used to augment Okapi BM25 to increase the model’s performance. The study resulted in 87 modiﬁcations which were all validated using open source corpora. The top scoring modiﬁcation from the validation phase was then tested under the Lisa corpus and the model performed 10.25% better than Okapi BM25 when evaluated under mean average precision. When compared against two industry standard search engines, Lucene and Solr, the top scoring modiﬁcation largely outperforms these systems by upwards to 21.78% and 23.01%, respectively. Semantic Analysis Document Retrieval Query Expansion Term Proximity Search Okapi BM25 Computer and Systems Architecture Data Storage Systems
8	Vícejazyčný systém pro odpovídání na otázky nad otevřenou doménou / Multilingual Open-Domain Question Answering Slávka, Michal January 2021 (has links) Táto práca sa zaoberá automatickým viacjazyčným zodpovedaním na otázky v otvorenej doméne. V tejto práci sú navrhnuté prístupy k tejto málo prebádanej doméne. Konkrétne skúma, či: (i) použitie prekladu z angličtiny je dostačujúce, (ii) multilinguálne systémy vedia využiť preklad otázky do iných jazykov (iii) alebo je výhodnejšie nepoužívať žiaden preklad. Porovnávam použitie anglického systému založeného na modeli T5, ktorý využíva strojový preklad s natívne viacjazyčnými systémami založenými na viacjazyčnom modeli MT5. Anglický systém so strojovým prekladom mierne prekonáva svoje jednojazyčné náprotivky vo viacerých úlohách. Napriek tomu, že tento model bol natrénovaný na väčšom množstve dát zlepšenie nie je dostatočne signifikantné. To ukazuje, že použitie natívne viacjazyčných systémov je sľubným prístupom pre budúci výskum. Tiež prezentujem metódu získavania dokumentov v rôznych jazykoch pomocou algoritmu BM25 a porovnávam ju s anglickým retrievalom. Používanie viacjazyčných dôkazov sa javí ako prospešné a zlepšuje výkonnosť systému systémov.

Search results