Spelling suggestions: "subject:"[een] DOCUMENT"" "subject:"[enn] DOCUMENT""
471 |
Web Search Based on Hierarchical Heading-Block Structure Analysis / 階層的な見出しブロック構造の分析に基づくWeb検索Manabe, Tomohiro 23 March 2016 (has links)
The contents of Section 2.2 and Chapter 4 first appeared in proceedings of the 12th International Conference on Web Information Systems and Technologies, 2016 (www.webist.org). The contents of Section 2.3 and Chapter 5 first appeared in DBSJ Journal, vol. 14, article no. 2, March 2016. The contents of Section 2.5 and Chapter 7 first appeared in proceedings of the 11th Asia Information Retrieval Societies Conference, Lecture Notes in Computer Science, vol. 9460, pp. 188-200, 2015 (The final publication is available at link.springer.com). / 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19854号 / 情博第605号 / 新制||情||105(附属図書館) / 32890 / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授 田島 敬史, 教授 田中 克己, 教授 吉川 正俊 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
472 |
Clustering in Swedish : The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation MethodRosell, Magnus January 2005 (has links)
Text clustering divides a set of texts into groups, so that texts within each group are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on known ones. The contributions of this thesis are an investigation of text representation for Swedish and an evaluation method that uses two or more manual categorizations. Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Since Swedish has more morphological variation than for instance English we have used a stemmer to strip suffixes. This gives moderate improvements and reduces the number of words in the representation. Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages.In the ordinary vector space model the constituents of compounds are not accounted for when calculating the similarity between texts. To use them we have employed a spell checking program to split compounds. The results clearly show that this is beneficial. The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. Noneof our experiments have shown any improvement over using the ordinary model. Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. Automatic evaluation methods are either intrinsic or extrinsic. Internal quality measures use the representation in some manner. Therefore they are not suitable for comparisons of different representations. External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is -- text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe an evaluation method for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. We also describe the kappa coefficient as a clustering quality measure in the same setting. / Textklustring delar upp en mängd texter i grupper, så att texterna inom dessa liknar varandra till innehåll. Man kan använda textklustring för att uppdaga strukturer och innehåll i okända textmängder och för att få nya perspektiv på redan kända. Bidragen i denna avhandling är en undersökning av textrepresentationer för svenska texter och en utvärderingsmetod som använder sig av två eller fler manuella kategoriseringar. Textklustring, åtminstonde som det beskrivs här, utnyttjar sig av den vektorrumsmodell, som används allmänt inom området. I denna modell representeras texter med orden som förekommer i dem och texter som har många gemensamma ord betraktas som lika till innehåll. Vad som betraktas som ett ord skiljer sig mellan språk. Vi har undersökt inverkan av några av svenskans egenskaper på textklustring. Eftersom svenska har större morfologisk variation än till exempel engelska har vi tagit bort suffix med hjälp av en stemmer. Detta ger lite bättre resultat och minskar antalet ord i representationen. I svenska används och skapas hela tiden fasta sammansättningar. De flesta delar av sammansättningar används som ord på egen hand och i många olika sammansättningar. Fasta sammansättningar i svenska språket motsvarar ofta fraser och öppna sammansättningar i andra språk. Delarna i sammansättningar används inte vid likhetsberäkningen i vektorrumsmodellen. För att utnyttja dem har vi använt ett rättstavningsprogram för att dela upp sammansättningar. Resultaten visar tydligt att detta är fördelaktigt I vektorrumsmodellen tas ingen hänsyn till ordens inbördes ordning. Vi har försökt utvidga modellen med nominalfraser på olika sätt. Inga av våra experiment visar på någon förbättring jämfört med den vanliga enkla modellen. Det är mycket svårt att utvärdera textklustringsresultat. Det ligger i sakens natur att vad som är en bra uppdelning av en mängd texter är subjektivt. Automatiska utvärderingsmetoder är antingen interna eller externa. Interna kvalitetsmått utnyttjar representationen på något sätt. Därför är de inte lämpliga att använda vid jämförelser av olika representationer. Externa kvalitetsmått jämför en klustring med en (manuell) kategorisering av samma mängd texter. Det teoretiska bästa värdet för måtten är kända, men vad som är ett bra värde är inte uppenbart -- mängder av texter skiljer sig åt i svårighet att klustra och kategoriseringar är mer eller mindre lämpliga för en speciell mängd texter. Vi beskriver en utvärderingsmetod som kan användas då en mängd texter har mer än en kategorisering. I sådana fall kan resultatet för en klustring jämföras med resultatet för en av kategoriseringarna, som vi antar är en bra uppdelning. Vi beskriver också kappakoefficienten som ett kvalitetsmått för klustring under samma förutsättningar. / QC 20101220
|
473 |
A Content Analysis of Student Conduct CodesMartin, Janice Earlene 28 April 2004 (has links)
Scholars in the field of student judicial affairs have recommended that institutions remove all legal terminology and references in student conduct codes and create codes based on student development theory and practice (Dannells, 1997; Gehring, 2001; Stoner & Cerminara 1990; Stoner, 2000). The purpose of this study was to analyze student conduct codes to determine the extent to which college and university administrators have adopted Stoner and Cerminara, Gehring, and Pavela's suggestions.
This study is a content analysis of student conduct codes. The data were collected by using a stratified randomly selected group of Carnegie classified institutions and examining the student conduct code for each institution from the respective institution's website. Descriptive statements were used to code and analyze the data. The study results show that only 20% of the institutions in the study had taken the advice of the judicial scholars and removed all legalistic language. Therefore, the majority of the institutions in this study, regardless of institutional type or size, need to reexamine and modify their student conduct codes. / Master of Arts
|
474 |
Information and Representation Tradeoffs in Document ClassificationJin, Timothy 23 May 2022 (has links)
No description available.
|
475 |
On-Line Electronic Document Collaboration and AnnotationHarmon, Trev R. 11 November 2006 (has links) (PDF)
The Internet provides a powerful medium for communication and collaboration. The ability one has to connect and interact with web-based tools from anywhere in the world makes the Internet ideal for such tasks. However, the lack of native tools can be a hindrance when deploying collaborative initiatives, as many current projects require specialized software in order to operate. This thesis demonstrates, with the comparably recent advances in browser technology and Document Object Model (DOM) implementation, a web-based collaborative annotation system can be developed that can be accessed by a user through a standards-compliant web browser. Such a system, demonstrated to work on the commonly-used web browsers constituting the vast majority of web traffic, was implemented using open-source tools and industry-recognized standards. Additionally, it accepts static copies of most standard document formats for both handwritten and typed annotations, while maintaining an archived copy of the original. The system developed for this thesis lends itself to use in a number of different process domains, as most collaborative annotation approaches can be described by a single process model. While a number of possible usage scenarios are discussed, this thesis approaches system usage only in an academic setting, focusing on applicability of the system to electronic grading and document exchange. From here, additional system usage can be easily extrapolated.
|
476 |
Bisecting Document Clustering Using Model-Based MethodsDavis, Aaron Samuel 09 December 2009 (has links) (PDF)
We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.
|
477 |
Status of Accountability in Online News Media: A Case Study of NepalAcharya, Bhanu Bhakta January 2014 (has links)
Scholars contend that media accountability to the public and professional stakeholders has been improving in recent years because of the increased use of digital platforms. Since most studies related to online news media accountability have focused on developed countries, this research study examines the state of accountability in online news media in Nepal, where access to online media is very limited and audiences are barely aware of media's journalistic responsibilities. By employing case study research method with three data sources, this research
study assesses the state of online media accountability in Nepal, key challenges for ensuring accountability in journalism created using digital platforms, and the role of audiences in making online news media accountable. The study finds that Internet accessibility, media literacy, and availability of resources are the primary challenges to making media accountable in Nepal. The study concludes by offering recommendations for future research and practical applications.
|
478 |
A difference analysis method for detecting differences between similar documents / En differens-analysmetod för att upptäcka skillnader mellan liknande dokumentSerra, Andreas January 2017 (has links)
Similarity analysis of documents is a well studied field. With a focus instead on the opposite concept, how can we try to define and distinguish the differences within documents? This project tries to determine if differences within documents can be detected as well as quantified based on their semantic qualities. We propose a method for quantifying differences by applying tf-idf based models with analysis methods for lemmatization and synonym extraction, together with utility ranking algorithms. The method is implemented and tested. The results show that the method has potential but that further studies are required in order to fully evaluate to what extent it could be of practical use. Such a method could though reap significant benefits within several different fields in which automatic difference detection could replace error prone manual labor in document management, as well as other beneficial purposes such as to provide automatically generated difference summaries. / Likhetsanalys mellan dokument är ett välutforskat område. Med fokus istället på motsatsen, hur kan vi försöka definiera och särskilja skillnaderna mellan dokument? Detta projekt försöker undersöka om skillnader mellan dokument kan detekteras samt kvantifieras baserat på deras semantiska kvalitéer. Vi föreslår en metod för kvantifiering av skillnader genom att applicera tf-idf baserade modeller tillsammans med analysmetoder för lemmatisering och synonymextrahering, i kombination med utilitetsrankningsalgoritmer. Metoden implementeras och testas. Resultaten visar att metoden har potential men att det krävs ytterligare studier för att fullt ut avgöra till vilken grad den skulle kunna vara praktiskt användbar. En sådan metod skulle dock kunna erbjuda stora fördelar för ett flertal olika discipliner, där automatisk skillnadsdetektering skulle kunna ersätta felbenägen manuellt arbete gällande dokumentationshantering, samt också fylla andra förmånliga syften som t.ex. att kunna erbjuda automatgenererade skillnadssammanfattningar.
|
479 |
Schemalysis: Visualization of a Sub-Schemas in Document NoSQL DatabasesDePero, Andrew Joseph 14 December 2022 (has links)
No description available.
|
480 |
Improving Document Clustering by Refining Overlapping Cluster RegionsUpadhye, Akshata Rajendra January 2022 (has links)
No description available.
|
Page generated in 0.0556 seconds