Spelling suggestions: "subject:"text minining"" "subject:"text chanining""
361 |
應用文字探勘技術於臺灣上市公司重大訊息對股價影響之研究 / The study on impact of material information of public listed company to its stock price by using text mining approach吳漢瑞, Wu, Han Ruei Unknown Date (has links)
台灣股票市場屬於淺碟型,因此外界的訊息易於影響股價波動;同時台灣是一個以個別投資人為主的散戶市場,外界的訊息會影響市場投資。因此,重大訊息的發布對公司股價變化的影響,值得我們進一步探討。
本研究以公開資訊觀測站之重大訊息為資料來源,蒐集2005~2009年間統一、中華電信、長榮航空以及臺灣企銀四間上市公司之重大訊息共1382篇。利用文字探勘kNN演算法將四間公司之重大訊息加以分群,分析出各訊息的發布對於股價之影響程度,並找出不同群組之重大訊息的漲跌趨勢,期能對未來即時重大訊息的發布,分析出其對於股價之漲跌影響,進一步得到訊息發布日後兩日之報酬率走勢,成為日後投資標的之選擇參考。
本研究結果顯示取樣公司於發布前兩日至發布後兩日,交易量有顯著之異常,顯示訊息發布對於公司股票確有影響;而不同的重大訊息內容,將會被分於不同之群組當中,各群組也各有其不同之漲跌趨勢,本研究於測試資料之分類結果,整體平均有六成五之準確率,在於上漲類別之準確率更高達八成;最後於發布後累積報酬率之影響,投資正確率平均高於六成。
本研究透過系統化之分析與預測,省去投資者對於重大訊息之搜尋以及解讀的時間,提供投資者一個可供參考之依據。 / In this study we used the technique of text mining to classify the material information of companies and analyze how the disclosure of it affects the market. Hence, we would be able to predict the price of stock based on disclosures of the material information and then use the outcome as reference of investment.
This study chose the Market Observation Post System as the source of information to its justice. We chose UNI-PRESIDENT ENTERPRISES CORP, Chunghwa Telecom Co., Ltd, EVA AIRWAYS CORPORATION and Taiwan Business Bank for their great evaluation of the information disclosure. We collected 1382 material information from 2005 to 2009 and for the better performance, we selected kNN algorithm as our rule of classification.
We conducted three experiments in this study. In these experiments, we have approved that the trading volume of two periods were with significant differences. We have over 60% accuracy of the all data to classify the tested data. As a result, we found that the return rate of the “up” group has over 60% upside probability and the “down” group has over 60% downside probability.
In this study, we built a time-saving automatic system to group material information and find out those that are valuable. Based on our result, we provided a reference to investors for their investment strategy. At the same time, we also came up with some inspiration for future research.
|
362 |
On text mining to identify gene networks with a special reference to cardiovascular disease / Identifiering av genetiska nätverk av betydelse för kärlförkalkning med hjälp av automatisk textsökning i Medline, en medicinsk litteraturdatabasStrandberg, Per Erik January 2005 (has links)
<p>The rate at which articles gets published grows exponentially and the possibility to access texts in machine-readable formats is also increasing. The need of an automated system to gather relevant information from text, text mining, is thus growing. </p><p>The goal of this thesis is to find a biologically relevant gene network for atherosclerosis, themain cause of cardiovascular disease, by inspecting gene cooccurrences in abstracts from PubMed. In addition to this gene nets for yeast was generated to evaluate the validity of using text mining as a method. </p><p>The nets found were validated in many ways, they were for example found to have the well known power law link distribution. They were also compared to other gene nets generated by other, often microbiological, methods from different sources. In addition to classic measurements of similarity like overlap, precision, recall and f-score a new way to measure similarity between nets are proposed and used. The method uses an urn approximation and measures the distance from comparing two unrelated nets in standard deviations. The validity of this approximation is supported both analytically and with simulations for both Erd¨os-R´enyi nets and nets having a power law link distribution. The new method explains that very poor overlap, precision, recall and f-score can still be very far from random and also how much overlap one could expect at random. The cutoff was also investigated. </p><p>Results are typically in the order of only 1% overlap but with the remarkable distance of 100 standard deviations from what one could have expected at random. Of particular interest is that one can only expect an overlap of 2 edges with a variance of 2 when comparing two trees with the same set of nodes. The use of a cutoff at one for cooccurrence graphs is discussed and motivated by for example the observation that this eliminates about 60-70% of the false positives but only 20-30% of the overlapping edges. This thesis shows that text mining of PubMed can be used to generate a biologically relevant gene subnet of the human gene net. A reasonable extension of this work is to combine the nets with gene expression data to find a more reliable gene net.</p>
|
363 |
Matching Vehicle License Plate Numbers Using License Plate Recognition and Text Mining TechniquesOliveira Neto, Francisco Moraes 01 August 2010 (has links)
License plate recognition (LPR) technology has been widely applied in many different transportation applications such as enforcement, vehicle monitoring and access control. In most applications involving enforcement (e.g. cashless toll collection, congestion charging) and access control (e.g. car parking) a plate is recognized at one location (or checkpoint) and compared against a list of authorized vehicles. In this research I dealt with applications where a vehicle is detected at two locations and there is no list of reference for vehicle identification.
There seems to be very little effort in the past to exploit all information generated by LPR systems. In nowadays, LPR machines have the ability to recognize most characters on the vehicle plates even under the harshest practical conditions. Therefore, even though the equipment are not perfect in terms of plate reading, it is still possible to judge with certain confidence if a pair of imperfect readings, in the form of sequenced characters (strings), most likely belong to the same vehicle. The challenge here is to design a matching procedure in order to decide whether or not they belong to same vehicle.
In view of the aforementioned problem, this research intended to design and assess a matching procedure that takes advantage of a similarity measure called edit distance (ED) between two strings. The ED measure the minimum editing cost to convert a string to another. The study first attempted to assess a simple case of a dual LPR setup using the traditional ED formulation with 0 or 1 cost assignments (i.e. 0 if a pair-wise character is the same, and 1 otherwise). For this dual setup, this research has further proposed a symbol-based weight function using a probabilistic approach having as input parameters the conditional probability matrix of character association. As a result, this new formulation outperformed the original ED formulation. Lastly, the research sought to incorporate the passage time information into the procedure. With this, the performance of the matching procedure improved considerably resulting in a high positive matching rate and much lower (about 2%) false matching rate.
|
364 |
Concept Based Knowledge Discovery from Biomedical Literature.Radovanovic, Aleksandar. January 2009 (has links)
<p>This thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology  / resented can be integrated with the researchers&rsquo / own knowledge, experimentation and observations for optimal progression of scientific research.</p>
|
365 |
The development of a single nucleotide polymorphism database for forensic identification of specified physical traitsAlecia Geraldine Naidu January 2009 (has links)
<p>Many Single Nucleotide Polymorphisms (SNPs) found in coding or regulatory regions within the human genome lead to phenotypic differences that make prediction of physical appearance, based on genetic analysis, potentially useful in forensic investigations. Complex traits such as pigmentation can be predicted from the genome sequence, provided that genes with strong effects on the trait exist and are known. Phenotypic traits may also be associated with variations in gene expression due to the presence of SNPs in promoter regions. In this project, the identification of genes associated with these physical traits of potential forensic relevance have been collated from the literature using a text mining platform and hand curation. The SNPs associated with these genes have been acquired from public SNP repositories such as the International HapMap project, dbSNP and Ensembl. Characterization of different population groups based on the SNPs has been performed and the results and data stored in a MySQL database. This database contains SNP genotyping data with respect to physical phenotypic differences of forensic interest. The potential forensicrelevance of the SNP information contained in this database has been verified through in silico SNP analysis aimed at establishing possible relationships between SNP occurrence and phenotype. The software used for this analysis is MATCH&trade / .</p>
|
366 |
Development of a Hepatitis C Virus knowledgebase with computational prediction of functional hypothesis of therapeutic relevanceKojo, Kwofie Samuel January 2011 (has links)
<p>To ameliorate Hepatitis C Virus (HCV) therapeutic and diagnostic challenges requires robust intervention strategies, including approaches that leverage the plethora of rich data published in biomedical literature to gain greater understanding of HCV pathobiological mechanisms. The multitudes of metadata originating from HCV clinical trials as well as low and high-throughput experiments embedded in text corpora can be mined as data sources for the implementation of HCV-specific resources. HCV-customized resources may support the generation of worthy and testable hypothesis and reveal potential research clues to augment the pursuit of efficient diagnostic biomarkers and therapeutic targets. This research thesis report the development of two freely available HCV-specific web-based resources: (i) Dragon Exploratory System on Hepatitis C Virus (DESHCV) accessible via http://apps.sanbi.ac.za/DESHCV/ or http://cbrc.kaust.edu.sa/deshcv/ and (ii) Hepatitis C Virus Protein Interaction Database (HCVpro) accessible via  / http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/. DESHCV is a text mining system implemented using named concept recognition and cooccurrence based  / approaches to computationally analyze about 32, 000 HCV related abstracts obtained from PubMed. As part of DESHCV development, the pre-constructed dictionaries of the  / Dragon Exploratory System (DES) were enriched with HCV biomedical concepts, including HCV proteins, name variants and symbols to enable HCV knowledge specific  / exploration. The DESHCV query inputs consist of user-defined keywords, phrases and concepts. DESHCV is therefore an information extraction tool that enables users to  / computationally generate association between concepts and support the prediction of potential hypothesis with diagnostic and therapeutic relevance. Additionally, users can  / retrieve a list of abstracts containing tagged concepts that can be used to overcome the herculean task of manual biocuration. DESHCV has been used to simulate previously  / reported thalidomide-chronic hepatitis C hypothesis and also to model a potentially novel thalidomide-amantadine hypothesis. HCVpro is a relational knowledgebase dedicated to housing experimentally detected HCV-HCV and HCV-human protein interaction information obtained from other databases and curated from biomedical journal articles.  / Additionally, the database contains consolidated biological information consisting of hepatocellular carcinoma (HCC) related genes, comprehensive reviews on HCV biology and drug development, functional genomics and molecular biology data, and cross-referenced links to canonical pathways and other essential biomedical databases. Users can retrieve enriched information including interaction metadata from HCVpro by using protein identifiers, gene chromosomal locations, experiment types used in detecting the interactions, PubMed IDs of journal articles reporting the interactions, annotated protein interaction IDs from external databases, and via &ldquo / string searches&rdquo / . The utility of HCVpro  / has been demonstrated by harnessing integrated data to suggest putative baseline clues that seem to support current diagnostic exploratory efforts directed towards vimentin.  / Furthermore, eight genes comprising of ACLY, AZGP1, DDX3X, FGG, H19, SIAH1, SERPING1 and THBS1 have been recommended for possible investigation to evaluate their  / diagnostic potential. The data archived in HCVpro can be  / utilized to support protein-protein interaction network-based candidate HCC gene prioritization for possible validation by experimental biologists.  / </p>
|
367 |
Efficient Temporal Synopsis of Social Media StreamsAbouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
|
368 |
運用文字探勘技術建立MD&A之 分類閱讀器 / Using text-mining technology in developing a classified reader for MD&A吳詩婷, Wu, Shih Ting Unknown Date (has links)
年報中富含眾多資訊,其中包含財務性資訊與文字性資訊,財務性資訊之分析方法已相當成熟,而文字性資訊受限於格式及檔案類型,而降低投資人使用或分析此類資訊之效率。管理階層討論與分析(Management’s Discussion & Analysis of Financial Condition and Results of Operations,以下簡稱MD&A)係管理階層傳達其經營決策觀點予投資人之媒介,投資人可透過閱讀MD&A取得更多資訊,過去學者之研究亦證實該項目內之文字性資訊有其重要性,由於文字性資訊缺乏通用之分類架構,因此投資人需耗費較多時間與成本分析該資訊。本研究自美國科技業上市公司,隨機選取40家企業2012年之年報作為樣本資料,藉由文字探勘技術,運用TFIDF將MD&A文字性內容分類至EBRC針對MD&A所發布之分類架構,建立分類閱讀器,使投資人可利用透過系統分類並彙整之文句,迅速取得所需之文字性資訊,以協助使用者有效率地閱讀這些非結構化之文字資訊,藉以減少資料蒐集之時間,提升文字性資訊之可使用性。 / Annual reports are rich in information, which contains financial information and textual information. While the approach of analyzing financial information is common, textual information is confined by its format or the file type it is stored, thus decreasing the efficiency of analyzing this sort of information. Management’s Discussion & Analysis of Financial Condition and Results of Operations (MD&A) is the vehicle for investor to share the sight of managements’ decision making consideration, through reading MD&A investor could obtain more information. According to past researches, textual information is of importance. Due to the lack of a common framework, investors would consume more time and cost to analyze textual information. This research randomly selected 40 samples from publicly traded technology firms of the United-States. Utilizing text-mining technology and TFIDF, classify textual information of MD&A into the framework EBRC established, developing a classified reader for MD&A. To assist investors read non-constructed textual information efficiently and reduce the time of information gathering, thereby enhancing the usability of textual information.
|
369 |
Cluster-based Query Expansion TechniqueHuang, Chun-Neng 14 August 2003 (has links)
As advances in information and networking technologies, huge amount of information typically in the form of text documents are available online. To facilitate efficient and effective access to documents relevant to users¡¦ information needs, information retrieval systems have been imposed a more significant role than ever. One challenging issue in information retrieval is word mismatch that refers to the phenomenon that concepts may be described by different words in user queries and/or documents. The word mismatch problem, if not appropriately addressed, would degrade retrieval effectiveness critically of an information retrieval system.
In this thesis, we develop a cluster-based query expansion technique to solve the word mismatch problem. Using the traditional query expansion techniques (i.e., global analysis and local feedback) as performance benchmarks, the empirical results suggest that when a user query only consists of one query term, the global analysis technique is more effective. However, if a user query consists of two or more query terms, the cluster-based query expansion technique can provide a more accurate query result, especially within the first few top-ranked documents retrieved.
|
370 |
Μέτρα ομοιότητας στην τεχνική ομαδοποίησης (clustering): εφαρμογή στην ανάλυση κειμένων (text mining) / Similarity measures in clustering: an application in text miningΠαπαστεργίου, Θωμάς 17 May 2007 (has links)
Ανάπτυξη ενός μέτρου ανομοιότητας μεταξύ κατηγορικών δεδομένων και η εφαρμογή του για την ομαδοποίηση κειμένων και την λύση του προβλήματος αυθεντiκότητας κειμένων. / Developement of a similarity measure for categorical data and the application of the measure in text clustering and in the authoring attribution problem.
|
Page generated in 0.0691 seconds