1 |
Data Selection using Topic Adaptation for Statistical Machine TranslationMatsushita, Hitokazu 01 November 2015 (has links)
Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse data, and not all data contained in a single dataset are useful for translations of a specific target task. In this study, we propose a new perspective on data quality and topical similarity to enhance SMT performance. Using our data adaptation approach called topic adaptation, we select topically suitable training data corresponding to test data in order to produce better translations. We propose three topic adaptation approaches for the SMT process and investigate the effectiveness in both idealized and realistic settings using large parallel corpora. We measure performance of SMT systems trained on topically similar data and their effectiveness based on BLEU, the widely-used objective SMT performance metric. We show that topic adaptation approaches outperform baseline systems (0.3 – 3 BLEU points) when data selection parameters are carefully determined.
|
2 |
BEST SOURCE SELECTORS AND MEASURING THE IMPROVEMENTSGatton, Tim 10 1900 (has links)
ITC/USA 2005 Conference Proceedings / The Forty-First Annual International Telemetering Conference and Technical Exhibition / October 24-27, 2005 / Riviera Hotel & Convention Center, Las Vegas, Nevada / After years of tracing the evolution and solutions to finding the best data, I learned that
it isn’t best source selection that we all want. What we need is best data selection.
|
3 |
CSI in the Web 2.0 Age: Data Collection, Selection, and Investigation for Knowledge DiscoveryFu, Tianjun January 2011 (has links)
The growing popularity of various Web 2.0 media has created massive amounts of user-generated content such as online reviews, blog articles, shared videos, forums threads, and wiki pages. Such content provides insights into web users' preferences and opinions, online communities, knowledge generation, etc., and presents opportunities for many knowledge discovery problems. However, several challenges need to be addressed: data collection procedure has to deal with unique characteristics and structures of various Web 2.0 media; advanced data selection methods are required to identify data relevant to specific knowledge discovery problems; interactions between Web 2.0 users which are often embedded in user-generated content also need effective methods to identify, model, and analyze. In this dissertation, I intend to address the above challenges and aim at three types of knowledge discovery tasks: (data) collection, selection, and investigation. Organized in this "CSI" framework, five studies which explore and propose solutions to these tasks for particular Web 2.0 media are presented. In Chapter 2, I study focused and hidden Web crawlers and propose a novel crawling system for Dark Web forums by addressing several unique issues to hidden web data collection. In Chapter 3 I explore the usage of both topical and sentiment information in web crawling. This information is also used to label nodes in web graphs that are employed by a graph-based tunneling mechanism to improve collection recall. Chapter 4 further extends the work in Chapter 3 by exploring the possibilities for other graph comparison techniques to be used in tunneling for focused crawlers. A subtree-based tunneling method which can scale up to large graphs is proposed and evaluated. Chapter 5 examines the usefulness of user-generated content in online video classification. Three types of text features are extracted from the collected user-generated content and utilized by several feature-based classification techniques to demonstrate the effectiveness of the proposed text-based video classification framework. Chapter 6 presents an algorithm to identify forum user interactions and shows how they can be used for knowledge discovery. The algorithm utilizes a bevy of system and linguistic features and adopts several similarity-based methods to account for interactional idiosyncrasies.
|
4 |
Two Essays in Finance: “Selection Biases and Long-run Abnormal Returns” And “The Impact of Financialization on the Benefits of Incorporating Commodity Futures in Actively Managed Portfolios”Adhikari, Ramesh 11 August 2015 (has links)
This dissertation consists of two essays. First essay investigates the implications of researcher data requirement on the risk-adjusted returns of firms. Using the monthly CRSP data from 1925 to 2013, we present evidence that firms which survive longer have higher average returns and lower standard deviation of annualized returns than the firms which do not. I further demonstrate that there is a positive relation between firms’ survival and average performance. In order to account for the positive correlation between survival and average performance, I model the relation of survival and pricing errors using a Farlie-Gumbel-Morgenstern joint distribution function and fit resulting the moment conditions to the data. Our results show that even a low correlation between firm survival time and pricing errors can lead to a much higher correlation between the survival time and average pricing errors. Failure to adjust for this data selection biases can result in over/under estimates of abnormal returns by 5.73 % in studies that require at least five years of returns data.
Second essay examines diversification benefits of commodity futures portfolios in the light of the rapid increase in investor participation in commodity futures market since 2000. Many actively managed portfolios outperform traditional buy and hold portfolios for the sample period from January, 1986 to October, 2013. The evidence documented through traditional intersection test and stochastic discount factor based spanning test indicates that financializaiton has reduced segmentation of commodity market with equity and bond market and has increased the riskiness of investing in commodity futures markets. However, diversifying property of commodity portfolios have not disappeared despite the increased correlation between commodity portfolios returns and equity index returns.
|
5 |
The Influence of Interactive Elements on Entertainment : An Interactive Infographics Experiment on University StudentsMöcander, Bonnie, Shen, Nuoting January 2023 (has links)
Objectives: Consequent to previous research indicating the positive influence of entertainment on attitudes, this quantitative study investigates which interactive elements more effectively augment entertainment and to which degree Gamification, Data Selection, Storytelling, and Motion augments entertainment through the beneficial components Emotional Arousal, Recovery and Regulation and Aesthetic Appreciation (Dobni, 2007). Method: A pre-study survey was conducted to determine a neutral topic for the interactive infographics used in the main study. The main study consisted of experiments on university students. The participants were asked to engage with four interactive infographics, each focusing on one interactive element, while consecutively answering a questionnaire. The results of the survey were analyzed by descriptive analysis, while the questionnaire was analyzed by ANOVA, MANOVA, descriptive analysis, and Pearson’s r. Results: The pre-study survey concluded that Wild Animals was the most neutral topic. Statistically significant differences in shown entertainment were discovered between all interactive elements except Gamification and Storytelling (p = 0.696) and Data Selection and Motion (p = 0.971). Statistically significant differences in signs of Emotional Arousal and Aesthetic Appreciation were discovered between the interactive elements, but none in Recovery and Regulation. A significant (p < 0.001), strong positive correlation (r = 0.744) between shown and self-reported entertainment level was identified. Conclusion: Four out of five null hypotheses were successfully rejected. The findings show that Gamification and Storytelling can be used to augment entertainment more effectively. All interactive elements augment entertainment mainly through Aesthetic Appreciation. Storytelling and Gamification secondarily use Emotional Arousal, while Data Selection and Motion use Recovery and Regulation. Practical implications include that teachers and creators can augment entertainment by choosing effective interactive elements when designing and selecting interactive infographics. Further research can be conducted through qualitative methods or investigating other interactive elements, populations or device environments.
|
6 |
Archaeomagnetic field intensity evolution during the last two millennia / Evolução da intensidade do campo arqueomagnético durante os últimos dois milêniosSilva, Wilbor Poletti 14 September 2018 (has links)
Temporal variations of Earth\'s magnetic field provide a great range of geophysical information about the dynamics at different layers of the Earth. Since it is a planetary field, regional and global aspects can be explored, depending on the timescale of variations. In this thesis, the geomagnetic field variations for the last two millennia were investigated. For that, some improvement on the methods to recover the ancient magnetic field intensity from archeological material were done, new data was acquired and a critical assessment of the global archaeomagnetic database was performed. Two methodological advances are reported, comprising: i) the correction for microwave method of the cooling rate effect, which is associated to the difference between the cooling times during the manufactory of the material and that of the heating steps during the archaeointensity experiment; (ii) a test for thermoremanent anisotropy correction from the arithmetic mean of six orthogonal samples. The temporal variation of the magnetic intensity for South America was investigated from nine new data, three from ruins of the Guaraní Jesuit Missions and six from archaeological sites associated with jerky beef farms, both located in Rio Grande do Sul, Brazil, with ages covering the last 400 years. These data combined with the regional archaeointensity database, demonstrates that the influence of significant non-dipole components in South America started at ~1800 CE. Finally, from a reassessment of the global archaeointensity database, a new interpretation was proposed about the geomagnetic axial dipole evolution, where this component falls constantly since ~700 CE associated to the breaking of the symmetry of the advective sources operating in the outer core. / Variações temporais do campo magnético da Terra fornecem uma grande diversidade de informações geofísicas sobre a dinâmica das diferentes camadas da Terra. Por ser um campo planetário, aspectos regionais e globais podem ser explorados, dependendo da escala de tempo das variações. Nesta tese, foram investigadas as variações do campo geomagnético para os dois últimos milênios. Para isso, aprimoramentos nos métodos de aquisição da intensidade geomagnética registrada em materiais arqueológicos foram realizados, bem como a aquisição de novos dados e uma avaliação crítica da base de dados arqueomagnética global. Dois novos avanços metodológicos são aqui propostos, sendo eles: i) correção para o método de micro-ondas do efeito da taxa de resfriamento, que está associada à diferença entre os tempos de resfriamento durante a manufatura do material e o das etapas de aquecimento durante o experimento de arqueointensidade; (ii) teste para correção da anisotropia termorremanente a partir da média aritmética de seis amostras posicionadas ortogonalmente umas às outras durante o experimento de arqueointensidade. A variação temporal da intensidade magnética para a América do Sul foi investigada a partir de nove dados inéditos, sendo três provenientes das ruínas das Missões Jesuíticas Guaraníticas e seis de sítios arqueológicos associados a fazendas de charque, ambos localizados no Rio Grande do Sul, Brasil, com idades que cobrem os últimos 400 anos. Esses dados, combinados com o banco de dados regionais de arqueointensidade, demonstram que a influência significativa de componentes não-dipolares do campo magnético na América do Sul começou em ~1800 CE. Finalmente, a partir de uma reavaliação do banco de dados globais de arqueointensidade uma nova interpretação foi proposta a respeito da evolução do dipolo axial geomagnético, sugerindo que essa componente está decrescendo constantemente desde ~700 CE devido à quebra da simetria das fontes advectivas que operam no núcleo externo.
|
7 |
Archaeomagnetic field intensity evolution during the last two millennia / Evolução da intensidade do campo arqueomagnético durante os últimos dois milêniosWilbor Poletti Silva 14 September 2018 (has links)
Temporal variations of Earth\'s magnetic field provide a great range of geophysical information about the dynamics at different layers of the Earth. Since it is a planetary field, regional and global aspects can be explored, depending on the timescale of variations. In this thesis, the geomagnetic field variations for the last two millennia were investigated. For that, some improvement on the methods to recover the ancient magnetic field intensity from archeological material were done, new data was acquired and a critical assessment of the global archaeomagnetic database was performed. Two methodological advances are reported, comprising: i) the correction for microwave method of the cooling rate effect, which is associated to the difference between the cooling times during the manufactory of the material and that of the heating steps during the archaeointensity experiment; (ii) a test for thermoremanent anisotropy correction from the arithmetic mean of six orthogonal samples. The temporal variation of the magnetic intensity for South America was investigated from nine new data, three from ruins of the Guaraní Jesuit Missions and six from archaeological sites associated with jerky beef farms, both located in Rio Grande do Sul, Brazil, with ages covering the last 400 years. These data combined with the regional archaeointensity database, demonstrates that the influence of significant non-dipole components in South America started at ~1800 CE. Finally, from a reassessment of the global archaeointensity database, a new interpretation was proposed about the geomagnetic axial dipole evolution, where this component falls constantly since ~700 CE associated to the breaking of the symmetry of the advective sources operating in the outer core. / Variações temporais do campo magnético da Terra fornecem uma grande diversidade de informações geofísicas sobre a dinâmica das diferentes camadas da Terra. Por ser um campo planetário, aspectos regionais e globais podem ser explorados, dependendo da escala de tempo das variações. Nesta tese, foram investigadas as variações do campo geomagnético para os dois últimos milênios. Para isso, aprimoramentos nos métodos de aquisição da intensidade geomagnética registrada em materiais arqueológicos foram realizados, bem como a aquisição de novos dados e uma avaliação crítica da base de dados arqueomagnética global. Dois novos avanços metodológicos são aqui propostos, sendo eles: i) correção para o método de micro-ondas do efeito da taxa de resfriamento, que está associada à diferença entre os tempos de resfriamento durante a manufatura do material e o das etapas de aquecimento durante o experimento de arqueointensidade; (ii) teste para correção da anisotropia termorremanente a partir da média aritmética de seis amostras posicionadas ortogonalmente umas às outras durante o experimento de arqueointensidade. A variação temporal da intensidade magnética para a América do Sul foi investigada a partir de nove dados inéditos, sendo três provenientes das ruínas das Missões Jesuíticas Guaraníticas e seis de sítios arqueológicos associados a fazendas de charque, ambos localizados no Rio Grande do Sul, Brasil, com idades que cobrem os últimos 400 anos. Esses dados, combinados com o banco de dados regionais de arqueointensidade, demonstram que a influência significativa de componentes não-dipolares do campo magnético na América do Sul começou em ~1800 CE. Finalmente, a partir de uma reavaliação do banco de dados globais de arqueointensidade uma nova interpretação foi proposta a respeito da evolução do dipolo axial geomagnético, sugerindo que essa componente está decrescendo constantemente desde ~700 CE devido à quebra da simetria das fontes advectivas que operam no núcleo externo.
|
8 |
Duomenų filtravimo ir atrankos sprendimų analizė / The analysis of data filtration and selection solutionsVairaitė, Rūta 10 July 2008 (has links)
Esant dideliems saugomų duomenų kiekiams, yra svarbus našus jų apdorojimas, taigi, vartotojams reikia vis didesnio duomenų bazių našumo. Šiame darbe sprendžiama problema, kaip paskatinti duomenų bazes veikti greičiau, kai duomenų bazių lentelės turi labai daug įrašų. Todėl skiriamas dėmesys duomenų bazių spartos derinimui, ar duomenų bazių spartos optimizavimui. Išnagrinėjus duomenų bazių esamus spartinimo metodus ir priežastis, kurios mažina našumą, yra siūlomas metodas, kuris leidžia sparčiau apdoroti ir filtruoti duomenis bei greičiau pateikti vartotojui užklausos rezultatą. Darbui atlikti pasirinkta MS SQL Server duomenų bazių valdymo sistema. Eksperimento metu atliktas užklausų greičio tyrimas, palyginant sudarytą metodą su virtualių lentelių metodu. / When the amount of stored data is growing, it is very important to get them fast and users are expecting to see how database performance is rising. Using database performance tuning, or database performance optimization, it is possible to make a database system run faster. In this paper after analysis of database performance optimization and performance tuning methods was suggested a method which enables to process data from database more quick and to user to get query result faster. To perform the research the MS SQL Server Database Management System was chosen. The experiment was performed in order to evaluate how method works. The experiment results show that compared with views, this method has better query performance.
|
9 |
Automatic speech recognition for resource-scarce environments / N.T. Kleynhans.Kleynhans, Neil Taylor January 2013 (has links)
Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. In this thesis we present research into developing techniques and tools to (1) harvest audio data, (2) rapidly adapt ASR systems and (3) select “useful” training samples in order to assist with resource-scarce ASR system development.
We demonstrate an automatic audio harvesting approach which efficiently creates a speech recognition corpus by harvesting an easily available audio resource. We show that by starting with bootstrapped acoustic models, trained with language data obtain from a dialect, and then running through a few iterations of an alignment-filter-retrain phase it is possible to create an accurate speech recognition corpus. As a demonstration we create a South African English speech recognition corpus by using our approach and harvesting an internet website which provides audio and approximate transcriptions. The acoustic models developed from harvested data are evaluated on independent corpora and show that the proposed harvesting approach provides a robust means to create ASR resources.
As there are many acoustic model adaptation techniques which can be implemented by an ASR system developer it becomes a costly endeavour to select the best adaptation technique. We investigate the dependence of the adaptation data amount and various adaptation techniques by systematically varying the adaptation data amount and comparing the performance of various adaptation techniques. We establish a guideline which can be used by an ASR developer to chose the best adaptation technique given a size constraint on the adaptation data, for the scenario where adaptation between narrow- and wide-band corpora must be performed. In addition, we investigate the effectiveness of a novel channel normalisation technique and compare the performance with standard normalisation and adaptation techniques.
Lastly, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions. / Thesis (PhD (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013.
|
10 |
Automatic speech recognition for resource-scarce environments / N.T. Kleynhans.Kleynhans, Neil Taylor January 2013 (has links)
Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. In this thesis we present research into developing techniques and tools to (1) harvest audio data, (2) rapidly adapt ASR systems and (3) select “useful” training samples in order to assist with resource-scarce ASR system development.
We demonstrate an automatic audio harvesting approach which efficiently creates a speech recognition corpus by harvesting an easily available audio resource. We show that by starting with bootstrapped acoustic models, trained with language data obtain from a dialect, and then running through a few iterations of an alignment-filter-retrain phase it is possible to create an accurate speech recognition corpus. As a demonstration we create a South African English speech recognition corpus by using our approach and harvesting an internet website which provides audio and approximate transcriptions. The acoustic models developed from harvested data are evaluated on independent corpora and show that the proposed harvesting approach provides a robust means to create ASR resources.
As there are many acoustic model adaptation techniques which can be implemented by an ASR system developer it becomes a costly endeavour to select the best adaptation technique. We investigate the dependence of the adaptation data amount and various adaptation techniques by systematically varying the adaptation data amount and comparing the performance of various adaptation techniques. We establish a guideline which can be used by an ASR developer to chose the best adaptation technique given a size constraint on the adaptation data, for the scenario where adaptation between narrow- and wide-band corpora must be performed. In addition, we investigate the effectiveness of a novel channel normalisation technique and compare the performance with standard normalisation and adaptation techniques.
Lastly, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions. / Thesis (PhD (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013.
|
Page generated in 0.1114 seconds