Spelling suggestions: "subject:"large data"" "subject:"marge data""
1 |
Valence and concreteness effects in word-learning: Evidence from a language learning appWild, Heather January 2023 (has links)
One goal of applied linguistics is to learn languages better and faster. Second language (L2) learners need to acquire large vocabularies to approach native-like proficiency in their targeted language. A number of studies have explored the factors that facilitate and hinder word learning using highly controlled experiments, however, these lack ecological validity and the findings may not generalize to real-world learning. The studies in this thesis respond to this gap in the literature. The studies leverage big data from a popular language learning app called Lingvist to explore how understudied semantic factors such as valence (positivity/negativity) and concreteness impact adult L2 word learning. Chapter 2 explores the shape of valence effects on learning, the interaction between the semantics of the target word and the linguistic context in which the word is learned, and how these effects unfold over multiple exposures to the target word. Users learn both positive and negative words better than neutral ones, and learning improves by 7% when target words appear in emotionally congruent contexts (i.e., positive words in positive sentences, negative words in negative sentences). These effects are strongest on the learner’s second encounter with the word and diminish over subsequent encounters. Chapter 3 examines the interaction between target word valence and concreteness. Increased positivity increased accuracy for concrete words by up to 13%, but had little impact on learning abstract words. On the theoretical front, findings provide support for embodied cognition, the lexical quality hypothesis, and the multimodal induction hypothesis. On the applied front, they indicate that context valence can be manipulated to facilitate learning and identify which words will be most difficult to learn. / Thesis / Master of Science (MSc) / Language learners need to know tens of thousands of words to communicate fluently in a language. These studies use data from a popular language learning app called Lingvist to understand how the emotionality of words and the sentences we see them in impact learning. Negative words (e.g., murder) and positive words (e.g., vaccation) were learned better than neutral words. Positive words were learned better when they are part of a positive sentence and negative words are learned better in more negative sentences. The second study found that concrete words like brick or table are easier to learn when they are positive, but emotions have little impact on learning abstract words like hope. These findings help researchers understand how words are represented in the mind and point to ways to make language learning faster and easier.
|
2 |
Finding delta difference in large data setsArvidsson, Johan January 2019 (has links)
To find out what differs between two versions of a file can be done with several different techniques and programs. These techniques and programs are often focusd on finding differences in text files, in documents, or in class files for programming. An example of a program is the popular git tool which focuses on displaying the difference between versions of files in a project. A common way to find these differences is to utilize an algorithm called Longest common subsequence, which focuses on finding the longest common subsequence in each file to find similarity between the files. By excluding all similarities in a file, all remaining text will be the differences between the files. The Longest Common Subsequence is often used to find the differences in an acceptable time. When two lines in a file is compared to see if they differ from each other hashing is used. The hash values for each correspondent line in both files will be compared. Hashing a line will give the content on that line a unique value. If as little as one character on a line is different between the version, the hash values for those lines will be different as well. These techniques are very useful when comparing two versions of a file with text content. With data from a database some, but not all, of these techniques can be useful. A big difference between data in a database and text in a file will be that content is not just added and delete but also updated. This thesis studies the problem on how to make use of these techniques when finding differences between large datasets, and doing this in a reasonable time, instead of finding differences in documents and files. Three different methods are going to be studied in theory. These results will be provided in both time and space complexities. Finally, a selected one of these methods is further studied with implementation and testing. The reason only one of these three is implemented is because of time constraint. The one that got chosen had easy maintainability, an easy implementation, and maintains a good execution time.
|
3 |
Detecting complex genetic mutations in large human genome dataAlsulaiman, Thamer 01 August 2019 (has links)
All cellular forms of life contain Deoxyribonucleic acid (DNA). DNA is a molecule that carries all the information necessary to perform both, basic and complex cellular functions. DNA is replicated to form new tissue/organs, and to pass genetic information to future generations. DNA replication ideally yield an exact copy of the original DNA. While replication generally occurs without error, it may leave DNA vulnerable to accidental changes via mistakes made during the replication process. Those changes are called mutations. Mutations range in magnitude. Yet, mutations of any magnitude range in consequences, from no effect on the organism, to disease initiation (e.g. cancer), or even death.
In this thesis, we limit our focus to mutations in human DNA, and in particular, MMBIR mutations. Recent literature in human genomics has found Microhomology-mediated break-induced replication (MMBIR) to be a common mechanism producing complex mutations in DNA. MMBIRFinder is a tool to detect MMBIR regions in Yeast DNA. Although MMBIRFinder is successful on Yeast DNA, MMBIRFinder is not capable of detecting MMBIR mutations in human DNA. Among several reasons, one major reason for its deficiency with human DNA is the amount of computations required to process human large data. Our contribution in this regard is two fold:
1) We utilize parallel computations to significantly reduce the processing time consumed by the original MMBIFinder, and address several performance degrading issues inherent in the original design;
2) We introduce a new heuristic to detect MMBIR mutations that were not detected by the original MMBIRFinder, even in the case of small sized DNA, like Yeast DNA.
|
4 |
Modeling Point Patterns, Measurement Error and Abundance for Exploring Species DistributionsCHAKRABORTY, AVISHEK January 2010 (has links)
<p>This dissertation focuses on solving some common problems associated with ecological field studies. In the core of the statistical methodology, lies spatial modeling that provides greater flexibility and improved predictive performance over existing algorithms. The applications involve prevalence datasets for hundreds of plants over a large area in the Cape Floristic Region (CFR) of South Africa.</p><p>In Chapter 2, we begin with modeling the categorical abundance data with a multi level spatial model using background information such as environmental and soil-type factors. The empirical pattern is formulated as a degraded version of the potential pattern, with the degradation effect accomplished in two stages. First, we adjust for land use transformation and then we adjust for measurement error, hence misclassification error, to yield the observed abundance classifications. With data on a regular grid over CFR, the analysis is done with a conditionally autoregressive prior on spatial random effects. With around ~ 37000 cells to work with, a novel paralleilization algorithm is developed for updating the spatial parameters to efficiently estimate potential and transformed abundance surfaces over the entire region.</p><p>In Chapter 3, we focus on a different but increasingly common type of prevalence data in the so called <italic>presence-only</italic> setting. We detail the limitations associated with a usual presence-absence analysis for this data and advocate modeling the data as a point pattern realization. The underlying intensity surface is modeled with a point-level spatial Gaussian process prior, after taking into account sampling bias and change in land-use pattern. The large size of the region enforces using an computational approximation with a bias-corrected predictive process. We compare our methodology against the the most commonly used maximum entropy method, to highlight the improvement in predictive performance.</p><p>In Chapter 4, we develop a novel hierarchical model for analyzing noisy point pattern datasets, that arise commonly in ecological surveys due to multiple sources of bias, as discussed in previous chapters. The effect of the noise leads to displacements of locations as well as potential loss of points inside a bounded domain. Depending on the assumption on existence of locations outside the boundary, a couple of different models -- <italic>island</italic> and <italic>subregion</italic>, are specified. The methodology assumes informative knowledge of the scale of measurement error, either pre-specified or learned from a training sample. Its performance is tested against different scales of measurement error related to the data collection techniques in CFR.</p><p>In Chapter 5, we suggest an alternative model for prevalence data, different from the one in Chapter 3, to avoid numerical approximation and subsequent computational complexities for a large region. A mixture model, similar to the one in Chapter 4 is used, with potential dependence among the weights and locations of components. The covariates as well as a spatial process are used to model the dependence. A novel birth-death algorithm for the number of components in the mixture is under construction.</p><p>Lastly, in Chapter 6, we proceed to joint modeling of multiple-species datasets. The challenge is to infer about inter-species competition with a large number of populations, possibly running into several hundreds. Our contribution involves applying hierarchical Dirichlet process to cluster the presence localities and subsequently developing measures of range overlap from posterior draws. This kind of simultaneous inference can potentially have implications for questions related to biodiversity and conservation studies. .</p> / Dissertation
|
5 |
Statistical models and algorithms for large data with complex dependence structuresLi, Miaoqi 02 June 2020 (has links)
No description available.
|
6 |
Efficient Virtualization of Scientific DataNarayanan, Sivaramakrishnan 16 September 2008 (has links)
No description available.
|
7 |
SHOCK & VIBRATION TESTING OF AN AIRBORNE INSTRUMENTATION DIGITAL RECORDERSmedley, Mark, Simpson, Gary 10 1900 (has links)
International Telemetering Conference Proceedings / October 23-26, 2000 / Town & Country Hotel and Conference Center, San Diego, California / Shock and vibration testing was performed on the Metrum-Datatape Inc. 32HE recorder to determine its
viability as an airborne instrumentation recorder. A secondary goal of the testing was to characterize the
recorder operational shock and vibration envelope. Both flight testing and laboratory environmental
testing of the recorder was performed to make these determinations. This paper addresses the laboratory
portion of the shock and vibration testing and addresses the test methodology and rationale, test set-up,
results, challenges, and lessons learned.
|
8 |
Pixel Oriented Visualization in XmdvToolPatro, Anilkumar G 07 September 2004 (has links)
"Many approaches to the visualization of multivariate data have been proposed to date. Pixel oriented techniques map each attribute value of the data to a single colored pixel, theoretically yielding the display of the maximum possible information at a time. A large number of pixel layout methods have been proposed, each of which enables users to perform their visual exploration tasks to varying degrees. Pixel oriented techniques typically maintain the global view of large amounts of data while still preserving the perception of small regions of interest, which makes them particularly interesting for visualizing very large multidimensional data sets. Pixel based methods also provide feedback on the given query by presenting not only the data items fulfilling the query but also the data that approximately fulfill the query. The goal of this thesis was to extend XmdvTool, a public domain multivariate data visualization package, to incorporate pixel based techniques and to explore their strengths and weaknesses. The main challenge here was to seamlessly apply the interaction and distortion techniques used in other visualization methods within XmdvTool to pixel based methods and investigate the capabilities made possible by fusing the various multivariate visualization techniques."
|
9 |
A Scalable Architecture for Simplifying Full-Range Scientific Data AnalysisKendall, Wesley James 01 December 2011 (has links)
According to a recent exascale roadmap report, analysis will be the limiting factor in gaining insight from exascale data. Analysis problems that must operate on the full range of a dataset are among the most difficult. Some of the primary challenges in this regard come from disk access, data managment, and programmability of analysis tasks on exascale architectures. In this dissertation, I have provided an architectural approach that simplifies and scales data analysis on supercomputing architectures while masking parallel intricacies to the user. My architecture has three primary general contributions: 1) a novel design pattern and implmentation for reading multi-file and variable datasets, 2) the integration of querying and sorting as a way to simplify data-parallel analysis tasks, and 3) a new parallel programming model and system for efficiently scaling domain-traversal tasks.
The design of my architecture has allowed studies in several application areas that were not previously possible. Some of these include large-scale satellite data and ocean flow analysis. The major driving example is of internal-model variability assessments of flow behavior in the GEOS-5 atmospheric modeling dataset. This application issued over 40 million particle traces for model comparison (the largest parallel flow tracing experiment to date), and my system was able to scale execution up to 65,536 processes on an IBM BlueGene/P system.
|
10 |
Optimisation de requêtes sur des données massives dans un environnement distribué / Optimization of queries over large data in a distributed environmentGillet, Noel 10 March 2017 (has links)
Les systèmes de stockage distribués sont massivement utilisés dans le contexte actuel des grandes masses de données. En plus de gérer le stockage de ces données, ces systèmes doivent répondre à une quantité toujours plus importante de requêtes émises par des clients distants afin d’effectuer de la fouille de données ou encore de la visualisation. Une problématique majeure dans ce contexte consiste à répartir efficacement les requêtes entre les différents noeuds qui composent ces systèmes afin de minimiser le temps de traitement des requêtes ( temps maximum et en moyenne d’une requête, temps total de traitement pour toutes les requêtes...). Dans cette thèse nous nous intéressons au problème d’allocation de requêtes dans un environnement distribué. On considère que les données sont répliquées et que les requêtes sont traitées par les noeuds stockant une copie de la donnée concernée. Dans un premier temps, des solutions algorithmiques quasi-optimales sont proposées lorsque les communications entre les différents noeuds du système se font de manière asynchrone. Le cas où certains noeuds du système peuvent être en panne est également considéré. Dans un deuxième temps, nous nous intéressons à l’impact de la réplication des données sur le traitement des requêtes. En particulier, un algorithme qui adapte la réplication des données en fonction de la demande est proposé. Cet algorithme couplé à nos algorithmes d’allocation permet de garantir une répartition des requêtes proche de l’idéal pour toute distribution de requêtes. Enfin, nous nous intéressons à l’impact de la réplication quand les requêtes arrivent en flux sur le système. Nous procédons à une évaluation expérimentale sur la base de données distribuées Apache Cassandra. Les expériences réalisées confirment l’intérêt de la réplication et de nos algorithmes d’allocation vis-à-vis des solutions présentes par défaut dans ce système. / Distributed data store are massively used in the actual context of Big Data. In addition to provide data management features, those systems have to deal with an increasing amount of queries sent by distant users in order to process data mining or data visualization operations. One of the main challenge is to evenly distribute the workload of queries between the nodes which compose these system in order to minimize the treatment time. In this thesis, we tackle the problem of query allocation in a distributed environment. We consider that data are replicated and a query can be handle only by a node storing the concerning data. First, near-optimal algorithmic proposals are given when communications between nodes are asynchronous. We also consider that some nodes can be faulty. Second, we study more deeply the impact of data replication on the query treatement. Particularly, we present an algorithm which manage the data replication based on the demand on these data. Combined with our allocation algorithm, we guaranty a near-optimal allocation. Finally, we focus on the impact of data replication when queries are received as a stream by the system. We make an experimental evaluation using the distributed database Apache Cassandra. The experiments confirm the interest of our algorithmic proposals to improve the query treatement compared to the native allocation scheme in Cassandra.
|
Page generated in 0.0487 seconds