191 |
Unsupervised learning of relation detection patternsGonzàlez Pellicer, Edgar 01 June 2012 (has links)
L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades
estructurades a partir de la informació rellevant continguda en fragments textuals.
L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest
coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un
cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest
coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta
la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades
per tal d'explotar el coneixement que hi ha en elles.
La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació,
per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el
problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les
diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes
de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que
incorporessin la informació de clustering.
Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de
patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i
fins i tot supera altres aproximacions comparables en l'estat de l'art. / Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant
information contained in textual fragments.
Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a
drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort.
Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively
reducing the amount of involved human supervision. However, as the availability of large document collections increases,
completely unsupervised approaches become necessary in order to mine the knowledge contained in them.
The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to
further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation
detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this
combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third,
devising pattern learning procedures which incorporated clustering information.
By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns
which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable
approaches in the state of the art.
|
192 |
Hydrological Modeling of the Upper South Saskatchewan River Basin: Multi-basin Calibration and Gauge De-clustering AnalysisDunning, Cameron January 2009 (has links)
This thesis presents a method for calibrating regional scale hydrologic models using the upper South Saskatchewan River watershed as a case study. Regional scale hydrologic models can be very difficult to calibrate due to the spatial diversity of their land types. To deal with this diversity, both a manual calibration method and a multi-basin automated calibration method were applied to a WATFLOOD hydrologic model of the watershed.
Manual calibration was used to determine the effect of each model parameter on modeling results. A parameter set that heavily influenced modeling results was selected. Each influential parameter was also assigned an initial value and a parameter range to be used during automated calibration. This manual calibration approach was found to be very effective for improving modeling results over the entire watershed.
Automated calibration was performed using a weighted multi-basin objective function based on the average streamflow from six sub-basins. The initial parameter set and ranges found during manual calibration were subjected to the optimization search algorithm DDS to automatically calibrate the model. Sub-basin results not involved in the objective function were considered for validation purposes. Automatic calibration was deemed successful in providing watershed-wide modeling improvements.
The calibrated model was then used as a basis for determining the effect of altering rain gauge density on model outputs for both a local (sub-basin) and global (watershed) scale. Four de-clustered precipitation data sets were used as input to the model and automated calibration was performed using the multi-basin objective function. It was found that more accurate results were obtained from models with higher rain gauge density. Adding a rain gauge did not necessarily improve modeled results over the entire watershed, but typically improved predictions in the sub-basin in which the gauge was located.
|
193 |
Clustering Lab value working with medical dataDavari, Mahtab January 2007 (has links)
Data mining is a relatively new field of research that its objective is to acquire knowledge from large amounts of data. In medical and health care areas, due to regulations and due to the availability of computers, a large amount of data is becoming available [27]. On the one hand, practitioners are expected to use all this data in their work but, at the same time, such a large amount of data cannot be processed by humans in a short time to make diagnosis, prognosis and treatment schedules. A major objective of this thesis is to evaluate data mining tools in medical and health care applications to develop a tool that can help make rather accurate decisions. In this thesis, the goal is finding a pattern among patients who got pneumonia by clustering of lab data values which have been recorded every day. By this pattern we can generalize it to the patients who did not have been diagnosed by this disease whose lab values shows the same trend as pneumonia patients does. There are 10 tables which have been extracted from a big data base of a hospital in Jena for my work .In ICU (intensive care unit), COPRA system which is a patient management system has been used. All the tables and data stored in German Language database.
|
194 |
Hydrological Modeling of the Upper South Saskatchewan River Basin: Multi-basin Calibration and Gauge De-clustering AnalysisDunning, Cameron January 2009 (has links)
This thesis presents a method for calibrating regional scale hydrologic models using the upper South Saskatchewan River watershed as a case study. Regional scale hydrologic models can be very difficult to calibrate due to the spatial diversity of their land types. To deal with this diversity, both a manual calibration method and a multi-basin automated calibration method were applied to a WATFLOOD hydrologic model of the watershed.
Manual calibration was used to determine the effect of each model parameter on modeling results. A parameter set that heavily influenced modeling results was selected. Each influential parameter was also assigned an initial value and a parameter range to be used during automated calibration. This manual calibration approach was found to be very effective for improving modeling results over the entire watershed.
Automated calibration was performed using a weighted multi-basin objective function based on the average streamflow from six sub-basins. The initial parameter set and ranges found during manual calibration were subjected to the optimization search algorithm DDS to automatically calibrate the model. Sub-basin results not involved in the objective function were considered for validation purposes. Automatic calibration was deemed successful in providing watershed-wide modeling improvements.
The calibrated model was then used as a basis for determining the effect of altering rain gauge density on model outputs for both a local (sub-basin) and global (watershed) scale. Four de-clustered precipitation data sets were used as input to the model and automated calibration was performed using the multi-basin objective function. It was found that more accurate results were obtained from models with higher rain gauge density. Adding a rain gauge did not necessarily improve modeled results over the entire watershed, but typically improved predictions in the sub-basin in which the gauge was located.
|
195 |
Fuzzy logic-based digital soil mapping in the Laurel Creek Conservation Area, Waterloo, OntarioRen, Que January 2012 (has links)
The aim of this thesis was to examine environmental covariate-related issues, the resolution dependency, the contribution of vegetation covariates, and the use of LiDAR data, in the purposive sampling design for fuzzy logic-based digital soil mapping. In this design fuzzy c-means (FCM) clustering of environmental covariates was employed to determine proper sampling sites and assist soil survey and inference. Two subsets of the Laurel Creek Conservation area were examined for the purposes of exploring the resolution and vegetation issues, respectively. Both conventional and LiDAR-derived digital elevation models (DEMs) were used to derive terrain covariates, and a vegetation index calculated from remotely sensed data was employed as a vegetation covariate. A basic field survey was conducted in the study area. A validation experiment was performed in another area.
The results show that the choices of optimal numbers of clusters shift with resolution aggregated, which leads to the variations in the optimal partition of environmental covariates space and the purposive sampling design. Combining vegetation covariates with terrain covariates produces different results from the use of only terrain covariates. The level of resolution dependency and the influence of adding vegetation covariates vary with DEM source. This study suggests that DEM resolution, vegetation, and DEM source bear significance to the purposive sampling design for fuzzy logic-based digital soil mapping. The interpretation of fuzzy membership values at sampled sites also indicates the associations between fuzzy clusters and soil series, which lends promise to the applicability of fuzzy logic-based digital soil mapping in areas where fieldwork and data are limited.
|
196 |
Determining and characterizing immunological self/non-selfLi, Ying 15 February 2007 (has links)
The immune system has the ability to discriminate self from non-self proteins and also make appropriate immune responses to pathogens. A fundamental problem is to understand the genomic differences and similarities among the sets of self peptides and non-self peptides. The sequencing of human, mouse and numerous pathogen genomes and cataloging of their respective proteomes allows host self and non-self peptides to be identified. T-cells make this determination at the peptide level based on peptides displayed by MHC molecules.<p>In this project, peptides of specific lengths (k-mers) are generated from each protein in the proteomes of various model organisms. The set of unique k-mers for each species is stored in a library and defines its "immunological self". Using the libraries, organisms can be compared to determine the levels of peptide overlap. The observed levels of overlap can also be compared with levels which can be expected "at random" and statistical conclusions drawn.<p>A problem with this procedure is that sequence information in public protein databases (Swiss-PROT, UniProt, PIR) often contains ambiguities. Three strategies for dealing with such ambiguities have been explored in earlier work and the strategy of removing ambiguous k-mers is used here.<p>Peptide fragments (k-mers) which elicit immune responses are often localized within the sequences of proteins from pathogens. These regions are known as "immunodominants" (i.e., hot spots) and are important in immunological work. After investigating the peptide universes and their overlaps, the question of whether known regions of immunological significance (e.g., epitope) come from regions of low host-similarity is explored. The known regions of epitopes are compared with the regions of low host-similarity (i.e., non-overlaps) between HIV-1 and human proteomes at the 7-mer level. Results show that the correlation between these two regions is not statistically significant. In addition, pairs involving human and human viruses are explored. For these pairs, one graph for each k-mer level is generated showing the actual numbers of matches between organisms versus the expected numbers. From graphs for 5-mer and 6-mer level, we can see that the number of overlapping occurrences increases as the size of the viral proteome increases.<p>A detailed investigation of the overlaps/non-overlaps between viral proteome and human proteome reveals that the distribution of the locations of these overlaps/non-overlaps may have "structure" (e.g. locality clustering). Thus, another question that is explored is whether the locality clustering is statistically significant. A chi-square analysis is used to analyze the locality clustering. Results show that the locality clusterings for HIV-1, HIV-2 and Influenza A virus at the 5-mer, 6-mer and 7-mer levels are statistically significant. Also, for self-similarity of human protein Desmoglein 3 to the remaining human proteome, it shows that the locality clustering is not statistically significant at the 5-mer level while it is at the 6-mer and 7-mer levels.
|
197 |
Hybrid Algorithms of Finding Features for Clustering Sequential DataChang, Hsi-mei 08 July 2010 (has links)
Proteins are
the structural components of living cells and tissues, and thus an
important building block in all living organisms. Patterns in
proteins sequences are some subsequences which appear frequently.
Patterns often denote important functional regions in proteins and
can be used to characterize a protein family or discover the
function of proteins. Moreover, it provides valuable information
about the evolution of species. Grouping protein sequences that
share similar structure helps in identifying sequences with similar
functionality. Many algorithms have been proposed for clustering
proteins according to their similarity, i.e., sequential
patterns in protein databases, for example, feature-based clustering
algorithms of the global approach and the local approach. They use
the algorithm of mining sequential patterns to solve the
no-gap-limit sequential pattern problem in a protein sequences
database, and then find global features and local features
separately for clustering. Feature-based clustering algorithms are
entirely different approaches to protein clustering that do not
require an all-against-all analysis and use a near-linear
complexity K-means based clustering algorithm. Although
feature-based clustering algorithms are scalable and lead to
reasonably good clusters, they consume time on performing the global
approach and the local approach separately. Therefore, in this
thesis, we propose hybrid algorithms to find and mark features for
feature-based clustering algorithms. We observe an interesting
result from the relation between the local features and the closed
frequent sequential patterns. The important observation which we
find is that some features in the closed frequent sequential
patterns can be taken apart to several features in the local
selected features and the total support number of these features in
the local selected features is equal to the support number of the
corresponding feature in the closed frequent sequential patterns.
There are two phases, find-feature and mark-feature, in the global
approach and the local approach after mining sequential patterns. In
our hybrid algorithms of Method 1 (LocalG), we first find and mark
the local features. Then, we find the global features. Finally, we
mark the bit vectors of the global features efficiently from the bit
vector of the local features. In our hybrid algorithms of Method 2
(CLoseLG), we first find the closed frequent sequential patterns
directly. Next, we find local candidate features efficiently from
the closed frequent sequential patterns and then mark the local
features. Finally, we find and mark the global features. From our
performance study based on the biological data and the synthetic
data, we show that our proposed hybrid algorithms are more efficient
than the feature-based algorithm.
|
198 |
Target Tracking by Information Filtering in Cluster-based UWB Sensor NetworksLee, Chih-ying 19 August 2011 (has links)
We consider the topic of target tracking in this thesis. Target tracking is one of the applications in wireless sensor networks (WSNs). Clustering approach prolongs sensor¡¦s lifetime and provides better data aggregation for WSNs. Most previous researches assumed that cluster regions are disjointed, while others assigned overlapping cluster regions, and utilized them in some applications, including inter-cluster routing and time synchronization. However, in overlapping clustering, processing of redundant sensing data may impair system performance. We present a regular distributed overlapping WSN in this thesis. The network is based on two kinds of sensors: (1) high-capability sensors, which are assigned as cluster heads (CHs), responsible for data processing and inter-cluster communication, (2) normal sensors, which are in a larger number when comparing with the high-capability sensors, the function of normal sensors are to provide data to the CHs. We define several operating modes of CHs and sensors. WSN works more efficient under the settings. Since a target may be located in the overlapping region, redundant data processing problem exists. To solve the problem, we utilize Cholesky decomposition to decorrelate the measurement noise covariance matrices. The correlation will be eliminated during the process. In addition, we modify extended information filter (EIF) and adapt to the decorrelated data. The CHs track the target, fuse the information from other CHs, and implement distributed positioning. The simulations are based on ultra-wideband (UWB) environment, we have verified that the proposed scheme works more efficient under the setting of different modes. The performance with decorrelated measurement is better than that with correlated ones.
|
199 |
Continuous Space Pattern Reduction Enhanced Metaheuristics for ClusteringLin, Tzu-Yuan 07 September 2012 (has links)
The pattern reduction (PR) algorithm we proposed previously, which works by eliminating patterns that are unlikely to change their membership during the convergence process, is obviously one of the most efficient methods for reducing the computation time of clustering algorithms. However, it is limited to problems with solutions that can be binary or integer encoded, such as combinatorial optimization problems. As such, this study is aimed at developing a new pattern reduction algorithm, called pattern reduction over continuous space, to get rid of this limitation. Like the PR, the proposed algorithm consists of two operators: detection and compression. Unlike the PR, the detection operator is divided into two steps. The first step is aimed at finding out subsolutions that can be considered as the candidate subsolutions for compression. The second step is performed to ensure that the candidate subsolutions have reached the final state so that any further computation is eventually a waste and thus can be compressed. To evaluate the performance of the proposed algorithm, we apply it to metaheuristics for clustering.
|
200 |
Support for Location and Comprehension of User History in Collaborative WorkKim, Do Hyoung 2011 December 1900 (has links)
Users are being embraced as partners in developing computer services in many current computer supported cooperative work systems. Many web-based applications, including collaborative authoring tools like wikis, place users into collaborations with unknown and distant partners. Individual participants in such environments need to identify and understand others' contributions for collaboration to succeed and be efficient. One approach to supporting such understanding is to record user activity for later access. Issues with this approach include difficulties in locating activity of interest in large tasks and the history is often recorded at a system-activity level instead of at a human-activity level. To address these issues, this dissertation introduces CoActIVE, an application-independent history mechanism that clusters records of user activity and extracts keywords in an attempt to provide a human-level representation of history. CoActIVE is integrated in three different software applications to show its applicability and validity. Multiple visualization techniques based on this processing are compared in their ability to improve users' location and comprehension of the activity of others. The results show that filmstrip visualization and visual summarization of user activity show significant improvement over traditional list view interfaces.
CoActIVE generates an interpretation of large-scale interaction history and provides the interpretation thorough a variety of visualizations that allow users to navigate the evolution of collaborative work. It supports branching history, with the understanding that asynchronous authoring and design tasks often involve the parallel development of alternatives. Additionally, CoActIVE has the potential to be integrated into a variety of applications with little adjustment for compatibility. Especially, the comparison of visualizations for locating and comprehending the work of others is unique.
|
Page generated in 0.0742 seconds