• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 26
  • 1
  • 1
  • 1
  • Tagged with
  • 33
  • 33
  • 25
  • 21
  • 17
  • 14
  • 11
  • 8
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Data Science techniques for predicting plant genes involved in secondary metabolites production

Muteba, Ben Ilunga January 2018 (has links)
Masters of Science / Plant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes.
2

Automated Feature Engineering for Deep Neural Networks with Genetic Programming

Heaton, Jeff T. 01 January 2017 (has links)
Feature engineering is a process that augments the feature vector of a machine learning model with calculated values that are designed to enhance the accuracy of a model’s predictions. Research has shown that the accuracy of models such as deep neural networks, support vector machines, and tree/forest-based algorithms sometimes benefit from feature engineering. Expressions that combine one or more of the original features usually create these engineered features. The choice of the exact structure of an engineered feature is dependent on the type of machine learning model in use. Previous research demonstrated that various model families benefit from different types of engineered feature. Random forests, gradient-boosting machines, or other tree-based models might not see the same accuracy gain that an engineered feature allowed neural networks, generalized linear models, or other dot-product based models to achieve on the same data set. This dissertation presents a genetic programming-based algorithm that automatically engineers features that increase the accuracy of deep neural networks for some data sets. For a genetic programming algorithm to be effective, it must prioritize the search space and efficiently evaluate what it finds. This dissertation algorithm faced a potential search space composed of all possible mathematical combinations of the original feature vector. Five experiments were designed to guide the search process to efficiently evolve good engineered features. The result of this dissertation is an automated feature engineering (AFE) algorithm that is computationally efficient, even though a neural network is used to evaluate each candidate feature. This approach gave the algorithm a greater opportunity to specifically target deep neural networks in its search for engineered features that improve accuracy. Finally, a sixth experiment empirically demonstrated the degree to which this algorithm improved the accuracy of neural networks on data sets augmented by the algorithm’s engineered features.
3

Improving Model Performance with Robust PCA

Bennett, Marissa A. 15 May 2020 (has links)
As machine learning becomes an increasingly relevant field being incorporated into everyday life, so does the need for consistently high performing models. With these high expectations, along with potentially restrictive data sets, it is crucial to be able to use techniques for machine learning that increase the likelihood of success. Robust Principal Component Analysis (RPCA) not only extracts anomalous data, but also finds correlations among the given features in a data set, in which these correlations can themselves be used as features. By taking a novel approach to utilizing the output from RPCA, we address how our method effects the performance of such models. We take into account the efficiency of our approach, and use projectors to enable our method to have a 99.79% faster run time. We apply our method primarily to cyber security data sets, though we also investigate the effects on data sets from other fields (e.g. medical).
4

Towards Learning Representations in Visual Computing Tasks

January 2017 (has links)
abstract: The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle anomalies in the data such as missing samples and noisy input caused by the undesired, external factors of variation. It should also reduce the data redundancy. Over the years, many feature extraction processes have been invented to produce good representations of raw images and videos. The feature extraction processes can be categorized into three groups. The first group contains processes that are hand-crafted for a specific task. Hand-engineering features requires the knowledge of domain experts and manual labor. However, the feature extraction process is interpretable and explainable. Next group contains the latent-feature extraction processes. While the original feature lies in a high-dimensional space, the relevant factors for a task often lie on a lower dimensional manifold. The latent-feature extraction employs hidden variables to expose the underlying data properties that cannot be directly measured from the input. Latent features seek a specific structure such as sparsity or low-rank into the derived representation through sophisticated optimization techniques. The last category is that of deep features. These are obtained by passing raw input data with minimal pre-processing through a deep network. Its parameters are computed by iteratively minimizing a task-based loss. In this dissertation, I present four pieces of work where I create and learn suitable data representations. The first task employs hand-crafted features to perform clinically-relevant retrieval of diabetic retinopathy images. The second task uses latent features to perform content-adaptive image enhancement. The third task ranks a pair of images based on their aestheticism. The goal of the last task is to capture localized image artifacts in small datasets with patch-level labels. For both these tasks, I propose novel deep architectures and show significant improvement over the previous state-of-art approaches. A suitable combination of feature representations augmented with an appropriate learning approach can increase performance for most visual computing tasks. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2017
5

Evolutionary dynamics, topological disease structures, and genetic machine learning

Gryder, Ryan Wayne 06 October 2021 (has links)
Topological evolution is a new dynamical systems model of biological evolution occurring within a genomic state space. It can be modeled equivalently as a stochastic dynamical system, a stochastic differential equation, or a partial differential equation drift-diffusion model. An application of this approach is a model of disease evolution tracing diseases in ways similar to standard functional traits (e.g., organ evolution). Genetically embedded diseases become evolving functional components of species-level genomes. The competition between species-level evolution (which tends to maintain diseases) and individual evolution (which acts to eliminate them), yields a novel structural topology for the stochastic dynamics involved. In particular, an unlimited set of dynamical time scales emerges as a means of timing different levels of evolution: from individual to group to species and larger units. These scales exhibit a dynamical tension between individual and group evolutions, which are modeled on very different (fast and slow, respectively) time scales. This is analyzed in the context of a potentially major constraint on evolution: the species-level enforcement of lifespan via (topological) barriers to genomic longevity. This species-enforced behavior is analogous to certain types of evolutionary altruism, but it is denoted here as extreme altruism based on its potential shaping through mass extinctions. We give examples of biological mechanisms implementing some of the topological barriers discussed and provide mathematical models for them. This picture also introduces an explicit basis for lifespan-limiting evolutionary pressures. This involves a species-level need to maintain flux in its genome via a paced turnover of its biomass. This is necessitated by the need for phenomic characteristics to keep pace with genomic changes through evolution. Put briefly, the phenome must keep up with the genome, which occurs with an optimized limited lifespan. An important consequence of this model is a new role for diseases in evolution. Rather than their commonly recognized role as accidental side-effects, they play a central functional role in the shaping of an optimal lifespan for a species implemented through the topology of their embedding into the genome state space. This includes cancers, which are known to be embedded into the genome in complex and sometimes hair-triggered ways arising from DNA damage. Such cancers are known also to act in engineered and teleological ways that have been difficult to explain using currently very popular theories of intra-organismic cancer evolution. This alternative inter-organismic picture presents cancer evolution as occurring over much longer (evolutionary) time scales rather than very shortened organic evolutions that occur in individual cancers. This in turn may explain some evolved, intricate, and seemingly engineered properties of cancer. This dynamical evolutionary model is framed in a multiscaled picture in which different time scales are almost independently active in the evolutionary process acting on semi-independent parts of the genome. We additionally move from natural evolution to artificial implementations of evolutionary algorithms. We study genetic programming for the structured construction of machine learning features in a new structural risk minimization environment. While genetic programming in feature engineering is not new, we propose a Lagrangian optimization criterion for defining new feature sets inspired by structural risk minimization in statistical learning. We bifurcate the optimization of this Lagrangian into two exhaustive categories involving local and global search. The former is accomplished through local descent with given basins of attraction while the latter is done through a combinatorial search for new basins via an evolution algorithm.
6

Context-Awareness for Adversarial and Defensive Machine Learning Methods in Cybersecurity

Quintal, Kyle 14 August 2020 (has links)
Machine Learning has shown great promise when combined with large volumes of historical data and produces great results when combined with contextual properties. In the world of the Internet of Things, the extraction of information regarding context, or contextual information, is increasingly prominent with scientific advances. Combining such advancements with artificial intelligence is one of the themes in this thesis. Particularly, there are two major areas of interest: context-aware attacker modelling and context-aware defensive methods. Both areas use authentication methods to either infiltrate or protect digital systems. After a brief introduction in chapter 1, chapter 2 discusses the current extracted contextual information within cybersecurity studies, and how machine learning accomplishes a variety of cybersecurity goals. Chapter 3 introduces an attacker injection model, championing the adversarial methods. Then, chapter 4 extracts contextual data and provides an intelligent machine learning technique to mitigate anomalous behaviours. Chapter 5 explores the feasibility of adopting a similar defensive methodology in the cyber-physical domain, and future directions are presented in chapter 6. Particularly, we begin this thesis by explaining the need for further improvements in cybersecurity using contextual information and discuss its feasibility, now that ubiquitous sensors exist in our everyday lives. These sensors often show a high correlation with user identity in surprising combinations. Our first contribution lay within the domain of Mobile CrowdSensing (MCS). Despite its benefits, MCS requires proper security solutions to prevent various attacks, notably injection attacks. Our smart-injection model, SINAM, monitors data traffic in an online-learning manner, simulating an injection model with undetection rates of 99%. SINAM leverages contextual similarities within a given sensing campaign to mimic anomalous injections. On the flip-side, we investigate how contextual features can be utilized to improve authentication methods in an enterprise context. Also motivated by the emergence of omnipresent mobile devices, we expand the Spatio-temporal features of unfolding contexts by introducing three contextual metrics: document shareability, document valuation, and user cooperation. These metrics are vetted against modern machine learning techniques and achieved an average of 87% successful authentication attempts. Our third contribution aims to further improve such results but introducing a Smart Enterprise Access Control (SEAC) technique. Combining the new contextual metrics with SEAC achieved an authenticity precision of 99% and a recall of 97%. Finally, the last contribution is an introductory study on risk analysis and mitigation using context. Here, cyber-physical coupling metrics are created to extract a precise representation of unfolding contexts in the medical field. The presented consensus algorithm achieves initial system conveniences and security ratings of 88% and 97% with these news metrics. Even as a feasibility study, physical context extraction shows good promise in improving cybersecurity decisions. In short, machine learning is a powerful tool when coupled with contextual data and is applicable across many industries. Our contributions show how the engineering of contextual features, adversarial and defensive methods can produce applicable solutions in cybersecurity, despite minor shortcomings.
7

Feature-based Time Series Analytics

Kegel, Lars 27 May 2020 (has links)
Time series analytics is a fundamental prerequisite for decision-making as well as automation and occurs in several applications such as energy load control, weather research, and consumer behavior analysis. It encompasses time series engineering, i.e., the representation of time series exhibiting important characteristics, and data mining, i.e., the application of the representation to a specific task. Due to the exhaustive data gathering, which results from the ``Industry 4.0'' vision and its shift towards automation and digitalization, time series analytics is undergoing a revolution. Big datasets with very long time series are gathered, which is challenging for engineering techniques. Traditionally, one focus has been on raw-data-based or shape-based engineering. They assess the time series' similarity in shape, which is only suitable for short time series. Another focus has been on model-based engineering. It assesses the time series' similarity in structure, which is suitable for long time series but requires larger models or a time-consuming modeling. Feature-based engineering tackles these challenges by efficiently representing time series and comparing their similarity in structure. However, current feature-based techniques are unsatisfactory as they are designed for specific data-mining tasks. In this work, we introduce a novel feature-based engineering technique. It efficiently provides a short representation of time series, focusing on their structural similarity. Based on a design rationale, we derive important time series characteristics such as the long-term and cyclically repeated characteristics as well as distribution and correlation characteristics. Moreover, we define a feature-based distance measure for their comparison. Both the representation technique and the distance measure provide desirable properties regarding storage and runtime. Subsequently, we introduce techniques based on our feature-based engineering and apply them to important data-mining tasks such as time series generation, time series matching, time series classification, and time series clustering. First, our feature-based generation technique outperforms state-of-the-art techniques regarding the accuracy of evolved datasets. Second, with our features, a matching method retrieves a match for a time series query much faster than with current representations. Third, our features provide discriminative characteristics to classify datasets as accurately as state-of-the-art techniques, but orders of magnitude faster. Finally, our features recommend an appropriate clustering of time series which is crucial for subsequent data-mining tasks. All these techniques are assessed on datasets from the energy, weather, and economic domains, and thus, demonstrate the applicability to real-world use cases. The findings demonstrate the versatility of our feature-based engineering and suggest several courses of action in order to design and improve analytical systems for the paradigm shift of Industry 4.0.
8

Efficiency analysis of verbal radio communication in air combat simulation / Effektivitetsanalys av verbal radiokommunikation i luftstridssimulering

Lilja, Hanna January 2016 (has links)
Efficient communication is an essential part of cooperative work, and no less so in the case of radio communication during air combat. With time being a limited resource and the consequences of a misunderstanding potentially fatal there is little room for negligence. This work is an exploratory study which combines data mining, machine learning, natural language processing and visual analytics in an effort to investigate the possibilities of using radio traffic data from air combat simulations for human performance evaluation. Both temporal and linguistic properties of the communication were analyzed, with several promising graphical results. Additionally, utterance classification was successfully attempted with mean precision and recall both over 0.9. It is hoped that more complex and to a larger extent automated data based communication analysis can be built upon the results presented in this report. / Effektiv kommunikation är en grundläggande del av god samarbetsförmåga, inte minst när det gäller radiokommunikation under luftstrid. När tid är en begränsad resurs och ett missförstånd kan få fatala följder finns inte mycket utrymme för slarv. Det här arbetet är en utforskande studie som kombinerar data mining, maskininlärning, natural language processing och visuell dataanalys i syfte att undersöka hur radiotrafikdata från luftstridssimulering skulle kunna användas för prestationsutvärdering. Såväl tidsrelaterade som språkliga egenskaper hos kommunikationen har analyserats och flera av visualiseringarna ser lovande ut. Vidare prövades med framgång att klassificera yttranden, med genomsnittlig precision och täckning över 0.9. Förhoppningen är att de resultat som presenteras i rapporten ska kunna användas som grund för vidareutveckling av mer djupgående och i större utsträckning automatiserad databaserad kommunikationsanalys.
9

LSTM Feature Engineering Through Time Series Similarity Embedding / Aspektkonstruktion för LSTM-nätverk genom inbäddning av tidsserielikheter

Bångerius, Sebastian January 2022 (has links)
Time series prediction has many applications. In cases with simultaneous series (like measurements of weather from multiple stations, or multiple stocks on the stock market)it is not unlikely that these series from different measurement origins behave similarly, or respond to the same contextual signals. Training input to a prediction model could be constructed from all simultaneous measurements to try and capture the relations between the measurement origins. A generalized approach is to train a prediction model on samples from any individual measurement origin. The data mass is the same in both cases, but in the first case, fewer samples of a larger width are used, while the second option uses a higher number of smaller samples. The first, high-width option, risks over-fitting as a result of fewer training samples per input variable. The second, general option, would have no way to learn relations between the measurement origins. Amending the general model with contextual information would allow for keeping a high samples-per-variable ratio without losing the ability to take the origin of the measurements into account. This thesis presents a vector embedding method for measurement origins in an environment with shared response to contextual signals. The embeddings are based on multi-variate time series from the origins. The embedding method is inspired by co-occurrence matrices commonly used in Natural Language Processing. The similarity measures used between the series are Dynamic Time Warping (DTW), Step-wise Euclidean Distance, and Pearson Correlation. The dimensionality of the resulting embeddings is reduced by Principal Component Analysis (PCA) to increase information density, and effectively preserve variance in the similarity space. The created embedding system allows contextualization of samples, akin to the human intuition that comes from knowing where measurements were taken from, like knowing what sort of company a stock ticker represents, or what environment a weather station is located in. In the embedded space, embeddings of series from fundamentally similar measurement origins are closely located, so that information regarding the behavior of one can be generalized to its neighbors. The resulting embeddings from this work resonate well with existing clustering methods in a weather dataset, and partially in a financial dataset, and do provide performance improvement for an LSTM network acting on said financial dataset. The similarity embeddings also outperform an embedding layer trained together with the LSTM.
10

Credit Modeling with Behavioral Data / Kreditmodellering med beteendedata

Zhou, Jingning January 2022 (has links)
In recent years, the Buy Now Pay Later service has spread across the e-commerce industry, and credit modeling is inevitable of interest for related companies to predict the default rate of the customers. The traditional data used in such models are financial bureaus which include credit records bought from external financial institutions. However, external financial bureaus are not ensured high quality, are expensive , and a large number of the population could lack bank records in some markets. In terms of ethics, the financial bureau can lead to discrimination between the traditional asset holder and the young generation, as well as the developed and developing countries for an international company. Instead of comparing different classification methods, this paper investigates the feasibility and usage of click behavior(CB) data from the customer in credit modeling by carrying out feature engineering and conducting comparative experiments. The study demonstrates whether and how we can use CB data as a new data source and the restrictions. The results show that despite the CB data doesn’t impact enhancing the performance of the traditional model, the CB data model has sufficient performance for orders with CB data and weak performance for orders in general due to the hitting rate of the CB data. The CB not only has predictability on orders placed in the shopping app but also on orders placed from other sources such as the website for the same customer. Besides, the CB data perform better on specific customer segments, including new customers, shopping app customers, and high order amount customers. Adding such segment indicators can improve the performance of the CB model. In addition, the best click behavioral feature set is selected by using correlation analysis and the Reverse Feature Elimination method. / Under de senaste åren har så kallade “Buy now, Pay later” (köp nu, betala senare) tjänster spridit sig över e-handelsbranschen, och kreditmodellering är oundvikligen av intresse för att förutsäga kundernas risk för fallissemang. De traditionella uppgifterna som används i sådana modeller kommer från till stor del från externa källor, såsom kreditupplysningar köpta från externa finansinstitut. Men externa finansbyråer har tillkortakommanden. Exempelvis kan kvaliteten vara otillräcklig, priset för tjänsten kan vara högt och ett stort antal av befolkningen kan sakna uppgifter. Från ett etiskt perspektiv kan användandet av denna data leda till diskriminering mellan den traditionella tillgångsinnehavaren och den yngre generationen, såväl som mellan de utvecklade länderna och utvecklingsländerna för ett internationellt företag. Istället för att jämföra olika klassificeringsmetoder, undersöker detta arbete genomförbarheten och användningsbarheten av att använda kunders klickbeteendedata (KB) i kreditmodellering genom att utföra variabelutveckling och jämförande experiment. Studien visar om och hur vi kan använda KB-data som en ny datakälla och vilka begränsningarna som medföljer. Resultaten visar att variabler baserad på KB-data inte har signifikant påverkan på kreditmodellers prestanda i allmänhet. Dock så har de en prediktiv förmåga när modeller tränas endast på ordrar där KB-data finns tillgängligt. Dessutom går studien igenom vilka kundsegment som främst gynnas av KB-data såsom nya kunder, kunder som gjort köp via Klarnas shopping app samt kunder med som gör stora köp. Att lägga till sådana segmentindikatorer kan förbättra KB-modellers prestanda.

Page generated in 0.0773 seconds