Global ETD Search

201	Computational Analysis of Metabolomic Toxicological Data Derived from NMR Spectroscopy Kelly, Benjamin J. 26 May 2009 (has links) No description available. Bioinformatics Computer Science Statistics Toxicology metabolomics toxicology NMR spectroscopy O-PLS-DA feature selection pattern recognition
202	Sparse Multinomial Logistic Regression via Approximate Message Passing Byrne, Evan Michael 14 October 2015 (has links) No description available. Electrical Engineering multinomial logistic regression multiclass linear classification approximate message passing feature selection
203	Multiple-Instance Feature Ranking Latham, Andrew C. 26 January 2016 (has links) No description available. Computer Science Machine Learning Feature Selection Feature Ranking Multiple-Instance Learning
204	A Genetic Algorithm Approach to Feature Selection for Computer Aided Detection of Lung Nodules Sprague, Matthew J. January 2016 (has links) No description available. Electrical Engineering genetic algorithm feature selection lung nodule detection computed tomography chest radiography
205	Applied Machine Learning : A case study in machine learning in the paper industry / Tillämpad maskininlärning : En fallstudie om maskininlärning i pappersindustrin Sjögren, Anton, Quan, Baiwei January 2022 (has links) With the rapid advancement of hardware and software technologies, machine learning has been pushed to the forefront of business value generating technologies. More and more businesses start to invest in machine learning to keep up with those that have already benefited from it. A local paper processing business is looking to improve upon the estimation of each order's runtime on the machines by leveraging the machine learning technologies. Traditionally, the predictions are done by experienced planners, but the actual runtimes do not always match the predictions. This thesis conducted an investigation about whether a machine learning model could be built to produce better estimations on behalf of the local business. By following a well-defined machine learning workflow in combination with Microsoft's AutoML model builder and data processing techniques, the result shows that predictions made by the machine learning model are able to perform better than the human made ones within an accepted margin. Machine learning Microsoft AutoML feature selection regression model Computer Sciences Datavetenskap (datalogi)
206	Graph theory applications in the energy sector : From the perspective of electric utility companies Espinosa, Kristofer, Vu, Tam January 2020 (has links) Graph theory is a mathematical study of objects and their pairwise relations, also known as nodes and edges. The birth of graph theory is often considered to take place in 1736 when Leonhard Euler tried to solve a problem involving seven bridges of Königsberg in Prussia. In more recent times, graphs has caught the attention of companies from many industries due to its power of modelling and analysing large networks. This thesis investigates the usage of graph theory in the energy sector for a utility company, in particular Fortum whose activities consist of, but not limited to, production and distribution of electricity and heat. The output of the thesis is a wide overview of graph-theoretic concepts and their applications, as well as an evaluation of energy-related use-cases where some concepts are put into deeper analysis. The chosen use-case within the scope of this thesis is feature selection for electricity price forecasting. Feature selection is a process for reducing the number of features, also known as input variables, typically before a regression model is built to avoid overfitting and to increase model interpretability. Five graph-based feature selection methods with different points of view are studied. Experiments are conducted on realistic data sets with many features to verify the legitimacy of the methods. One of the data sets is owned by Fortum and used for forecasting the electricity price, among other important quantities. The obtained results look promising according to several evaluation metrics and can be used by Fortum as a support tool to develop prediction models. In general, a utility company can likely take advantage graph theory in many ways and add value to their business with enriched mathematical knowledge. / Grafteori är ett matematiskt område där objekt och deras parvisa relationer, även kända som noder respektive kanter, studeras. Grafteorins födsel anses ofta ha ägt rum år 1736 när Leonhard Euler försökte lösa ett problem som involverade sju broar i Königsberg i Preussen. På senare tid har grafer fått uppmärksamhet från företag inom flera branscher på grund av dess kraft att modellera och analysera stora nätverk. Detta arbete undersöker användningen av grafteori inom energisektorn för ett allmännyttigt företag, närmare bestämt Fortum, vars verksamhet består av, men inte är begränsad till, produktion och distribution av el och värme. Arbetet resulterar i en bred genomgång av grafteoretiska begrepp och deras tillämpningar inom både allmänna tekniska sammanhang och i synnerhet energisektorn, samt ett fallstudium där några begrepp sätts in i en djupare analys. Den valda fallstudien inom ramen för arbetet är variabelselektering för elprisprognostisering. Variabelselektering är en process för att minska antalet ingångsvariabler, vilket vanligtvis genomförs innan en regressions- modell skapas för att undvika överanpassning och öka modellens tydbarhet. Fem grafbaserade metoder för variabelselektering med olika ståndpunkter studeras. Experiment genomförs på realistiska datamängder med många ingångsvariabler för att verifiera metodernas giltighet. En av datamängderna ägs av Fortum och används för att prognostisera elpriset, bland andra viktiga kvantiteter. De erhållna resultaten ser lovande ut enligt flera utvärderingsmått och kan användas av Fortum som ett stödverktyg för att utveckla prediktionsmodeller. I allmänhet kan ett energiföretag sannolikt dra fördel av grafteori på många sätt och skapa värde i sin affär med hjälp av berikad matematisk kunskap graph theory feature selection energy industry grafteori variabelselektering energiindustri Engineering and Technology Teknik och teknologier
207	LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA Lou, Qiang January 2013 (has links) Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning such data usually requires some preprocess such as selecting informative features and imputing missing values based on observed data. These processes can provide more accurate and more efficient prediction as well as better understanding of the data distribution. In my dissertation I will describe my work in both of these aspects and also my following up work on feature selection in incomplete dataset without imputing missing values. In the last part of my dissertation, I will present my current work on more challenging situation where high-dimensional data is time-involving. The first two parts of my dissertation consist of my methods that focus on handling such data in a straightforward way: imputing missing values first, and then applying traditional feature selection method to select informative features. We proposed two novel methods, one for imputing missing values and the other one for selecting informative features. We proposed a new method that imputes the missing attributes by exploiting temporal correlation of attributes, correlations among multiple attributes collected at the same time and space, and spatial correlations among attributes from multiple sources. The proposed feature selection method aims to find a minimum subset of the most informative variables for classification/regression by efficiently approximating the Markov Blanket which is a set of variables that can shield a certain variable from the target. I present, in the third part, how to perform feature selection in incomplete high-dimensional data without imputation, since imputation methods only work well when data is missing completely at random, when fraction of missing values is small, or when there is prior knowledge about the data distribution. We define the objective function of the uncertainty margin-based feature selection method to maximize each instance's uncertainty margin in its own relevant subspace. In optimization, we take into account the uncertainty of each instance due to the missing values. The experimental results on synthetic and 6 benchmark data sets with few missing values (less than 25%) provide evidence that our method can select the same accurate features as the alternative methods which apply an imputation method first. However, when there is a large fraction of missing values (more than 25%) in data, our feature selection method outperforms the alternatives, which impute missing values first. In the fourth part, I introduce my method handling more challenging situation where the high-dimensional data varies in time. Existing way to handle such data is to flatten temporal data into single static data matrix, and then applying traditional feature selection method. In order to keep the dynamics in the time series data, our method avoid flattening the data in advance. We propose a way to measure the distance between multivariate temporal data from two instances. Based on this distance, we define the new objective function based on the temporal margin of each data instance. A fixed-point gradient descent method is proposed to solve the formulated objective function to learn the optimal feature weights. The experimental results on real temporal microarray data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance. / Computer and Information Science Computer Science Data Mining Feature Selection High-dimensional Data Incomplete Data Machine Learning
208	Solar flare prediction using advanced feature extraction, machine learning and feature selection Ahmed, Omar W., Qahwaji, Rami S.R., Colak, Tufan, Higgins, P.A., Gallagher, P.T., Bloomfield, D.S. 03 1900 (has links) Yes / Novel machine-learning and feature-selection algorithms have been developed to study: (i) the flare prediction capability of magnetic feature (MF) properties generated by the recently developed Solar Monitor Active Region Tracker (SMART); (ii) SMART's MF properties that are most significantly related to flare occurrence. Spatio-temporal association algorithms are developed to associate MFs with flares from April 1996 to December 2010 in order to differentiate flaring and non-flaring MFs and enable the application of machine learning and feature selection algorithms. A machine-learning algorithm is applied to the associated datasets to determine the flare prediction capability of all 21 SMART MF properties. The prediction performance is assessed using standard forecast verification measures and compared with the prediction measures of one of the industry's standard technologies for flare prediction that is also based on machine learning - Automated Solar Activity Prediction (ASAP). The comparison shows that the combination of SMART MFs with machine learning has the potential to achieve more accurate flare prediction than ASAP. Feature selection algorithms are then applied to determine the MF properties that are most related to flare occurrence. It is found that a reduced set of 6 MF properties can achieve a similar degree of prediction accuracy as the full set of 21 SMART MF properties.
209	Flight Data Processing Techniques to Identify Unusual Events Mugtussids, Iossif B. 26 June 2000 (has links) Modern aircraft are capable of recording hundreds of parameters during flight. This fact not only facilitates the investigation of an accident or a serious incident, but also provides the opportunity to use the recorded data to predict future aircraft behavior. It is believed that, by analyzing the recorded data, one can identify precursors to hazardous behavior and develop procedures to mitigate the problems before they actually occur. Because of the enormous amount of data collected during each flight, it becomes necessary to identify the segments of data that contain useful information. The objective is to distinguish between typical data points, that are present in the majority of flights, and unusual data points that can be only found in a few flights. The distinction between typical and unusual data points is achieved by using classification procedures. In this dissertation, the application of classification procedures to flight data is investigated. It is proposed to use a Bayesian classifier that tries to identify the flight from which a particular data point came. If the flight from which the data point came is identified with a high level of confidence, then the conclusion that the data point is unusual within the investigated flights can be made. The Bayesian classifier uses the overall and conditional probability density functions together with a priori probabilities to make a decision. Estimating probability density functions is a difficult task in multiple dimensions. Because many of the recorded signals (features) are redundant or highly correlated or are very similar in every flight, feature selection techniques are applied to identify those signals that contain the most discriminatory power. In the limited amount of data available to this research, twenty five features were identified as the set exhibiting the best discriminatory power. Additionally, the number of signals is reduced by applying feature generation techniques to similar signals. To make the approach applicable in practice, when many flights are considered, a very efficient and fast sequential data clustering algorithm is proposed. The order in which the samples are presented to the algorithm is fixed according to the probability density function value. Accuracy and reduction level are controlled using two scalar parameters: a distance threshold value and a maximum compactness factor. / Ph. D. Pattern Recognition Flight Data Recorders Flight Data Analysis Feature Generation Clustering Feature Selection Classification Bayes' Classifier
210	Machine Learning Approaches for Modeling and Correction of Confounding Effects in Complex Biological Data Wu, Chiung Ting 09 June 2021 (has links) With the huge volume of biological data generated by new technologies and the booming of new machine learning based analytical tools, we expect to advance life science and human health at an unprecedented pace. Unfortunately, there is a significant gap between the complex raw biological data from real life and the data required by mathematical and statistical tools. This gap is contributed by two fundamental and universal problems in biological data that are both related to confounding effects. The first is the intrinsic complexities of the data. An observed sample could be the mixture of multiple underlying sources and we may be only interested in one or part of the sources. The second type of complexities come from the acquisition process of the data. Different samples may be gathered at different time and/or from different locations. Therefore, each sample is associated with specific distortion that must be carefully addressed. These confounding effects obscure the signals of interest in the acquired data. Specifically, this dissertation will address the two major challenges in confounding effects removal: alignment and deconvolution. Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention time (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. To utilize this information, we develop an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. We applied ncGTW to two large-scale metabolomics LC-MS datasets, which identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. When the desired signal is buried in a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of individual sources, instead of mixtures. Though there are some promising supervised deconvolution methods, when there is no a priori information, unsupervised deconvolution is still needed. Among current unsupervised methods, Convex Analysis of Mixtures (CAM) is the most theoretically solid and strongest performing one. However, there are some major limitations of this method. Most importantly, the overall time complexity can be very high, especially when analyzing a large dataset or a dataset with many sources. Also, since there are some stochastic and heuristic steps, the deconvolution result is not accurate enough. To address these problems, we redesigned the modules of CAM. In the feature clustering step, we propose a clustering method, radius-fixed clustering, which could not only control the space size of the cluster, but also find out the outliers simultaneously. Therefore, the disadvantages of K-means clustering, such as instability and the need of cluster number are avoided. Moreover, when identifying the convex hull, we replace Quickhull with linear programming, which decreases the computation time significantly. To avoid the not only heuristic but also approximated step in optimal simplex identification, we propose a greedy search strategy instead. The experimental results demonstrate the vast improvement of computation time. The accuracy of the deconvolution is also shown to be higher than the original CAM. / Doctor of Philosophy / Due to the complexity of biological data, there are two major pre-processing steps: alignment and deconvolution. The alignment step corrects the time and location related data acquisition distortion by aligning the detected signals to a reference signal. Though many alignment methods are proposed for biological data, most of them fail to consider the relationships among samples carefully. This piece of structure information can help alignment when the data is noisy and/or irregular. To utilize this information, we develop a new method, Neighbor-wise Compound-specific Graphical Time Warping (ncGTW), inspired by graph theory. This new alignment method not only utilizes the structural information but also provides a reference-free solution. We show that the performance of our new method is better than other methods in both simulations and real datasets. When the signal is from a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of single sources, instead of mixtures. There is a classic unsupervised deconvolution method: Convex Analysis of Mixtures (CAM). However, there are some limitations of this method. For example, the time complexity of some steps is very high. Thus, when facing a large dataset or a dataset with many sources, the computation time would be extremely long. Also, since there are some stochastic and heuristic steps, the deconvolution result may be not accurate enough. We improved CAM and the experimental results show that the speed and accuracy of the deconvolution is significantly improved. bioinformatics multiple alignment deconvolution unsupervised learning convex analysis feature selection tissue heterogeneity

Search results