211 |
Empirical Hierarchical Modeling and Predictive Inference for Big, Spatial, Discrete, and Continuous DataSengupta, Aritra 17 December 2012 (has links)
No description available.
|
212 |
Canonical Correlation and Clustering for High Dimensional DataOuyang, Qing January 2019 (has links)
Multi-view datasets arise naturally in statistical genetics when the genetic
and trait profile of an individual is portrayed by two feature vectors.
A motivating problem concerning the Skin Intrinsic Fluorescence (SIF)
study on the Diabetes Control and Complications Trial (DCCT) subjects
is presented. A widely applied quantitative method to explore the correlation
structure between two domains of a multi-view dataset is the
Canonical Correlation Analysis (CCA), which seeks the canonical loading
vectors such that the transformed canonical covariates are maximally
correlated. In the high dimensional case, regularization of the dataset is
required before CCA can be applied. Furthermore, the nature of genetic
research suggests that sparse output is more desirable. In this thesis, two
regularized CCA (rCCA) methods and a sparse CCA (sCCA) method
are presented. When correlation sub-structure exists, stand-alone CCA
method will not perform well. To tackle this limitation, a mixture of
local CCA models can be employed. In this thesis, I review a correlation
clustering algorithm proposed by Fern, Brodley and Friedl (2005),
which seeks to group subjects into clusters such that features are identically
correlated within each cluster. An evaluation study is performed
to assess the effectiveness of CCA and correlation clustering algorithms
using artificial multi-view datasets. Both sCCA and sCCA-based correlation
clustering exhibited superior performance compare to the rCCA and
rCCA-based correlation clustering. The sCCA and the sCCA-clustering
are applied to the multi-view dataset consisted of PrediXcan imputed gene
expression and SIF measurements of DCCT subjects. The stand-alone
sparse CCA method identified 193 among 11538 genes being correlated
with SIF#7. Further investigation of these 193 genes with simple linear
regression and t-test revealed that only two genes, ENSG00000100281.9
and ENSG00000112787.8, were significance in association with SIF#7. No
plausible clustering scheme was detected by the sCCA based correlation
clustering method. / Thesis / Master of Science (MSc)
|
213 |
Data-driven Infrastructure InspectionBianchi, Eric Loran 18 January 2022 (has links)
Bridge inspection and infrastructure inspection are critical steps in the lifecycle of the built environment. Emerging technologies and data are driving factors which are disrupting the traditional processes for conducting these inspections. Because inspections are mainly conducted visually by human inspectors, this paper focuses on improving the visual inspection process with data-driven approaches. Data driven approaches, however, require significant data, which was sparse in the existing literature. Therefore, this research first examined the present state of the existing data in the research domain. We reviewed hundreds of image-based visual inspection papers which used machine learning to augment the inspection process and from this, we compiled a comprehensive catalog of over forty available datasets in the literature and identified promising, emerging techniques and trends in the field. Based on our findings in our review we contributed six significant datasets to target gaps in data in the field. The six datasets comprised of structural material segmentation, corrosion condition state segmentation, crack detection, structural detail detection, and bearing condition state classification. The contributed datasets used novel annotation guidelines and benefitted from a novel semi-automated annotation process for both object detection and pixel-level detection models. Using the data obtained from our collected sources, task-appropriate deep learning models were trained. From these datasets and models, we developed a change detection algorithm to monitor damage evolution between two inspection videos and trained a GAN-Inversion model which generated hyper-realistic synthetic bridge inspection image data and could forecast a future deterioration state of an existing bridge element. While the application of machine learning techniques in civil engineering is not wide-spread yet, this research provides impactful contribution which demonstrates the advantages that data driven sciences can provide to more economically and efficiently inspect structures, catalog deterioration, and forecast potential outcomes. / Doctor of Philosophy / Bridge inspection and infrastructure inspection are critical steps in the lifecycle of the built environment. Emerging technologies and data are driving factors which are disrupting the traditional processes for conducting these inspections. Because inspections are mainly conducted visually by human inspectors, this paper focuses on improving the visual inspection process with data-driven approaches. Data driven approaches, however, require significant data, which was sparse in the existing literature. Therefore, this research first examined the present state of the existing data in the research domain. We reviewed hundreds of image-based visual inspection papers which used machine learning to augment the inspection process and from this, we compiled a comprehensive catalog of over forty available datasets in the literature and identified promising, emerging techniques and trends in the field. Based on our findings in our review we contributed six significant datasets to target gaps in data in the field. The six datasets comprised of structural material detection, corrosion condition state identification, crack detection, structural detail detection, and bearing condition state classification. The contributed datasets used novel labeling guidelines and benefitted from a novel semi-automated labeling process for the artificial intelligence models. Using the data obtained from our collected sources, task-appropriate artificial intelligence models were trained. From these datasets and models, we developed a change detection algorithm to monitor damage evolution between two inspection videos and trained a generative model which generated hyper-realistic synthetic bridge inspection image data and could forecast a future deterioration state of an existing bridge element. While the application of machine learning techniques in civil engineering is not widespread yet, this research provides impactful contribution which demonstrates the advantages that data driven sciences can provide to more economically and efficiently inspect structures, catalog deterioration, and forecast potential outcomes.
|
214 |
New Theoretical Techniques For Analyzing And Mitigating Password Cracking AttacksPeiyuan Liu (18431811) 26 April 2024 (has links)
<p dir="ltr">Brute force guessing attacks continue to pose a significant threat to user passwords. To protect user passwords against brute force attacks, many organizations impose restrictions aimed at forcing users to select stronger passwords. Organizations may also adopt stronger hashing functions in an effort to deter offline brute force guessing attacks. However, these defenses induce trade-offs between security, usability, and the resources an organization is willing to investigate to protect passwords. In order to make informed password policy decisions, it is crucial to understand the distribution over user passwords and how policy updates will impact this password distribution and/or the strategy of a brute force attacker.</p><p dir="ltr">This first part of this thesis focuses on developing rigorous statistical tools to analyze user password distributions and the behavior of brute force password attackers. In particular, we first develop several rigorous statistical techniques to upper and lower bound the guessing curve of an optimal attacker who knows the user password distribution and can order guesses accordingly. We apply these techniques to analyze eight password datasets and two PIN datasets. Our empirical analysis demonstrates that our statistical techniques can be used to evaluate password composition policies, compare the strength of different password distributions, quantify the impact of applying PIN blocklists, and help tune hash cost parameters. A real world attacker may not have perfect knowledge of the password distribution. Prior work introduced an efficient Monte Carlo technique to estimate the guessing number of a password under a particular password cracking model, i.e., the number of guesses an attacker would check before this particular password. This tool can also be used to generate password guessing curves, but there is no absolute guarantee that the guessing number and the resulting guessing curves are accurate. Thus, we propose a tool called Confident Monte Carlo that uses rigorous statistical techniques to upper and lower bound the guessing number of a particular password as well as the attacker's entire guessing curve. Our empirical analysis also demonstrate that this tool can be used to help inform password policy decisions, e.g., identifying and warning users with weaker passwords, or tuning hash cost parameters.</p><p dir="ltr">The second part of this thesis focuses on developing stronger password hashing algorithms to protect user passwords against offline brute force attacks. In particular, we establish that the memory hard function Scrypt, which has been widely deployed as password hash function, is maximally bandwidth hard. We also present new techniques to construct and analyze depth robust graph with improved concrete parameters. Depth robust graph play an essential rule in the design and analysis of memory hard functions.</p>
|
215 |
Voice for Decision Support in Healthcare Applied to Chronic Obstructive Pulmonary Disease Classification : A Machine Learning ApproachIdrisoglu, Alper January 2024 (has links)
Background: Advancements in machine learning (ML) techniques and voice technology offer the potential to harness voice as a new tool for developing decision-support tools in healthcare for the benefit of both healthcare providers and patients. Motivated by technological breakthroughs and the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) in healthcare, numerous studies aim to investigate the diagnostic potential of ML algorithms in the context of voice-affecting disorders. This thesis focuses on respiratory diseases such as Chronic Obstructive Pulmonary Disease (COPD) and explores the potential of a decision support tool that utilizes voice and ML. This exploration exemplifies the intricate relationship between voice and overall health through the lens of applied health technology (AHT. This interdisciplinary nature of research recognizes the need for accurate and efficient diagnostic tools. Objective: The objectives of this licentiate thesis are twofold. Firstly, a Systematic Literature Review (SLR) thoroughly investigates the current state of ML algorithms in detecting voice-affecting disorders, pinpointing existing gaps and suggesting directions for future research. Secondly, the study focuses on respiratory health, specifically COPD, employing ML techniques with a distinct emphasis on the vowel "A". The aim is to explore hidden information that could potentially be utilized for the binary classification of COPD vs no COPD. The creation of a new Swedish COPD voice classification dataset is anticipated to enhance the experimental and exploratory dimensions of the research. Methods: In order to have a holistic view of a research field, one of the commonly utilized methods is to scan and analyze the literature. Therefore, Paper I followed the methodology of an SLR where existing journal publications were scanned and synthesized to create a holistic view in the realm of ML techniques employed to experiment on voice-affecting disorders. Based on the results from the SLR, Paper II focused on the data collection and experimentation for the binary classification of COPD, which was one of the gaps identified in the first study. Three distinct ML algorithms were investigated on the collected datasets through voice features, which consisted of recordings collected through a mobile application from participants 18 years old and above, and the most utilized performance measures were computed for the best outcome. Results: The summary of findings from Paper I reveals the dominance of Support Vector Machine (SVM) classifiers in voice disorder research, with Parkinson's Disease and Alzheimer's Disease as the most studied disorders. Gaps in research include underrepresented disorders, limited datasets in terms of number of participants, and a lack of interest in longitudinal studies. Paper II demonstrates promising results in COPD classification using ML and a newly developed dataset, offering insights into potential decision support tools for COPD diagnosis. Conclusion: The studies covered in this dissertation provide a comprehensive literature summary of ML techniques used to support decision-making on voice-affecting disorders for clinical outcomes. The findings contribute to understanding the diagnostic potential of using ML on vocal features and highlight avenues for future research and technology development. Nonetheless, the experiment reveals the potential of employing voice as a digital biomarker for COPD diagnosis using ML.
|
216 |
Harnessing the Value of Open Data through Business Model Adaptation : A Multiple Case Study on Data-Intelligence Service-ProvidersThalin, Simon, Svennefalk, Marcus January 2024 (has links)
Purpose - The objective of this study is to explore how Data-Intelligence Service-Providers (DISP) can adapt existing Business Model (BM) dimensions to leverage the potential value and mitigate the emerging challenges Open Data (OD) introduces. Method – By developing a multiple case study, we intend to qualitatively explore what BM practices DISPs employ when incorporating OD. Interviews are conducted in multiple phases with a total of 25 interviews and results generated using a thematic analysis. Findings – Through empirical investigation and analysis of DISPs actions and strategies, the study uncovers how these firms navigate challenges and opportunities presented by OD. By portraying the strategies across three BM dimensions—value creation, delivery, and capture—this study identifies six key practices that help DISPs competitively differentiate themselves in the OD environment. The identified practices include Use-case understanding and Data-driven Service Innovation for value creation, Enhanced Data Delivery and Collaborative Data Optimization for value delivery, and AdjustedRevenue Model and Market Expansion for value capture. Implications – In our contribution to existing literature, we present empirical evidence spanning across all dimensions of the BM, shedding light on the competitive advantages facilitated by OD. Additionally, through identifying key practices, this thesis uncovers several areas where there is a lack of understanding on ODs impact in a commercial context. Specifically, by solely focusing on the perspective of DISPs, we offer detailed insight into how these practices are practically unfolding. Furthermore, the thesis presents a framework categorizing practices based on priority and ecosystem dependency. This framework delineates certain practices that are considered fundamental when incorporating OD while also recognizing their intricate requirement of involving external parties, offering managers a visual overview of how to systematically adapt their BMs to incorporate OD into their services. In addition, we manage to address the common distortions about OD by offering a thorough theoretical foundation and defining it clearly within a commercial context, making this complex topic more accessible and better understood. Limitations and future research – As this study is limited to data-providers and DISPs, this thesis advocates for exploring end-user perspectives in future research deemed crucial for gathering a comprehensive understanding of their needs and interactions with OD solutions to solidify findings in this study. Additionally, it is encouraged that future research should investigate misalignments between data-providers and DISPs (e.g. regulatory and technical matters) which currently, are leading to massive inefficiencies in data supply chains. Understanding these issues and implementing strategies to address them can optimize OD resource utilization, thereby facilitating greater innovative potential for service-providers leveraging it.
|
217 |
[pt] DETECÇÃO DE CONTEÚDO SENSÍVEL EM VIDEO COM APRENDIZADO PROFUNDO / [en] SENSITIVE CONTENT DETECTION IN VIDEO WITH DEEP LEARNINGPEDRO VINICIUS ALMEIDA DE FREITAS 09 June 2022 (has links)
[pt] Grandes quantidades de vídeo são carregadas em plataformas de hospedagem de vídeo a cada minuto. Esse volume de dados apresenta um desafio no controle do tipo de conteúdo enviado para esses serviços de hospedagem de vídeo, pois essas plataformas são responsáveis por qualquer mídia
sensível enviada por seus usuários. Nesta dissertação, definimos conteúdo
sensível como sexo, violencia fisica extrema, gore ou cenas potencialmente
pertubadoras ao espectador. Apresentamos um conjunto de dados de vídeo
sensível para classificação binária de vídeo (se há conteúdo sensível no vídeo
ou não), contendo 127 mil vídeos anotados, cada um com seus embeddings
visuais e de áudio extraídos. Também treinamos e avaliamos quatro modelos
baseline para a tarefa de detecção de conteúdo sensível em vídeo. O modelo
com melhor desempenho obteve 99 por cento de F2-Score ponderado no nosso subconjunto de testes e 88,83 por cento no conjunto de dados Pornography-2k. / [en] Massive amounts of video are uploaded on video-hosting platforms
every minute. This volume of data presents a challenge in controlling the
type of content uploaded to these video hosting services, for those platforms
are responsible for any sensitive media uploaded by their users. There
has been an abundance of research on methods for developing automatic
detection of sensitive content. In this dissertation, we define sensitive
content as sex, extreme physical violence, gore, or any scenes potentially
disturbing to the viewer. We present a sensitive video dataset for binary
video classification (whether there is sensitive content in the video or not),
containing 127 thousand tagged videos, Each with their extracted audio and
visual embeddings. We also trained and evaluated four baseline models for
the sensitive content detection in video task. The best performing model
achieved 99 percent weighed F2-Score on our test subset and 88.83 percent on the
Pornography-2k dataset.
|
218 |
Large Language Models as Advanced Data Preprocessors : Transforming Unstructured Text into Fine-Tuning DatasetsVangeli, Marius January 2024 (has links)
The digital landscape increasingly generates vast amounts of unstructured textual data, valuable for analytics and various machine learning (ML) applications. These vast stores of data, often likened to digital gold, are often challenging to process and utilize. Traditional text processing methods, lacking the ability to generalize, typically struggle with unstructured and unlabeled data. For many complex data management workflows, the solution typically involves human intervention in the form of manual curation and labeling — a time-consuming process. Large Language Models (LLMs) are AI models trained on vast amounts of text data. They have remarkable Natural Language Processing (NLP) capabilities and offer a promising alternative. This thesis serves as an empirical case study of LLMs as advanced data preprocessing tools. It explores the effectiveness and limitations of using LLMs to automate and refine traditionally challenging data preprocessing tasks, highlighting a critical area of research in data management. An LLM-based preprocessing pipeline, designed to clean and prepare raw textual data for use in ML applications, is implemented and evaluated. This pipeline was applied to a corpus of unstructured text documents, extracted from PDFs, with the aim of transforming them into a fine-tuning dataset for LLMs. The efficacy of the LLM-based preprocessing pipeline was assessed by comparing the results against a manually curated benchmark dataset using two text similarity metrics: the Levenshtein distance and ROUGE score. The findings indicate that although LLMs are not yet capable of fully replacing human curation in complex data management workflows, they substantially improve the efficiency and manageability of preprocessing unstructured textual data.
|
219 |
Optimization Approaches for the (r,Q) Inventory PolicyMoghtader, Omid January 2024 (has links)
This thesis presents a comprehensive investigation into the performance and generalizability of optimization approaches for the single-echelon (r, Q) inventory management policy under stochastic demand, specifically focusing on demand characterized by a Poisson distribution. The research integrates both classical optimization techniques and advanced metaheuristic methods, with a particular emphasis on Genetic Programming (GP), to assess the effectiveness of various heuristics. The study systematically compares the performance of these approaches in terms of both accuracy and computational efficiency using two well-known datasets. To rigorously evaluate the generalizability of the heuristics, an extensive random dataset of 10,000 instances, drawn from a vast population of approximately 24 billion instances, was generated and employed in this study.
Our findings reveal that the exact solution provided by the Federgruen-Zheng Algorithm consistently outperforms hybrid heuristics in terms of computational efficiency, confirming its reliability in smaller datasets where precise solutions are critical. Additionally, the extended Cooperative Coevolutionary Genetic Programming (eCCGP) heuristic proposed by Lopes et al. emerges as the most efficient in terms of runtime, achieving a remarkable balance between speed and accuracy, with an optimality error gap of only 1%. This performance makes the eCCGP heuristic particularly suitable for real-time inventory management systems, especially in scenarios involving large datasets where computational speed is paramount.
The implications of this study are significant for both theoretical research and practical applications, suggesting that while exact solution, i.e., the Federgruen-Zheng Algorithm is ideal for smaller datasets, the eCCGP heuristic provides a scalable and efficient alternative for larger, more complex datasets without substantial sacrifices in accuracy. These insights contribute to the ongoing development of more effective inventory management strategies in environments characterized by stochastic demand. / Thesis / Master of Science (MSc)
|
220 |
Price Prediction of Vinyl Records Using Machine Learning AlgorithmsJohansson, David January 2020 (has links)
Machine learning algorithms have been used for price prediction within several application areas. Examples include real estate, the stock market, tourist accommodation, electricity, art, cryptocurrencies, and fine wine. Common approaches in studies are to evaluate the accuracy of predictions and compare different algorithms, such as Linear Regression or Neural Networks. There is a thriving global second-hand market for vinyl records, but the research of price prediction within the area is very limited. The purpose of this project was to expand on existing knowledge within price prediction in general to evaluate some aspects of price prediction of vinyl records. That included investigating the possible level of accuracy and comparing the efficiency of algorithms. A dataset of 37000 samples of vinyl records was created with data from the Discogs website, and multiple machine learning algorithms were utilized in a controlled experiment. Among the conclusions drawn from the results was that the Random Forest algorithm generally generated the strongest results, that results can vary substantially between different artists or genres, and that a large part of the predictions had a good accuracy level, but that a relatively small amount of large errors had a considerable effect on the general results.
|
Page generated in 0.0834 seconds