Julock, Gregory Alan
01 January 2013
Buffer Overflows are a common type of network intrusion attack that continue to plague the networked community. Unfortunately, this type of attack is not well detected with current data mining algorithms. This research investigated the use of Random Forests, an ensemble technique that creates multiple decision trees, and then votes for the best tree. The research Investigated Random Forests' effectiveness in detecting buffer overflows compared to other data mining methods such as CART and Naïve Bayes. Random Forests was used for variable reduction, cost sensitive classification was applied, and each method's detection performance compared and reported along with the receive operator characteristics. The experiment was able to show that Random Forests outperformed CART and Naïve Bayes in classification performance. Using a technique to obtain Buffer Overflow most important variables, Random Forests was also able to improve upon its Buffer Overflow classification performance.
Anderson, Michael P.
Doctor of Philosophy / Department of Statistics / Suzanne Dubnicka / DNA barcodes are short strands of nucleotide bases taken from the cytochrome c oxidase subunit 1 (COI) of the mitochondrial DNA (mtDNA). A single barcode may have the form C C G G C A T A G T A G G C A C T G . . . and typically ranges in length from 255 to around 700 nucleotide bases. Unlike nuclear DNA (nDNA), mtDNA remains largely unchanged as it is passed from mother to offspring. It has been proposed that these barcodes may be used as a method of differentiating between biological species (Hebert, Ratnasingham, and deWaard 2003). While this proposal is sharply debated among some taxonomists (Will and Rubinoff 2004), it has gained momentum and attention from biologists. One issue at the heart of the controversy is the use of genetic distance measures as a tool for species differentiation. Current methods of species classification utilize these distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined "gap" between intra- and interspecies variation (Meyer and Paulay 2005). We point out the limitations of such distance measures and propose a character-based method of species classification which utilizes an application of Bayes' rule to overcome these deficiencies. The proposed method is shown to provide accurate species-level classification. The proposed methods also provide answers to important questions not addressable with current methods.
Borges, Hugo Lima
Orientadora: Ana Carolina Lorena / Dissertação (mestrado) - Universidade Federal do ABC. Programa de Pós-Graduação em Engenharia da Informação, 2009
Lee, Jun won
27 January 2011
Metalearning aims to obtain knowledge of the relationship between the mechanism of learning and the concrete contexts in which that mechanisms is applicable. As new mechanisms of learning are continually added to the pool of learning algorithms, the chances of encountering behavior similarity among algorithms are increased. Understanding the relationships among algorithms and the interactions between algorithms and tasks help to narrow down the space of algorithms to search for a given learning task. In addition, this process helps to disclose factors contributing to the similar behavior of different algorithms. We first study general characteristics of learning tasks and their correlation with the performance of algorithms, isolating two metafeatures whose values are fairly distinguishable between easy and hard tasks. We then devise a new metafeature that measures the difficulty of a learning task that is independent of the performance of learning algorithms on it. Building on these preliminary results, we then investigate more formally how we might measure the behavior of algorithms at a ner grained level than a simple dichotomy between easy and hard tasks. We prove that, among all many possible candidates, the Classifi er Output Difference (COD) measure is the only one possessing the properties of a metric necessary for further use in our proposed behavior-based clustering of learning algorithms. Finally, we cluster 21 algorithms based on COD and show the value of the clustering in 1) highlighting interesting behavior similarity among algorithms, which leads us to a thorough comparison of Naive Bayes and Radial Basis Function Network learning, and 2) designing more accurate algorithm selection models, by predicting clusters rather than individual algorithms.
Machine learning (ML) is an area of computer science that gives computers the ability to learn data patterns without prior programming for those patterns. Using neural networks in this area is based on simulating the biological functions of neurons in brains to learn patterns in data, giving computers a predictive ability to comprehend how data can be clustered. This research investigates the possibilities of using neural networks for classifying email, i.e. working as an email case manager. A Deep Neural Network (DNN) are multiple layers of neurons connected to each other by trainable weights. The main objective of this thesis was to evaluate how the three input arguments - data size, training time and neural network structure – affects the accuracy of Deep Neural Networks pattern recognition; also an evaluation of how the DNN performs compared to the statistical ML method, Naïve Bayes, in the form of prediction accuracy and complexity; and finally the viability of the resulting DNN as a case manager. Results show an improvement of accuracy on our networks with the increase of training time and data size respectively. By testing increasingly complex network structures (larger networks of neurons with more layers) it is observed that overfitting becomes a problem with increased training time, i.e. how accuracy decrease after a certain threshold of training time. Naïve Bayes classifiers performs worse than DNN in terms of accuracy, but better in reduced complexity; making NB viable on mobile platforms. We conclude that our developed prototype may work well in tangent with existing case management systems, tested by future research.
Modelos probabilísticos e não probabilísticos de classificação binária para pacientes com ou sem demência como auxílio na prática clínica em geriatria.Galdino, Maicon Vinícius. January 2020 (has links)
Orientador: Liciana Vaz de Arruda Silveira / Resumo: Os objetivos deste trabalho foram apresentar modelos de classificação (Regressão Logística, Naive Bayes, Árvores de Classificação, Random Forest, k-Vizinhos mais próximos e Redes Neurais Artificiais) e a comparação destes utilizando processos de reamostragem em um conjunto de dados da área de geriatria (diagnóstico de demência). Analisar as pressuposições de cada metodologia, vantagens, desvantagens e cenários em que cada metodologia pode ser melhor utilizada. A justificativa e relevância desse projeto se baseiam na importância e na utilidade do tema proposto, visto que a população idosa aumenta em todo o mundo (nos países desenvolvidos e nos em desenvolvimento como o Brasil), os modelos de classificação podem ser úteis aos profissionais médicos, em especial aos médicos generalistas, no diagnóstico de demências, pois em diversos momentos o diagnóstico não é simples. / Doutor
Exploration of infectious disease transmission dynamics using the relative probability of direct transmission between patientsLeavitt, Sarah Van Ness 06 October 2020 (has links)
The question “who infected whom” is a perennial one in the study of infectious disease dynamics. To understand characteristics of infectious diseases such as how many people will one case produce over the course of infection (the reproductive number), how much time between the infection of two connected cases (the generation interval), and what factors are associated with transmission, one must ascertain who infected whom. The current best practices for linking cases are contact investigations and pathogen whole genome sequencing (WGS). However, these data sources cannot perfectly link cases, are expensive to obtain, and are often not available for all cases in a study. This lack of discriminatory data limits the use of established methods in many existing infectious disease datasets. We developed a method to estimate the relative probability of direct transmission between any two infectious disease cases. We used a subset of cases that have pathogen WGS or contact investigation data to train a model and then used demographic, spatial, clinical, and temporal data to predict the relative transmission probabilities for all case-pairs using a simple machine learning algorithm called naive Bayes. We adapted existing methods to estimate the reproductive number and generation interval to use these probabilities. Finally, we explored the associations between various covariates and transmission and how they related to the associations between covariates and pathogen genetic relatedness. We applied these methods to a tuberculosis outbreak in Hamburg, Germany and to surveillance data in Massachusetts, USA. Through simulations we found that our estimated transmission probabilities accurately classified pairs as links and nonlinks and were able to accurately estimate the reproductive number and the generation interval. We also found that the association between covariates and genetic relatedness captures the direction but not absolute magnitude of the association between covariates and transmission, but the bias was improved by using effect estimates from the naive Bayes algorithm. The methods developed in this dissertation can be used to explore transmission dynamics and estimate infectious disease parameters in established datasets where this was not previously feasible because of a lack of highly discriminatory information, and therefore expand our understanding of many infectious diseases.
Historians, mostly engaging with written evidence, have argued that the Christianisation of the Roman Empire resulted in changes in both attitudes and behaviour towards children, resulting in a decrease in their maltreatment by society. I begin with a working hypothesis that this attitude-change was real and resulted in a reduction in the maltreatment of children; and that this reduction in maltreatment is evident in the literature. The approach to investigating this hypothesis belongs to the emerging field of digital humanities: by using programming techniques developed in the field of sentiment analysis, I create two sentiment-analysis like tools, one a lexicon-based approach, the other an application of a naive bayes machine learning approach. The latter is favoured as more accurate. The tool is used to automatically tag sentences, extracted from a corpus of texts written between 100 B.C and 600 A.D, that mention children, as to whether the sentences feature the maltreatment of children or not. The results are then quantitively analysed with reference to the year in which the text was written, with no statistically significant result found. However, the high accuracy of the tool in tagging sentences, at above 88%, suggests that similar tools may be able to play an important role, alongside traditional research techniques, in historical and social-science research in the future.
Isa, H., Trundle, Paul R., Neagu, Daniel
No / The growing incidents of counterfeiting and associated economic and health consequences necessitate the development of active surveillance systems capable of producing timely and reliable information for all stake holders in the anti-counterfeiting fight. User generated content from social media platforms can provide early clues about product allergies, adverse events and product counterfeiting. This paper reports a work in progress with contributions including: the development of a framework for gathering and analyzing the views and experiences of users of drug and cosmetic products using machine learning, text mining and sentiment analysis; the application of the proposed framework on Facebook comments and data from Twitter for brand analysis, and the description of how to develop a product safety lexicon and training data for modeling a machine learning classifier for drug and cosmetic product sentiment prediction. The initial brand and product comparison results signify the usefulness of text mining and sentiment analysis on social media data while the use of machine learning classifier for predicting the sentiment orientation provides a useful tool for users, product manufacturers, regulatory and enforcement agencies to monitor brand or product sentiment trends in order to act in the event of sudden or significant rise in negative sentiment.
Using sentiment analysis to craft a narrative of the COVID-19 pandemic from the perspective of social mediaRay, Taylor Breanna 06 August 2021 (has links)
Throughout the COVID-19 pandemic, people have turned to social media to share their experiences with the coronavirus and their feelings regarding subjects like social distancing, mask-wearing, COVID-19 vaccines, and other related topics. The publicly available nature of these social media posts provides researchers the chance to obtain a consensus on an array of issues, topics, people, and entities. For the COVID-19 pandemic, this is valuable information that can prepare communities and governing bodies for future epidemics or events of a similar magnitude. However, clearly defining such a consensus can be difficult, especially if researchers want to limit the amount of bias they introduce. The process of sentiment analysis helps to address this need by categorizing text sources into one of three distinct polarities. Namely, those polarities are often positive, neutral, and negative. While sentiment analysis can take form as a completely manual task, this becomes incredibly burdensome for projects that involve substantial amounts of data. This thesis attempts to overcome this challenge by programmatically classifying the sentiment of COVID-19 posts from 10 social media and web-based forums using a multinomial Naive Bayes classifier. The unique and contrasting qualities of the social networks being analyzed provide a robust take on the public's perception of the pandemic that has not yet been offered up to the present.
Page generated in 0.0613 seconds