• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 129
  • 19
  • 11
  • 11
  • 4
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 281
  • 281
  • 123
  • 100
  • 70
  • 62
  • 53
  • 39
  • 38
  • 36
  • 35
  • 33
  • 32
  • 31
  • 30
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Investigating Daily Fantasy Baseball: An Approach to Automated Lineup Generation

Smith, Ryan 01 June 2021 (has links) (PDF)
A recent trend among sports fans along both sides of the letterman jacket is that of Daily Fantasy Sports (DFS). The DFS industry has been under legal scrutiny recently, due to the view that daily sports data is too random to make its prediction skillful. Therefore, a common view is that it constitutes online gambling. This thesis proves that DFS, as it pertains to Baseball, is significantly more predictable than random chance, and thus does not constitute gambling. We propose a system which generates daily lists of lineups for Fanduel Daily Fantasy Baseball contests. The system consists of two components: one for predicting player scores for every player on a given day, and one for generating lists of the best combinations of players (lineups) using the predicted player scores. The player score prediction component makes use of deep neural network models, including a Long Short-Term Memory recurrent neural network, to model daily player performance over the 2016 and 2017 MLB seasons. Our results indicate this to be a useful prediction tool, even when not paired with the lineup generation component of our system. We build off of previous work to develop two models for lineup generation, one completely novel, dependent on a set of player predictions. Our evaluations show that these lineup generation models paired with player predictions are significantly better than random, and analysis shows insights into key aspects of the lineup generation process.
32

Big data in predictive toxicology / Big Data in Predictive Toxicology

Neagu, Daniel, Richarz, A-N. 15 January 2020 (has links)
No / The rate at which toxicological data is generated is continually becoming more rapid and the volume of data generated is growing dramatically. This is due in part to advances in software solutions and cheminformatics approaches which increase the availability of open data from chemical, biological and toxicological and high throughput screening resources. However, the amplified pace and capacity of data generation achieved by these novel techniques presents challenges for organising and analysing data output. Big Data in Predictive Toxicology discusses these challenges as well as the opportunities of new techniques encountered in data science. It addresses the nature of toxicological big data, their storage, analysis and interpretation. It also details how these data can be applied in toxicity prediction, modelling and risk assessment.
33

Characterizing Dimensionality Reduction Algorithm Performance in terms of Data Set Aspects

Sulecki, Nathan 08 May 2017 (has links)
No description available.
34

Predicting Myocardial Infarction using Textual Prehospital Data and Machine Learning

Van der Haas, Yvette Jane January 2021 (has links)
A major healthcare problem is the overcrowding of hospitals and emergency departments which leads to negative patient outcomes and increased costs. In a previous study, performed by Leiden University Medical Centre, a new and innovative prehospital triage method was developed where two nurse paramedics could consult a cardiologist for patients with cardiac symptoms, via a live connection on a digital triage platform. The developed triage method resulted in a recall = 0.995 and specificity = 0.0113. This study arise the following research question: ‘Would there be enough (good) information gathered on the prehospital scene to make a machine learning model able to predict myocardial infarction?’. By testing different pre-processing steps, several features (premade ones and self-made ones), multiple models (Support Vector Machine, K Nearest Neighbour, Logistic Regression and Random Forest), various outcome settings and hyperparameters, led to the final results: recall = 0.995 and specificity = 0.1101. This is gained through the feature selected by a cardiologist and the Support Vector Machine model. The outcomes are controlled by an extra explainability layer named Explain Like I’m Five. This outcome illustrates that the created machine learning model is trained mostly on the right words and characters.
35

Informatic strategies for the discovery and characterization of peptidic natural products

Merwin, Nishanth 06 1900 (has links)
Microbial natural products have served a key role in the development of clinically relevant drugs. Despite significant interest, traditional strategies in their characterization have lead to diminishing returns, leaving this field stagnant. Recently developed technologies such as low-cost, high-throughput genome sequencing and high-resolution mass spectrometry allow for a much richer experimental strategy, allowing us to gather data at an unprecedented scale. Naive efforts in analyzing genomic data have already revealed the wealth of natural products encoded within diverse bacterial phylogenies. Herein, I leverage these technologies through the development of specialized computational platforms cognizant of existing natural products and their biosynthesis in order to reinvigorate our drug discovery protocols. As a first, I present a strategy for the targeted isolation of novel and structurally divergent ribosomally synthesized and post-translationally modified peptides (RiPPs). Specifically, this software platform is able to directly compare genomically encoded RiPPs to previously characterized chemical scaffolds, allowing for the identification of bacterial strains producing these specialized, and previously unstudied metabolites. Further, using metabolomics data, I have developed a strategy that facilitates direct identification and targeted isolation of these uncharacterized RiPPs. Through these set of tools, we were able to successfully isolate a structurally unique lasso peptide from a previously unexplored \textit{Streptomyces} isolate. With the technological rise of genomic sequencing, it is now possible to survey polymicrobial environments with remarkable detail. Through the use of metagenomics, we can survey the presence and abundances of bacteria, and further metatranscriptomics is able to reveal the expression of their biosynthetic pathways. Here, I developed a platform which is able to identify microbial peptides exclusively found within the human microbiome, and further characterize their putative antimicrobial properties. Through this endeavour, we identified a bacterially encoded peptide that can effectively protect against pathogenic \textit{Clostridium difficile} infections. With the wealth of publicly available multi-omics datasets, these works in conjunction demonstrate the potential of informatics strategies in the advancement of natural product discovery. / Thesis / Master of Science (MSc) / Biochemistry is the study in which life is built upon a series of diverse chemistry and their interactions. Some of these chemicals are not essential for the maintaining basic metabolism, but are instead tailored for alternative functions best suited to their environment. Often, these molecules mediate biological warfare, allowing organisms to compete and establish dominance amongst their neighbours. Understanding this, several of these molecules have been exploited in our modern pharmaceutical regimen as effective antibiotics. Due to the ever rising reality of antibiotic resistance, we are in dire need of novel antibiotics. With this goal, I have developed several software tools that can both identify these molecules encoded within bacterial genomes, but also predict their effects on neighbouring bacteria. Through these computational tools, I provide an updated strategy for the discovery and characterization of these biologically derived chemicals.
36

A Pedagogical Approach to Create and Assess Domain-Specific Data Science Learning Materials in the Biomedical Sciences

Chen, Daniel 01 February 2022 (has links)
This dissertation explores creating a set of domain-specific learning materials for the biomedical sciences to meet the educational gap in biomedical informatics, while also meeting the call for statisticians advocating for process improvements in other disciplines. Data science educational materials are plenty enough to become a commodity. This provides the opportunity to create domain-specific learning materials to better motivate learning using real-world examples while also capturing intricacies of working with data in a specific domain. This dissertation shows how the use of persona methodologies can be combined with a backwards design approach of creating domain-specific learning materials. The work is divided into three (3) major steps: (1) create and validate a learner self-assessment survey that can identify learner personas by clustering. (2) combine the information from persona methodology with a backwards design approach using formative and summative assessments to curate, plan, and assess domain-specific data science workshop materials for short term and long term efficacy. (3) pilot and identify at how to manage real-time feedback within a data coding teaching session to drive better learner motivation and engagement. The key findings from this dissertation suggests using a structured framework to plan and curate learning materials is an effective way to identify key concepts in data science. However, just creating and teaching learning materials is not enough for long-term retention of knowledge. More effort for long-term lesson maintenance and long-term strategies for practice will help retain the concepts learned from live instruction. Finally, it is essential that we are careful and purposeful in our content creation as to not overwhelm learners and to integrate their needs into the materials as a primary focus. Overall, this contributes to the growing need for data science education in the biomedical sciences to train future clinicians use and work with data and improve patient outcomes. / Doctor of Philosophy / Regardless of the field and domain you are in, we are all inundated with data. The more agency we can give individuals to work with data, the better equipped they will be to bring their own expertise to complex problems and work in multidisciplinary teams. There already exists a plethora of data science learning materials to help learners work with data; however, many are not domain-focused and can be overwhelming to new learners. By integrating in domain specificity to data science education, we hypothesize that we can help learners learn and retain knowledge by keeping them more engaged and motivated. This dissertation focuses on the domain of the biomedical sciences to use best practices on how to improve data science education and impact the field. Specifically, we explore how to address major gaps in data education in the biomedical field and create a set of domain-specific learning materials (e.g. workshops) for the biomedical sciences. We use best educational practices to curate these learning materials and assess how effective they are. This assessment was performed in three (3) major steps including: (1) identify who the learners are and what they already know in the context of using a programming language to work with data, (2) plan and curate a learning path for the learners and assessing materials created for short and long term effectiveness, and (3) pilot and identify at how to manage real-time feedback within a data coding teaching session to drive better learner motivation and engagement. The key findings from this dissertation suggest using a structured framework to plan and curate learning materials is an effective way to identify key concepts in data science. However, just creating the materials and teaching them is not enough for long-term retention of knowledge. More effort for long-term lesson maintenance and long-term strategies for practice will help retain the concepts learned from live instruction. Finally, it is essential that we are careful and purposeful in our content creation as to not overwhelm learners and to integrate their needs into the materials as a primary focus. Overall, this contributes to the growing need for data science education in the biomedical sciences to train future clinicians to use and work with data and improve patient outcomes.
37

Clustering Web Users by Mouse Movement to Detect Bots and Botnet Attacks

Morgan, Justin L 01 March 2021 (has links) (PDF)
The need for website administrators to efficiently and accurately detect the presence of web bots has shown to be a challenging problem. As the sophistication of modern web bots increases, specifically their ability to more closely mimic the behavior of humans, web bot detection schemes are more quickly becoming obsolete by failing to maintain effectiveness. Though machine learning-based detection schemes have been a successful approach to recent implementations, web bots are able to apply similar machine learning tactics to mimic human users, thus bypassing such detection schemes. This work seeks to address the issue of machine learning based bots bypassing machine learning-based detection schemes, by introducing a novel unsupervised learning approach to cluster users based on behavioral biometrics. The idea is that, by differentiating users based on their behavior, for example how they use the mouse or type on the keyboard, information can be provided for website administrators to make more informed decisions on declaring if a user is a human or a bot. This approach is similar to how modern websites require users to login before browsing their website; which in doing so, website administrators can make informed decisions on declaring if a user is a human or a bot. An added benefit of this approach is that it is a human observational proof (HOP); meaning that it will not inconvenience the user (user friction) with human interactive proofs (HIP) such as CAPTCHA, or with login requirements
38

Statistical Modelling of Plug-In Hybrid Fuel Consumption : A study using data science methods on test fleet driving data / Statistisk Modellering av Bränsleförbrukning För Laddhybrider : En studie gjord med hjälp av data science metoder baserat på data från en test flotta

Matteusson, Theodor, Persson, Niclas January 2020 (has links)
The automotive industry is undertaking major technological steps in an effort to reduce emissions and fight climate change. To reduce the reliability on fossil fuels a lot of research is invested into electric motors (EM) and their applications. One such application is plug-in hybrid electric vehicles (PHEV), in which internal combustion engines (ICE) and EM are used in combination, and take turns to propel the vehicle based on driving conditions. The main optimization problem of PHEV is to decide when to use which motor. If this optimization is done with respect to emissions, the entire electric charge should be used up before the end of the trip. But if the charge is used up too early, latter driving segments for which the optimal choice would have been to use the EM will have to be done using the ICE. To address this optimization problem, we studied the fuel consumption during different driving conditions. These driving conditions are characterized by hundreds of sensors which collect data about the state of the vehicle continuously when driving. From these data, we constructed 150 seconds segments, including e.g. vehicle speed, before new descriptive features were engineered for each segment, e.g. max vehicle speed. By using the characteristics of typical driving conditions specified by the Worldwide Harmonized Light Vehicles Test Cycle (WLTC), segments were labelled as a highway or city road segments. To reduce the dimensions without losing information, principle component analysis was conducted, and a Gaussian mixture model was used to uncover hidden structures in the data. Three machine learning regression models were trained and tested: a linear mixed model, a kernel ridge regression model with linear kernel function, and lastly a kernel ridge regression model with an RBF kernel function. By splitting the data into a training set and a test set the models were evaluated on data which they have not been trained on. The model performance and explanation rate obtained for each model, such as R2, Mean Absolute Error and Mean Squared Error, were compared to find the best model. The study shows that the fuel consumption can be modelled by the sensor data of a PHEV test fleet where 6 features contributes to an explanation ratio of 0.5, thus having highest impact on the fuel consumption. One needs to keep in mind the data were collected during the Covid-19 outbreak where travel patterns were not considered to be normal. No regression model can explain the real world better than what the underlying data does. / Fordonsindustrin vidtar stora tekniska steg för att minska utsläppen och bekämpa klimatförändringar. För att minska tillförlitligheten på fossila bränslen investeras en hel del forskning i elmotorer (EM) och deras tillämpningar. En sådan applikation är laddhybrider (PHEV), där förbränningsmotorer (ICE) och EM används i kombination, och turas om för att driva fordonet baserat på rådande körförhållanden. PHEV: s huvudoptimeringsproblem är att bestämma när man ska använda vilken motor. Om denna optimering görs med avseende på utsläpp bör hela den elektriska laddningen användas innan resan är slut. Men om laddningen används för tidigt måste senare delar av resan, för vilka det optimala valet hade varit att använda EM, göras med ICE. För att ta itu med detta optimeringsproblem, studerade vi bränsleförbrukningen under olika körförhållanden. Dessa körförhållanden kännetecknas av hundratals sensorer som samlar in data om fordonets tillstånd kontinuerligt vid körning. Från dessa data konstruerade vi 150 sekunder segment, inkluderandes exempelvis fordonshastighet, innan nya beskrivande attribut konstruerades för varje segment, exempelvis högsta fordonshastighet. Genom att använda egenskaperna för typiska körförhållanden som specificerats av Worldwide Harmonized Light Vehicles Test Cycle (WLTC), märktes segment som motorvägs- eller stadsvägsegment. För att minska dimensioner på data utan att förlora information, användes principal component analysis och en Gaussian Mixture model för att avslöja dolda strukturer i data. Tre maskininlärnings regressionsmodeller skapades och testades: en linjär blandad modell, en kernel ridge regression modell med linjär kernel funktion och slutligen en en kernel ridge regression modell med RBF kernel funktion. Genom att dela upp informationen i ett tränings set och ett test set utvärderades de tre modellerna på data som de inte har tränats på. För utvärdering och förklaringsgrad av varje modell användes, R2, Mean Absolute Error och Mean Squared Error. Studien visar att bränsleförbrukningen kan modelleras av sensordata för en PHEV-testflotta där 6 stycken attribut har en förklaringsgrad av 0.5 och därmed har störst inflytande på bränsleförbrukningen . Man måste komma ihåg att all data samlades in under Covid-19-utbrottet där resmönster inte ansågs vara normala och att ingen regressionsmodell kan förklara den verkliga världen bättre än vad underliggande data gör.
39

(Intelligentes) Text Mining in der Marktforschung

Stützer, Cathleen M., Wachenfeld-Schell, Alexandra, Oglesby, Stefan 24 November 2021 (has links)
Die Extraktion von Informationen aus Texten – insbesondere aus unstrukturierten Textdaten wie Foren, Bewertungsportalen bzw. aus offenen Angaben – stellen heute eine besondere Herausforderung für Marktforscher und Marktforscherinnen dar. Hierzu wird zum einen neues methodisches Know-how gebraucht, um mit den komplexen Datenbeständen sowohl bei der Erhebung wie auch bei der Bewertung dieser umzugehen. Zum anderen müssen im Kontext der digitalen Beforschung von neuen Customer Insights sowohl technische als auch organisationale Infrastrukturen geschaffen werden, um u. a. Geschäftsmodelle in Abläufen und Arbeitsprozessen von Unternehmen, Institutionen und Organisationen etablieren zu können. Die Beiträge des Bandes besprechen nicht nur vielfältigste Methoden und Verfahren zur automatischen Textextraktion, sondern zeigen hierbei sowohl die Relevanz als auch die Herausforderungen für die Online-Marktforschung auf, die mit dem Einsatz solch innovativer Ansätze und Verfahren verbunden sind.:C. M. Stützer, A. Wachenfeld-Schell & S. Oglesby: Digitale Transformation der Marktforschung A. Lang & M. Egger, Insius UG: Wie Marktforscher durch kooperatives Natural Language Processing bei der qualitativen Inhaltsanalyse profitieren können M. Heurich & S. Štajner, Symanto Research: Durch Technologie zu mehr Empathie in der Kundenansprache – Wie Text Analytics helfen kann, die Stimme des digitalen Verbrauchers zu verstehen G. Heisenberg, TH Köln & T. Hees, Questback GmbH: Text Mining-Verfahren zur Analyse offener Antworten in Online-Befragungen im Bereich der Markt- und Medienforschung T. Reuter, Cogia Intelligence GmbH: Automatische semantische Analysen für die Online-Marktforschung P. de Buren, Caplena GmbH: Offenen Nennungen gekonnt analysieren
40

Extending Synthetic Data and Data Masking Procedures using Information Theory

Tyler J Lewis (15361780) 26 April 2023 (has links)
<p>The two primarily methodologies discussed in this thesis are the nonparametric entropy-based synthetic timeseries (NEST) and Directed infusion of data (DIOD) algorithms. </p> <p><br></p> <p>The former presents a novel synthetic data algorithm that is shown to outperform sismilar state-of-the-art, including generative networks, in terms of utility and data consistency. Majority of data used are open-source, and are cited where appropriate.</p> <p><br></p> <p>DIOD presents a novel data masking paradigm that presevres the utility, privacy, and efficiency required by the current industrial paradigm, and presents a cheaper alternative to many state-of-the-art. Data used include simulation data (source code cited), equations-based data, and open-source images (cited as needed). </p>

Page generated in 0.0333 seconds