Global ETD Search

711	Automatic Question Answering and Knowledge Discovery from Electronic Health Records Wang, Ping 25 August 2021 (has links) Electronic Health Records (EHR) data contain comprehensive longitudinal patient information, which is usually stored in databases in the form of either multi-relational structured tables or unstructured texts, e.g., clinical notes. EHR provides a useful resource to assist doctors' decision making, however, they also present many unique challenges that limit the efficient use of the valuable information, such as large data volume, heterogeneous and dynamic information, medical term abbreviations, and noisy nature caused by misspelled words. This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to seek answers from EHR for clinical activity related questions posed in human language without the assistance of database and natural language processing (NLP) domain experts, (2) How to discover underlying relationships of different events and entities in structured tabular EHRs, and (3) How to predict when a medical event will occur and estimate its probability based on previous medical information of patients. First, to automatically retrieve answers for natural language questions from the structured tables in EHR, we study the question-to-SQL generation task by generating the corresponding SQL query of the input question. We propose a translation-edit model driven by a language generation module and an editing module for the SQL query generation task. This model helps automatically translate clinical activity related questions to SQL queries, so that the doctors only need to provide their questions in natural language to get the answers they need. We also create a large-scale dataset for question answering on tabular EHR to simulate a more realistic setting. Our performance evaluation shows that the proposed model is effective in handling the unique challenges about clinical terminologies, such as abbreviations and misspelled words. Second, to automatically identify answers for natural language questions from unstructured clinical notes in EHR, we propose to achieve this goal by querying a knowledge base constructed based on fine-grained document-level expert annotations of clinical records for various NLP tasks. We first create a dataset for clinical knowledge base question answering with two sets: clinical knowledge base and question-answer pairs. An attention-based aspect-level reasoning model is developed and evaluated on the new dataset. Our experimental analysis shows that it is effective in identifying answers and also allows us to analyze the impact of different answer aspects in predicting correct answers. Third, we focus on discovering underlying relationships of different entities (e.g., patient, disease, medication, and treatment) in tabular EHR, which can be formulated as a link prediction problem in graph domain. We develop a self-supervised learning framework for better representation learning of entities across a large corpus and also consider local contextual information for the down-stream link prediction task. We demonstrate the effectiveness, interpretability, and scalability of the proposed model on the healthcare network built from tabular EHR. It is also successfully applied to solve link prediction problems in a variety of domains, such as e-commerce, social networks, and academic networks. Finally, to dynamically predict the occurrence of multiple correlated medical events, we formulate the problem as a temporal (multiple time-points) and multi-task learning problem using tensor representation. We propose an algorithm to jointly and dynamically predict several survival problems at each time point and optimize it with the Alternating Direction Methods of Multipliers (ADMM) algorithm. The model allows us to consider both the dependencies between different tasks and the correlations of each task at different time points. We evaluate the proposed model on two real-world applications and demonstrate its effectiveness and interpretability. / Doctor of Philosophy / Healthcare is an important part of our lives. Due to the recent advances of data collection and storing techniques, a large amount of medical information is generated and stored in Electronic Health Records (EHR). By comprehensively documenting the longitudinal medical history information about a large patient cohort, this EHR data forms a fundamental resource in assisting doctors' decision making including optimization of treatments for patients and selection of patients for clinical trials. However, EHR data also presents a number of unique challenges, such as (i) large-scale and dynamic data, (ii) heterogeneity of medical information, and (iii) medical term abbreviation. It is difficult for doctors to effectively utilize such complex data collected in a typical clinical practice. Therefore, it is imperative to develop advanced methods that are helpful for efficient use of EHR and further benefit doctors in their clinical decision making. This dissertation focuses on automatically retrieving useful medical information, analyzing complex relationships of medical entities, and detecting future medical outcomes from EHR data. In order to retrieve information from EHR efficiently, we develop deep learning based algorithms that can automatically answer various clinical questions on structured and unstructured EHR data. These algorithms can help us understand more about the challenges in retrieving information from different data types in EHR. We also build a clinical knowledge graph based on EHR and link the distributed medical information and further perform the link prediction task, which allows us to analyze the complex underlying relationships of various medical entities. In addition, we propose a temporal multi-task survival analysis method to dynamically predict multiple medical events at the same time and identify the most important factors leading to the future medical events. By handling these unique challenges in EHR and developing suitable approaches, we hope to improve the efficiency of information retrieval and predictive modeling in healthcare. Electronic Health Records Question Answering Knowledge Discovery Knowledge Graph Survival Analysis
712	Hit to Lead Stage Optimization of Orally Efficacious β-Carboline Antimalarials Mathew, Jopaul 24 January 2023 (has links) Malaria, a disease caused by the parasite Plasmodium, continues to be one of the deadliest diseases worldwide. The WHO reported over 627,000 deaths in 2020, and over 1 billion people are at risk of infection. Even though Artemisinin-based Combination Therapies (ACT) are the current standard of care for malaria, the emergence of drug resistance generates a constant need to develop and synthesize new drugs. Tetrahydro-β-carboline acid (THβC) 1-(2,4-dichlorophenyl)-2,3,4,9-tetrahydro-1H-pyrido[3,4-b]indol-2-ium-3-carboxylate (MMV008138) has promising antimalarial properties; it was discovered by screening the Malaria Box with the so-called IPP Rescue assay. This assay identified MMV008138 as an inhibitor of the MEP pathway, which produces essential isoprenoid precursors (IPP and DMAPP) in the malaria parasite P. falciparum (EC50 250 ± 70 nM, IPP rescue 100% @ 2.5 μM). Subsequent investigation revealed that (1R,3S)-configuration and 2',4'-dihalogen substitution were critical for the activity of this compound, and that substitution of the non-aromatic ring was not tolerated. To search for new antimalarial structures, our collaborator Dr. Max Totrov constructed a generalized 3D pharmacophore-based on MMV008138 and 92 of its analogs and used it for a virtual ligand screen (VLS) of the 13K compound hit set from which MMV008138 had been selected. This exercise identified TCMDC-140230, a THβC, 1-(3,4-dichlorophenyl)-8-methyl-N-(2-(methylamino)ethyl)-2,3,4,9-tetrahydro-1H-pyrido[3,4-b]indole-3-carboxamide (undefined stereochemistry) reported having nearly the same potency of MMV008138. Synthesis of the stereoisomers of compound TCMDC-140230 was accomplished via Pictet-Spengler reaction of (S)- and (R)-7-methyl tryptophan methyl ester and 3,4-dichlorobenzaldehyde. The individual stereoisomeric esters were converted to the corresponding amides, but none of the stereoisomers of TCMDC-140230 were potent antimalarials (IC50 = 1,300 – 3,700 nM). However, a significant amount of oxidized byproduct 1-(3,4-dichlorophenyl)-8-methyl-N-(2-(methylamino)ethyl)-9H-pyrido[3,4-b]indole-3-carboxamide (MMV1803522) was observed in the synthesis of (1S,3S)- and (1R,3R)-TCDMC-140230. This achiral β-carboline amide (PRC1584, IC50 = 108 ± 7 nM) proved more potent towards P. falciparum than MMV008138 and its toxicity was not reversed by co-application of IPP. Thus, the antimalarial target of MMV1803522 is distinct from that of MMV008138. Most importantly, MMV1803522 at 40 mg/kg/day (oral) cured P. berghei malaria infection in mice. The lead compound also was found to have a good safety profile. Medicines for Malaria Venture (MMV) has expressed interest in this compound which is now also known as MMV1803522. The results from these biological assays gave the insight to develop new analogs that have better asexual blood stage inhibition potency. Extensive structure-activity relationship studies were conducted by synthesizing analogs of the compound MMV1803522. The studies were mainly focused on analyzing the effect of aliphatic substitutions, how well the potency can be improved with different D-ring substitutions, and amide substitutions. In addition to this structural optimization, several metabolism studies were also conducted on this new lead compound. The potency study results of C1 alkyl-substituted analogs of MMV1803522 showed that aromatic substitutions are required at C1 for maintaining good inhibition potency. The heteroaryl substituents at C1 were found to be slightly less potent than the lead compound MMV1803522. Synthesis of analogs without C8 methyl group as in lead compound showed an EC50 < 100 nM is possible with a C8 hydrogen substitution. Most noteworthy is 3,4,5-trichlotophenyl-bearing compound 3.20a, which had an EC50 of 54 ± 8 nM. This compound is twice as potent as MMV1803522. Equipotent analogs to MMV1803522 were also synthesized with different amide substituents. The metabolism studies showed low solubility for compounds having an EC50 less than or close to 100 nM. Unfortunately, the intrinsic clearance rate of several selected compounds was found to be higher than MMV1803522. These results left us with scope for the development of new analog compounds. The emerging structure-activity relationship within this scaffold and outline of remaining challenges to improve potency sub-100 nM without compromising moderate solubility and good metabolic stability are in progress. / Doctor of Philosophy / Malaria is a global health problem that causes significant sickness and death annually in the developing world. The emergence of resistant parasite strains of malaria massively challenges efforts to eliminate this threat. To control the spread of malaria, there is a continuous need for the development of new antimalarial drugs that ideally offer a single-dose cure and new mechanism of action. One such promising target, called, Methyl Erythrytol Phosphate (MEP) pathway which produces IPP and DMAPP, are important isoprenoid precursors required in living beings. A compound MMV008138 was identified from a collection of compounds that exhibited antimalarial activity, the so-called "Malaria Box", and this compound was further analyzed for several biological assays. Unfortunately, MMV008138 was unsuccessful Since it was found toxic in mice when ingested orally. The efforts to develop structurally similar analogs of MMV008138 resulted in the accidental discovery of a compound that inhibits the parasites' growth much better than the former compound. This compound has a similar molecular structure to MMV008138, and the Medicines for Malaria organization (MMV) has designated it as MMV1803522. The newly obtained compound and its analogs were investigated and found to have promising potency to inhibit the growth of the malarial parasite Plasmodium falciparum. Multiple biological assays were conducted and found that even though MMV1803522 is toxic to malarial parasites, it does not show toxicity to other cells. The studies in mice showed that it was not toxic orally. Also, it was found to be non-toxic towards several mammalian cell lines. The development of structurally similar analogs can help in improving the potency of the compound, make a better orally bioavailable compound, and improve oral efficacy. Analyzing these results will help to determine the mechanism of action of the compound. Drug discovery Plasmodium synthesis DMPK in vivo SAR metabolism in vitro β-carboline carboxamide
713	Distinguishing Dynamical Kinds: An Approach for Automating Scientific Discovery Shea-Blymyer, Colin 02 July 2019 (has links) The automation of scientific discovery has been an active research topic for many years. The promise of a formalized approach to developing and testing scientific hypotheses has attracted researchers from the sciences, machine learning, and philosophy alike. Leveraging the concept of dynamical symmetries a new paradigm is proposed for the collection of scientific knowledge, and algorithms are presented for the development of EUGENE – an automated scientific discovery tool-set. These algorithms have direct applications in model validation, time series analysis, and system identification. Further, the EUGENE tool-set provides a novel metric of dynamical similarity that would allow a system to be clustered into its dynamical regimes. This dynamical distance is sensitive to the presence of chaos, effective order, and nonlinearity. I discuss the history and background of these algorithms, provide examples of their behavior, and present their use for exploring system dynamics. / Master of Science / Determining why a system exhibits a particular behavior can be a difficult task. Some turn to causal analysis to show what particular variables lead to what outcomes, but this can be time-consuming, requires precise knowledge of the system’s internals, and often abstracts poorly to salient behaviors. Others attempt to build models from the principles of the system, or try to learn models from observations of the system, but these models can miss important interactions between variables, and often have difficulty recreating high-level behaviors. To help scientists understand systems better, an algorithm has been developed that estimates how similar the causes of one system’s behaviors are to the causes of another. This similarity between two systems is called their ”dynamical distance” from each other, and can be used to validate models, detect anomalies in a system, and explore how complex systems work. Data Analysis Dynamical Kinds Nonlinear Systems Chaos Automated Scientific Discovery Order Identification
714	A Landfill Reclamation Project: an Observatory that Observes the Self Knotts, Amy Margaret 19 January 2006 (has links) "Transparency- the ability to see into and understand the inner workings of a landscape- is an absolutely essential ingredient to sustainability" -Robert Thayer from "Green World, Green Heart" Current land filling practices that bury waste and debris below layers of earth and synthetic caps do not take into account the potential of reclamation of the site after the landfill debris has become stable. As development and consumerism increases, the need for land reclamation grows stronger, as earth will succumb to overabundance of human excessiveness. Can a space be created that not only reclaims land, but also exposes what is hidden- in order to educate the public on the importance of recycling and sustainability? Is it possible to design a space that addresses the issues and culture of the past, present and future, particular to a geographic site? Can landscape architects use landscape as an educational medium for self-discovery? / Master of Landscape Architecture Landill Reclamation Methane Production Self-discovery in the Landscape Waste Management Phytoremediation Fauquier County
715	Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery Studies Wang, Minkun 14 June 2017 (has links) Identification of disease-related alterations in molecular and cellular mechanisms may reveal useful biomarkers for human diseases including cancers. High-throughput omic technologies for identifying and quantifying multi-level biological molecules (e.g., proteins, glycans, and metabolites) have facilitated the advances in biological research in recent years. Liquid (or gas) chromatography coupled with mass spectrometry (LC/GC-MS) has become an essential tool in such large-scale omic studies. Appropriate LC/GC-MS data preprocessing pipelines are needed to detect true differences between biological groups. Challenges exist in several aspects of MS data analysis. Specifically for biomarker discovery, one fundamental challenge in quantitation of biomolecules is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based omic studies. Purification of mass spectometric data is highly desired prior to subsequent differential analysis. In this research dissertation, we majorly target at addressing the purification problem through probabilistic modeling. We propose an intensity-level purification model (IPM) to computationally purify LC/GC-MS based cancerous data in biomarker discovery studies. We further extend IPM to scan-level purification model (SPM) by considering information from extracted ion chromatogram (EIC, scan-level feature). Both IPM and SPM belong to the category of topic modeling approach, which aims to identify the underlying "topics" (sources) and their mixture proportions in composing the heterogeneous data. Additionally, denoise deconvolution model (DMM) is proposed to capture the noise signals in samples based on purified profiles. Variational expectation-maximization (VEM) and Markov chain Monte Carlo (MCMC) methods are used to draw inference on the latent variables and estimate the model parameters. Before we come to purification, other research topics in related to mass spectrometric data analysis for cancer biomarker discovery are also investigated in this dissertation. Chapter 3 discusses the developed methods in the differential analysis of LC/GC-MS based omic data, specifically for the preprocessing in data of LC-MS profiled glycans. Chapter 4 presents the assumptions and inference details of IPM, SPM, and DDM. A latent Dirichlet allocation (LDA) core is used to model the heterogeneous cancerous data as mixtures of topics consisting of sample-specific pure cancerous source and non-cancerous contaminants. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum and tissue proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis. Chapter 5 elaborates these applications in cancer biomarker discovery, where typical single omic and integrative analysis of multi-omic studies are included. / Ph. D. topic model computational purification Bayesian inference biomarker discovery
716	Numerical Methods in Deep Learning and Computer Vision Song, Yue 23 April 2024 (has links) Numerical methods, the collective name for numerical analysis and optimization techniques, have been widely used in the field of computer vision and deep learning. In this thesis, we investigate the algorithms of some numerical methods and their relevant applications in deep learning. These studied numerical techniques mainly include differentiable matrix power functions, differentiable eigendecomposition (ED), feasible orthogonal matrix constraints in optimization and latent semantics discovery, and physics-informed techniques for solving partial differential equations in disentangled and equivariant representation learning. We first propose two numerical solvers for the faster computation of matrix square root and its inverse. The proposed algorithms are demonstrated to have considerable speedup in practical computer vision tasks. Then we turn to resolve the main issues when integrating differentiable ED into deep learning -- backpropagation instability, slow decomposition for batched matrices, and ill-conditioned input throughout the training. Some approximation techniques are first leveraged to closely approximate the backward gradients while avoiding gradient explosion, which resolves the issue of backpropagation instability. To improve the computational efficiency of ED, we propose an efficient ED solver dedicated to small and medium batched matrices that are frequently encountered as input in deep learning. Some orthogonality techniques are also proposed to improve input conditioning. All of these techniques combine to mitigate the difficulty of applying differentiable ED in deep learning. In the last part of the thesis, we rethink some key concepts in disentangled representation learning. We first investigate the relation between disentanglement and orthogonality -- the generative models are enforced with different proposed orthogonality to show that the disentanglement performance is indeed improved. We also challenge the linear assumption of the latent traversal paths and propose to model the traversal process as dynamic spatiotemporal flows on the potential landscapes. Finally, we build probabilistic generative models of sequences that allow for novel understandings of equivariance and disentanglement. We expect our investigation could pave the way for more in-depth and impactful research at the intersection of numerical methods and deep learning. Settore INF/01 - Informatica
717	Assessment of Penalized Regression for Genome-wide Association Studies Yi, Hui 27 August 2014 (has links) The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single marker association methods. As an alternative to Single Marker Analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of Penalized Regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by False Discovery Rate (FDR) control, and assess their performance (including penalties incorporating linkage disequilibrium) in comparison with SMA. PR methods were compared with SMA on realistically simulated GWAS data consisting of genotype data from single and multiple chromosomes and a continuous phenotype and on real data. Based on our comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini-Hochberg FDR control. PR controlled the FDR conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on variable selection with FDR control. Incorporating LD into PR by adapting penalties developed for covariates measured on graphs can improve power but also generate morel false positives or wider regions for follow-up. We recommend using the Elastic Net with a mixing weight for the Lasso penalty near 0.5 as the best method. / Ph. D. Genome-wide Association Study penalized regression false discovery rate linkage disequilibrium
718	Computational modeling-based discovery of novel classes of anti-inflammatory drugs that target lanthionine synthetase C-like protein 2 Lu, Pinyi 15 December 2015 (has links) Lanthionine synthetase C-like protein 2 (LANCL2) is a member of the LANCL protein family, which is broadly expressed throughout the body. LANCL2 is the molecular target of abscisic acid (ABA), a compound with insulin-sensitizing and immune modulatory actions. LANCL2 is required for membrane binding and signaling of ABA in immune cells. Direct binding of ABA to LANCL2 was predicted in silico using molecular modeling approaches and validated experimentally using ligand-binding assays and kinetic surface plasmon resonance studies. The therapeutic potential of the LANCL2 pathway ranges from increasing cellular sensitivity to anticancer drugs, insulin-sensitizing effects and modulating immune and inflammatory responses in the context of immune-mediated and infectious diseases. A case for LANCL2-based drug discovery and development is also illustrated by the anti-inflammatory activity of novel LANCL2 ligands such as NSC61610 against inflammatory bowel disease in mice. This dissertation discusses the value of LANCL2 as a novel therapeutic target for the discovery and development of new classes of orally active drugs against chronic metabolic, immune-mediated and infectious diseases and as a validated target that can be used in precision medicine. Specifically, in Chapter 2 of the dissertation, we performed homology modeling to construct a three-dimensional structure of LANCL2 using the crystal structure of LANCL1 as a template. Our molecular docking studies predicted that ABA and other PPAR - agonists share a binding site on the surface of LANCL2. In Chapter 3 of the dissertation, structure-based virtual screening was performed. Several potential ligands were identified using molecular docking. In order to validate the anti-inflammatory efficacy of the top ranked compound (NSC61610) in the NCI Diversity Set II, a series of in vitro and pre-clinical efficacy studies were performed using a mouse model of dextran sodium sulfate (DSS)-induced colitis. In Chapter 4 of the dissertation, we developed a novel integrated approach for creating a synthetic patient population and testing the efficacy of the novel pre-clinical stage LANCL2 therapeutic for Crohn's disease in large clinical cohorts in silico. Efficacy of treatments on Crohn's disease was evaluated by analyzing predicted changes of Crohn's disease activity index (CDAI) scores and correlations with immunological variables were evaluated. The results from our placebo-controlled, randomized, Phase III in silico clinical trial at 6 weeks following the treatment shows a positive correlation between the initial disease activity score and the drop in CDAI score. This observation highlights the need for precision medicine strategies for IBD. / Ph. D. LANCL2 Anti-inflammatory Drug Discovery Molecular Modeling Virtual Screening In Silico Clinical Trial
719	Bayesian Alignment Model for Analysis of LC-MS-based Omic Data Tsai, Tsung-Heng 22 May 2014 (has links) Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform. / Ph. D. alignment Bayesian inference biomarker discovery Markov chain Monte Carlo (MCMC)
720	Augmenting Dynamic Query Expansion in Microblog Texts Khandpur, Rupinder P. 17 August 2018 (has links) Dynamic query expansion is a method of automatically identifying terms relevant to a target domain based on an incomplete query input. With the explosive growth of online media, such tools are essential for efficient search result refining to track emerging themes in noisy, unstructured text streams. It's crucial for large-scale predictive analytics and decision-making, systems which use open source indicators to find meaningful information rapidly and accurately. The problems of information overload and semantic mismatch are systemic during the Information Retrieval (IR) tasks undertaken by such systems. In this dissertation, we develop approaches to dynamic query expansion algorithms that can help improve the efficacy of such systems using only a small set of seed queries and requires no training or labeled samples. We primarily investigate four significant problems related to the retrieval and assessment of event-related information, viz. (1) How can we adapt the query expansion process to support rank-based analysis when tracking a fixed set of entities? A scalable framework is essential to allow relative assessment of emerging themes such as airport threats. (2) What visual knowledge discovery framework to adopt that can incorporate users' feedback back into the search result refinement process? A crucial step to efficiently integrate real-time `situational awareness' when monitoring specific themes using open source indicators. (3) How can we contextualize query expansions? We focus on capturing semantic relatedness between a query and reference text so that it can quickly adapt to different target domains. (4) How can we synchronously perform knowledge discovery and characterization (unstructured to structured) during the retrieval process? We mainly aim to model high-order, relational aspects of event-related information from microblog texts. / Ph. D. / Analysis of real-time, social media can provide critical insights into ongoing societal events. Where consequences and implications of specific events include monetary losses, threats to critical infrastructure and national security, disruptions to daily life, and a potential to cause loss of life and physical property. It is imperative for developing good ‘ground truth’ to develop adequate data-driven information systems, i.e., an authoritative record of events reported in the media cataloged alongside important dimensions. Availability of high-quality ground truth events can support various analytic efforts, e.g., identifying precursors of attacks, developing predictive indicators using surrogate data sources, and tracking the progression of events over space and time. A dynamic search result refinement is useful for expanding a general set of user queries into a more relevant collection. The challenges of information overload and misalignment of context between the user query and retrieved results can overwhelm both human and machine. In this dissertation, we focus our efforts on these specific challenges. With the ever-increasing volume of user-generated data large-scale analysis is a tedious task. Our first focus is to develop a scalable model that dynamically tracks and ranks evolving topics as they appear in social media. Then to simplify the cognitive tasks involving sense-making of evolving themes, we take a visual approach to retrieve situationally critical and emergent information effectively. This visual analytics approach learns from user’s interactions during the exploratory process and then generates a better representation of the data. Thus, improving the situational understanding and usability of underlying data models. Such features are crucial for big-data based decision & support systems. To make the event-focused retrieval process more robust, we developed a context-rich procedure that adds new relevant key terms to the user’s original query by utilizing the linguistic structures in text. This context-awareness allows the algorithm to retrieve those relevant characteristics that can help users to gain adequate information from social media about real-world events. Online social commentary about events is very informal and can be incomplete. However, to get the complete picture and adequately describe these events we develop an approach that models the underlying relatedness of information and iteratively extract meaning and denotations from event-related texts. We learn how to express the high-order relationships between events and entities and group them to identify those attributes that best explain the events the user is trying to uncover. In all the augmentations we develop, our strategy is to allow only very minimal human supervision using just a small set of seed event triggers and requires no training or labeled samples. We show a comprehensive evaluation of these augmentations on real-world domains - threats on airports, cyber attacks, and protests. We also demonstrate their applicability as for real-time analysis that provides vital event characteristics, and contextually consistent information can be a beneficial aid for emergency responders. Dynamic Query Expansion Microblog Event Retrieval Social Media Analytics Visual Knowledge Discovery

Search results