Return to search

Statistical Learning for Sequential Unstructured Data

Unstructured data, which cannot be organized into predefined structures, such as texts, human behavior status, and system logs, often presented in a sequential format with inherent dependencies. Probabilistic model are commonly used to capture these dependencies in the data generation process through latent parameters and can naturally extend into hierarchical forms. However, these models rely on the correct specification of assumptions about the sequential data generation process, which often limits their scalable learning abilities. The emergence of neural network tools has enabled scalable learning for high-dimensional sequential data. From an algorithmic perspective, efforts are directed towards reducing dimensionality and representing unstructured data units as dense vectors in low-dimensional spaces, learned from unlabeled data, a practice often referred to as numerical embedding. While these representations offer measures of similarity, automated generalizations, and semantic understanding, they frequently lack the statistical foundations required for explicit inference. This dissertation aims to develop statistical inference techniques tailored for the analysis of unstructured sequential data, with their application in the field of transportation safety. The first part of dissertation presents a two-stage method. It adopts numerical embedding to map large-scale unannotated data into numerical vectors. Subsequently, a kernel test using maximum mean discrepancy is employed to detect abnormal segments within a given time period. Theoretical results showed that learning from numerical vectors is equivalent to learning directly through the raw data. A real-world example illustrates how driver mismatched visual behavior occurred during a lane change. The second part of the dissertation introduces a two-sample test for comparing text generation similarity. The hypothesis tested is whether the probabilistic mapping measures that generate textual data are identical for two groups of documents. The proposed test compares the likelihood of text documents, estimated through neural network-based language models under the autoregressive setup. The test statistic is derived from an estimation and inference framework that first approximates data likelihood with an estimation set before performing inference on the remaining part. The theoretical result indicates that the test statistic's asymptotic behavior approximates a normal distribution under mild conditions. Additionally, a multiple data-splitting strategy is utilized, combining p-values into a unified decision to enhance the test's power. The third part of the dissertation develops a method to measure differences in text generation between a benchmark dataset and a comparison dataset, focusing on word-level generation variations. This method uses the sliced-Wasserstein distance to compute the contextual discrepancy score. A resampling method establishes a threshold to screen the scores. Crash report narratives are analyzed to compare crashes involving vehicles equipped with level 2 advanced driver assistance systems and those involving human drivers. / Doctor of Philosophy / Unstructured data, such as texts, human behavior records, and system logs, cannot be neatly organized. This type of data often appears in sequences with natural connections. Traditional methods use models to understand these connections, but these models depend on specific assumptions, which can limit their effectiveness. New tools using neural networks have made it easier to work with large and complex data. These tools help simplify data by turning it into smaller, manageable pieces, a process known as numerical embedding. While this helps in understanding the data better, it often requires a statistical foundation for the proceeding inferential analysis. This dissertation aims to develop statistical inference techniques for analyzing unstructured sequential data, focusing on transportation safety. The first part of the dissertation introduces a two-step method. First, it transforms large-scale unorganized data into numerical vectors. Then, it uses a statistical test to detect unusual patterns over a period. For example, it can identify when a driver's visual behavior doesn't properly aligned with the driving attention demand during lane changes. The second part of the dissertation presents a method to compare the similarity of text generation. It tests whether the way texts are generated is the same for two groups of documents. This method uses neural network-based models to estimate the likelihood of text documents. Theoretical results show that as the more data observed, the distribution of the test statistic will get closer to the desired distribution under certain conditions. Additionally, combining multiple data splits improves the test's power. The third part of the dissertation constructs a score to measure differences in text generation processes, focusing on word-level differences. This score is based on a specific distance measure. To check if the difference is not a false discovery, a screening threshold is established using resampling technique. If the score exceeds the threshold, the difference is considered significant. An application of this method compares crash reports from vehicles with advanced driver assistance systems to those from human-driven vehicles.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/120787
Date30 July 2024
CreatorsXu, Jingbin
ContributorsStatistics, Guo, Feng, Zhu, Hongxiao, Liu, Meimei, Kim, Inyoung
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf, application/pdf
RightsCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International, http://creativecommons.org/licenses/by-nc-sa/4.0/

Page generated in 0.0015 seconds