Global ETD Search

231	Sequential learning, large-scale calibration, and uncertainty quantification Huang, Jiangeng 23 July 2019 (has links) With remarkable advances in computing power, computer experiments continue to expand the boundaries and drive down the cost of various scientific discoveries. New challenges keep arising from designing, analyzing, modeling, calibrating, optimizing, and predicting in computer experiments. This dissertation consists of six chapters, exploring statistical methodologies in sequential learning, model calibration, and uncertainty quantification for heteroskedastic computer experiments and large-scale computer experiments. For heteroskedastic computer experiments, an optimal lookahead based sequential learning strategy is presented, balancing replication and exploration to facilitate separating signal from input-dependent noise. Motivated by challenges in both large data size and model fidelity arising from ever larger modern computer experiments, highly accurate and computationally efficient divide-and-conquer calibration methods based on on-site experimental design and surrogate modeling for large-scale computer models are developed in this dissertation. The proposed methodology is applied to calibrate a real computer experiment from the gas and oil industry. This on-site surrogate calibration method is further extended to multiple output calibration problems. / Doctor of Philosophy / With remarkable advances in computing power, complex physical systems today can be simulated comparatively cheaply and to high accuracy through computer experiments. Computer experiments continue to expand the boundaries and drive down the cost of various scientific investigations, including biological, business, engineering, industrial, management, health-related, physical, and social sciences. This dissertation consists of six chapters, exploring statistical methodologies in sequential learning, model calibration, and uncertainty quantification for heteroskedastic computer experiments and large-scale computer experiments. For computer experiments with changing signal-to-noise ratio, an optimal lookahead based sequential learning strategy is presented, balancing replication and exploration to facilitate separating signal from complex noise structure. In order to effectively extract key information from massive amount of simulation and make better prediction for the real world, highly accurate and computationally efficient divide-and-conquer calibration methods for large-scale computer models are developed in this dissertation, addressing challenges in both large data size and model fidelity arising from ever larger modern computer experiments. The proposed methodology is applied to calibrate a real computer experiment from the gas and oil industry. This large-scale calibration method is further extended to solve multiple output calibration problems. sequential learning computer experiments uncertainty quantification big data hierarchical modeling
232	Parallel Mining and Analysis of Triangles and Communities in Big Networks Arifuzzaman, S M. 19 August 2016 (has links) A network (graph) is a powerful abstraction for interactions among entities in a system. Examples include various social, biological, collaboration, citation, and co-purchase networks. Real-world networks are often characterized by an abundance of triangles and the existence of well-structured communities. Thus, counting triangles and detecting communities in networks have become important algorithmic problems in network mining and analysis. In the era of big data, the network data emerged from numerous scientific disciplines are very large. Online social networks such as Twitter and Facebook have millions to billions of users. Such massive networks often do not fit in the main memory of a single machine, and the existing sequential methods might take a prohibitively large runtime. This motivates the need for scalable parallel algorithms for mining and analysis. We design MPI-based distributed-memory parallel algorithms for counting triangles and detecting communities in big networks and present related analysis. The dissertation consists of four parts. In Part I, we devise parallel algorithms for counting and enumerating triangles. The first algorithm employs an overlapping partitioning scheme and novel load-balancing schemes leading to a fast algorithm. We also design a space-efficient algorithm using non-overlapping partitioning and an efficient communication scheme. This space efficiency allows the algorithm to work on even larger networks. We then present our third parallel algorithm based on dynamic load balancing. All these algorithms work on big networks, scale to a large number of processors, and demonstrate very good speedups. An important property, very related to triangles, of many real-world networks is high transitivity, which states that two nodes having common neighbors tend to become neighbors themselves. In Part II, we characterize networks by quantifying the number of common neighbors and demonstrate its relationship to community structure of networks. In Part III, we design parallel algorithms for detecting communities in big networks. We propose efficient load balancing and communication approaches, which lead to fast and scalable algorithms. Finally, in Part IV, we present scalable parallel algorithms for a useful graph preprocessing problem-- converting edge list to adjacency list. We present non-trivial parallelization with efficient HPC-based techniques leading to fast and space-efficient algorithms. / Ph. D. Network Mining Parallel Algorithm Triangle Counting Community Detection Big Data
233	Surveillance Technology and the Neoliberal State: Expanding the Power to Criminalize in a Data-Unlimited World Hurley, Emily Elizabeth 23 June 2017 (has links) For the past several decades, the neoliberal school of economics has dominated public policy, encouraging American politicians to reduce the size of the government. Despite this trend, the power of the state to surveille, criminalize, and detain has become more extensive, even as the state appears to be growing less powerful. By allowing information technology corporations such as Google to collect location data from users with or without their knowledge, the state can tap into a vast surveillance network at any time, retroactively surveilling and criminalizing at its discretion. Furthermore, neoliberal political theory has eroded the classical liberal conception of freedom so that these surveillance tactics to not appear to restrict individuals' freedom or privacy so long as they give their consent to be surveilled by a private corporation. Neoliberalism also encourages the proliferation of information technologies by making individuals responsible for their economic success and wellbeing in an increasingly competitive world, thus pushing more individuals to use information technologies to enter into the gig economy. The individuating logic of neoliberalism, combined with the rapid economic potentialities of information technology, turn individuals into mere sources of human capital. Even though the American state's commitment to neoliberalism precludes it from covertly managing the labor economy, it can still manage a population through criminalization and incarceration. Access to users' data by way of information technology makes the process of criminalization more manageable and allows the state to more easily incarcerate indiscriminately. / Master of Arts Neoliberal governance surveillance technology big data criminality freedom
234	Human Mobility Perturbation and Resilience in Natural Disasters Wang, Qi 30 April 2015 (has links) Natural disasters exert a profound impact on the world population. In 2012, natural disasters affected 106 million people, forcing over 31.7 million people to leave their homes. Climate change has intensified natural disasters, resulting in more catastrophic events and making extreme weather more difficult to predict. Understanding and predicting human movements plays a critical role in disaster evacuation, response and relief. Researchers have developed different methodologies and applied several models to study human mobility patterns, including random walks, Lévy flight, and Brownian walks. However, the extent to which these models may apply to perturbed human mobility patterns during disasters and the associated implications for improving disaster evacuation, response and relief efforts is lacking. My PhD research aims to address the limitation in human mobility research and gain a ground truth understanding of human mobility patterns under the influence of natural disasters. The research contains three interdependent projects. In the first project, I developed a novel data collecting system. The system can be used to collect large scale data of human mobility from large online social networking platforms. By analyzing both the general characteristics of the collected data and conducting a case study in NYC, I confirmed that the data collecting system is a viable venue to collect empirical data for human mobility research. My second project examined human mobility patterns in NYC under the influence of Hurricane Sandy. Using the data collecting system developed in the first project, I collected 12 days of human mobility data from NYC. The data set contains movements during and several days after the strike of Hurricane Sandy. The results showed that human mobility was strongly perturbed by Hurricane Sandy, but meanwhile inherent resilience was observed in human movements. In the third project, I extended my research to fifteen additional natural disasters from five categories. Using over 3.5 million data entries of human movement, I found that while human mobility still followed the Lévy flight model during these disaster events, extremely powerful natural disasters could break the correlation between human mobility in steady states and perturbation states and thus destroy the inherent resilience in human mobility. The overall findings have significant implications in improving understanding and predicting human mobility under the influence of natural disasters and extreme events. / Ph. D. Human Mobility Natural Disasters Geo-Social Networks Resilience Big Data
235	On the Use of Grouped Covariate Regression in Oversaturated Models Loftus, Stephen Christopher 11 December 2015 (has links) As data collection techniques improve, oftentimes the number of covariates exceeds the number of observations. When this happens, regression models become oversaturated and, thus, inestimable. Many classical and Bayesian techniques have been designed to combat this difficulty, with various means of combating the oversaturation. However, these techniques can be tricky to implement well, difficult to interpret, and unstable. What is proposed is a technique that takes advantage of the natural clustering of variables that can often be found in biological and ecological datasets known as the omics datasests. Generally speaking, omics datasets attempt to classify host species structure or function by characterizing a group of biological molecules, such as genes (Genomics), the proteins (Proteomics), and metabolites (Metabolomics). By clustering the covariates and regressing on a single value for each cluster, the model becomes both estimable and stable. In addition, the technique can account for the variability within each cluster, allow for the inclusion of expert judgment, and provide a probability of inclusion for each cluster. / Ph. D. Oversaturated model Big data Variable selection Data Analytics Bayesian methods
236	Data-Driven Park Planning: Comparative Study of Survey with Social Media Data Sim, Jisoo 05 May 2020 (has links) The purpose of this study was (1) to identify visitors’ behaviors in and perceptions of linear parks, (2) to identify social media users’ behaviors in and perceptions of linear parks, and (3) to compare small data with big data. This chapter discusses the main findings and their implications for practitioners such as landscape architects and urban planners. It has three sections. The first addresses the main findings in the order of the research questions at the center of the study. The second describes implications and recommendations for practitioners. The final section discusses the limitations of the study and suggests directions for future work. This study compares two methods of data collection, focused on activities and benefits. The survey asked respondents to check all the activities they did in the park. Social media users’ activities were detected by term frequency in social media data. Both results ordered the activities similarly. For example social interaction and art viewing were most popular on the High Line, then the 606, then the High Bridge according to both methods. Both methods also reported that High Line visitors engaged in viewing from overlooks the most. As for benefits, according to both methods vistors to the 606 were more satisfied than High Line visitors with the parks’ social and natural benefits. These results suggest social media analytics can replace surveys when the textual information is sufficient for analysis. Social media analytics also differ from surveys in accuracy of results. For example, social media revealed that 606 users were interested in events and worried about housing prices and crimes, but the pre-designed survey could not capture those facts. Social media analytics can also catch hidden and more general information: through cluster analysis, we found possible reasons for the High Line’s success in the arts and in the New York City itself. These results involve general information that would be hard to identify through a survey. On the other hand, surveys provide specific information and can describe visitors’ demographics, motivations, travel information, and specific benefits. For example, 606 users tend to be young, high-income, well educated, white, and female. These data cannot be collected through social media. / Doctor of Philosophy / Turning unused infrastructure into green infrastructure, such as linear parks, is not a new approach to managing brownfields. In the last few decades, changes in the industrial structure and the development of transportation have had a profound effect on urban spatial structure. As the need for infrastructure, which played an important role in the development of past industry, has decreased, many industrial sites, power plants, and military bases have become unused. This study identifies new ways of collecting information about a new type of park, linear parks, using a new method, social media analytics. The results are then compared with survey results to establish the credibility of social media analytics. Lastly, shortcomings of social media analytics are identified. This study is meaningful in helping us understand the users of new types of parks and suggesting design and planning strategies. Regarding methodology, this study also involves evaluating the use of social media analytics and its advantages, disadvantages, and reliability. linear park survey small data social media big data analytics
237	Algorithmic Distribution of Applied Learning on Big Data Shukla, Manu 16 October 2020 (has links) Machine Learning and Graph techniques are complex and challenging to distribute. Generally, they are distributed by modeling the problem in a similar way as single node sequential techniques except applied on smaller chunks of data and compute and the results combined. These techniques focus on stitching the results from smaller chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs access to all the data during execution. In this work, we propose key-value pair based distribution techniques that are widely applicable to statistical machine learning techniques along with matrix, graph, and time series based algorithms. The crucial difference with previously proposed techniques is that all operations are modeled on key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. This thesis focuses on key-value pair based distribution of applied machine learning techniques on a variety of problems. For the first method key-value pair distribution is used for storytelling at scale. Storytelling connects entities (people, organizations) using their observed relationships to establish meaningful storylines. When performed sequentially these computations become a bottleneck because the massive number of entities make space and time complexity untenable. We present DISCRN, or DIstributed Spatio-temporal ConceptseaRch based StorytelliNg, a distributed framework for performing spatio-temporal storytelling. The framework extracts entities from microblogs and event data, and links these entities using a novel ConceptSearch to derive storylines in a distributed fashion utilizing key-value pair paradigm. Performing these operations at scale allows deeper and broader analysis of storylines. The novel parallelization techniques speed up the generation and filtering of storylines on massive datasets. Experiments with microblog posts such as Twitter data and GDELT(Global Database of Events, Language and Tone) events show the efficiency of the techniques in DISCRN. The second work determines brand perception directly from people's comments in social media. Current techniques for determining brand perception, such as surveys of handpicked users by mail, in person, phone or online, are time consuming and increasingly inadequate. The proposed DERIV system distills storylines from open data representing direct consumer voice into a brand perception. The framework summarizes the perception of a brand in comparison to peer brands with in-memory key-value pair based distributed algorithms utilizing supervised machine learning techniques. Experiments performed with open data and models built with storylines of known peer brands show the technique as highly scalable and accurate in capturing brand perception from vast amounts of social data compared to sentiment analysis. The third work performs event categorization and prospect identification in social media. The problem is challenging due to endless amount of information generated daily. In our work, we present DISTL, an event processing and prospect identifying platform. It accepts as input a set of storylines (a sequence of entities and their relationships) and processes them as follows: (1) uses different algorithms (LDA, SVM, information gain, rule sets) to identify themes from storylines; (2) identifies top locations and times in storylines and combines with themes to generate events that are meaningful in a specific scenario for categorizing storylines; and (3) extracts top prospects as people and organizations from data elements contained in storylines. The output comprises sets of events in different categories and storylines under them along with top prospects identified. DISTL utilizes in-memory key-value pair based distributed processing that scales to high data volumes and categorizes generated storylines in near real-time. The fourth work builds flight paths of drones in a distributed manner to survey a large area taking images to determine growth of vegetation over power lines allowing for adjustment to terrain and number of drones and their capabilities. Drones are increasingly being used to perform risky and labor intensive aerial tasks cheaply and safely. To ensure operating costs are low and flights autonomous, their flight plans must be pre-built. In existing techniques drone flight paths are not automatically pre-calculated based on drone capabilities and terrain information. We present details of an automated flight plan builder DIMPL that pre-builds flight plans for drones tasked with surveying a large area to take photographs of electric poles to identify ones with hazardous vegetation overgrowth. DIMPL employs a distributed in-memory key-value pair based paradigm to process subregions in parallel and build flight paths in a highly efficient manner. The fifth work highlights scaling graph operations, particularly pruning and joins. Linking topics to specific experts in technical documents and finding connections between experts are crucial for detecting the evolution of emerging topics and the relationships between their influencers in state-of-the-art research. Current techniques that make such connections are limited to similarity measures. Methods based on weights such as TF-IDF and frequency to identify important topics and self joins between topics and experts are generally utilized to identify connections between experts. However, such approaches are inadequate for identifying emerging keywords and experts since the most useful terms in technical documents tend to be infrequent and concentrated in just a few documents. This makes connecting experts through joins on large dense graphs challenging. We present DIGDUG, a framework that identifies emerging topics by applying graph operations to technical terms. The framework identifies connections between authors of patents and journal papers by performing joins on connected topics and topics associated with the authors at scale. The problem of scaling the graph operations for topics and experts is solved through dense graph pruning and graph joins categorized under their own scalable separable dense graph class based on key-value pair distribution. Comparing our graph join and pruning technique against multiple graph and join methods in MapReduce revealed a significant improvement in performance using our approach. / Doctor of Philosophy / Distribution of Machine Learning and Graph algorithms is commonly performed by modeling the core algorithm in the same way as the sequential technique except implemented on distributed framework. This approach is satisfactory in very few cases, such as depth-first search and subgraph enumerations in graphs, k nearest neighbors, and few additional common methods. These techniques focus on stitching the results from smaller data or compute chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs to perform exhaustive computations on all the data during execution. In this work, we propose key-value pair based distribution techniques that are exhaustive and widely applicable to statistical machine learning algorithms along with matrix, graph, and time series based operations. The crucial difference with previously proposed techniques is that all operations are modeled as key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. Big Data Distributed Machine Learning In-Memory Distribution Graph Distribution
238	Scalable and Productive Data Management for High-Performance Analytics Youssef, Karim Yasser Mohamed Yousri 07 November 2023 (has links) Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores. / Doctor of Philosophy / Advancements in various fields, like genetics, satellite imaging, and physics simulations, are generating massive amounts of data. Analyzing this data can lead to groundbreaking scientific discoveries. However, the sheer size of these datasets presents a challenge. High-performance computing (HPC) offers a solution to process and understand this data efficiently. Still, several issues hinder the performance of big data analytics software on HPC systems. These problems include finding the right balance between performance and ease of programming, dealing with the challenges of handling massive amounts of data, and optimizing storage usage. This dissertation focuses on three areas to improve high-performance data analytics (HPDA). Firstly, it demonstrates how using Spark and optimized data partitioning can optimize programming productivity while achieving similar scalability as the Message Passing Interface (MPI) for scalable bioinformatics. Secondly, it addresses the limitations of the operating system's memory management for processing data that is too large to fit entirely in memory. Lastly, it tackles the efficiency issues related to input/output operations and storage when dealing with data structures like graphs, data frames, and key-value datastores in iterative and incremental workflows. high-performance computing (HPC) big data performance productivity storage efficiency
239	The impact of big data analytics on firms’ high value business performance Popovic, A., Hackney, R., Tassabehji, Rana, Castelli, M. 2016 October 1928 (has links) Yes / Big Data Analytics (BDA) is an emerging phenomenon with the reported potential to transform how firms manage and enhance high value businesses performance. The purpose of our study is to investigate the impact of BDA on operations management in the manufacturing sector, which is an acknowledged infrequently researched context. Using an interpretive qualitative approach, this empirical study leverages a comparative case study of three manufacturing companies with varying levels of BDA usage (experimental, moderate and heavy). The information technology (IT) business value literature and a resource based view informed the development of our research propositions and the conceptual framework that illuminated the relationships between BDA capability and organizational readiness and design. Our findings indicate that BDA capability (in terms of data sourcing, access, integration, and delivery, analytical capabilities, and people’s expertise) along with organizational readiness and design factors (such as BDA strategy, top management support, financial resources, and employee engagement) facilitated better utilization of BDA in manufacturing decision making, and thus enhanced high value business performance. Our results also highlight important managerial implications related to the impact of BDA on empowerment of employees, and how BDA can be integrated into organizations to augment rather than replace management capabilities. Our research will be of benefit to academics and practitioners in further aiding our understanding of BDA utilization in transforming operations and production management. It adds to the body of limited empirically based knowledge by highlighting the real business value resulting from applying BDA in manufacturing firms and thus encouraging beneficial economic societal changes. Big data analytics Business value Operations performance Case analysis
240	Driving Innovation through Big Open Linked Data (BOLD): Exploring Antecedents using Interpretive Structural Modelling Dwivedi, Y.K., Janssen, M., Slade, E.L., Rana, Nripendra P., Weerakkody, Vishanth J.P., Millard, J., Hidders, J., Snijders, D. 07 2016 (has links) Yes / Innovation is vital to find new solutions to problems, increase quality, and improve profitability. Big open linked data (BOLD) is a fledgling and rapidly evolving field that creates new opportunities for innovation. However, none of the existing literature has yet considered the interrelationships between antecedents of innovation through BOLD. This research contributes to knowledge building through utilising interpretive structural modelling to organise nineteen factors linked to innovation using BOLD identified by experts in the field. The findings show that almost all the variables fall within the linkage cluster, thus having high driving and dependence powers, demonstrating the volatility of the process. It was also found that technical infrastructure, data quality, and external pressure form the fundamental foundations for innovation through BOLD. Deriving a framework to encourage and manage innovation through BOLD offers important theoretical and practical contributions. Big data Open data Linked data Innovation Interpretive structural modelling

Search results