Global ETD Search

231	On the Use of Grouped Covariate Regression in Oversaturated Models Loftus, Stephen Christopher 11 December 2015 (has links) As data collection techniques improve, oftentimes the number of covariates exceeds the number of observations. When this happens, regression models become oversaturated and, thus, inestimable. Many classical and Bayesian techniques have been designed to combat this difficulty, with various means of combating the oversaturation. However, these techniques can be tricky to implement well, difficult to interpret, and unstable. What is proposed is a technique that takes advantage of the natural clustering of variables that can often be found in biological and ecological datasets known as the omics datasests. Generally speaking, omics datasets attempt to classify host species structure or function by characterizing a group of biological molecules, such as genes (Genomics), the proteins (Proteomics), and metabolites (Metabolomics). By clustering the covariates and regressing on a single value for each cluster, the model becomes both estimable and stable. In addition, the technique can account for the variability within each cluster, allow for the inclusion of expert judgment, and provide a probability of inclusion for each cluster. / Ph. D. Oversaturated model Big data Variable selection Data Analytics Bayesian methods
232	Data-Driven Park Planning: Comparative Study of Survey with Social Media Data Sim, Jisoo 05 May 2020 (has links) The purpose of this study was (1) to identify visitors’ behaviors in and perceptions of linear parks, (2) to identify social media users’ behaviors in and perceptions of linear parks, and (3) to compare small data with big data. This chapter discusses the main findings and their implications for practitioners such as landscape architects and urban planners. It has three sections. The first addresses the main findings in the order of the research questions at the center of the study. The second describes implications and recommendations for practitioners. The final section discusses the limitations of the study and suggests directions for future work. This study compares two methods of data collection, focused on activities and benefits. The survey asked respondents to check all the activities they did in the park. Social media users’ activities were detected by term frequency in social media data. Both results ordered the activities similarly. For example social interaction and art viewing were most popular on the High Line, then the 606, then the High Bridge according to both methods. Both methods also reported that High Line visitors engaged in viewing from overlooks the most. As for benefits, according to both methods vistors to the 606 were more satisfied than High Line visitors with the parks’ social and natural benefits. These results suggest social media analytics can replace surveys when the textual information is sufficient for analysis. Social media analytics also differ from surveys in accuracy of results. For example, social media revealed that 606 users were interested in events and worried about housing prices and crimes, but the pre-designed survey could not capture those facts. Social media analytics can also catch hidden and more general information: through cluster analysis, we found possible reasons for the High Line’s success in the arts and in the New York City itself. These results involve general information that would be hard to identify through a survey. On the other hand, surveys provide specific information and can describe visitors’ demographics, motivations, travel information, and specific benefits. For example, 606 users tend to be young, high-income, well educated, white, and female. These data cannot be collected through social media. / Doctor of Philosophy / Turning unused infrastructure into green infrastructure, such as linear parks, is not a new approach to managing brownfields. In the last few decades, changes in the industrial structure and the development of transportation have had a profound effect on urban spatial structure. As the need for infrastructure, which played an important role in the development of past industry, has decreased, many industrial sites, power plants, and military bases have become unused. This study identifies new ways of collecting information about a new type of park, linear parks, using a new method, social media analytics. The results are then compared with survey results to establish the credibility of social media analytics. Lastly, shortcomings of social media analytics are identified. This study is meaningful in helping us understand the users of new types of parks and suggesting design and planning strategies. Regarding methodology, this study also involves evaluating the use of social media analytics and its advantages, disadvantages, and reliability. linear park survey small data social media big data analytics
233	Algorithmic Distribution of Applied Learning on Big Data Shukla, Manu 16 October 2020 (has links) Machine Learning and Graph techniques are complex and challenging to distribute. Generally, they are distributed by modeling the problem in a similar way as single node sequential techniques except applied on smaller chunks of data and compute and the results combined. These techniques focus on stitching the results from smaller chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs access to all the data during execution. In this work, we propose key-value pair based distribution techniques that are widely applicable to statistical machine learning techniques along with matrix, graph, and time series based algorithms. The crucial difference with previously proposed techniques is that all operations are modeled on key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. This thesis focuses on key-value pair based distribution of applied machine learning techniques on a variety of problems. For the first method key-value pair distribution is used for storytelling at scale. Storytelling connects entities (people, organizations) using their observed relationships to establish meaningful storylines. When performed sequentially these computations become a bottleneck because the massive number of entities make space and time complexity untenable. We present DISCRN, or DIstributed Spatio-temporal ConceptseaRch based StorytelliNg, a distributed framework for performing spatio-temporal storytelling. The framework extracts entities from microblogs and event data, and links these entities using a novel ConceptSearch to derive storylines in a distributed fashion utilizing key-value pair paradigm. Performing these operations at scale allows deeper and broader analysis of storylines. The novel parallelization techniques speed up the generation and filtering of storylines on massive datasets. Experiments with microblog posts such as Twitter data and GDELT(Global Database of Events, Language and Tone) events show the efficiency of the techniques in DISCRN. The second work determines brand perception directly from people's comments in social media. Current techniques for determining brand perception, such as surveys of handpicked users by mail, in person, phone or online, are time consuming and increasingly inadequate. The proposed DERIV system distills storylines from open data representing direct consumer voice into a brand perception. The framework summarizes the perception of a brand in comparison to peer brands with in-memory key-value pair based distributed algorithms utilizing supervised machine learning techniques. Experiments performed with open data and models built with storylines of known peer brands show the technique as highly scalable and accurate in capturing brand perception from vast amounts of social data compared to sentiment analysis. The third work performs event categorization and prospect identification in social media. The problem is challenging due to endless amount of information generated daily. In our work, we present DISTL, an event processing and prospect identifying platform. It accepts as input a set of storylines (a sequence of entities and their relationships) and processes them as follows: (1) uses different algorithms (LDA, SVM, information gain, rule sets) to identify themes from storylines; (2) identifies top locations and times in storylines and combines with themes to generate events that are meaningful in a specific scenario for categorizing storylines; and (3) extracts top prospects as people and organizations from data elements contained in storylines. The output comprises sets of events in different categories and storylines under them along with top prospects identified. DISTL utilizes in-memory key-value pair based distributed processing that scales to high data volumes and categorizes generated storylines in near real-time. The fourth work builds flight paths of drones in a distributed manner to survey a large area taking images to determine growth of vegetation over power lines allowing for adjustment to terrain and number of drones and their capabilities. Drones are increasingly being used to perform risky and labor intensive aerial tasks cheaply and safely. To ensure operating costs are low and flights autonomous, their flight plans must be pre-built. In existing techniques drone flight paths are not automatically pre-calculated based on drone capabilities and terrain information. We present details of an automated flight plan builder DIMPL that pre-builds flight plans for drones tasked with surveying a large area to take photographs of electric poles to identify ones with hazardous vegetation overgrowth. DIMPL employs a distributed in-memory key-value pair based paradigm to process subregions in parallel and build flight paths in a highly efficient manner. The fifth work highlights scaling graph operations, particularly pruning and joins. Linking topics to specific experts in technical documents and finding connections between experts are crucial for detecting the evolution of emerging topics and the relationships between their influencers in state-of-the-art research. Current techniques that make such connections are limited to similarity measures. Methods based on weights such as TF-IDF and frequency to identify important topics and self joins between topics and experts are generally utilized to identify connections between experts. However, such approaches are inadequate for identifying emerging keywords and experts since the most useful terms in technical documents tend to be infrequent and concentrated in just a few documents. This makes connecting experts through joins on large dense graphs challenging. We present DIGDUG, a framework that identifies emerging topics by applying graph operations to technical terms. The framework identifies connections between authors of patents and journal papers by performing joins on connected topics and topics associated with the authors at scale. The problem of scaling the graph operations for topics and experts is solved through dense graph pruning and graph joins categorized under their own scalable separable dense graph class based on key-value pair distribution. Comparing our graph join and pruning technique against multiple graph and join methods in MapReduce revealed a significant improvement in performance using our approach. / Doctor of Philosophy / Distribution of Machine Learning and Graph algorithms is commonly performed by modeling the core algorithm in the same way as the sequential technique except implemented on distributed framework. This approach is satisfactory in very few cases, such as depth-first search and subgraph enumerations in graphs, k nearest neighbors, and few additional common methods. These techniques focus on stitching the results from smaller data or compute chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs to perform exhaustive computations on all the data during execution. In this work, we propose key-value pair based distribution techniques that are exhaustive and widely applicable to statistical machine learning algorithms along with matrix, graph, and time series based operations. The crucial difference with previously proposed techniques is that all operations are modeled as key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. Big Data Distributed Machine Learning In-Memory Distribution Graph Distribution
234	Scalable and Productive Data Management for High-Performance Analytics Youssef, Karim Yasser Mohamed Yousri 07 November 2023 (has links) Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores. / Doctor of Philosophy / Advancements in various fields, like genetics, satellite imaging, and physics simulations, are generating massive amounts of data. Analyzing this data can lead to groundbreaking scientific discoveries. However, the sheer size of these datasets presents a challenge. High-performance computing (HPC) offers a solution to process and understand this data efficiently. Still, several issues hinder the performance of big data analytics software on HPC systems. These problems include finding the right balance between performance and ease of programming, dealing with the challenges of handling massive amounts of data, and optimizing storage usage. This dissertation focuses on three areas to improve high-performance data analytics (HPDA). Firstly, it demonstrates how using Spark and optimized data partitioning can optimize programming productivity while achieving similar scalability as the Message Passing Interface (MPI) for scalable bioinformatics. Secondly, it addresses the limitations of the operating system's memory management for processing data that is too large to fit entirely in memory. Lastly, it tackles the efficiency issues related to input/output operations and storage when dealing with data structures like graphs, data frames, and key-value datastores in iterative and incremental workflows. high-performance computing (HPC) big data performance productivity storage efficiency
235	Co-creating social licence for sharing health and care data Fylan, F., Fylan, Beth 25 March 2021 (has links) Yes / Optimising the use of patient data has the potential to produce a transformational change in healthcare planning, treatment, condition prevention and understanding disease progression. Establishing how people's trust could be secured and a social licence to share data could be achieved is of paramount importance. The study took place across Yorkshire and the Humber, in the North of the England, using a sequential mixed methods approach comprising focus groups, surveys and co-design groups. Twelve focus groups explored people's response to how their health and social care data is, could, and should be used. A survey examined who should be able to see health and care records, acceptable uses of anonymous health and care records, and trust in different organisations. Case study cards addressed willingness for data to be used for different purposes. Co-creation workshops produced a set of guidelines for how data should be used. Focus group participants (n = 80) supported sharing health and care data for direct care and were surprised that this is not already happening. They discussed concerns about the currency and accuracy of their records and possible stigma associated with certain diagnoses, such as mental health conditions. They were less supportive of social care access to their records. They discussed three main concerns about their data being used for research or service planning: being identified; security limitations; and the potential rationing of care on the basis of information in their record such as their lifestyle choices. Survey respondents (n = 1031) agreed that their GP (98 %) and hospital doctors and nurses (93 %) should be able to see their health and care records. There was more limited support for pharmacists (37 %), care staff (36 %), social workers (24 %) and researchers (24 %). Respondents thought their health and social care records should be used to help plan services (88 %), to help people stay healthy (67 %), to help find cures for diseases (67 %), for research for the public good (58 %), but only 16 % for commercial research. Co-creation groups developed a set of principles for a social licence for data sharing based around good governance, effective processes, the type of organisation, and the ability to opt in and out. People support their data being shared for a range of purposes and co-designed a set of principles that would secure their trust and consent to data sharing. / This work was supported by Humber Teaching NHS Foundation Trust and the National Institute for Health Research (NIHR) Yorkshire and Humber Patient Safety Translational Research Centre (NIHR Yorkshire and Humber PSTRC). Patient Big data Health records Barriers Co-design
236	Surveillance Technology and the Neoliberal State: Expanding the Power to Criminalize in a Data-Unlimited World Hurley, Emily Elizabeth 23 June 2017 (has links) For the past several decades, the neoliberal school of economics has dominated public policy, encouraging American politicians to reduce the size of the government. Despite this trend, the power of the state to surveille, criminalize, and detain has become more extensive, even as the state appears to be growing less powerful. By allowing information technology corporations such as Google to collect location data from users with or without their knowledge, the state can tap into a vast surveillance network at any time, retroactively surveilling and criminalizing at its discretion. Furthermore, neoliberal political theory has eroded the classical liberal conception of freedom so that these surveillance tactics to not appear to restrict individuals' freedom or privacy so long as they give their consent to be surveilled by a private corporation. Neoliberalism also encourages the proliferation of information technologies by making individuals responsible for their economic success and wellbeing in an increasingly competitive world, thus pushing more individuals to use information technologies to enter into the gig economy. The individuating logic of neoliberalism, combined with the rapid economic potentialities of information technology, turn individuals into mere sources of human capital. Even though the American state's commitment to neoliberalism precludes it from covertly managing the labor economy, it can still manage a population through criminalization and incarceration. Access to users' data by way of information technology makes the process of criminalization more manageable and allows the state to more easily incarcerate indiscriminately. / Master of Arts / Since the era of President Reagan, the American economic and political tradition has been committed to opening trade, limiting government regulation, and reducing public benefits in the interest of expending freedom from the government. Despite this commitment to shrinking the size of the government, the government still considers it responsible for public security, including both national security and criminalization. At the same time as this wave of deregulation, information technology companies such as Google have expanded their ability to collect and store data of individual users—data which the government has access to when it deems such access necessary. The deregulation of private markets has ushered in an era of extreme labor competition, which pushes many people to use information technology such as computers and cell phones, to market their labor and make extra money. However, whenever a person is connected to GPS, Wi-Fi, or uses data on their phone, their location information is being stored and the government has access to this information. Neoliberalism therefore encourages people to use technology that allows them to be watched by the government. Location information is one of the main factors of criminalization; historically, a persons’ location informs the police’s decision to arrest them or not. Enforcing laws against vagrancy, homelessness, prostitution, etc. require law enforcement agencies to know where someone is, which becomes much easier when everyone is connected to their location data by their cell phone. This gives the state a huge amount of power to find and criminalize whoever it wants. Neoliberal governance surveillance technology big data criminality freedom
237	Mining Security Risks from Massive Datasets Liu, Fang 09 August 2017 (has links) Cyber security risk has been a problem ever since the appearance of telecommunication and electronic computers. In the recent 30 years, researchers have developed various tools to protect the confidentiality, integrity, and availability of data and programs. However, new challenges are emerging as the amount of data grows rapidly in the big data era. On one hand, attacks are becoming stealthier by concealing their behaviors in massive datasets. One the other hand, it is becoming more and more difficult for existing tools to handle massive datasets with various data types. This thesis presents the attempts to address the challenges and solve different security problems by mining security risks from massive datasets. The attempts are in three aspects: detecting security risks in the enterprise environment, prioritizing security risks of mobile apps and measuring the impact of security risks between websites and mobile apps. First, the thesis presents a framework to detect data leakage in very large content. The framework can be deployed on cloud for enterprise and preserve the privacy of sensitive data. Second, the thesis prioritizes the inter-app communication risks in large-scale Android apps by designing new distributed inter-app communication linking algorithm and performing nearest-neighbor risk analysis. Third, the thesis measures the impact of deep link hijacking risk, which is one type of inter-app communication risks, on 1 million websites and 160 thousand mobile apps. The measurement reveals the failure of Google's attempts to improve the security of deep links. / Ph. D. / Cyber security risk has been a problem ever since the appearance of telecommunication and electronic computers. In the recent 30 years, researchers have developed various tools to prevent sensitive data from being accessed by unauthorized users, protect program and data from being changed by attackers, and make sure program and data to be available whenever needed. However, new challenges are emerging as the amount of data grows rapidly in the big data era. On one hand, attacks are becoming stealthier by concealing their attack behaviors in massive datasets. On the other hand, it is becoming more and more difficult for existing tools to handle massive datasets with various data types. This thesis presents the attempts to address the challenges and solve different security problems by mining security risks from massive datasets. The attempts are in three aspects: detecting security risks in the enterprise environment where massive datasets are involved, prioritizing security risks of mobile apps to make sure the high-risk apps being analyzed first and measuring the impact of security risks within the communication between websites and mobile apps. First, the thesis presents a framework to detect sensitive data leakage in enterprise environment from very large content. The framework can be deployed on cloud for enterprise and avoid the sensitive data being accessed by the semi-honest cloud at the same time. Second, the thesis prioritizes the inter-app communication risks in large-scale Android apps by designing new distributed inter-app communication linking algorithm and performing nearest-neighbor risk analysis. The algorithm runs on a cluster to speed up the computation. The analysis leverages each app’s communication context with all the other apps to prioritize the inter-app communication risks. Third, the thesis measures the impact of mobile deep link hijacking risk on 1 million websites and 160 thousand mobile apps. Mobile deep link hijacking happens when a user clicks a link, which is supposed to be opened by one app but being hijacked by another malicious app. Mobile deep link hijacking is one type of inter-app communication risks between mobile browser and apps. The measurement reveals the failure of Google’s attempts to improve the security of mobile deep links. Cyber Security Big Data Security Mobile Security Data Leakage Detection
238	The impact of big data analytics on firms’ high value business performance Popovic, A., Hackney, R., Tassabehji, Rana, Castelli, M. 2016 October 1928 (has links) Yes / Big Data Analytics (BDA) is an emerging phenomenon with the reported potential to transform how firms manage and enhance high value businesses performance. The purpose of our study is to investigate the impact of BDA on operations management in the manufacturing sector, which is an acknowledged infrequently researched context. Using an interpretive qualitative approach, this empirical study leverages a comparative case study of three manufacturing companies with varying levels of BDA usage (experimental, moderate and heavy). The information technology (IT) business value literature and a resource based view informed the development of our research propositions and the conceptual framework that illuminated the relationships between BDA capability and organizational readiness and design. Our findings indicate that BDA capability (in terms of data sourcing, access, integration, and delivery, analytical capabilities, and people’s expertise) along with organizational readiness and design factors (such as BDA strategy, top management support, financial resources, and employee engagement) facilitated better utilization of BDA in manufacturing decision making, and thus enhanced high value business performance. Our results also highlight important managerial implications related to the impact of BDA on empowerment of employees, and how BDA can be integrated into organizations to augment rather than replace management capabilities. Our research will be of benefit to academics and practitioners in further aiding our understanding of BDA utilization in transforming operations and production management. It adds to the body of limited empirically based knowledge by highlighting the real business value resulting from applying BDA in manufacturing firms and thus encouraging beneficial economic societal changes. Big data analytics Business value Operations performance Case analysis
239	Driving Innovation through Big Open Linked Data (BOLD): Exploring Antecedents using Interpretive Structural Modelling Dwivedi, Y.K., Janssen, M., Slade, E.L., Rana, Nripendra P., Weerakkody, Vishanth J.P., Millard, J., Hidders, J., Snijders, D. 07 2016 (has links) Yes / Innovation is vital to find new solutions to problems, increase quality, and improve profitability. Big open linked data (BOLD) is a fledgling and rapidly evolving field that creates new opportunities for innovation. However, none of the existing literature has yet considered the interrelationships between antecedents of innovation through BOLD. This research contributes to knowledge building through utilising interpretive structural modelling to organise nineteen factors linked to innovation using BOLD identified by experts in the field. The findings show that almost all the variables fall within the linkage cluster, thus having high driving and dependence powers, demonstrating the volatility of the process. It was also found that technical infrastructure, data quality, and external pressure form the fundamental foundations for innovation through BOLD. Deriving a framework to encourage and manage innovation through BOLD offers important theoretical and practical contributions. Big data Open data Linked data Innovation Interpretive structural modelling
240	Decision analysis in Turkey Gonul, M.S., Soyer, E., Onkal, Dilek 05 1900 (has links) No Decision analysis Turkey DA community Decision making Big data Modeling

Search results