Global ETD Search

221	High performance shared state schedulers Kouzoupis, Antonios January 2016 (has links) Large organizations and research institutes store a huge volume of data nowadays.In order to gain any valuable insights distributed processing frameworks over acluster of computers are needed. Apache Hadoop is the prominent framework fordistributed storage and data processing. At SICS Swedish ICT we are building Hops, a new distribution of Apache Hadoop relying on a distributed, highly available MySQL Cluster NDB to improve performance. Hops-YARN is the resource management framework of Hops which introduces distributed resource management, load balancing the tracking of resources in a cluster. In Hops-YARN we make heavy usage of the back-end database storing all the resource manager metadata and incoming RPCs to provide high fault tolerance and very short recovery time. This project aims in optimizing the mechanisms used for persisting metadata in NDB both in terms of transactional commit time but also in terms of pre-processing them. Under no condition should the in-memory RM state diverge from the state stored in NDB. With these goals in mind several solutions were examined that improved the performance of the system, making Hops-YARN comparable to Apache YARN with the extra benefits of high-fault tolerance and short recovery time. The solutions proposed in this thesis project enhance the pure commit time of a transaction to the MySQL Cluster and the pre-processing and parallelism of our Transaction Manager. The results indicate that the performance of Hops increased dramatically, utilizing more resources on a cluster with thousands of machines. Increasing the cluster utilization by a few percentages can save organizations a big amount of money. / Nu för tiden lagrar stora organisationer och forskningsinstitutioner enorma mängder data.För att kunna utvinna någon värdefull information från dessa data behöver den bearbetasav ett kluster av datorer. När flera datorer gemensamt ska bearbeta data behöver de utgåfrån ett så kallat "distributed processing framework''. I dagsläget är Apache Hadoop detmest använda ramverket för distribuerad lagring och behandling av data. Detta examensarbeteär har genomförts vid SICS Swedish ICT där vi byggt Hops, en ny distribution avApache Hadoop som drivs av ett distribuerat MySQL Cluster NDB som erbjuder en hög tillgänglighet.Hops-YARN är Hops ramverk för resurshantering med distribuerade ResourceManagers som lastbalanserarderas ResourceTrackerService. I detta examensarbete använder vi Hops-Yarn på ett sätt där ``back-end''databasen flitigt används för att hantera ResourceManagerns metadata och inkommande RPC-anrop. Vårkonfiguration erbjuder en hög feltolerans och återställer sig mycket snabbt vidfelberäkningar. Vidare används NDB-klustrets Event API för att ResourceManager ska kunnakommunicera med den distribuerade ResourceTrackers. Detta projekt syftar till att optimera de mekanismer som används för ihållande metadatai NDB både i termer av transaktions begå tid men också i termer av pre-bearbeta dem medan samtidigt garantera enhetlighet i RM: s tillstånd. ResourceManagerns tillståndi RAM-minnet får under inga omständigheteravvika från det tillstånd som finns lagrat i NDB:n. Med dessa mål i åtanke undersöktes fleralösningar som förbättrar prestandan och därmed gör Hops-Yarn jämförbart med Apache YARN.De lösningar som föreslås i denna uppsats förbättrar “pure commit time” när en transaktiongörs i ett MySQL Cluster samt förbehandlingen och parallelismen i vår Transaction Manager.Resultaten tyder på att Hops prestanda ökade dramatiskt vilket ledde till ett effektivarenyttjande av tillgängliga resurser i ett kluster bestående av ett tusental datorer. Närnyttjandet av tillgänliga resurser i ett kluster förbättras med några få procent kanorganisationer spara mycket pengar. Hops Hadoop Big data Yarn schedulers Computer Systems Datorsystem
222	Big-Data Driven Optimization Methods with Applications to LTL Freight Routing Tamvada, Srinivas January 2020 (has links) We propose solution strategies for hard Mixed Integer Programming (MIP) problems, with a focus on distributed parallel MIP optimization. Although our proposals are inspired by the Less-than-truckload (LTL) freight routing problem, they are more generally applicable to hard MIPs from other domains. We start by developing an Integer Programming model for the Less-than-truckload (LTL) freight routing problem, and present a novel heuristic for solving the model in a reasonable amount of time on large LTL networks. Next, we identify some adaptations to MIP branching strategies that are useful for achieving improved scaling upon distribution when the LTL routing problem (or other hard MIPs) are solved using parallel MIP optimization. Recognizing that our model represents a pseudo-Boolean optimization problem (PBO), we leverage solution techniques used by PBO solvers to develop a CPLEX based look-ahead solver for LTL routing and other PBO problems. Our focus once again is on achieving improved scaling upon distribution. We also analyze a technique for implementing subtree parallelism during distributed MIP optimization. We believe that our proposals represent a significant step towards solving big-data driven optimization problems (such as the LTL routing problem) in a more efficient manner. / Thesis / Doctor of Philosophy (PhD) / Less-than-truckload (LTL) freight transportation is a vital part of Canada's economy, with revenues running into billions of dollars and a cascading impact on many other industries. LTL operators often have to deal with large volumes of shipments, unexpected changes in traffic conditions, and uncertainty in demand patterns. In an industry that already has low profit margins, it is therefore vitally important to make good routing decisions without expending a lot of time. The optimization of such LTL freight networks often results in complex big-data driven optimization problems. In addition to the challenge of finding optimal solutions for these problems, analysts often have to deal with the complexities of big-data driven inputs. In this thesis we develop several solution strategies for solving the LTL freight routing problem including an exact model, novel heuristics, and techniques for solving the problem efficiently on a cluster of computers. Although the techniques we develop are inspired by LTL routing, they are more generally applicable for solving big-data driven optimization problems from other domains. Experiments conducted over the years in consultation with industry experts indicate that our proposals can significantly improve solution quality and reduce time to solution. Furthermore, our proposals open up interesting avenues for future research. Big-data driven optimization methods Less-than-truckload freight routing
223	A drug repurposing study based on clinical big data for the protective role of vitamin D in olanzapine-induced dyslipidemia / 臨床ビッグデータに基づくオランザピン誘発脂質異常症に対するビタミンDの予防作用の解明 ZHOU, ZIJIAN 23 March 2023 (has links) 京都大学 / 新制・課程博士 / 博士(薬科学) / 甲第24551号 / 薬科博第168号 / 新制\|\|薬科\|\|18(附属図書館) / 京都大学大学院薬学研究科薬科学専攻 / (主査)教授金子周司, 教授竹島浩, 教授上杉志成 / 学位規則第4条第1項該当 / Doctor of Pharmaceutical Sciences / Kyoto University / DFAM Clinical big data Dyslipidemia Olanzapine Vitamin D Cholesterol biosynthesis 499.3
224	The security of big data in fog-enabled IoT applications including blockchain: a survey Tariq, N., Asim, M., Al-Obeidat, F., Farooqi, M.Z., Baker, T., Hammoudeh, M., Ghafir, Ibrahim 24 January 2020 (has links) Yes / The proliferation of inter-connected devices in critical industries, such as healthcare and power grid, is changing the perception of what constitutes critical infrastructure. The rising interconnectedness of new critical industries is driven by the growing demand for seamless access to information as the world becomes more mobile and connected and as the Internet of Things (IoT) grows. Critical industries are essential to the foundation of today’s society, and interruption of service in any of these sectors can reverberate through other sectors and even around the globe. In today’s hyper-connected world, the critical infrastructure is more vulnerable than ever to cyber threats, whether state sponsored, criminal groups or individuals. As the number of interconnected devices increases, the number of potential access points for hackers to disrupt critical infrastructure grows. This new attack surface emerges from fundamental changes in the critical infrastructure of organizations technology systems. This paper aims to improve understanding the challenges to secure future digital infrastructure while it is still evolving. After introducing the infrastructure generating big data, the functionality-based fog architecture is defined. In addition, a comprehensive review of security requirements in fog-enabled IoT systems is presented. Then, an in-depth analysis of the fog computing security challenges and big data privacy and trust concerns in relation to fog-enabled IoT are given. We also discuss blockchain as a key enabler to address many security related issues in IoT and consider closely the complementary interrelationships between blockchain and fog computing. In this context, this work formalizes the task of securing big data and its scope, provides a taxonomy to categories threats to fog-based IoT systems, presents a comprehensive comparison of state-of-the-art contributions in the field according to their security service and recommends promising research directions for future investigations. Security Big data Internet of Things Fog computing Edge computing Blockchain
225	<b>Sample Size Determination for Subsampling in the Analysis of Big Data, Multiplicative models for confidence intervals and Free-Knot changepoint models</b> Sheng Zhang (18468615) 11 June 2024 (has links) <p dir="ltr">We studied the relationship between subsample size and the accuracy of resulted estimation under big data setup.</p><p dir="ltr">We also proposed a novel approach to the construction of confidence intervals based on improved concentration inequalities.</p><p dir="ltr">Lastly, we studied irregular change-point models using free-knot splines.</p> Applied statistics Subsampling. Big Data Analytics Analyzing Changepoint model
226	Efficient computer experiment designs for Gaussian process surrogates Cole, David Austin 28 June 2021 (has links) Due to advancements in supercomputing and algorithms for finite element analysis, today's computer simulation models often contain complex calculations that can result in a wealth of knowledge. Gaussian processes (GPs) are highly desirable models for computer experiments for their predictive accuracy and uncertainty quantification. This dissertation addresses GP modeling when data abounds as well as GP adaptive design when simulator expense severely limits the amount of collected data. For data-rich problems, I introduce a localized sparse covariance GP that preserves the flexibility and predictive accuracy of a GP's predictive surface while saving computational time. This locally induced Gaussian process (LIGP) incorporates latent design points, inducing points, with a local Gaussian process built from a subset of the data. Various methods are introduced for the design of the inducing points. LIGP is then extended to adapt to stochastic data with replicates, estimating noise while relying upon the unique design locations for computation. I also address the goal of identifying a contour when data collection resources are limited through entropy-based adaptive design. Unlike existing methods, the entropy-based contour locator (ECL) adaptive design promotes exploration in the design space, performing well in higher dimensions and when the contour corresponds to a high/low quantile. ECL adaptive design can join with importance sampling for the purpose of reducing uncertainty in reliability estimation. / Doctor of Philosophy / Due to advancements in supercomputing and physics-based algorithms, today's computer simulation models often contain complex calculations that can produce larger amounts of data than through physical experiments. Computer experiments conducted with simulation models are sought-after ways to gather knowledge about physical problems but come with design and modeling challenges. In this dissertation, I address both data size extremes - building prediction models with large data sets and designing computer experiments when scarce resources limit the amount of data. For the former, I introduce a strategy of constructing a series of models including small subsets of observed data along with a set of unobserved data locations (inducing points). This methodology also contains the ability to perform calculations with only unique data locations when replicates exist in the data. The locally induced model produces accurate predictions while saving computing time. Various methods are introduced to decide the locations of these inducing points. The focus then shifts to designing an experiment for the purpose of accurate prediction around a particular output quantity of interest (contour). A experimental design approach is detailed that selects new sample locations one-at-a-time through a function to maximize the amount of information gain in the contour region for the overall model. This work is combined with an existing method to estimate the true volume of the contour. inducing points active learning big data kriging reliability
227	SensAnalysis: A Big Data Platform for Vibration-Sensor Data Analysis Kumar, Abhinav 26 June 2019 (has links) The Goodwin Hall building on the Virginia Tech campus is the most instrumented building for vibration monitoring. It houses 225 hard-wired accelerometers which record vibrations arising due to internal as well as external activities. The recorded vibration data can be used to develop real-time applications for monitoring the health of the building or detecting human activity in the building. However, the lack of infrastructure to handle the massive scale of the data, and the steep learning curve of the tools required to store and process the data, are major deterrents for the researchers to perform their experiments. Additionally, researchers want to explore the data to determine the type of experiments they can perform. This work tries to solve these problems by providing a system to store and process the data using existing big data technologies. The system simplifies the process of big data analysis by supporting code re-usability and multiple programming languages. The effectiveness of the system was demonstrated by four case studies. Additionally, three visualizations were developed to help researchers in the initial data exploration. / Master of Science / The Goodwin Hall building on the Virginia Tech campus is an example of a ‘smart building.’ It uses sensors to record the response of the building to various internal and external activities. The recorded data can be used by algorithms to facilitate understanding of the properties of the building or to detect human activity. Accordingly, researchers in the Virginia Tech Smart Infrastructure Lab (VTSIL) run experiments using a part of the complete data. Ideally, they want to run their experiments continuously as new data is collected. However, the massive scale of the data makes it difficult to process new data as soon as it arrives, and to make it available immediately to the researchers. The technologies that can handle data at this scale have a steep learning curve. Starting to use them requires much time and effort. This project involved building a system to handle these challenges so that researchers can focus on their core area of research. The system provides visualizations depicting various properties of the data to help researchers explore that data before running an experiment. The effectiveness of this work was demonstrated using four case studies. These case studies used the actual experiments conducted by VTSIL researchers in the past. The first three case studies help in understanding the properties of the building whereas the final case study deals with detecting and locating human footsteps, on one of the floors, in real-time. big data data analysis sensor data Goodwin hall
228	Scalable and Productive Data Management for High-Performance Analytics Youssef, Karim Yasser Mohamed Yousri 07 November 2023 (has links) Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores. / Doctor of Philosophy / Advancements in various fields, like genetics, satellite imaging, and physics simulations, are generating massive amounts of data. Analyzing this data can lead to groundbreaking scientific discoveries. However, the sheer size of these datasets presents a challenge. High-performance computing (HPC) offers a solution to process and understand this data efficiently. Still, several issues hinder the performance of big data analytics software on HPC systems. These problems include finding the right balance between performance and ease of programming, dealing with the challenges of handling massive amounts of data, and optimizing storage usage. This dissertation focuses on three areas to improve high-performance data analytics (HPDA). Firstly, it demonstrates how using Spark and optimized data partitioning can optimize programming productivity while achieving similar scalability as the Message Passing Interface (MPI) for scalable bioinformatics. Secondly, it addresses the limitations of the operating system's memory management for processing data that is too large to fit entirely in memory. Lastly, it tackles the efficiency issues related to input/output operations and storage when dealing with data structures like graphs, data frames, and key-value datastores in iterative and incremental workflows. high-performance computing (HPC) big data performance productivity storage efficiency
229	Co-creating social licence for sharing health and care data Fylan, F., Fylan, Beth 25 March 2021 (has links) Yes / Optimising the use of patient data has the potential to produce a transformational change in healthcare planning, treatment, condition prevention and understanding disease progression. Establishing how people's trust could be secured and a social licence to share data could be achieved is of paramount importance. The study took place across Yorkshire and the Humber, in the North of the England, using a sequential mixed methods approach comprising focus groups, surveys and co-design groups. Twelve focus groups explored people's response to how their health and social care data is, could, and should be used. A survey examined who should be able to see health and care records, acceptable uses of anonymous health and care records, and trust in different organisations. Case study cards addressed willingness for data to be used for different purposes. Co-creation workshops produced a set of guidelines for how data should be used. Focus group participants (n = 80) supported sharing health and care data for direct care and were surprised that this is not already happening. They discussed concerns about the currency and accuracy of their records and possible stigma associated with certain diagnoses, such as mental health conditions. They were less supportive of social care access to their records. They discussed three main concerns about their data being used for research or service planning: being identified; security limitations; and the potential rationing of care on the basis of information in their record such as their lifestyle choices. Survey respondents (n = 1031) agreed that their GP (98 %) and hospital doctors and nurses (93 %) should be able to see their health and care records. There was more limited support for pharmacists (37 %), care staff (36 %), social workers (24 %) and researchers (24 %). Respondents thought their health and social care records should be used to help plan services (88 %), to help people stay healthy (67 %), to help find cures for diseases (67 %), for research for the public good (58 %), but only 16 % for commercial research. Co-creation groups developed a set of principles for a social licence for data sharing based around good governance, effective processes, the type of organisation, and the ability to opt in and out. People support their data being shared for a range of purposes and co-designed a set of principles that would secure their trust and consent to data sharing. / This work was supported by Humber Teaching NHS Foundation Trust and the National Institute for Health Research (NIHR) Yorkshire and Humber Patient Safety Translational Research Centre (NIHR Yorkshire and Humber PSTRC). Patient Big data Health records Barriers Co-design
230	Surveillance Technology and the Neoliberal State: Expanding the Power to Criminalize in a Data-Unlimited World Hurley, Emily Elizabeth 23 June 2017 (has links) For the past several decades, the neoliberal school of economics has dominated public policy, encouraging American politicians to reduce the size of the government. Despite this trend, the power of the state to surveille, criminalize, and detain has become more extensive, even as the state appears to be growing less powerful. By allowing information technology corporations such as Google to collect location data from users with or without their knowledge, the state can tap into a vast surveillance network at any time, retroactively surveilling and criminalizing at its discretion. Furthermore, neoliberal political theory has eroded the classical liberal conception of freedom so that these surveillance tactics to not appear to restrict individuals' freedom or privacy so long as they give their consent to be surveilled by a private corporation. Neoliberalism also encourages the proliferation of information technologies by making individuals responsible for their economic success and wellbeing in an increasingly competitive world, thus pushing more individuals to use information technologies to enter into the gig economy. The individuating logic of neoliberalism, combined with the rapid economic potentialities of information technology, turn individuals into mere sources of human capital. Even though the American state's commitment to neoliberalism precludes it from covertly managing the labor economy, it can still manage a population through criminalization and incarceration. Access to users' data by way of information technology makes the process of criminalization more manageable and allows the state to more easily incarcerate indiscriminately. / Master of Arts / Since the era of President Reagan, the American economic and political tradition has been committed to opening trade, limiting government regulation, and reducing public benefits in the interest of expending freedom from the government. Despite this commitment to shrinking the size of the government, the government still considers it responsible for public security, including both national security and criminalization. At the same time as this wave of deregulation, information technology companies such as Google have expanded their ability to collect and store data of individual users—data which the government has access to when it deems such access necessary. The deregulation of private markets has ushered in an era of extreme labor competition, which pushes many people to use information technology such as computers and cell phones, to market their labor and make extra money. However, whenever a person is connected to GPS, Wi-Fi, or uses data on their phone, their location information is being stored and the government has access to this information. Neoliberalism therefore encourages people to use technology that allows them to be watched by the government. Location information is one of the main factors of criminalization; historically, a persons’ location informs the police’s decision to arrest them or not. Enforcing laws against vagrancy, homelessness, prostitution, etc. require law enforcement agencies to know where someone is, which becomes much easier when everyone is connected to their location data by their cell phone. This gives the state a huge amount of power to find and criminalize whoever it wants. Neoliberal governance surveillance technology big data criminality freedom

Search results