221 |
Big Data - Stort intresse, nya möjligheterHellström, Hampus, Ohm, Oscar January 2014 (has links)
Dagens informationssamhälle har bidragit till att människor, maskiner och företag genererar och lagrar stora mängder data. Hanteringen och bearbetningen av de stora datamängderna har fått samlingsnamnet Big Data.De stora datamängderna ökar bland annat möjligheterna att bedriva kunskapsbaserad verksamhetsutveckling. Med traditionella metoder för insamling och analys av data har kunskapsbaserad verksamhetsutveckling tillämpats genom att skicka ut resurskrävande marknadsundersökningar och kartläggningar, ofta genomförda av specialiserade undersökningsföretag. Efterhand som analyser av samhällets befintliga datamängder blir allt värdefullare,har undersökningsföretagen därmed en stor utvecklingsmöjlighet att vaska guld ifrån samhällets enorma datamängder.Studien är genomförd som en explorativ fallstudie som undersöker hur svenska undersökningsföretag arbetar med Big Data och identifierar några av de utmaningar de står inför tillämpningen av Big Data analyser i verksamheten. Resultatet visar att de deltagande undersökningsföretagen använder Big Data som verktyg för att effektivisera befintliga processer och i viss mån komplettera traditionella undersökningar. Trots att man ser möjligheter med tekniken arbetar man passivt med utvecklingen av nya processer som ämnas stödjas av Big data analyser. Och det finns en utmaning i en bristande kompetens som råder på marknaden. Resultatet behandlar även en etisk aspekt undersökningsföretagen måste ta hänsyn till, speciellt problematiskt är den när data behandlas och analyseras i realtid och kan kopplas till en individ. / Today’s information society is consisting of people, businesses and machinesthat together generate large amounts of data every day. This exponatial growthof datageneration has led to the creation of what we call Big Data. Amongother things the data produced, gathered and stored can be used bycompanies to practise knowledge based business development. Traditionallythe methods used for generating knowledge about a business environment andmarket has been timeconsuming and expensive and often conducted by aspecialized research company that carry out market research and surveys.Today the analysis of existing data sets is becoming increasingly valuable, andthe research companies have a great opportunity to mine value from societyshuge amounts of data.The study is designed as an exploratory case study that investigates how theresearch companies in Sweden work with these data sets, and identifies someof the challenges they face in the application of Big Data analysis in theirbusiness. The results shows that the participating research companies areusing Big Data tools to steamline existing business processes and to someextent use it as a complementary value to traditional research and surveys.Although they see possibilities with the technology, the participatingcompanies are unwilling to drive the development of new business processesthat are supported by Big Data analysis. There is a challenge identified in thelack of competence prevailing in the Swedish market. The result also coverssome of the ethical aspects research companies need to take intoconsideration. The ethical issues are especially problematic when data, thatcan be linked to an individual, is processed and analysed in real time.
|
222 |
High performance shared state schedulersKouzoupis, Antonios January 2016 (has links)
Large organizations and research institutes store a huge volume of data nowadays.In order to gain any valuable insights distributed processing frameworks over acluster of computers are needed. Apache Hadoop is the prominent framework fordistributed storage and data processing. At SICS Swedish ICT we are building Hops, a new distribution of Apache Hadoop relying on a distributed, highly available MySQL Cluster NDB to improve performance. Hops-YARN is the resource management framework of Hops which introduces distributed resource management, load balancing the tracking of resources in a cluster. In Hops-YARN we make heavy usage of the back-end database storing all the resource manager metadata and incoming RPCs to provide high fault tolerance and very short recovery time. This project aims in optimizing the mechanisms used for persisting metadata in NDB both in terms of transactional commit time but also in terms of pre-processing them. Under no condition should the in-memory RM state diverge from the state stored in NDB. With these goals in mind several solutions were examined that improved the performance of the system, making Hops-YARN comparable to Apache YARN with the extra benefits of high-fault tolerance and short recovery time. The solutions proposed in this thesis project enhance the pure commit time of a transaction to the MySQL Cluster and the pre-processing and parallelism of our Transaction Manager. The results indicate that the performance of Hops increased dramatically, utilizing more resources on a cluster with thousands of machines. Increasing the cluster utilization by a few percentages can save organizations a big amount of money. / Nu för tiden lagrar stora organisationer och forskningsinstitutioner enorma mängder data.För att kunna utvinna någon värdefull information från dessa data behöver den bearbetasav ett kluster av datorer. När flera datorer gemensamt ska bearbeta data behöver de utgåfrån ett så kallat "distributed processing framework''. I dagsläget är Apache Hadoop detmest använda ramverket för distribuerad lagring och behandling av data. Detta examensarbeteär har genomförts vid SICS Swedish ICT där vi byggt Hops, en ny distribution avApache Hadoop som drivs av ett distribuerat MySQL Cluster NDB som erbjuder en hög tillgänglighet.Hops-YARN är Hops ramverk för resurshantering med distribuerade ResourceManagers som lastbalanserarderas ResourceTrackerService. I detta examensarbete använder vi Hops-Yarn på ett sätt där ``back-end''databasen flitigt används för att hantera ResourceManagerns metadata och inkommande RPC-anrop. Vårkonfiguration erbjuder en hög feltolerans och återställer sig mycket snabbt vidfelberäkningar. Vidare används NDB-klustrets Event API för att ResourceManager ska kunnakommunicera med den distribuerade ResourceTrackers. Detta projekt syftar till att optimera de mekanismer som används för ihållande metadatai NDB både i termer av transaktions begå tid men också i termer av pre-bearbeta dem medan samtidigt garantera enhetlighet i RM: s tillstånd. ResourceManagerns tillståndi RAM-minnet får under inga omständigheteravvika från det tillstånd som finns lagrat i NDB:n. Med dessa mål i åtanke undersöktes fleralösningar som förbättrar prestandan och därmed gör Hops-Yarn jämförbart med Apache YARN.De lösningar som föreslås i denna uppsats förbättrar “pure commit time” när en transaktiongörs i ett MySQL Cluster samt förbehandlingen och parallelismen i vår Transaction Manager.Resultaten tyder på att Hops prestanda ökade dramatiskt vilket ledde till ett effektivarenyttjande av tillgängliga resurser i ett kluster bestående av ett tusental datorer. Närnyttjandet av tillgänliga resurser i ett kluster förbättras med några få procent kanorganisationer spara mycket pengar.
|
223 |
Big-Data Driven Optimization Methods with Applications to LTL Freight RoutingTamvada, Srinivas January 2020 (has links)
We propose solution strategies for hard Mixed Integer Programming (MIP) problems,
with a focus on distributed parallel MIP optimization. Although our proposals are
inspired by the Less-than-truckload (LTL) freight routing problem, they are more
generally applicable to hard MIPs from other domains. We start by developing an Integer
Programming model for the Less-than-truckload (LTL) freight routing problem,
and present a novel heuristic for solving the model in a reasonable amount of time
on large LTL networks. Next, we identify some adaptations to MIP branching strategies
that are useful for achieving improved scaling upon distribution when the LTL
routing problem (or other hard MIPs) are solved using parallel MIP optimization.
Recognizing that our model represents a pseudo-Boolean optimization problem
(PBO), we leverage solution techniques used by PBO solvers to develop a CPLEX
based look-ahead solver for LTL routing and other PBO problems. Our focus once
again is on achieving improved scaling upon distribution. We also analyze a technique
for implementing subtree parallelism during distributed MIP optimization. We
believe that our proposals represent a significant step towards solving big-data driven
optimization problems (such as the LTL routing problem) in a more efficient manner. / Thesis / Doctor of Philosophy (PhD) / Less-than-truckload (LTL) freight transportation is a vital part of Canada's economy,
with revenues running into billions of dollars and a cascading impact on many
other industries. LTL operators often have to deal with large volumes of shipments,
unexpected changes in traffic conditions, and uncertainty in demand patterns. In an
industry that already has low profit margins, it is therefore vitally important to make
good routing decisions without expending a lot of time.
The optimization of such LTL freight networks often results in complex big-data
driven optimization problems. In addition to the challenge of finding optimal solutions
for these problems, analysts often have to deal with the complexities of big-data driven
inputs. In this thesis we develop several solution strategies for solving the LTL freight
routing problem including an exact model, novel heuristics, and techniques for solving
the problem efficiently on a cluster of computers.
Although the techniques we develop are inspired by LTL routing, they are more
generally applicable for solving big-data driven optimization problems from other
domains. Experiments conducted over the years in consultation with industry experts
indicate that our proposals can significantly improve solution quality and reduce
time to solution. Furthermore, our proposals open up interesting avenues for future
research.
|
224 |
A drug repurposing study based on clinical big data for the protective role of vitamin D in olanzapine-induced dyslipidemia / 臨床ビッグデータに基づくオランザピン誘発脂質異常症に対するビタミンDの予防作用の解明ZHOU, ZIJIAN 23 March 2023 (has links)
京都大学 / 新制・課程博士 / 博士(薬科学) / 甲第24551号 / 薬科博第168号 / 新制||薬科||18(附属図書館) / 京都大学大学院薬学研究科薬科学専攻 / (主査)教授 金子 周司, 教授 竹島 浩, 教授 上杉 志成 / 学位規則第4条第1項該当 / Doctor of Pharmaceutical Sciences / Kyoto University / DFAM
|
225 |
The security of big data in fog-enabled IoT applications including blockchain: a surveyTariq, N., Asim, M., Al-Obeidat, F., Farooqi, M.Z., Baker, T., Hammoudeh, M., Ghafir, Ibrahim 24 January 2020 (has links)
Yes / The proliferation of inter-connected devices in critical industries, such as healthcare and power
grid, is changing the perception of what constitutes critical infrastructure. The rising interconnectedness
of new critical industries is driven by the growing demand for seamless access to information as the
world becomes more mobile and connected and as the Internet of Things (IoT) grows. Critical industries
are essential to the foundation of today’s society, and interruption of service in any of these sectors can
reverberate through other sectors and even around the globe. In today’s hyper-connected world, the
critical infrastructure is more vulnerable than ever to cyber threats, whether state sponsored, criminal
groups or individuals. As the number of interconnected devices increases, the number of potential
access points for hackers to disrupt critical infrastructure grows. This new attack surface emerges from
fundamental changes in the critical infrastructure of organizations technology systems. This paper aims
to improve understanding the challenges to secure future digital infrastructure while it is still evolving.
After introducing the infrastructure generating big data, the functionality-based fog architecture is
defined. In addition, a comprehensive review of security requirements in fog-enabled IoT systems is
presented. Then, an in-depth analysis of the fog computing security challenges and big data privacy and
trust concerns in relation to fog-enabled IoT are given. We also discuss blockchain as a key enabler to
address many security related issues in IoT and consider closely the complementary interrelationships
between blockchain and fog computing. In this context, this work formalizes the task of securing big
data and its scope, provides a taxonomy to categories threats to fog-based IoT systems, presents a
comprehensive comparison of state-of-the-art contributions in the field according to their security service
and recommends promising research directions for future investigations.
|
226 |
<b>Sample Size Determination for Subsampling in the Analysis of Big Data, Multiplicative models for confidence intervals and Free-Knot changepoint models</b>Sheng Zhang (18468615) 11 June 2024 (has links)
<p dir="ltr">We studied the relationship between subsample size and the accuracy of resulted estimation under big data setup.</p><p dir="ltr">We also proposed a novel approach to the construction of confidence intervals based on improved concentration inequalities.</p><p dir="ltr">Lastly, we studied irregular change-point models using free-knot splines.</p>
|
227 |
Efficient computer experiment designs for Gaussian process surrogatesCole, David Austin 28 June 2021 (has links)
Due to advancements in supercomputing and algorithms for finite element analysis, today's computer simulation models often contain complex calculations that can result in a wealth of knowledge. Gaussian processes (GPs) are highly desirable models for computer experiments for their predictive accuracy and uncertainty quantification. This dissertation addresses GP modeling when data abounds as well as GP adaptive design when simulator expense severely limits the amount of collected data. For data-rich problems, I introduce a localized sparse covariance GP that preserves the flexibility and predictive accuracy of a GP's predictive surface while saving computational time. This locally induced Gaussian process (LIGP) incorporates latent design points, inducing points, with a local Gaussian process built from a subset of the data. Various methods are introduced for the design of the inducing points. LIGP is then extended to adapt to stochastic data with replicates, estimating noise while relying upon the unique design locations for computation. I also address the goal of identifying a contour when data collection resources are limited through entropy-based adaptive design. Unlike existing methods, the entropy-based contour locator (ECL) adaptive design promotes exploration in the design space, performing well in higher dimensions and when the contour corresponds to a high/low quantile. ECL adaptive design can join with importance sampling for the purpose of reducing uncertainty in reliability estimation. / Doctor of Philosophy / Due to advancements in supercomputing and physics-based algorithms, today's computer simulation models often contain complex calculations that can produce larger amounts of data than through physical experiments. Computer experiments conducted with simulation models are sought-after ways to gather knowledge about physical problems but come with design and modeling challenges. In this dissertation, I address both data size extremes - building prediction models with large data sets and designing computer experiments when scarce resources limit the amount of data. For the former, I introduce a strategy of constructing a series of models including small subsets of observed data along with a set of unobserved data locations (inducing points). This methodology also contains the ability to perform calculations with only unique data locations when replicates exist in the data. The locally induced model produces accurate predictions while saving computing time. Various methods are introduced to decide the locations of these inducing points. The focus then shifts to designing an experiment for the purpose of accurate prediction around a particular output quantity of interest (contour). A experimental design approach is detailed that selects new sample locations one-at-a-time through a function to maximize the amount of information gain in the contour region for the overall model. This work is combined with an existing method to estimate the true volume of the contour.
|
228 |
SensAnalysis: A Big Data Platform for Vibration-Sensor Data AnalysisKumar, Abhinav 26 June 2019 (has links)
The Goodwin Hall building on the Virginia Tech campus is the most instrumented building for vibration monitoring. It houses 225 hard-wired accelerometers which record vibrations arising due to internal as well as external activities. The recorded vibration data can be used to develop real-time applications for monitoring the health of the building or detecting human activity in the building. However, the lack of infrastructure to handle the massive scale of the data, and the steep learning curve of the tools required to store and process the data, are major deterrents for the researchers to perform their experiments. Additionally, researchers want to explore the data to determine the type of experiments they can perform. This work tries to solve these problems by providing a system to store and process the data using existing big data technologies. The system simplifies the process of big data analysis by supporting code re-usability and multiple programming languages. The effectiveness of the system was demonstrated by four case studies. Additionally, three visualizations were developed to help researchers in the initial data exploration. / Master of Science / The Goodwin Hall building on the Virginia Tech campus is an example of a ‘smart building.’ It uses sensors to record the response of the building to various internal and external activities. The recorded data can be used by algorithms to facilitate understanding of the properties of the building or to detect human activity. Accordingly, researchers in the Virginia Tech Smart Infrastructure Lab (VTSIL) run experiments using a part of the complete data. Ideally, they want to run their experiments continuously as new data is collected. However, the massive scale of the data makes it difficult to process new data as soon as it arrives, and to make it available immediately to the researchers. The technologies that can handle data at this scale have a steep learning curve. Starting to use them requires much time and effort. This project involved building a system to handle these challenges so that researchers can focus on their core area of research. The system provides visualizations depicting various properties of the data to help researchers explore that data before running an experiment. The effectiveness of this work was demonstrated using four case studies. These case studies used the actual experiments conducted by VTSIL researchers in the past. The first three case studies help in understanding the properties of the building whereas the final case study deals with detecting and locating human footsteps, on one of the floors, in real-time.
|
229 |
Sequential learning, large-scale calibration, and uncertainty quantificationHuang, Jiangeng 23 July 2019 (has links)
With remarkable advances in computing power, computer experiments continue to expand the boundaries and drive down the cost of various scientific discoveries. New challenges keep arising from designing, analyzing, modeling, calibrating, optimizing, and predicting in computer experiments. This dissertation consists of six chapters, exploring statistical methodologies in sequential learning, model calibration, and uncertainty quantification for heteroskedastic computer experiments and large-scale computer experiments. For heteroskedastic computer experiments, an optimal lookahead based sequential learning strategy is presented, balancing replication and exploration to facilitate separating signal from input-dependent noise. Motivated by challenges in both large data size and model fidelity arising from ever larger modern computer experiments, highly accurate and computationally efficient divide-and-conquer calibration methods based on on-site experimental design and surrogate modeling for large-scale computer models are developed in this dissertation. The proposed methodology is applied to calibrate a real computer experiment from the gas and oil industry. This on-site surrogate calibration method is further extended to multiple output calibration problems. / Doctor of Philosophy / With remarkable advances in computing power, complex physical systems today can be simulated comparatively cheaply and to high accuracy through computer experiments. Computer experiments continue to expand the boundaries and drive down the cost of various scientific investigations, including biological, business, engineering, industrial, management, health-related, physical, and social sciences. This dissertation consists of six chapters, exploring statistical methodologies in sequential learning, model calibration, and uncertainty quantification for heteroskedastic computer experiments and large-scale computer experiments. For computer experiments with changing signal-to-noise ratio, an optimal lookahead based sequential learning strategy is presented, balancing replication and exploration to facilitate separating signal from complex noise structure. In order to effectively extract key information from massive amount of simulation and make better prediction for the real world, highly accurate and computationally efficient divide-and-conquer calibration methods for large-scale computer models are developed in this dissertation, addressing challenges in both large data size and model fidelity arising from ever larger modern computer experiments. The proposed methodology is applied to calibrate a real computer experiment from the gas and oil industry. This large-scale calibration method is further extended to solve multiple output calibration problems.
|
230 |
Parallel Mining and Analysis of Triangles and Communities in Big NetworksArifuzzaman, S M. 19 August 2016 (has links)
A network (graph) is a powerful abstraction for interactions among entities in a system. Examples include various social, biological, collaboration, citation, and co-purchase networks. Real-world networks are often characterized by an abundance of triangles and the existence of well-structured communities. Thus, counting triangles and detecting communities in networks have become important algorithmic problems in network mining and analysis. In the era of big data, the network data emerged from numerous scientific disciplines are very large. Online social networks such as Twitter and Facebook have millions to billions of users. Such massive networks often do not fit in the main memory of a single machine, and the existing sequential methods might take a prohibitively large runtime. This motivates the need for scalable parallel algorithms for mining and analysis.
We design MPI-based distributed-memory parallel algorithms for counting triangles and detecting communities in big networks and present related analysis. The dissertation consists of four parts. In Part I, we devise parallel algorithms for counting and enumerating triangles. The first algorithm employs an overlapping partitioning scheme and novel load-balancing schemes leading to a fast algorithm. We also design a space-efficient algorithm using non-overlapping partitioning and an efficient communication scheme. This space efficiency allows the algorithm to work on even larger networks. We then present our third parallel algorithm based on dynamic load balancing. All these algorithms work on big networks, scale to a large number of processors, and demonstrate very good speedups. An important property, very related to triangles, of many real-world networks is high transitivity, which states that two nodes having common neighbors tend to become neighbors themselves. In Part II, we characterize networks by quantifying the number of common neighbors and demonstrate its relationship to community structure of networks. In Part III, we design parallel algorithms for detecting communities in big networks. We propose efficient load balancing and communication approaches, which lead to fast and scalable algorithms. Finally, in Part IV, we present scalable parallel algorithms for a useful graph preprocessing problem-- converting edge list to adjacency list. We present non-trivial parallelization with efficient HPC-based techniques leading to fast and space-efficient algorithms. / Ph. D.
|
Page generated in 0.0706 seconds