Global ETD Search

21	Scalable Embeddings for Kernel Clustering on MapReduce Elgohary, Ahmed 14 February 2014 (has links) There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications. The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically. Data Clustering Kernel Methods Scalable Data Analytics MapReduce Big Data
22	Low-cost Data Analytics for Shared Storage and Network Infrastructures Mihailescu, Madalin 09 August 2013 (has links) Data analytics used to depend on specialized, high-end software and hardware platforms. Recent years, however, have brought forth the data-flow programming model, i.e., MapReduce, and with it a flurry of sturdy, scalable open-source software solutions for analyzing data. In essence, the commoditization of software frameworks for data analytics is well underway. Yet, up to this point, data analytics frameworks are still regarded as standalone, em dedicated components; deploying these frameworks requires companies to purchase hardware to meet storage and network resource demands, and system administrators to handle management of data across multiple storage systems. This dissertation explores the low-cost integration of frameworks for data analytics within existing, shared infrastructures. The thesis centers on smart software being the key enabler for holistic commoditization of data analytics. We focus on two instances of smart software that aid in realizing the low-cost integration objective. For an efficient storage integration, we build MixApart, a scalable data analytics framework that removes the dependency on dedicated storage for analytics; with MixApart, a single, consolidated storage back-end manages data and services all types of workloads, thereby lowering hardware costs and simplifying data management. We evaluate MixApart at scale with micro-benchmarks and production workload traces, and show that MixApart provides faster or comparable performance to an analytics framework with dedicated storage. For an effective sharing of the networking infrastructure, we implement OX, a virtual machine management framework that allows latency-sensitive web applications to share the data center network with data analytics through intelligent VM placement; OX further protects all applications from hardware failures. The two solutions allow the reuse of existing storage and networking infrastructures when deploying analytics frameworks, and substantiate our thesis that smart software upgrades can enable the end-to-end commoditization of analytics. computer science software data analytics data centers 0984
23	Low-cost Data Analytics for Shared Storage and Network Infrastructures Mihailescu, Madalin 09 August 2013 (has links) Data analytics used to depend on specialized, high-end software and hardware platforms. Recent years, however, have brought forth the data-flow programming model, i.e., MapReduce, and with it a flurry of sturdy, scalable open-source software solutions for analyzing data. In essence, the commoditization of software frameworks for data analytics is well underway. Yet, up to this point, data analytics frameworks are still regarded as standalone, em dedicated components; deploying these frameworks requires companies to purchase hardware to meet storage and network resource demands, and system administrators to handle management of data across multiple storage systems. This dissertation explores the low-cost integration of frameworks for data analytics within existing, shared infrastructures. The thesis centers on smart software being the key enabler for holistic commoditization of data analytics. We focus on two instances of smart software that aid in realizing the low-cost integration objective. For an efficient storage integration, we build MixApart, a scalable data analytics framework that removes the dependency on dedicated storage for analytics; with MixApart, a single, consolidated storage back-end manages data and services all types of workloads, thereby lowering hardware costs and simplifying data management. We evaluate MixApart at scale with micro-benchmarks and production workload traces, and show that MixApart provides faster or comparable performance to an analytics framework with dedicated storage. For an effective sharing of the networking infrastructure, we implement OX, a virtual machine management framework that allows latency-sensitive web applications to share the data center network with data analytics through intelligent VM placement; OX further protects all applications from hardware failures. The two solutions allow the reuse of existing storage and networking infrastructures when deploying analytics frameworks, and substantiate our thesis that smart software upgrades can enable the end-to-end commoditization of analytics. computer science software data analytics data centers 0984
24	Wearable technology model to control and monitor hypertension during pregnancy Lopez, Betsy Diamar Balbin, Aguirre, Jimmy Alexander Armas, Coronado, Diego Antonio Reyes, Gonzalez, Paola A. 27 June 2018 (has links) El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / In this paper, we proposed a wearable technology model to control and monitor hypertension during pregnancy. We enhanced prior models by adding a series of health parameters that could potentially prevent and correct hypertension disorders in pregnancy. Our proposed model also emphasizes the application of real-time data analysis for the healthcare organization. In this process, we also assessed the current technologies and systems applications offered in the market. The model consists of four phases: 1. The health parameters of the patient are collected through a wearable device; 2. The data is received by a mobile application; 3. The data is stored in a cloud database; 4. The data is analyzed on real-time using a data analytics application. The model was validated and piloted in a public hospital in Lima, Peru. The preliminary results showed an increased-on number of controlled patients by 11% and a reduction of maternal deaths by 7%, among other relevant health factors that allowed healthcare providers to take corrective and preventive actions. / Revisión por pares Wearable technology Mobile health application Pregnancy Hypertension Data analytics
25	A prescriptive analytics approach for energy efficiency in datacentres Panneerselvam, John January 2018 (has links) Given the evolution of Cloud Computing in recent years, users and clients adopting Cloud Computing for both personal and business needs have increased at an unprecedented scale. This has naturally led to the increased deployments and implementations of Cloud datacentres across the globe. As a consequence of this increasing adoption of Cloud Computing, Cloud datacentres are witnessed to be massive energy consumers and environmental polluters. Whilst the energy implications of Cloud datacentres are being addressed from various research perspectives, predicting the future trend and behaviours of workloads at the datacentres thereby reducing the active server resources is one particular dimension of green computing gaining the interests of researchers and Cloud providers. However, this includes various practical and analytical challenges imposed by the increased dynamism of Cloud systems. The behavioural characteristics of Cloud workloads and users are still not perfectly clear which restrains the reliability of the prediction accuracy of existing research works in this context. To this end, this thesis presents a comprehensive descriptive analytics of Cloud workload and user behaviours, uncovering the cause and energy related implications of Cloud Computing. Furthermore, the characteristics of Cloud workloads and users including latency levels, job heterogeneity, user dynamicity, straggling task behaviours, energy implications of stragglers, job execution and termination patterns and the inherent periodicity among Cloud workload and user behaviours have been empirically presented. Driven by descriptive analytics, a novel user behaviour forecasting framework has been developed, aimed at a tri-fold forecast of user behaviours including the session duration of users, anticipated number of submissions and the arrival trend of the incoming workloads. Furthermore, a novel resource optimisation framework has been proposed to avail the most optimum level of resources for executing jobs with reduced server energy expenditures and job terminations. This optimisation framework encompasses a resource estimation module to predict the anticipated resource consumption level for the arrived jobs and a classification module to classify tasks based on their resource intensiveness. Both the proposed frameworks have been verified theoretically and tested experimentally based on Google Cloud trace logs. Experimental analysis demonstrates the effectiveness of the proposed framework in terms of the achieved reliability of the forecast results and in reducing the server energy expenditures spent towards executing jobs at the datacentres.
26	A Goal-Oriented Method for Regulatory Intelligence Akhigbe, Okhaide Samson 10 October 2018 (has links) When creating and administering regulations, regulators have to demonstrate that regulations accomplish intended societal outcomes at costs that do not outweigh their benefits. While regulators have this responsibility as custodians of the regulatory ecosystem, they are also required to create and administer regulations transparently and impartially, addressing the needs and concerns of all stakeholders involved. This is in addition to regulators having to deal with various administrative bottlenecks, competing internal priorities, as well as financial and human resource limitations. Nonetheless, governments, regulated parties, citizens and interest groups can each express different views on the relevance and performance of a piece of regulation. These views range from too many regulations burdening business operations to perceptions that crises in society are the results of insufficient regulations. As such, regulators have to be innovative, employing methods that show that regulations are effective, and justify the introduction, evolution or repeal of regulations. The regulatory process has been the topic of various studies with several such studies exploring the use of information systems at the software level to confirm compliance with regulations and evaluate issues related to non-compliance. The rationale is that if information systems can improve operational functions in organizations, they can also help measure compliance. However, the research focus has been on enabling regulated parties to comply with regulations rather than on enabling regulators to assess or enforce compliance or show that regulations are effective. Regulators need to address concerns of too much regulations or too little regulations with data-driven evidence especially in this age of big data and artificial intelligence enhanced tools. A method that facilitates evidencebased decision-making using data for enacting, implementing and reviewing regulations is now inevitable. In response to the above challenges, this thesis explores the use of a goaloriented modelling method and a data analytics software, to create a method that enables monitoring, assessing and reporting on the effectiveness of regulations and regulatory initiatives. This Goal-oriented Regulatory Intelligence Method (GoRIM) provides an intelligent approach to regulatory management, as well as a feedback loop in the use of data from and within the regulatory ecosystem to create and administer regulations. To demonstrate its applicability, GoRIM was applied to three case studies involving regulators in three different real regulatory scenarios, and its feasibility and utility were evaluated. The results indicate that regulators found GoRIM promising in enabling them to show, with evidence, whether their regulations are effective. Data Analytics Goal-oriented Modelling Regulations Regulatory Intelligence
27	A Data Analytics Framework for Smart Grids: Spatio-temporal Wind Power Analysis and Synchrophasor Data Mining January 2013 (has links) abstract: Under the framework of intelligent management of power grids by leveraging advanced information, communication and control technologies, a primary objective of this study is to develop novel data mining and data processing schemes for several critical applications that can enhance the reliability of power systems. Specifically, this study is broadly organized into the following two parts: I) spatio-temporal wind power analysis for wind generation forecast and integration, and II) data mining and information fusion of synchrophasor measurements toward secure power grids. Part I is centered around wind power generation forecast and integration. First, a spatio-temporal analysis approach for short-term wind farm generation forecasting is proposed. Specifically, using extensive measurement data from an actual wind farm, the probability distribution and the level crossing rate of wind farm generation are characterized using tools from graphical learning and time-series analysis. Built on these spatial and temporal characterizations, finite state Markov chain models are developed, and a point forecast of wind farm generation is derived using the Markov chains. Then, multi-timescale scheduling and dispatch with stochastic wind generation and opportunistic demand response is investigated. Part II focuses on incorporating the emerging synchrophasor technology into the security assessment and the post-disturbance fault diagnosis of power systems. First, a data-mining framework is developed for on-line dynamic security assessment by using adaptive ensemble decision tree learning of real-time synchrophasor measurements. Under this framework, novel on-line dynamic security assessment schemes are devised, aiming to handle various factors (including variations of operating conditions, forced system topology change, and loss of critical synchrophasor measurements) that can have significant impact on the performance of conventional data-mining based on-line DSA schemes. Then, in the context of post-disturbance analysis, fault detection and localization of line outage is investigated using a dependency graph approach. It is shown that a dependency graph for voltage phase angles can be built according to the interconnection structure of power system, and line outage events can be detected and localized through networked data fusion of the synchrophasor measurements collected from multiple locations of power grids. Along a more practical avenue, a decentralized networked data fusion scheme is proposed for efficient fault detection and localization. / Dissertation/Thesis / Ph.D. Electrical Engineering 2013 Electrical engineering data analytics power systems renewable generation smart grids
28	Cloud enabled data analytics and visualization framework for health-shock prediction Mahmud, S. January 2016 (has links) Health-shock can be defined as a health event that causes severe hardship to the household because of the financial burden for healthcare payments and the income loss due to inability to work. It is one of the most prevalent shocks faced by the people of underdeveloped and developing countries. In Pakistan especially, policy makers and healthcare sector face an uphill battle in dealing with health-shock due to the lack of a publicly available dataset and an effective data analytics approach. In order to address this problem, this thesis presents a data analytics and visualization framework for health-shock prediction based on a large-scale health informatics dataset. The framework is developed using cloud computing services based on Amazon web services integrated with Geographical Information Systems (GIS) to facilitate the capture, storage, indexing and visualization of big data for different stakeholders using smart devices. The data was collected through offline questionnaires and an online mobile based system through Begum Memhooda Welfare Trust (BMWT). All data was coded in the online system for the purpose of analysis and visualization. In order to develop a predictive model for health-shock, a user study was conducted to collect a multidimensional dataset from 1000 households in rural and remotely accessible regions of Pakistan, focusing on their health, access to health care facilities and social welfare, as well as economic and environmental factors. The collected data was used to generate a predictive model using a fuzzy rule summarization technique, which can provide stakeholders with interpretable linguistic rules to explain the causal factors affecting health-shock. The evaluation of the proposed system in terms of the interpretability and accuracy of the generated data models for classifying health-shock shows promising results. The prediction accuracy of the fuzzy model based on a k-fold crossvalidation of the data samples shows above 89% performance in predicting health-shock based on the given factors. Such a framework will not only help the government and policy makers to manage and mitigate health-shock effectively and timely, but will also provide a low-cost, flexible, scalable, and secure architecture for data analytics and visualization. Future work includes extending this study to form Pakistan’s first publicly available health informatics tool to help government and healthcare professionals to form policies and healthcare reforms. This study has implications at a national and international level to facilitate large-scale health data analytics through cloud computing in order to minimize the resource commitments needed to predict and manage health-shock. 004.67
29	Automated feature synthesis on big data using cloud computing resources Saker, Vanessa January 2020 (has links) The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results. Computer Science Data Analytics Cloud Computing Big Data
30	An Empirical Analysis of Network Traffic: Device Profiling and Classification Anbazhagan, Mythili Vishalini 02 July 2019 (has links) Time and again we have seen the Internet grow and evolve at an unprecedented scale. The number of online users in 1995 was 40 million but in 2020, number of online devices are predicted to reach 50 billion, which would be 7 times the human population on earth. Up until now, the revolution was in the digital world. But now, the revolution is happening in the physical world that we live in; IoT devices are employed in all sorts of environments like domestic houses, hospitals, industrial spaces, nuclear plants etc., Since they are employed in a lot of mission-critical or even life-critical environments, their security and reliability are of paramount importance because compromising them can lead to grave consequences. IoT devices are, by nature, different from conventional Internet connected devices like laptops, smart phones etc., They have small memory, limited storage, low processing power etc., They also operate with little to no human intervention. Hence it becomes very important to understand IoT devices better. How do they behave in a network? How different are they from traditional Internet connected devices? Can they be identified from their network traffic? Is it possible for anyone to identify them just by looking at the network data that leaks outside the network, without even joining the network? That is the aim of this thesis. To the best of our knowledge, no study has collected data from outside the network, without joining the network, with the intention of finding out if IoT devices can be identified from this data. We also identify parameters that classify IoT and non-IoT devices. Then we do manual grouping of similar devices and then do the grouping automatically, using clustering algorithms. This will help in grouping devices of similar nature and create a profile for each kind of device. IoT Network Data Analytics Machine Learning Clustering K-Means

Search results