Global ETD Search

181	Automated feature synthesis on big data using cloud computing resources Saker, Vanessa January 2020 (has links) The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results. Computer Science Data Analytics Cloud Computing Big Data
182	Impact of Big Data Analytics in Industry 4.0 Oikonomidi, Sofia January 2020 (has links) Big data in industry 4.0 is a major subject for the currently developed research but also for the organizations that are motivated to invest in these kinds of projects. The big data are known as the large quantity of data collected from various resources that potentially could be analyzed and provide valuable insights and patterns. In industry 4.0 the production of data is massive, and thus, provides the basis for analysis and important information extraction. This study aims to provide the impact of big data analytics in industry 4.0 environments by the utilization of the SWOT dimensions framework with the intention to provide both a positive and a negative perspective of the subject. Considering that these implementations are an innovative trend and limited awareness exists for the subject, it is valuable to summarize and explore the identified findings from the published literature that will be reviewed based on interviews with data scientists. The intention is to increase the knowledge of the subject and inform the organizations about their potential expectations and challenges. The effects are represented in the SWOT analysis based on findings collected from 22 selected articles which were afterwards discussed with professionals. The systematic literature review started with the creation of a plan and specifically defined steps approach based on previously existing scientific papers. The relevant literature was decided upon specified inclusion and exclusion criteria and their relevance to the research questions. Following this, the interview questionnaire was build based on the findings in order to gather empirical data on the subject. The results revealed that the insights developed through big data support the management towards effective decision-making since it reduces the ambiguity of the actions. Meanwhile, the optimization of production, expenditure decrement, and customer satisfaction are the following as top categories mentioned in the selected articles for the strength dimension. In the opportunities, the interoperability of the equipment, the real-time information acquirement and exchange, and self-awareness of the systems are reflected in the majority of the papers. On the contrary, the threats and weaknesses are referred to fewer studies. The infrastructure limitations, security, and privacy issues are demonstrated substantially. The organizational changes and human resources matters are also expressed but infrequently. The data scientists agreed with the findings and mentioned that decision-making, process effectiveness and customer relationships are their major expectations and objectives while the experience and knowledge limitations of the personnel is their main concern. In general, the gaps in the existing literature could be identified in the challenges that occur for the big data projects in industry 4.0. Consequently, further research is recommended in the field in order to raise the awareness in the interested parties and ensure the project’s success. Big Data Industry 4.0 SWOT Information Systems
183	Análisis de redes sociales en usuarios peruanos acerca del tratamiento para Covid-19 utilizado herramienta de Big data: El caso del Dióxido de Cloro Aguirre, Aranza, de la cruz, betsy, Gonzales Cobeñas, Joe, Macedo Lozano, Sasha Darlene 14 October 2020 (has links) A lo largo de la pandemia, a nivel de redes sociales, se han propuesto múltiples métodos que supuestamente buscaban reducir el impacto del COVID-19 en las personas, siendo el consumo de dióxido de cloro uno de ellos a pesar de no tener evidencia científica que lo respalde. En este contexto, se llevará a cabo el presente estudio con el propósito de realizar un análisis de redes sociales de usuarios peruanos acerca del uso de dióxido de cloro como tratamiento de COVID-19. Se busca que en el futuro dicho análisis pueda servir para la vigilancia en salud pública. Se usará información públicamente disponible (Big Data) y se examinará a través de Google trends y social-searcher para analizar las tendencias de búsquedas; asimismo, se realizará sentiment analysis de las redes sociales (Facebook, Twitter, Instagram, Youtube. Tumblr, Reddit, Flickr, Dailymotion y Vimeo). COVID-19 Dióxido de Cloro Big Data Análisis de sentimiento
184	Implementierung und Evaluierung einer Verarbeitung von Datenströmen im Big Data Umfeld am Beispiel von Apache Flink Oelschlegel, Jan 17 May 2021 (has links) Die Verarbeitung von Datenströmen rückt zunehmend in den Fokus beim Aufbau moderner Big Data Infrastrukturen. Der Praxispartner dieser Master-Thesis, die integrationfactory GmbH & Co. KG, möchte zunehmend den Big Data Bereich ausbauen, um den Kunden auch in diesen Aspekten als Beratungshaus Unterstützung bieten zu können. Der Fokus wurde von Anfang an auf Apache Flink gelegt, einem aufstrebenden Stream-Processing-Framework. Das Ziel dieser Arbeit ist die Implementierung verschiedener typischer Anwendungsfälle des Unternehmens mithilfe von Flink und die anschließende Evaluierung dieser. Im Rahmen dessen wird am Anfang zunächst die zentrale Problemstellung festgehalten und daraus die Zielstellungen abgeleitet. Zum besseren Verständnis werden im Nachgang wichtige Grundbegriffe und Konzepte vermittelt. Es wird außerdem dem Framework ein eigenes Kapitel gewidmet, um den Leser einen umfangreichen aber dennoch kompakten Einblick in Flink zu geben. Dabei wurde auf verschiedene Quellen eingegangen, mitunter wurde auch ein direkter Kontakt mit aktiven Entwicklern des Frameworks aufgebaut. Dadurch konnten zunächst unklare Sachverhalte durch fehlende Informationen aus den Primärquellen im Nachgang geklärt und aufbereitet in das Kapitel hinzugefügt werden. Im Hauptteil der Arbeit wird eine Implementierung von definierten Anwendungsfällen vorgenommen. Dabei kommen die Datastream-API und FlinkSQL zum Einsatz, dessen Auswahl auch begründet wird. Die Ausführung der programmierten Jobs findet im firmeneigenen Big Data Labor statt, einer virtualisierten Umgebung zum Testen von Technologien. Als zentrales Problem dieser Master-Thesis sollen beide Schnittstellen auf die Eignung hinsichtlich der Anwendungsfälle evaluiert werden. Auf Basis des Wissens aus den Grundlagen-Kapiteln und der Erfahrungen aus der Entwicklung der Jobs werden Kriterien zur Bewertung mithilfe des Analytic Hierarchy Processes aufgestellt. Im Nachgang findet eine Auswertung statt und die Einordnung des Ergebnisses.:1. Einleitung 1.1. Motivation 1.2. Problemstellung 1.3. Zielsetzung 2. Grundlagen 2.1. Begriffsdefinitionen 2.1.1. Big Data 2.1.2. Bounded vs. unbounded Streams 2.1.3. Stream vs. Tabelle 2.2. Stateful Stream Processing 2.2.1. Historie 2.2.2. Anforderungen 2.2.3. Pattern-Arten 2.2.4. Funktionsweise zustandsbehafteter Datenstromverarbeitung 3. Apache Flink 3.1. Historie 3.2. Architektur 3.3. Zeitabhängige Verarbeitung 3.4. Datentypen und Serialisierung 3.5. State Management 3.6. Checkpoints und Recovery 3.7. Programmierschnittstellen 3.7.1. DataStream-API 3.7.2. FlinkSQL & Table-API 3.7.3. Integration mit Hive 3.8. Deployment und Betrieb 4. Implementierung 4.1. Entwicklungsumgebung 4.2. Serverumgebung 4.3. Konfiguration von Flink 4.4. Ausgangsdaten 4.5. Anwendungsfälle 4.6. Umsetzung in Flink-Jobs 4.6.1. DataStream-API 4.6.2. FlinkSQL 4.7. Betrachtung der Resultate 5. Evaluierung 5.1. Analytic Hierarchy Process 5.1.1. Ablauf und Methodik 5.1.2. Phase 1: Problemstellung 5.1.3. Phase 2: Struktur der Kriterien 5.1.4. Phase 3: Aufstellung der Vergleichsmatrizen 5.1.5. Phase 4: Bewertung der Alternativen 5.2. Auswertung des AHP 6. Fazit und Ausblick 6.1. Fazit 6.2. Ausblick
185	The Evolution of Big Data and Its Business Applications Halwani, Marwah Ahmed 05 1900 (has links) The arrival of the Big Data era has become a major topic of discussion in many sectors because of the premises of big data utilizations and its impact on decision-making. It is an interdisciplinary issue that has captured the attention of scholars and created new research opportunities in information science, business, heath care, and many others fields. The problem is the Big Data is not well defined, so that there exists confusion in IT what jobs and skill sets are required in big data area. The problem stems from the newness of the Big Data profession. Because many aspects of the area are unknown, organizations do not yet possess the IT, human, and business resources necessary to cope with and benefit from big data. These organizations include health care, enterprise, logistics, universities, weather forecasting, oil companies, e-business, recruiting agencies etc., and are challenged to deal with high volume, high variety, and high velocity big data to facilitate better decision- making. This research proposes a new way to look at Big Data and Big Data analysis. It helps and meets the theoretical and methodological foundations of Big Data and addresses an increasing demand for more powerful Big Data analysis from the academic researches prospective. Essay 1 provides a strategic overview of the untapped potential of social media Big Data in the business world and describes its challenges and opportunities for aspiring business organizations. It also aims to offer fresh recommendations on how companies can exploit social media data analysis to make better business decisions—decisions that embrace the relevant social qualities of its customers and their related ecosystem. The goal of this research is to provide insights for businesses to make better, more informed decisions based on effective social media data analysis. Essay 2 provides a better understanding of the influence of social media during the 2016 American presidential election and develops a model to examine individuals' attitudes toward participating in social media (SM) discussions that might influence their decision in choosing between the two presidential election candidates, Donald Trump and Hilary Clinton. The goal of this research is to provide a theoretical foundation that supports the influence of social media on individual's decisions. Essay 3 defines the major job descriptions for careers in the new Big Data profession. It to describe the Big Data professional profile as reflected by the demand side, and explains the differences and commonalities between company-posted job requirements for data analytics, business analytics, and data scientists jobs. The main aim for this work is to clarify of the skill requirements for Big Data professionals for the joint benefit of the job market where they will be employed and of academia, where such professionals will be prepared in data science programs, to aid in the entire process of preparing and recruiting for Big Data positions. Big Data analysis Social Media Data Social Media Influence Decisions Making Big Data Professions Big data. Decision support systems. Information technology -- Management. Social media. Presidents -- Election -- 2016. Big data -- Vocational guidance.
186	Aplikace pro Big Data / Application for Big Data Blaho, Matúš January 2018 (has links) This work deals with the description and analysis of the Big Data concept and its processing and use in the process of decision support. Suggested processing is based on the MapReduce concept designed for Big Data processing. The theoretical part of this work is largely about the Hadoop system that implements this concept. Its understanding is a key feature for properly designing applications that run within it. The work also contains design for specific Big Data processing applications. In the implementation part of the thesis is a description of Hadoop system management, description of implementation of MapReduce applications and description of their testing over data sets.
187	Návrh řešení pro efektivní analýzu bezpečnostních dat / Design of a Solution for Effective Analysis of Security Data Podlesný, Šimon January 2021 (has links) The goal of this thesis is to design architecture capable of processing big data with focus on data leaks. For this purpose multiple data storage systems were described a and compared. The proposed architecture can load, process, store and access data for analytic purposes while taking into account authentication and authorisation of users and principles of modern agile infrastructure.
188	Analyses, Mitigation and Applications of Secure Hash Algorithms Al-Odat, Zeyad Abdel-Hameed January 2020 (has links) Cryptographic hash functions are one of the widely used cryptographic primitives with a purpose to ensure the integrity of the system or data. Hash functions are also utilized in conjunction with digital signatures to provide authentication and non-repudiation services. Secure Hash Algorithms are developed over time by the National Institute of Standards and Technology (NIST) for security, optimal performance, and robustness. The most known hash standards are SHA-1, SHA-2, and SHA-3. The secure hash algorithms are considered weak if security requirements have been broken. The main security attacks that threaten the secure hash standards are collision and length extension attacks. The collision attack works by finding two different messages that lead to the same hash. The length extension attack extends the message payload to produce an eligible hash digest. Both attacks already broke some hash standards that follow the Merkle-Damgrard construction. This dissertation proposes methodologies to improve and strengthen weak hash standards against collision and length extension attacks. We propose collision-detection approaches that help to detect the collision attack before it takes place. Besides, a proper replacement, which is supported by a proper construction, is proposed. The collision detection methodology helps to protect weak primitives from any possible collision attack using two approaches. The first approach employs a near-collision detection mechanism that was proposed by Marc Stevens. The second approach is our proposal. Moreover, this dissertation proposes a model that protects the secure hash functions from collision and length extension attacks. The model employs the sponge structure to construct a hash function. The resulting function is strong against collision and length extension attacks. Furthermore, to keep the general structure of the Merkle-Damgrard functions, we propose a model that replaces the SHA-1 and SHA-2 hash standards using the Merkle-Damgrard construction. This model employs the compression function of the SHA-1, the function manipulators of the SHA-2, and the $10*1$ padding method. In the case of big data over the cloud, this dissertation presents several schemes to ensure data security and authenticity. The schemes include secure storage, anonymous privacy-preserving, and auditing of the big data over the cloud. authorization big data cloud computing cryptography hash algorithms security
189	A machine learning approach to detect insider threats in emails caused by human behaviour Michael, Antonia January 2020 (has links) In recent years, there has been a significant increase in insider threats within organisations and these have caused massive losses and damages. Due to the fact that email communications are a crucial part of the modern-day working environment, many insider threats exist within organisations’ email infrastructure. It is a well-known fact that employees not only dispatch ‘business-as-usual’ emails, but also emails that are completely unrelated to company business, perhaps even involving malicious activity and unethical behaviour. Such insider threat activities are mostly caused by employees who have legitimate access to their organisation’s resources, servers, and non-public data. However, these same employees abuse their privileges for personal gain or even to inflict malicious damage on the employer. The problem is that the high volume and velocity of email communication make it virtually impossible to minimise the risk of insider threat activities, by using techniques such as filtering and rule-based systems. The research presented in this dissertation suggests strategies to minimise the risk of insider threat via email systems by employing a machine-learning-based approach. This is done by studying and creating categories of malicious behaviours posed by insiders, and mapping these to phrases that would appear in email communications. Furthermore, a large email dataset is classified according to behavioural characteristics of employees. Machine learning algorithms are employed to identify commonly occurring insider threats and to group the occurrences according to insider threat classifications. / Dissertation (MSc (Computer Science))--University of Pretoria, 2020. / Computer Science / MSc (Computer Science) / Unrestricted Big Data Insider Threat Detection Insider Threats Emails Cybersecurity
190	Big data = Big money? : En kvantitativ studie om big data, förtroende och köp online Lundvall, Helena January 2021 (has links) Tidigare forskning har entydigt visat på att ett ökat förtroende hos kunder i köpsituationer ökar deras vilja att genomföra köp. Vilka faktorer som påverkar kunders förtroende har även det undersökts flitigt och faktorer som kan kopplas till hantering av kunders data tas allt oftare upp som avgörande. Dock behandlas dessa faktorer många gånger på ett övergripande plan och studier som djupdyker i vilka underliggande faktorer kopplat till datahantering som påverkar kunders förtroende saknas. Genom att samla in kvantitativ data om hur kunder förhåller sig till företags insamling och användande av big data, deras förtroende för e-handelsföretag, samt deras vilja att genomföra köp online ämnar denna studie till att besvara syftet att undersöka effekten av företags insamling och användande av big data på kunders förtroende för företag inom e-handel, samt att undersöka effekten av kunders förtroende på deras vilja att genomföra köp. Studiens resultat visar att företags insamling av big data har en signifikant negativ effekt på kundernas förtroende, samt att kunders förtroende har ett signifikant positivt samband med kunders köpintention. Gällande företags användande av big data kunde däremot inte en signifikant negativ effekt på kundernas förtroende påvisas. e-handel big data förtroende köpintention Business Administration Företagsekonomi

Search results