Global ETD Search

431	Performance assessment of Apache Spark applications AL Jorani, Salam January 2019 (has links) This thesis addresses the challenges of large software and data-intensive systems. We will discuss a Big Data software that consists of quite a bit of Linux configuration, some Scala coding and a set of frameworks that work together to achieve the smooth performance of the system. Moreover, the thesis focuses on the Apache Spark framework and the challenging of measuring the lazy evaluation of the transformation operations of Spark. Investigating the challenges are essential for the performance engineers to increase their ability to study how the system behaves and take decisions in early design iteration. Thus, we made some experiments and measurements to achieve this goal. In addition to that, and after analyzing the result we could create a formula that will be useful for the engineers to predict the performance of the system in production. Big Data Apache Spark BigBlu Lazy evaluation of Spark Computer Sciences Datavetenskap (datalogi)
432	Os dados como base à criação de um método de planejamento de propaganda / Data as basis for developing an advertising planning method Lima, Carlos Eduardo de 14 March 2018 (has links) Submitted by CARLOS EDUARDO DE LIMA (kadulima.planner@gmail.com) on 2018-05-07T14:07:05Z No. of bitstreams: 1 CarlosEduardoDeLima - PequisaPPGMIT.pdf: 3805666 bytes, checksum: 230489a041e8c83f67320e15714cf8ad (MD5) / Rejected by Lucilene Cordeiro da Silva Messias null (lubiblio@bauru.unesp.br), reason: Solicitamos que realize uma nova submissão seguindo as orientações abaixo: 1 - Preencher e inserir a ficha catalográfica no arquivo em pdf, pois é um ítem obrigatório. 1 - Inserir a ata de defesa no arquivo em pdf, pois é um ítem obrigatório. Agradecemos a compreensão. on 2018-05-07T18:51:39Z (GMT) / Submitted by CARLOS EDUARDO DE LIMA (kadulima.planner@gmail.com) on 2018-05-07T19:40:12Z No. of bitstreams: 1 CarlosEduardoDeLima - PequisaPPGMIT.pdf: 4248684 bytes, checksum: 9dc7a3260da510a2abea5c583764f524 (MD5) / Approved for entry into archive by Lucilene Cordeiro da Silva Messias null (lubiblio@bauru.unesp.br) on 2018-05-08T13:07:34Z (GMT) No. of bitstreams: 1 lima_ce_me_bauru.pdf: 4122614 bytes, checksum: 0db356c3911bdb32092653e2195a5519 (MD5) / Made available in DSpace on 2018-05-08T13:07:34Z (GMT). No. of bitstreams: 1 lima_ce_me_bauru.pdf: 4122614 bytes, checksum: 0db356c3911bdb32092653e2195a5519 (MD5) Previous issue date: 2018-03-14 / O presente estudo visa identificar as inúmeras transformações que o planejamento de propaganda tem enfrentado desde o advento da Internet e das tecnologias de comunicação e informação baseadas em big data, Machine Learning, cluster e outras ferramentas de inteligência de dados. Dessa forma, buscou-se fazer um levantamento histórico e documental sobre os modelos de planejamento de propaganda e briefs criativos. Percebeu-se fundamental traçar uma breve documentação histórica sobre a concepção da disciplina de planejamento para o planejador e a forma como esse processo foi desenvolvido no Brasil, assim como sua evolução. Fez-se necessário também definir conceitos sobre big data e inovação, buscando identificar como afetam a estrutura e as metodologias até então usadas pelo planejamento. Com isso, objetivou-se poder entender como o planejador está sendo levado a desenvolver novas competências que abordam diferentes disciplinas, além das que já eram aplicadas no processo de investigação e criação do planejamento. Dessa forma, foram utilizadas metodologias de pesquisa de campo com entrevistas em profundidade com heads e diretores de planejamento de agências de comunicação e players reconhecidos por sua competência e experiência no planejamento de propaganda. Sendo assim, esta pesquisa apresenta uma proposta de um método de planejamento que, por meio de ferramentas baseadas em softwares e aplicativos, permita que o profissional de planejamento possa gerar ideias inovadoras e propor uma nova cultura de pensamento à agência. / This study aims to spot the countless transformations that the advertising planning has been passing through since the appearance of the Internet, as well as communication and information technologies based upon big data, Machine Learning , cluster and othe r data intelligence mechanisms. Along these lines, it was undertaken to assemble historical and documental facts about advertising planning and creative briefs guidelines. It was noticed the importance to picture a brief historical documentation about the conception of the planning subject for planners and the way this process was developed in Brazil, as well as its evolution. It was also necessary to define concepts about big data and innovation, in order to find how they impact the structure and methodolo gies used by the advertising planning until then. Thereby, the goal is to understand how the planner is being compelled to develop new skills which approach different matters, beyond the ones that were already used in the process of inquiring and creating in advertising planning. Thus, field research methodologies were applied with in - depth interviews with heads and directors of planning at communication agencies and market players whom are renowned for their competence and experience in advertising plannin g. Therefore, this essay proposes a planning approach which, utilizing tools based upon softwares and appliances, enables planners to develop disrupting ideas and come up with new mindsets to agencies. Planejamento de propaganda Método Inovação Mídia e tecnologia Big data Advertising planning Method Innovation Media and technology
433	Réutilisation de données hospitalières pour la recherche d'effets indésirables liés à la prise d'un médicament ou à la pose d'un dispositif médical implantable / Reuse of hospital data to seek adverse events related to drug administration or the placement of an implantable medical device Ficheur, Grégoire 11 June 2015 (has links) Introduction : les effets indésirables associés à un traitement médicamenteux ou à la pose d'un dispositif médical implantable doivent être recherchés systématiquement après le début de leur commercialisation. Les études réalisées pendant cette phase sont des études observationnelles qui peuvent s'envisager à partir des bases de données hospitalières. L'objectif de ce travail est d'étudier l'intérêt de la ré-utilisation de données hospitalières pour la mise en évidence de tels effets indésirables.Matériel et méthodes : deux bases de données hospitalières sont ré-utilisées pour les années 2007 à 2013 : une première contenant 171 000 000 de séjours hospitaliers incluant les codes diagnostiques, les codes d'actes et des données démographiques, ces données étant chaînées selon un identifiant unique de patient ; une seconde issue d'un centre hospitalier contenant les mêmes types d'informations pour 80 000 séjours ainsi que les résultats de biologie médicale, les administrations médicamenteuses et les courriers hospitaliers pour chacun des séjours. Quatre études sont conduites sur ces données afin d'identifier d'une part des évènements indésirables médicamenteux et d'autre part des évènements indésirables faisant suite à la pose d'un dispositif médical implantable.Résultats : la première étude démontre l'aptitude d'un jeu de règles de détection à identifier automatiquement les effets indésirables à type d'hyperkaliémie. Une deuxième étude décrit la variation d'un paramètre de biologie médicale associée à la présence d'un motif séquentiel fréquent composé d'administrations de médicaments et de résultats de biologie médicale. Un troisième travail a permis la construction d'un outil web permettant d'explorer à la volée les motifs de réhospitalisation des patients ayant eu une pose de dispositif médical implantable. Une quatrième et dernière étude a permis l'estimation du risque thrombotique et hémorragique faisant suite à la pose d'une prothèse totale de hanche.Conclusion : la ré-utilisation de données hospitalières dans une perspective pharmacoépidémiologique permet l'identification d'effets indésirables associés à une administration de médicament ou à la pose d'un dispositif médical implantable. L'intérêt de ces données réside dans la puissance statistique qu'elles apportent ainsi que dans la multiplicité des types de recherches d'association qu'elles permettent. / Introduction:The adverse events associated with drug administration or placement of an implantable medical device should be sought systematically after the beginning of the commercialisation. Studies conducted in this phase are observational studies that can be performed from hospital databases. The objective of this work is to study the interest of the re-use of hospital data for the identification of such an adverse event.Materials and methods:Two hospital databases have been re-used between the years 2007 to 2013: the first contains 171 million inpatient stays including diagnostic codes, procedures and demographic data. This data is linked with a single patient identifier; the second database contains the same kinds of information for 80,000 stays and also the laboratory results and drug administrations for each inpatient stay. Four studies were conducted on these pieces of data to identify adverse drug events and adverse events following the placement of an implantable medical device.Results:The first study demonstrates the ability of a set of detection of rules to automatically identify adverse drug events with hyperkalaemia. The second study describes the variation of a laboratory results associated with the presence of a frequent sequential pattern composed of drug administrations and laboratory results. The third piece of work enables the user to build a web tool exploring on the fly the reasons for rehospitalisation of patients with an implantable medical device. The fourth and final study estimates the thrombotic and bleeding risks following a total hip replacement.Conclusion:The re-use of hospital data in a pharmacoepidemiological perspective allows the identification of adverse events associated with drug administration or placement of an implantable medical device. The value of this data is the amount statistical power they bring as well as the types of associations they allow to analyse. Données massives Réutilisation de données Pharmaco-épidémiologie Événement indésirable Cas-témoin en cross-over Big data Data reuse
434	Internet, big data e discurso de ódio: reflexões sobre as dinâmicas de interação no Twitter e os novos ambientes de debate político Cappi, Juliano 23 November 2017 (has links) Submitted by Filipe dos Santos (fsantos@pucsp.br) on 2017-12-06T11:01:38Z No. of bitstreams: 1 Juliano Cappi.pdf: 4589720 bytes, checksum: bdb2ed397326a6feabccff377ea6f51c (MD5) / Made available in DSpace on 2017-12-06T11:01:38Z (GMT). No. of bitstreams: 1 Juliano Cappi.pdf: 4589720 bytes, checksum: bdb2ed397326a6feabccff377ea6f51c (MD5) Previous issue date: 2017-11-23 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The present research analyzes the relations between, on the one hand, the interaction dynamics that have consolidated in the social networks and, on the other, the of hate speech online increase within these environments. The objectives of the research are, firstly, to investigate the unfolding of the increasingly widespread use of social network environments in political debates through the lens of cultural diversity, and to investigate possible patterns of dissemination of hate speech in the new debate sphere that emerges in these environments. The violence manifest in social networks has presented contours of racial prejudice, misogyny, homophobia and totalitarianism, often surpassing the limits of cyberspace. The analysis of the debates that took place on Twitter through the posts around the conviction of former President Luis Inacio Lula da Silva showed that the phenomenon of the filter bubbles was identified and followed the patterns already identified in international research. The analysis of the postings suggests that the construction of mutual identifications between groups of users ends up authorizing the systematic discourse of disrespect for dignity from characteristics that identify Lula as a whore, as a drunkard, a hobo and a thief. The theoretical framework uses the notion of communicational environment proposed by Baitello to support the assumption that the construction of identity and therefore the notion of alterity is increasingly related to the environment developed by the Internet applications present in our daily life. If the environment is a construction associated with subjectivity, an atmosphere generated by the availability of subjects - people and things - by their intentionality of establishing bonds, then the environments of interaction of cyberspace contribute to the structuring of the bonds so important for the construction of identity. We also used in the research the concept of filter bubbles in the terms of the work of Eli Pariser. The author proposes that the new digital Internet browsing environments are bubbles of familiarity, structured by systems of collection, analysis, classification and distribution of information using algorithms, in which users are inserted. Pariser disputes the widely accepted belief that the Internet environment is conducive to the contact with the diversity of expressions. The research was able to extend this approach by proposing that the bubbles are often manifested by ideological approximation. Finally, Eugenio Trivinho’s concept of cybercultural dromocracy is based on the violent condition in which recognition of alterity in modern society takes place. The methodological framework that guided the research is centered in the social networks analysis (SNA), based on the works of Raquel Recuero / A presente pesquisa analisa as relações entre, de um lado, as dinâmicas de interação que se consolidaram nas redes sociais digitais e, de outro, o avanço do discurso de ódio nesses ambientes. Os objetivos da pesquisa são, em primeiro lugar investigar os desdobamentos do uso cada vez mais disseminado dos ambiente das redes sociais digitais nos debates políticos pela lente da diversidade cultural e investigar possíveis padrões de disseminação do discurso de ódio na nova esfera de debates que emerge nesses ambientes. A violência que se manifesta nas redes sociais digitais tem apresentado contornos de preconceito racial, misoginia, homofobia e totalitarismo, muitas vezes ultrapassando os limites do ciberespaço. A analise dos debates que tiveram lugar no Twitter através das postagem em torno da condenação do ex-presidente Luis Inácio Lula da Silva mostraram que o fenômemo da formação de bolhas ideológicas que se manifesta e tem acompanhado os padrões já identificados em pesquisas internacionais. A análise das postagens sugerem que a construção de identificações mútuas entre grupos de usuários acaba por autorizar o discurso sistemático de desrespeito a dignidade a partir de características que identificam Lula como uma puta, como bêbado, vagabundo e ladrão. O referencial teórico emprega a noção de ambiente comunicacional proposta por Baitello para fundamentar o pressuposto de que a construção da identidade e portanto da noção de alteridade está relacionada cada vez mais intimamente com o ambiente disponibilizado pelas aplicações Internet presentes no nosso dia a dia. Se o ambiente é uma construção associada a subjetividade, uma atmosfera gerada pela disponibilidade dos sujeitos – pessoas e coisas – por sua intencionalidade de estabelecer vínculos, então os ambientes de interação do ciberespaço tem contribuição na estruturação dos vínculos tão importantes para a construção da identidade. Igualmente foi utilizado na pesquisa o conceito de filtros invisíveis da Internet nos termos do trabalho de Eli Pariser. O autor propõe que os novos ambientes digitais de navegação na Internet são bolhas de familiaridade, estruturadas por sistemas de coleta, análise, classificação e distribuição de informações com o uso de algoritmos, nas quais os usuários se encontram inseridos. Pariser contesta a crença amplamente aceita de que o ambiente da Internet propicia o contato com a diversidade de expressões. A pesquisa pôde ampliar a essa abordagem ao propor que as bolhas se manifestam muitas vezes por aproximação ideológica. Por fim, o conceito de dromocracia cibercultural de Eugênio Trivinho fundamenta a condição violenta na qual se processa o reconhecimento da alteridade na sociedade moderna. O referencial metodológico que orientou a investigação está centrado na análise de redes sociais (ARS), a partir dos trabalhos de Raquel Recuero Internet Big data Ódio Políticos - Discussões e debates Hate Politics - Discussions and debates
435	A importância dos 2 Vs – Velocidade e Variedade – do Big Data em situações de busca da internet: um estudo envolvendo alunos do ensino superior Kadow, André Luis Dal Santo 06 December 2017 (has links) Submitted by Filipe dos Santos (fsantos@pucsp.br) on 2017-12-18T12:06:38Z No. of bitstreams: 1 André Luis Dal Santo Kadow.pdf: 4292044 bytes, checksum: b5a5a6e86704ac783dc0e1274c897b50 (MD5) / Made available in DSpace on 2017-12-18T12:06:38Z (GMT). No. of bitstreams: 1 André Luis Dal Santo Kadow.pdf: 4292044 bytes, checksum: b5a5a6e86704ac783dc0e1274c897b50 (MD5) Previous issue date: 2017-12-06 / The amount of digital data, structured or not, generated daily is enormous and it is possible to observe within society how the companies that hold this information end up using the mass information of the past to try to predict or even facilitate future behavior on the part of users. It is the concept of Big Data working on increasingly diverse fronts, driven by algorithms, Internet of Things (IoT) among others. However, having a large amount of data alone is not as representative as understanding and exploiting its use within the processes involved. Authors such as Turing, Searle and Andersen have already foreseen the importance and impacts of this myriad of data for a long time. The analysis of these different points of view is a starting point for the understanding of how autonomous systems that learn through data inputs captured or given spontaneously by the users were the essence to reach the central idea of this thesis, which is to analyze and understand the influence of two Big Data Vs - Speed and Variety - of Big Data within the digital life of young college students, using a class from a specific faculty for the case study / A quantidade de dados digitais, estruturados ou não, gerada diariamente é enorme e é possível observar dentro da sociedade como as empresas que detêm essas informações acabam usando as informações em massa do passado para tentar prever ou, até mesmo, facilitar comportamentos futuros por parte dos usuários. É o conceito do Big Data trabalhando em frentes cada vez mais diversificadas, impulsionado por algoritmos, Internet das Coisas (IoT) entre outros. Contudo, possuir uma grande quantidade de dados por si só não tem tanta representatividade quanto entender e explorar a sua utilização dentro dos processos envolvidos. Autores como Turing, Searle e Andersen já anteviram a importância e os impactos desta miríade de dados há tempos. A análise desses diferentes pontos de vista se faz como um local de partida para o entendimento da atuação de como sistemas autônomos que aprendem através de entradas de dados capturados ou cedidos espontaneamente por parte dos usuários foram a essência para chegar à ideia central desta tese, que é analisar e entender a influência de dois Vs do Big Data – Velocidade e Variedade – do Big Data dentro da vida digital dos jovens universitários, usando uma classe de uma faculdade específica para o estudo de caso Inteligência artificial Redes sociais Artificial intelligence Social networks Big Data CNPQ::ENGENHARIAS
436	Utilização da estatística e Big Data na Copa do Mundo FIFA 2014 / Use of statistics and Big Data at the 2014 FIFA World Cup Benetti, Felipe Nogueira 12 December 2017 (has links) Submitted by Filipe dos Santos (fsantos@pucsp.br) on 2018-01-19T10:48:06Z No. of bitstreams: 1 Felipe Nogueira Benetti.pdf: 858687 bytes, checksum: 4987e158a0496fbf988ca88a363a474b (MD5) / Made available in DSpace on 2018-01-19T10:48:06Z (GMT). No. of bitstreams: 1 Felipe Nogueira Benetti.pdf: 858687 bytes, checksum: 4987e158a0496fbf988ca88a363a474b (MD5) Previous issue date: 2017-12-12 / The objective of this study was to show the importance of statistical analysis and Big Data for the development of sport, especially soccer and the results obtained by the German team (specifically, the 2014 FIFA World Cup, in Brazil). The work covered the emergence of statistics and the types of analyses most used to obtain results with Big Data, passing through their definition and contributions to the daily lives of the population and companies that have access to the internet and smartphones. It was also was mentioned which sports modalities use the data volume processing with statistical analysis as a contribution to improve training and games. Finally, it was discussed the importance of the use of Big Data gave the German soccer team in conquering the World Cup in Brazil, what motives moved this investment and what results were obtained with this partnership. All the work was developed according to the standardization of the Brazilian Association of Technical Standards (ABNT, in portuguese) / O objetivo de estudo desta pesquisa foi mostrar a importância das análises estatísticas e do Big Data para o desenvolvimento do esporte, principalmente do futebol e os resultados obtidos pela seleção alemã (especificamente, a conquista da Copa do Mundo FIFA, em 2014). O trabalho abordou o surgimento da estatística e os tipos de análises mais utilizadas para a obtenção de resultados com Big Data, passando por sua definição e contribuições para o cotidiano da população e das empresas que possuem acesso à internet e a smartphones. Também foi mencionado quais modalidades esportivas utilizam o processamento de volume de dados com análises estatísticas como contribuição para melhorar treinos e partidas. Por fim, foi discutida a importância do uso do Big Data deu a seleção alemã de futebol na conquista da Copa do Mundo no Brasil, quais motivos moveram este investimento e quais resultados foram obtidos com essa parceria. Todo o trabalho foi desenvolvido de acordo com a normatização da Associação Brasileira de Normas Técnicas (ABNT) Futebol - Análise estatística Estatística - Programas de computador Big Data Soccer - Statistic analysis Statistic - Computer programs CNPQ::ENGENHARIAS
437	Opportunities and challenges of Big Data Analytics in healthcare : An exploratory study on the adoption of big data analytics in the Management of Sickle Cell Anaemia. Saenyi, Betty January 2018 (has links) Background: With increasing technological advancements, healthcare providers are adopting electronic health records (EHRs) and new health information technology systems. Consequently, data from these systems is accumulating at a faster rate creating a need for more robust ways of capturing, storing and processing the data. Big data analytics is used in extracting insight form such large amounts of medical data and is increasingly becoming a valuable practice for healthcare organisations. Could these strategies be applied in disease management? Especially in rare conditions like Sickle Cell Disease (SCD)? The study answers the following research questions;1. What Data Management practices are used in Sickle Cell Anaemia management?2. What areas in the management of sickle cell anaemia could benefit from use of big data Analytics?3. What are the challenges of applying big data analytics in the management of sickle cell anaemia?Purpose: The purpose of this research was to serve as pre-study in establishing the opportunities and challenges of applying big data analytics in the management of SCDMethod: The study adopted both deductive and inductive approaches. Data was collected through interviews based on a framework which was modified specifically for this study. It was then inductively analysed to answer the research questions.Conclusion: Although there is a lot of potential for big data analytics in SCD in areas like population health management, evidence-based medicine and personalised care, its adoption is not a surety. This is because of lack of interoperability between the existing systems and strenuous legal compliant processes in data acquisition. Big data Analytics Sickle cell Anaemia Healthcare Other Engineering and Technologies Annan teknik
438	Development of computational approaches for whole-genome sequence variation and deep phenotyping Haimel, Matthias January 2019 (has links) The rare disease pulmonary arterial hypertension (PAH) results in high blood pressure in the lung caused by narrowing of lung arteries. Genes causative in PAH were discovered through family studies and very often harbour rare variants. However, the genetic cause in heritable (31%) and idiopathic (79%) PAH cases is not yet known but are speculated to be caused by rare variants. Advances in high-throughput sequencing (HTS) technologies made it possible to detect variants in 98% of the human genome. A drop in sequencing costs made it feasible to sequence 10,000 individuals including 1,250 subjects diagnosed with PAH and relatives as part of the NIHR Bioresource - Rare (BR-RD) disease study. This large cohort allows the genome-wide identification of rare variants to discover novel causative genes associated with PAH in a case-control study to advance our understanding of the underlying aetiology. In the first part of my thesis, I establish a phenotype capture system that allows research nurses to record clinical measurements and other patient related information of PAH patients recruited to the NIHR BR-RD study. The implemented extensions provide a programmatic data transfer and an automated data release pipeline for analysis ready data. The second part is dedicated to the discovery of novel disease genes in PAH. I focus on one well characterised PAH disease gene to establish variant filter strategies to enrich for rare disease causing variants. I apply these filter strategies to all known PAH disease genes and describe the phenotypic differences based on clinically relevant values. Genome-wide results from different filter strategies are tested for association with PAH. I describe the findings of the rare variant association tests and provide a detailed interrogation of two novel disease genes. The last part describes the data characteristics of variant information, available non SQL (NoSQL) implementations and evaluates the suitability and scalability of distributed compute frameworks to store and analyse population scale variation data. Based on the evaluation, I implement a variant analysis platform that incrementally merges samples, annotates variants and enables the analysis of 10,000 individuals in minutes. An incremental design for variant merging and annotation has not been described before. Using the framework, I develop a quality score to reduce technical variation and other biases. The result from the rare variant association test is compared with traditional methods.
439	Big Data : le nouvel enjeu de l'apprentissage à partir des données massives / Big Data : the new challenge Learning from data Massive Adjout Rehab, Moufida 01 April 2016 (has links) Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing. / In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction. Données massives Big data Régression linéaire multiple Large scale data Mapreduce Multiple linear regression Bagging
440	Extreme Learning Machines: novel extensions and application to Big Data Akusok, Anton 01 May 2016 (has links) Extreme Learning Machine (ELM) is a recently discovered way of training Single Layer Feed-forward Neural Networks with an explicitly given solution, which exists because the input weights and biases are generated randomly and never change. The method in general achieves performance comparable to Error Back-Propagation, but the training time is up to 5 orders of magnitude smaller. Despite a random initialization, the regularization procedures explained in the thesis ensure consistently good results. While the general methodology of ELMs is well developed, the sheer speed of the method enables its un-typical usage for state-of-the-art techniques based on repetitive model re-training and re-evaluation. Three of such techniques are explained in the third chapter: a way of visualizing high-dimensional data onto a provided fixed set of visualization points, an approach for detecting samples in a dataset with incorrect labels (mistakenly assigned, mistyped or a low confidence), and a way of computing confidence intervals for ELM predictions. All three methods prove useful, and allow even more applications in the future. ELM method is a promising basis for dealing with Big Data, because it naturally deals with the problem of large data size. An adaptation of ELM to Big Data problems, and a corresponding toolbox (published and freely available) are described in chapter 4. An adaptation includes an iterative solution of ELM which satisfies a limited computer memory constraints and allows for a convenient parallelization. Other tools are GPU-accelerated computations and support for a convenient huge data storage format. The chapter also provides two real-world examples of dealing with Big Data using ELMs, which present other problems of Big Data such as veracity and velocity, and solutions to them in the particular problem context. publicabstract Artificial Neural Network Big Data Extreme Learning Machine Industrial Engineering

Search results