Global ETD Search

151	Scalable Embeddings for Kernel Clustering on MapReduce Elgohary, Ahmed 14 February 2014 (has links) There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications. The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically. Data Clustering Kernel Methods Scalable Data Analytics MapReduce Big Data
152	Integrating Fuzzy Decisioning Models With Relational Database Constructs Durham, Erin-Elizabeth A 18 December 2014 (has links) Human learning and classification is a nebulous area in computer science. Classic decisioning problems can be solved given enough time and computational power, but discrete algorithms cannot easily solve fuzzy problems. Fuzzy decisioning can resolve more real-world fuzzy problems, but existing algorithms are often slow, cumbersome and unable to give responses within a reasonable timeframe to anything other than predetermined, small dataset problems. We have developed a database-integrated highly scalable solution to training and using fuzzy decision models on large datasets. The Fuzzy Decision Tree algorithm is the integration of the Quinlan ID3 decision-tree algorithm together with fuzzy set theory and fuzzy logic. In existing research, when applied to the microRNA prediction problem, Fuzzy Decision Tree outperformed other machine learning algorithms including Random Forest, C4.5, SVM and Knn. In this research, we propose that the effectiveness with which large dataset fuzzy decisions can be resolved via the Fuzzy Decision Tree algorithm is significantly improved when using a relational database as the storage unit for the fuzzy ID3 objects, versus traditional storage objects. Furthermore, it is demonstrated that pre-processing certain pieces of the decisioning within the database layer can lead to much swifter membership determinations, especially on Big Data datasets. The proposed algorithm uses the concepts inherent to databases: separated schemas, indexing, partitioning, pipe-and-filter transformations, preprocessing data, materialized and regular views, etc., to present a model with a potential to learn from itself. Further, this work presents a general application model to re-architect Big Data applications in order to efficiently present decisioned results: lowering the volume of data being handled by the application itself, and significantly decreasing response wait times while allowing the flexibility and permanence of a standard relational SQL database, supplying optimal user satisfaction in today's Data Analytics world. We experimentally demonstrate the effectiveness of our approach. Database SQL Big Data Query optimization Fuzzy Classification
153	Approches collaboratives pour la classification des données complexes / Collaborative approaches for complex data classification Rabah, Mazouzi 12 December 2016 (has links) La présente thèse s'intéresse à la classification collaborative dans un contexte de données complexes, notamment dans le cadre du Big Data, nous nous sommes penchés sur certains paradigmes computationels pour proposer de nouvelles approches en exploitant des technologies de calcul intensif et large echelle. Dans ce cadre, nous avons mis en oeuvre des classifieurs massifs, au sens où le nombre de classifieurs qui composent le multi-classifieur peut être tres élevé. Dans ce cas, les méthodes classiques d'interaction entre classifieurs ne demeurent plus valables et nous devions proposer de nouvelles formes d'interactions, qui ne se contraignent pas de prendre la totalité des prédictions des classifieurs pour construire une prédiction globale. Selon cette optique, nous nous sommes trouvés confrontés à deux problèmes : le premier est le potientiel de nos approches à passer à l'echelle. Le second, relève de la diversité qui doit être créée et maintenue au sein du système, afin d'assurer sa performance. De ce fait, nous nous sommes intéressés à la distribution de classifieurs dans un environnement de Cloud-computing, ce système multi-classifieurs est peut etre massif et ses propréités sont celles d'un système complexe. En terme de diversité des données, nous avons proposé une approche d'enrichissement de données d'apprentissage par la génération de données de synthèse, à partir de modèles analytiques qui décrivent une partie du phenomène étudié. Aisni, la mixture des données, permet de renforcer l'apprentissage des classifieurs. Les expérientations menées ont montré un grand potentiel pour l'amélioration substantielle des résultats de classification. / This thesis focuses on the collaborative classification in the context of complex data, in particular the context of Big Data, we used some computational paradigms to propose new approaches based on HPC technologies. In this context, we aim at offering massive classifiers in the sense that the number of elementary classifiers that make up the multiple classifiers system can be very high. In this case, conventional methods of interaction between classifiers is no longer valid and we had to propose new forms of interaction, where it is not constrain to take all classifiers predictions to build an overall prediction. According to this, we found ourselves faced with two problems: the first is the potential of our approaches to scale up. The second, is the diversity that must be created and maintained within the system, to ensure its performance. Therefore, we studied the distribution of classifiers in a cloud-computing environment, this multiple classifiers system can be massive and their properties are those of a complex system. In terms of diversity of data, we proposed a training data enrichment approach for the generation of synthetic data from analytical models that describe a part of the phenomenon studied. so, the mixture of data reinforces learning classifiers. The experimentation made have shown the great potential for the substantial improvement of classification results. Classification Ensemble de classifieurs Big data Cloud-computing Diversité
154	Discovery of novel prognostic tools to stratify high risk stage II colorectal cancer patients utilising digital pathology Caie, Peter David January 2015 (has links) Colorectal cancer (CRC) patients are stratified by the Tumour, Node and Metastasis (TNM) staging system for clinical decision making. Additional genomic markers have a limited utility in some cases where precise targeted therapy may be available. Thus, classical clinical pathological staging remains the mainstay of the assessment of this disease. Surgical resection is generally considered curative for Stage II patients, however 20-30% of these patients experience disease recurrence and disease specific death. It is imperative to identify these high risk patients in order to assess if further treatment or detailed follow up could be beneficial to their overall survival. The aim of the thesis was to categorise Stage II CRC patients into high and low risk of disease specific death through novel image based analysis algorithms. Firstly, an image analysis algorithm was developed to quantify and assess the prognostic value of three histopathological features through immuno-fluorescence: lymphatic vessel density (LVD), lymphatic vessel invasion (LVI) and tumour budding (TB). Image analysis provides the ability to standardise their quantification and negates observer variability. All three histopathological features were found to be predictors of CRC specific death within the training set (n=50); TB (HR =5.7; 95% CI, 2.38-13.8), LVD (HR =5.1; 95% CI, 2.04-12.99) and LVI (HR =9.9; 95% CI, 3.57- 27.98). Only TB (HR=2.49; 95% CI, 1.03-5.99) and LVI (HR =2.46; 95%CI, 1 - 6.05), however, were significant predictors of disease specific death in the validation set (n=134). Image analysis was further employed to characterise TB and quantify intra-tumoural heterogeneity. Tumour subpopulations within CRC tissue sections were segmented for the quantification of differential biomarker expression associated with epithelial mesenchymal transition and aggressive disease. Secondly, a novel histopathological feature ‘Sum Area Large Tumour Bud’ (ALTB) was identified through immunofluorescence coupled to a novel tissue phenomics approach. The tissue phenomics approach created a complex phenotypic fingerprint consisting of multiple parameters extracted from the unbiased segmentation of all objects within a digitised image. Data mining was employed to identify the significant parameters within the phenotypic fingerprint. ALTB was found to be a more significant predictor of disease specific death than LVI or TB in both the training set (HR = 20.2; 95% CI, 4.6 – 87.9) and the validation set (HR = 4; 95% CI, 1.5 – 11.1). Finally, ALTB was combined with two parameters, ‘differentiation’ and ‘pT stage’, which were exported from the original patient pathology report to form an integrative pathology score. The integrative pathology score was highly significant at predicting disease specific death within the validation set (HR = 7.5; 95% CI, 3 – 18.5). In conclusion, image analysis allows the standardised quantification of set histopathological features and the heterogeneous expression of biomarkers. A novel image based histopathological feature combined with classical pathology allows the highly significant stratification of Stage II CRC patients into high and low risk of disease specific death. 616.99
155	Os dados como base à criação de um método de planejamento de propaganda / Lima, Carlos Eduardo de. January 2018 (has links) Orientador: Francisco Machado Filho / Banca: Marcos Americo / Banca: Nirave Reigota Caram / Resumo: O presente estudo visa identificar as inúmeras transformações que o planejamento de propaganda tem enfrentado desde o advento da Internet e das tecnologias de comunicação e informação baseadas em big data, Machine Learning, cluster e outras ferramentas de inteligência de dados. Dessa forma, buscou-se fazer um levantamento histórico e documental sobre os modelos de planejamento de propaganda e briefs criativos. Percebeu-se fundamental traçar uma breve documentação histórica sobre a concepção da disciplina de planejamento para o planejador e a forma como esse processo foi desenvolvido no Brasil, assim como sua evolução. Fez-se necessário também definir conceitos sobre big data e inovação, buscando identificar como afetam a estrutura e as metodologias até então usadas pelo planejamento. Com isso, objetivou-se poder entender como o planejador está sendo levado a desenvolver novas competências que abordam diferentes disciplinas, além das que já eram aplicadas no processo de investigação e criação do planejamento. Dessa forma, foram utilizadas metodologias de pesquisa de campo com entrevistas em profundidade com heads e diretores de planejamento de agências de comunicação e players reconhecidos por sua competência e experiência no planejamento de propaganda. Sendo assim, esta pesquisa apresenta uma proposta de um método de planejamento que, por meio de ferramentas baseadas em softwares e aplicativos, permita que o profissional de planejamento possa gerar ideias inovadoras e propor ... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: This study aims to spot the countless transformations that the advertising planning has been passing through since the appearance of the Internet, as well as communication and information technologies based upon big data, Machine Learning, cluster and othe r data intelligence mechanisms. Along these lines, it was undertaken to assemble historical and documental facts about advertising planning and creative briefs guidelines. It was noticed the importance to picture a brief historical documentation about the conception of the planning subject for planners and the way this process was developed in Brazil, as well as its evolution. It was also necessary to define concepts about big data and innovation, in order to find how they impact the structure and methodolo gies used by the advertising planning until then. Thereby, the goal is to understand how the planner is being compelled to develop new skills which approach different matters, beyond the ones that were already used in the process of inquiring and creating in advertising planning. Thus, field research methodologies were applied with in - depth interviews with heads and directors of planning at communication agencies and market players whom are renowned for their competence and experience in advertising plannin g. Therefore, this essay proposes a planning approach which, utilizing tools based upon softwares and appliances, enables planners to develop disrupting ideas and come up with new mindsets to agencies. / Mestre Propaganda. Planejamento. Metodologia. Mídia digital. Tecnologia. Big data. Advertising.
156	MACHINE LEARNING ON BIG DATA FOR STOCK MARKET PREDICTION Fallahi, Faraz 01 August 2017 (has links) In recent decades, the rapid development of information technology in the big data field has introduced new opportunities to explore a large amount of data available online. The Global Database of Events, Location (Language), and Tone (GDELT) is the largest, most comprehensive, and highest resolution open source database of human society that includes more than 440 million entries capturing information about events that have been covered by local, national, and international news sources since 1979 in over 100 languages. GDELT constructs a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what is happening around the world, what its context is and who is involved, and how the world is feeling about it, every single day. On the other hand, the stock market prediction has also been a long-time attractive topic and is extensively studied by researchers in different fields with numerous studies of the correlation between stock market fluctuations and different data sources derived from the historical data of world major stock indices or external information from social media and news. Support Vector Machine (SVM) and Logistic Regression are two of the most widely used machine learning techniques in recent studies. The main objective of this research project is to investigate the worthiness of information derived from GDELT project in improving the accuracy of stock market trend prediction specifically for the next days' price changes. This research is based on data sets of events from GDELT database and daily prices of Bitcoin and some other stock market companies and indices from Yahoo Finance, all from March 2015 to May 2017. Then multiple different machine learning and specifically classification algorithms are applied to data sets generated, first using only features derived from historical market prices and then including more features derived from external sources, in this case, GDELT. Then the performance is evaluated for each model over a range of parameters. Finally, experimental results show that using information gained from GDELT has a direct positive impact on improving the prediction accuracy. Keywords: Machine Learning, Stock Market, GDELT, Big Data, Data Mining Big Data Bitcoin Data Mining GDELT Machine learning Stock Market
157	Caracterização e modelagem multivariada do desempenho de sistemas de arquivos paralelos Inacio, Eduardo Camilo January 2015 (has links) Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015. / Made available in DSpace on 2015-04-29T21:10:29Z (GMT). No. of bitstreams: 1 332968.pdf: 1630035 bytes, checksum: ab750b282530f4ce742e30736aa9d74d (MD5) Previous issue date: 2015 / A quantidade de dados digitais gerados diariamente vem aumentando de forma significativa. Por consequência, as aplicações precisam manipular volumes de dados cada vez maiores, dos mais variados formatos e origens, em alta velocidade, sendo essa problemática denominada como Big Data. Uma vez que os dispositivos de armazenamento não acompanharam a evolução de desempenho observada em processadores e memórias principais, esses acabam se tornando os gargalos dessas aplicações. Sistemas de arquivos paralelos são soluções de software que vêm sendo amplamente adotados para mitigar as limitações de entrada e saída (E/S) encontradas nas plataformas computacionais atuais. Contudo, a utilização eficiente dessas soluções de armazenamento depende da compreensão do seu comportamento diante de diferentes condições de uso. Essa é uma tarefa particularmente desafiadora, em função do caráter multivariado do problema, ou seja, do fato de o desempenho geral do sistema depender do relacionamento e da influência de um grande conjunto de variáveis. Nesta dissertação se propõe um modelo analítico multivariado para representar o comportamento do desempenho do armazenamento em sistemas de arquivos paralelos para diferentes configurações e cargas de trabalho. Um extenso conjunto de experimentos, executados em quatro ambientes computacionais reais, foi realizado com o intuito de identificar um número significativo de variáveis relevantes, caracterizar a influência dessas variáveis no desempenho geral do sistema e construir e avaliar o modelo proposto.Como resultado do esforço de caracterização, o efeito de três fatores, não explorados em trabalhos anteriores, é apresentado. Os resultados da avaliação realizada, comparando o comportamento e valores estimados pelo modelo com o comportamento e valores medidos nos ambientes reais para diferentes cenários de uso, demonstraram que o modelo proposto obteve sucesso na representação do desempenho do sistema. Apesar de alguns desvios terem sido encontrados nos valores estimados pelo modelo, considerando o número significativamente maior de cenários de uso avaliados nessa pesquisa em comparação com propostas anteriores encontradas na literatura, a acurácia das predições foi considerada aceitável.<br> / Abstract : The amount of digital data generated dialy has increased significantly.Consequently, applications need to handle increasing volumes of data, in a variety of formats and sources, with high velocity, namely Big Data problem. Since storage devices did not follow the performance evolution observed in processors and main memories, they become the bottleneck of these applications. Parallel file systems are software solutions that have been widely adopted to mitigate input and output (I/O) limitations found in current computing platforms. However, the efficient utilization of these storage solutions depends on the understanding of their behavior in different conditions of use. This is a particularly challenging task, because of the multivariate nature of the problem, namely the fact that the overall performance of the system depends on the relationship and the influence of a large set of variables. This dissertation proposes an analytical multivariate model to represent storage performance behavior in parallel file systems for different configurations and workloads. An extensive set of experiments, executed in four real computing environments, was conducted in order to identify a significant number of relevant variables, to determine the influence of these variables on overall system performance, and to build and evaluate the proposed model. As a result of the characterization effort, the effect of three factors, not explored in previous works, is presented. Results of the model evaluation, comparing the behavior and values estimated by the model with behavior and values measured in real environments for different usage scenarios, showed that the proposed model was successful in system performance representation. Although some deviations were found in the values estimated by the model, considering the significantly higher number of usage scenarios evaluated in this research work compared to previous proposals found in the literature, the accuracy of prediction was considered acceptable. Computação Big data Analise multivariada Arquivos de computador Banco de dados
158	Mapeamento de qualidade de experiência (QOE) através de qualidade de serviço (QOS) focado em bases de dados distribuídas Souza, Ramon Hugo de January 2017 (has links) Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2017. / Made available in DSpace on 2017-08-22T04:23:28Z (GMT). No. of bitstreams: 1 347258.pdf: 3773483 bytes, checksum: 2ab3ff4810fbf2c17a929b1ac3ab553c (MD5) Previous issue date: 2017 / A falta de conceitualização congruente sobre qualidade de serviço (QoS) para bases de dados (BDs) foi o fator que impulsionou o estudo resultante nesta tese. A definição de QoS como uma simples verificação de se um nó corre risco de falha devido ao número de acessos, como faziam, na época do levantamento bibliométrico desta tese, alguns sistemas comerciais, era uma simplificação exagerada para englobar um conceito tão complexo. Outros trabalhos que dizem lidar com estes conceitos também não são exatos, em termos matemáticos, e não possuem definições concretas ou com qualidade passível de utilização ou replicação, o que torna inviável sua aplicação ou mesmo verificação. O foco deste estudo é direcionado à bases de dados distribuídas (BDDs), de maneira que a conceitualização aqui desenvolvida é também compatível, ao menos parcialmente, com modelos não distribuídos de BDs. As novas definições de QoS desenvolvidas são utilizadas para se lidar com o conceito correlacionado de qualidade de experiência (QoE), em uma abordagem em nível de sistema focada em completude de QoS. Mesmo sendo QoE um conceito multidimensional, difícil de ser mensurado, o foco é mantido em uma abordagem passível de mensuramento, de maneira a permitir que sistemas de BDDs possam lidar com autoavaliação. A proposta de autoavaliação surge da necessidade de identificação de problemas passíveis de autocorreção. Tendo-se QoS bem definida, de maneira estatística, pode-se fazer análise de comportamento e tendência comportamental de maneira a se inferir previsão de estados futuros, o que permite o início de processo de correção antes que se alcance estados inesperados, por predição estatística. Sendo o objetivo geral desta tese a definição de métricas de QoS e QoE, com foco em BDDs, lidando com a hipótese de que é possível se definir QoE estatisticamente com base em QoS, para propósitos de nível de sistema. Ambos os conceitos sendo novos para BDDs quando lidando com métricas mensuráveis exatas. E com estes conceitos então definidos, um modelo de recuperação arquitetural é apresentado e testado para demonstração de resultados quando da utilização das métricas definidas para predição comportamental.<br> / Abstract : The hitherto lack of quality of service (QoS) congruent conceptualization to databases (DBs) was the factor that drove the initial development of this thesis. To define QoS as a simple verification that if a node is at risk of failure due to memory over-commitment, as did some commercial systems at the time that was made the bibliometric survey of this thesis, it is an oversimplification to encompass such a complex concept. Other studies that quote to deal with this concept are not accurate and lack concrete definitions or quality allowing its use, making infeasible its application or even verification. Being the focus targeted to distributed databases (DDBs), the developed conceptualization is also compatible, at least partially, with models of non-distributed DBs. These newfound QoS settings are then used to handle the correlated concept of quality of experience (QoE) in a system-level approach, focused on QoS completeness. Being QoE a multidimensional concept, hard to be measured, the focus is kept in an approach liable of measurement, in a way to allow DDBs systems to deal with self-evaluation. The idea of self-evaluation arises from the need of identifying problems subject to self-correction. With QoS statistically well-defined, it is possible to analyse behavior and to indetify tendencies in order to predict future states, allowing early correction before the system reaches unexpected states. Being the general objective of this thesis the definition of metrics of QoS and QoE, focused on DDBs, dealing with the hypothesis that it is possible to define QoE statistically based on QoS, for system level purposes. Both these concepts being new to DDBs when dealing with exact measurable metrics. Once defined these concepts, an architectural recovering model is presented and tested to demonstrate the results when using the metrics defined for behavioral prediction. Informática Banco de dados Computação em nuvem Big data
159	Aplicação de ETL para a integração de dados com ênfase em big data na área de saúde pública Pinto, Clícia dos Santos 05 March 2015 (has links) Submitted by Santos Davilene (davilenes@ufba.br) on 2016-05-30T15:55:34Z No. of bitstreams: 1 Dissertação_Mestrado_Clicia(1).pdf: 2228201 bytes, checksum: d990a114eac5a988c57ba6d1e22e8f99 (MD5) / Made available in DSpace on 2016-05-30T15:55:34Z (GMT). No. of bitstreams: 1 Dissertação_Mestrado_Clicia(1).pdf: 2228201 bytes, checksum: d990a114eac5a988c57ba6d1e22e8f99 (MD5) / Transformar os dados armazenados em informações úteis tem sido um desafio cada vez maior e mais complexo a medida em que o volume de dados produzidos todos os dias aumenta. Nos últimos anos, conceitos e tecnologias de Big Data têm sido amplamente utilizados como solução para o gerenciamento de grandes quantidades de dados em diferentes domínios. A proposta deste trabalho diz respeito `a utiliza¸c˜ao de técnicas de ETL (extração,transformação e carga) no desenvolvimento de um módulo de pré-processamento para o pareamento probabilístico de registros em bases de dados na área de Saúde Pública. A utiliza¸c˜ao da ferramenta de processamento distribuído do Spark garante o tratamento adequado para o contexto de Big Data em que esta pesquisa está inserida, gerando respostas em tempo hábil. Ciência da Computação Big Data ETL pré-processamento correlação de registros Spark
160	Uma proposta de modelo conceitual para uso de Big Data e Open data para Smart Cities Klein, Vinicius Barreto January 2015 (has links) Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia e Gestão do Conhecimento, Florianópolis, 2015. / Made available in DSpace on 2016-10-19T12:58:00Z (GMT). No. of bitstreams: 1 339506.pdf: 1667510 bytes, checksum: c7bc948b31b480bb71ca28c0bfe1b9aa (MD5) Previous issue date: 2015 / Atualmente vivemos um contexto onde a sociedade produz um alto volume de dados, produzidos pelas mais diversas fontes, em diferentes formatos e esquemas distintos, e de forma cada vez mais veloz. Este fato corresponde ao fenômeno big data. Contribuindo com este fenômeno, o movimento dados abertos (open data) também adiciona novas fontes de dados produzidos pela sociedade atual. Os dados big data e os dados abertos podem servir de insumo para a geração de conhecimento, e as smart cities (cidades inteligentes) podem se beneficiar deste processo. Estas cidades representam um conceito que envolve utilizar TICs (tecnologias da informação e comunicação) como meio de melhorar a qualidade de vida nos centro urbanos atuais. Esta ideia motiva-se principalmente pelos diversos problemas enfrentados pelos habitantes destas cidades, como o mal gerenciamento de seus recursos naturais, altos índices de poluição atmosférica, trânsito intenso, taxas de crimes dentre outros, causados principalmente pela alta concentração de pessoas nestes locais. Neste contexto, o objetivo desta dissertação é identificar as principais fontes de dados e suas características, e interligá-las às necessidades das smart cities. Foi desenvolvida uma proposta de modelo conceitual para smart cities, que utiliza big data e open data como fonte de dados. Para isso, foi realizada primeiramente uma pesquisa exploratória dos temas relacionados à pesquisa que foram organizados na fundamentação teórica. Em seguida, foram desenvolvidas questões de competência e outras práticas do método OntoKEM para desenvolvimento de ontologias, que guiaram o desenvolvimento do Modelo. Estas questões foram respondidas com auxílio do modelo CESM de Bunge. Em seguida, foi proposto o Modelo, organizado em camadas, e foi realizada sua verificação em um cenário de uso, onde foram apresentadas discussões, resultados e sugestões futuras.<br> / Abstract: We currently live in a context where society produces a high volume of data, generated by various sources, in different formats and different schemes, and in a increasingly fast way. This fact corresponds to the big data phenomenon. Contributing to this phenomenon, the movement open data also adds new data sources produced by today's society. The big data and open data can sources serve as input for the generation of knowledge, and the smart cities can benefit from this process. These cities represent a concept that involves using ICTs (information and communication technologies) as means of improving the quality of life in today's urban center. This idea motivated is mainly by the various problems faced by inhabitants of these cities, such as mismanagement of its natural resources, high levels of air pollution, heavy traffic, crime rates and others problems mainly caused by the high concentration of people in these places. In this context, in order to identify the main data sources and their characteristics, and connect them to the needs of smart cities, a proposal for a conceptual model for smart cities was developed, which uses big data and open data as data sources. To do this, first it was made an exploratory research of issues related to research and then they were organized in the theoretical foundation chapter. Then competency questions have been developed and other ontoKEM practices for developing ontologies, which guided the development of the model. These questions were answered with the aid of Bunge CESM model. Then the model was proposed, organized in layers and checked in a usage scenario, where discussions were presented, so as the results and future researches related. Engenharia e gestão do conhecimento Gestão do conhecimento Big data Tecnologia da informação

Search results