Spelling suggestions: "subject:"unstructured data"" "subject:"unstructured mata""
21 |
Analýza nestrukturovaného obsahu z veřejně dostupných sociálních médií za pomocí nástroje Watson společnosti IBM / The analyses of unstructured content from publicly available social media by WatsonŠverák, Martin January 2014 (has links)
This graduate thesis deals with the analysis of unstructured data from public social media. In particular, it deals with the analysis of data from social media of Vodafone Czech Republic a.s. This thesis is divided into two parts. The first part provides theoretical background for the second part. Therefore, the first part describes social media, structured and unstructured data and tools which are used for analysing of unstructured data. In the second part, tool Watson is used for the analysis of publicly available data. Then, methodology is designed to control the analysis process and subsequently this methodology used in the formation of the pilot application that has to verify the functionality of unstructured data by tool Watson. The results of the analysis are in the conclusion. The main benefits of this thesis are the development of a pilot application of Watson and the verification of its functionality. The pilot application cannot be equated with a complete analysis that can be done by Watson. But this pilot application may work as a demonstration of Watson's functionalities.
|
22 |
Navigating the Risks of Dark Data : An Investigation into Personal SafetyGautam, Anshu January 2023 (has links)
With the exponential proliferation of data, there has been a surge in data generation fromdiverse sources, including social media platforms, websites, mobile devices, and sensors.However, not all data is readily visible or accessible to the public, leading to the emergence ofthe concept known as "dark data." This type of data can exist in structured or unstructuredformats and can be stored in various repositories, such as databases, log files, and backups.The reasons behind data being classified as "dark" can vary, encompassing factors such as limited awareness, insufficient resources or tools for data analysis, or a perception ofirrelevance to current business operations. This research employs a qualitative research methodology incorporating audio/videorecordings and personal interviews to gather data, aiming to gain insights into individuals'understanding of the risks associated with dark data and their behaviors concerning thesharing of personal information online. Through the thematic analysis of the collected data,patterns and trends in individuals' risk perceptions regarding dark data become evident. The findings of this study illuminate the multiple dimensions of individuals' risk perceptions andt heir influence on attitudes towards sharing personal information in online contexts. Theseinsights provide valuable understanding of the factors that shape individuals' decisionsconcerning data privacy and security in the digital era. By contributing to the existing body ofknowledge, this research offers a deeper comprehension of the interplay between dark datarisks, individuals' perceptions, and their behaviors pertaining to online information sharing.The implications of this study can inform the development of strategies and interventionsaimed at fostering informed decision-making and ensuring personal safety in an increasinglydata-centric world
|
23 |
Large Language Models as Advanced Data Preprocessors : Transforming Unstructured Text into Fine-Tuning DatasetsVangeli, Marius January 2024 (has links)
The digital landscape increasingly generates vast amounts of unstructured textual data, valuable for analytics and various machine learning (ML) applications. These vast stores of data, often likened to digital gold, are often challenging to process and utilize. Traditional text processing methods, lacking the ability to generalize, typically struggle with unstructured and unlabeled data. For many complex data management workflows, the solution typically involves human intervention in the form of manual curation and labeling — a time-consuming process. Large Language Models (LLMs) are AI models trained on vast amounts of text data. They have remarkable Natural Language Processing (NLP) capabilities and offer a promising alternative. This thesis serves as an empirical case study of LLMs as advanced data preprocessing tools. It explores the effectiveness and limitations of using LLMs to automate and refine traditionally challenging data preprocessing tasks, highlighting a critical area of research in data management. An LLM-based preprocessing pipeline, designed to clean and prepare raw textual data for use in ML applications, is implemented and evaluated. This pipeline was applied to a corpus of unstructured text documents, extracted from PDFs, with the aim of transforming them into a fine-tuning dataset for LLMs. The efficacy of the LLM-based preprocessing pipeline was assessed by comparing the results against a manually curated benchmark dataset using two text similarity metrics: the Levenshtein distance and ROUGE score. The findings indicate that although LLMs are not yet capable of fully replacing human curation in complex data management workflows, they substantially improve the efficiency and manageability of preprocessing unstructured textual data.
|
24 |
Abordagem para integração automática de dados estruturados e não estruturados em um contexto Big Data / Approach for automatic integration of structured and unstructured data in a Big Data contextKeylla Ramos Saes 22 November 2018 (has links)
O aumento de dados disponíveis para uso tem despertado o interesse na geração de conhecimento pela integração de tais dados. No entanto, a tarefa de integração requer conhecimento dos dados e também dos modelos de dados utilizados para representá-los. Ou seja, a realização da tarefa de integração de dados requer a participação de especialistas em computação, o que limita a escalabilidade desse tipo de tarefa. No contexto de Big Data, essa limitação é reforçada pela presença de uma grande variedade de fontes e modelos heterogêneos de representação de dados, como dados relacionais com dados estruturados e modelos não relacionais com dados não estruturados, essa variedade de representações apresenta uma complexidade adicional para o processo de integração de dados. Para lidar com esse cenário é necessário o uso de ferramentas de integração que reduzam ou até mesmo eliminem a necessidade de intervenção humana. Como contribuição, este trabalho oferece a possibilidade de integração de diversos modelos de representação de dados e fontes de dados heterogêneos, por meio de uma abordagem que permite o do uso de técnicas variadas, como por exemplo, algoritmos de comparação por similaridade estrutural dos dados, algoritmos de inteligência artificial, que através da geração do metadados integrador, possibilita a integração de dados heterogêneos. Essa flexibilidade permite lidar com a variedade crescente de dados, é proporcionada pela modularização da arquitetura proposta, que possibilita que integração de dados em um contexto Big Data de maneira automática, sem a necessidade de intervenção humana / The increase of data available to use has piqued interest in the generation of knowledge for the integration of such data bases. However, the task of integration requires knowledge of the data and the data models used to represent them. Namely, the accomplishment of the task of data integration requires the participation of experts in computing, which limits the scalability of this type of task. In the context of Big Data, this limitation is reinforced by the presence of a wide variety of sources and heterogeneous data representation models, such as relational data with structured and non-relational models with unstructured data, this variety of features an additional complexity representations for the data integration process. Handling this scenario is required the use of integration tools that reduce or even eliminate the need for human intervention. As a contribution, this work offers the possibility of integrating diverse data representation models and heterogeneous data sources through the use of varied techniques such as comparison algorithms for structural similarity of the artificial intelligence algorithms, data, among others. This flexibility, allows dealing with the growing variety of data, is provided by the proposed modularized architecture, which enables data integration in a context Big Data automatically, without the need for human intervention
|
25 |
Abordagem para integração automática de dados estruturados e não estruturados em um contexto Big Data / Approach for automatic integration of structured and unstructured data in a Big Data contextSaes, Keylla Ramos 22 November 2018 (has links)
O aumento de dados disponíveis para uso tem despertado o interesse na geração de conhecimento pela integração de tais dados. No entanto, a tarefa de integração requer conhecimento dos dados e também dos modelos de dados utilizados para representá-los. Ou seja, a realização da tarefa de integração de dados requer a participação de especialistas em computação, o que limita a escalabilidade desse tipo de tarefa. No contexto de Big Data, essa limitação é reforçada pela presença de uma grande variedade de fontes e modelos heterogêneos de representação de dados, como dados relacionais com dados estruturados e modelos não relacionais com dados não estruturados, essa variedade de representações apresenta uma complexidade adicional para o processo de integração de dados. Para lidar com esse cenário é necessário o uso de ferramentas de integração que reduzam ou até mesmo eliminem a necessidade de intervenção humana. Como contribuição, este trabalho oferece a possibilidade de integração de diversos modelos de representação de dados e fontes de dados heterogêneos, por meio de uma abordagem que permite o do uso de técnicas variadas, como por exemplo, algoritmos de comparação por similaridade estrutural dos dados, algoritmos de inteligência artificial, que através da geração do metadados integrador, possibilita a integração de dados heterogêneos. Essa flexibilidade permite lidar com a variedade crescente de dados, é proporcionada pela modularização da arquitetura proposta, que possibilita que integração de dados em um contexto Big Data de maneira automática, sem a necessidade de intervenção humana / The increase of data available to use has piqued interest in the generation of knowledge for the integration of such data bases. However, the task of integration requires knowledge of the data and the data models used to represent them. Namely, the accomplishment of the task of data integration requires the participation of experts in computing, which limits the scalability of this type of task. In the context of Big Data, this limitation is reinforced by the presence of a wide variety of sources and heterogeneous data representation models, such as relational data with structured and non-relational models with unstructured data, this variety of features an additional complexity representations for the data integration process. Handling this scenario is required the use of integration tools that reduce or even eliminate the need for human intervention. As a contribution, this work offers the possibility of integrating diverse data representation models and heterogeneous data sources through the use of varied techniques such as comparison algorithms for structural similarity of the artificial intelligence algorithms, data, among others. This flexibility, allows dealing with the growing variety of data, is provided by the proposed modularized architecture, which enables data integration in a context Big Data automatically, without the need for human intervention
|
26 |
探索性資料分析方法在文本資料中的應用─以「新青年」雜誌為例 / A Study of Exploratory Data Analysis on Text Data ── A Case study based on New Youth Magazine潘艷艷, Pan, Yan Yan Unknown Date (has links)
隨著經濟繁榮和網絡發展的日新月異,線上線下每時每刻都產生龐大數據,其中約有80%的文字、影像等非結構化數據,如何量化和採取適合的分析方法,成為有效提取有價值信息及對其加以利用的關鍵。針對文字類型的資料,本文提出探索性資料分析方法,並以《新青年》雜誌的語言變化為例,呈現如何選取文本特徵并对其量化及分析的過程。
首先,本文以卷為分析單位,多角度量化《新青年》雜誌各卷的文本結構,包括文本用字、用句、文言和白虛字使用以及常用字詞共用等方面,通過多種圖表相結合的呈現方式,窺探《新青年》雜誌語言變化歷程以及轉變特點。這其中既包括了對文言文到白話文轉變機制的探索,也包括白話語言演化的探索。其次,根據各卷初探的結果,尋找可區隔文言文和白話文兩種語言形式的文本特徵變數,再以《新青年》第一卷和第七卷為訓練樣本,結合主成分和羅吉斯迴歸,對文、白兩種語言形式的文章進行分類訓練,再利用第四卷進行測試。結果證實,所提取的文本變數能夠有效實現對文、白兩種語言形式的文章的區分。此外,本文亦根據前述初探結果以及人文學者經驗,探索《新青年》雜誌後期語言形式的變化,即從五四運動時期的白話文至以「紅色中文」為特徵的白話文(二戰之後中國使用的白話文)的變化。以第七卷和第十一卷為樣本進行訓練,結果證實這兩卷語言形式存在明顯區別;並加入台灣《聯合報》和中國大陸的《人民日報》進行分類預測,發現兩類報刊的語言偏向有明顯差異,值得後續深入研究。 / Tremendous data are produced every day, due to the rapid development of computer technology and economics. Unstructured data, such as text, pictures, videos, etc., account for nearly 80 percent of all data created. Choosing appropriate methods for quantifying and analyzing this kind of data would determine whether or not we can extract useful information. For that, we propose a standard operating process of exploratory data analysis (EDA) and use a case study of language changes in New Youth Magazine as a demonstration.
First, we quantify the texts of New Youth magazine from different perspectives, including the uses of words, sentences, function words, and share of common vocabulary. We aim to detect the evolution of modern language itself as well as changes from traditional Chinese to modern Chinese. Then, according to the results of exploratory data analysis, we treat the first and seventh volumes of New Youth magazine for training data to develop classification model and apply the model to fourth volume (i.e., testing data). The results show that the traditional Chinese and modern Chinese can be successfully classified. Next, we intend to verify the changes from modern Chinese of the May 4th Movement to those by advocating Socialism. We treat the seventh volume and eleventh volume of New Youth magazine as training data and again develop a classification model. Then we apply this model to the United Daily News from Taiwan and People’s Daily from Mainland China. We found these two newspapers are very different and the style of United Daily News is closer to that of seventh volume, while the style of People’s Daily is more like that of eleventh volume. This indicates that the People’s Daily is likely to be influenced by the Soviet Union.
|
27 |
The Information Value of Unstructured Analyst Opinions / Studies on the Determinants of Information Value and its Relationship to Capital MarketsEickhoff, Matthias 29 June 2017 (has links)
No description available.
|
28 |
La era del big data y el desarrollo económico / The era of Big Data in economic developmentMorales Aldava, Zhander Karolly, Hidalgo Barrera, Jefferson 14 July 2021 (has links)
El Big Data es un fenómeno que ha experimentado un incremento en el interés público desde el 2012 (Cao, 2017). La aparición de grandes cantidades de datos, frecuentemente no estructurados, es una característica de la economía global (Amankwah-Amoah, 2016), además según Suoniemi et al. (2020), las empresas emplean Big Data porque les sirve para mejorar sus capacidades en el mercado.
Para utilizar el Big Data es necesario resolver una serie de desafíos desde el ámbito de la tecnología, personas, y organizaciones (Alharthi et al., 2017), ante ello, el presente trabajo comienza con la revisión del panorama del Big Data, muestra los desafíos identificados en las diversas investigaciones examinadas, para luego explicar el impacto del Big Data en las empresas, y el impacto del Big Data en la economía global.
Las investigaciones de los diversos autores, presentadas en este trabajo, permitirán resolver la principal controversia identificada:
· ¿El uso del Big Data puede contribuir con el desarrollo económico, o podría agravar las brechas actualmente existentes en la economía?
En relación a esta controversia, Blazquez y Domenech (2018), Monleon-Getino (2015), y Fuchs (2003) comparten una postura a favor de que el uso del Big Data contribuye al desarrollo económico, mientras que Nuccio y Guerzoni (2018), y Cuquet y Fensel (2018), comparten una postura en contra.
Las posturas de estos autores nos permiten describir la forma en la que el Big Data logra contribuir en el desarrollo económico, y los riesgos que están apareciendo, los cuales pueden agravar las brechas existentes en la economía. / Big Data is a phenomenon that has experienced an increase in public interest since 2012 (Cao, 2017). The appearance of large amounts of data, often unstructured, is a characteristic of the global economy (Amankwah-Amoah, 2016), also according to Suoniemi et al. (2020), companies use Big Data because it helps them improve their capabilities in the market. To use Big Data, it is necessary to solve a series of challenges from the field of technology, people, and organizations (Alharthi et al., 2017), given this, this work begins with a review of the Big Data panorama, shows the challenges identified in the various researches examined, and then explains the impact of Big Data on companies, and the impact of Big Data on the global economy. The investigations of the various authors presented in this work will allow to resolve the main controversy identified:
• The use of Big Data can contribute to economic development, or could aggravate the currently existing gaps in the economy?
In relation to this controversy, Blazquez and Domenech (2018), Monleon-Getino (2015), and Fuchs (2003) share a position in favor that the use of Big Data contributes to economic development, while Nuccio and Guerzoni (2018), y Cuquet and Fensel (2018) share a stance against.
The positions of these authors allow us to describe the way in which Big Data contributes to economic development, and the risks that are appearing, which can aggravate the existing gaps in the economy. / Trabajo de Suficiencia Profesional
|
29 |
An Efficient Framework for Processing and Analyzing Unstructured Text to Discover Delivery Delay and Optimization of Route Planning in Realtime / Un framework efficace pour le traitement et l'analyse des textes non structurés afin de découvrir les retards de livraison et d'optimiser la planification de routes en temps réelAlshaer, Mohammad 13 September 2019 (has links)
L'Internet des objets, ou IdO (en anglais Internet of Things, ou IoT) conduit à un changement de paradigme du secteur de la logistique. L'avènement de l'IoT a modifié l'écosystème de la gestion des services logistiques. Les fournisseurs de services logistiques utilisent aujourd'hui des technologies de capteurs telles que le GPS ou la télémétrie pour collecter des données en temps réel pendant la livraison. La collecte en temps réel des données permet aux fournisseurs de services de suivre et de gérer efficacement leur processus d'expédition. Le principal avantage de la collecte de données en temps réel est qu’il permet aux fournisseurs de services logistiques d’agir de manière proactive pour éviter des conséquences telles que des retards de livraison dus à des événements imprévus ou inconnus. De plus, les fournisseurs ont aujourd'hui tendance à utiliser des données provenant de sources externes telles que Twitter, Facebook et Waze, parce que ces sources fournissent des informations critiques sur des événements tels que le trafic, les accidents et les catastrophes naturelles. Les données provenant de ces sources externes enrichissent l'ensemble de données et apportent une valeur ajoutée à l'analyse. De plus, leur collecte en temps réel permet d’utiliser les données pour une analyse en temps réel et de prévenir des résultats inattendus (tels que le délai de livraison, par exemple) au moment de l’exécution. Cependant, les données collectées sont brutes et doivent être traitées pour une analyse efficace. La collecte et le traitement des données en temps réel constituent un énorme défi. La raison principale est que les données proviennent de sources hétérogènes avec une vitesse énorme. La grande vitesse et la variété des données entraînent des défis pour effectuer des opérations de traitement complexes telles que le nettoyage, le filtrage, le traitement de données incorrectes, etc. La diversité des données - structurées, semi-structurées et non structurées - favorise les défis dans le traitement des données à la fois en mode batch et en temps réel. Parce que, différentes techniques peuvent nécessiter des opérations sur différents types de données. Une structure technique permettant de traiter des données hétérogènes est très difficile et n'est pas disponible actuellement. En outre, l'exécution d'opérations de traitement de données en temps réel est très difficile ; des techniques efficaces sont nécessaires pour effectuer les opérations avec des données à haut débit, ce qui ne peut être fait en utilisant des systèmes d'information logistiques conventionnels. Par conséquent, pour exploiter le Big Data dans les processus de services logistiques, une solution efficace pour la collecte et le traitement des données en temps réel et en mode batch est essentielle. Dans cette thèse, nous avons développé et expérimenté deux méthodes pour le traitement des données: SANA et IBRIDIA. SANA est basée sur un classificateur multinomial Naïve Bayes, tandis qu'IBRIDIA s'appuie sur l'algorithme de classification hiérarchique (CLH) de Johnson, qui est une technologie hybride permettant la collecte et le traitement de données par lots et en temps réel. SANA est une solution de service qui traite les données non structurées. Cette méthode sert de système polyvalent pour extraire les événements pertinents, y compris le contexte (tel que le lieu, l'emplacement, l'heure, etc.). En outre, il peut être utilisé pour effectuer une analyse de texte sur les événements ciblés. IBRIDIA a été conçu pour traiter des données inconnues provenant de sources externes et les regrouper en temps réel afin d'acquérir une connaissance / compréhension des données permettant d'extraire des événements pouvant entraîner un retard de livraison. Selon nos expériences, ces deux approches montrent une capacité unique à traiter des données logistiques / Internet of Things (IoT) is leading to a paradigm shift within the logistics industry. The advent of IoT has been changing the logistics service management ecosystem. Logistics services providers today use sensor technologies such as GPS or telemetry to collect data in realtime while the delivery is in progress. The realtime collection of data enables the service providers to track and manage their shipment process efficiently. The key advantage of realtime data collection is that it enables logistics service providers to act proactively to prevent outcomes such as delivery delay caused by unexpected/unknown events. Furthermore, the providers today tend to use data stemming from external sources such as Twitter, Facebook, and Waze. Because, these sources provide critical information about events such as traffic, accidents, and natural disasters. Data from such external sources enrich the dataset and add value in analysis. Besides, collecting them in real-time provides an opportunity to use the data for on-the-fly analysis and prevent unexpected outcomes (e.g., such as delivery delay) at run-time. However, data are collected raw which needs to be processed for effective analysis. Collecting and processing data in real-time is an enormous challenge. The main reason is that data are stemming from heterogeneous sources with a huge speed. The high-speed and data variety fosters challenges to perform complex processing operations such as cleansing, filtering, handling incorrect data, etc. The variety of data – structured, semi-structured, and unstructured – promotes challenges in processing data both in batch-style and real-time. Different types of data may require performing operations in different techniques. A technical framework that enables the processing of heterogeneous data is heavily challenging and not currently available. In addition, performing data processing operations in real-time is heavily challenging; efficient techniques are required to carry out the operations with high-speed data, which cannot be done using conventional logistics information systems. Therefore, in order to exploit Big Data in logistics service processes, an efficient solution for collecting and processing data in both realtime and batch style is critically important. In this thesis, we developed and experimented with two data processing solutions: SANA and IBRIDIA. SANA is built on Multinomial Naïve Bayes classifier whereas IBRIDIA relies on Johnson's hierarchical clustering (HCL) algorithm which is hybrid technology that enables data collection and processing in batch style and realtime. SANA is a service-based solution which deals with unstructured data. It serves as a multi-purpose system to extract the relevant events including the context of the event (such as place, location, time, etc.). In addition, it can be used to perform text analysis over the targeted events. IBRIDIA was designed to process unknown data stemming from external sources and cluster them on-the-fly in order to gain knowledge/understanding of data which assists in extracting events that may lead to delivery delay. According to our experiments, both of these approaches show a unique ability to process logistics data. However, SANA is found more promising since the underlying technology (Naïve Bayes classifier) out-performed IBRIDIA from performance measuring perspectives. It is clearly said that SANA was meant to generate a graph knowledge from the events collected immediately in realtime without any need to wait, thus reaching maximum benefit from these events. Whereas, IBRIDIA has an important influence within the logistics domain for identifying the most influential category of events that are affecting the delivery. Unfortunately, in IBRIRDIA, we should wait for a minimum number of events to arrive and always we have a cold start. Due to the fact that we are interested in re-optimizing the route on the fly, we adopted SANA as our data processing framework
|
30 |
Automatic map generation from nation-wide data sources using deep learningLundberg, Gustav January 2020 (has links)
The last decade has seen great advances within the field of artificial intelligence. One of the most noteworthy areas is that of deep learning, which is nowadays used in everything from self driving cars to automated cancer screening. During the same time, the amount of spatial data encompassing not only two but three dimensions has also grown and whole cities and countries are being scanned. Combining these two technological advances enables the creation of detailed maps with a multitude of applications, civilian as well as military.This thesis aims at combining two data sources covering most of Sweden; laser data from LiDAR scans and surface model from aerial images, with deep learning to create maps of the terrain. The target is to learn a simplified version of orienteering maps as these are created with high precision by experienced map makers, and are a representation of how easy or hard it would be to traverse a given area on foot. The performance on different types of terrain are measured and it is found that open land and larger bodies of water is identified at a high rate, while trails are hard to recognize.It is further researched how the different densities found in the source data affect the performance of the models, and found that some terrain types, trails for instance, benefit from higher density data, Other features of the terrain, like roads and buildings are predicted with higher accuracy by lower density data.Finally, the certainty of the predictions is discussed and visualised by measuring the average entropy of predictions in an area. These visualisations highlight that although the predictions are far from perfect, the models are more certain about their predictions when they are correct than when they are not.
|
Page generated in 0.0743 seconds