Global ETD Search

171	Big data of tree species distributions: how big and how good? Serra-Diaz, Josep M., Enquist, Brian J., Maitner, Brian, Merow, Cory, Svenning, Jens-C. 15 January 2018 (has links) Background: Trees play crucial roles in the biosphere and societies worldwide, with a total of 60,065 tree species currently identified. Increasingly, a large amount of data on tree species occurrences is being generated worldwide: from inventories to pressed plants. While many of these data are currently available in big databases, several challenges hamper their use, notably geolocation problems and taxonomic uncertainty. Further, we lack a complete picture of the data coverage and quality assessment for open/public databases of tree occurrences. Methods: We combined data from five major aggregators of occurrence data (e.g. Global Biodiversity Information Facility, Botanical Information and Ecological Network v.3, DRYFLOR, RAINBIO and Atlas of Living Australia) by creating a workflow to integrate, assess and control data quality of tree species occurrences for species distribution modeling. We further assessed the coverage - the extent of geographical data - of five economically important tree families (Arecaceae, Dipterocarpaceae, Fagaceae, Myrtaceae, Pinaceae). Results: Globally, we identified 49,206 tree species (84.69% of total tree species pool) with occurrence records. The total number of occurrence records was 36.69 M, among which 6.40 M could be considered high quality records for species distribution modeling. The results show that Europe, North America and Australia have a considerable spatial coverage of tree occurrence data. Conversely, key biodiverse regions such as South-East Asia and central Africa and parts of the Amazon are still characterized by geographical open-public data gaps. Such gaps are also found even for economically important families of trees, although their overall ranges are covered. Only 15,140 species (26.05%) had at least 20 records of high quality. Conclusions: Our geographical coverage analysis shows that a wealth of easily accessible data exist on tree species occurrences worldwide, but regional gaps and coordinate errors are abundant. Thus, assessment of tree distributions will need accurate occurrence quality control protocols and key collaborations and data aggregation, especially from national forest inventory programs, to improve the current publicly available data. Tree distributions Big data Quality control and assessment Occurrence data
172	Improvements on Scientific System Analysis Grupchev, Vladimir 01 January 2015 (has links) Thanks to the advancement of the modern computer simulation systems, many scientific applications generate, and require manipulation of large volumes of data. Scientific exploration substantially relies on effective and accurate data analysis. The shear size of the generated data, however, imposes big challenges in the process of analyzing the system. In this dissertation we propose novel techniques as well as using some known designs in a novel way in order to improve scientific data analysis. We develop an efficient method to compute an analytical query called spatial distance histogram (SDH). Special heuristics are exploited to process SDH efficiently and accurately. We further develop a mathematical model to analyze the mechanism leading to errors. This gives rise to a new approximate algorithm with improved time/accuracy tradeoff. Known MS analysis systems follow a pull-based design, where the executed queries mandate the data needed on their part. Such a design introduces redundant and high I/O traffic as well as cpu/data latency. To remedy such issues, we design and implement a push-based system, which uses a sequential scan-based I/O framework that pushes the loaded data to a number of pre-programmed queries. The efficiency of the proposed system as well as the approximate SDH algorithms is backed by the results of extensive experiments on MS generated data. Molecular Simulations Streaming Push-Based SDH Big Data Computer Engineering
173	Användningsområden för Big data inom analytisk CRM Nilsson, Per January 2014 (has links) Customer Relationship Management (CRM) är ett ofta använt konceptför verksamheter att hantera sina kundkontakter. En viktig del av CRMär användningen av tekniska lösningar för att lagra och analysera informationom kunder, till exempel genom data mining för att upptäckamönster hos kunders beteende. Idag produceras allt större mängderdata genom människors användning av informations- och kommunikationsteknik.Traditionell teknik klarar ej av att hantera den variation ochmängd data som existerar, vilket lett till utvecklingen av nya tekniskalösningar för dessa uppgifter. Begreppet Big data brukar användas föratt beskriva stora datamängder. Syftet med denna studie har varit att geökad förståelse för hur Big data kan användas inom CRM. För att uppnådetta har studien undersökt om Big data kan uppfylla behoven för framtidensdata mining inom CRM. Den kvalitativa studien har genomförtsmed en litteraturstudie kring CRM och Big data, och därefter har semistruktureradeintervjuer med svenska IT-konsulter genomförts. Resultatenantyder att teknik för Big data kan vara en möjlig lösning på debehov kring framtidens data mining som identifierats, bland dessa möjlighetenatt använda ett utökat antal datakällor och hantera stora datamängder.Resultaten antyder även att det finns problemområden sommåste beaktas. Ett problem med Big data och användningen av externdata är den osäkerhet kring informationens tillförlitlighet som finns. Detförekommer också diskussion kring den personliga integriteten och vilkaetiska problem som hantering av personlig data kan medföra. CRM Data mining Big data. Computer Sciences Datavetenskap (datalogi)
174	Estructura espacial urbana de movilidad desde datos masivos de transporte público en Santiago de Chile Hernández Godoy, Felipe Andrés January 2017 (has links) Magíster en Ciencias, Mención Computación. Ingeniero Civil en Computación / La estructura espacial urbana se refiere a la disposición del espacio en la ciudad producto de su forma actual junto con las relaciones subyacentes entre estas. Estas interacciones son generadas por el movimiento de personas, mercancías o información entre un origen y destino, y enmarcadas en un concepto de ciudad entendido como una colección de componentes interrelacionados donde se destacan: el sistema de actividades, sistema de transporte y las relaciones que se generan entre ellos. En este trabajo se intenta caracterizar la estructura espacial para Santiago de Chile mediante tres indicadores: centros (zonas de la ciudad con la capacidad de atraer o concentrar personas), centros de pasada (zonas de la ciudad que sirven para conectar un par de zonas, funcionan como puentes espaciales) y comunidades (zonas de la ciudad que presentan un fuerte nivel de interacción interno). La metodología utiliza análisis de redes y análisis espacial sobre datos de tarjetas inteligentes de pago de transporte público. Los datos son obtenidos a partir de las validaciones hechas con tarjetas inteligentes generadas en parte del sistema de transporte público de la ciudad de Santiago (Transantiago) entre el 14 al 21 de abril de 2013 (una semana). Con las validaciones y un proceso de estimación de bajada es posible estimar etapas y destino en cerca del 80% de los viajes. Con el origen, etapas y destino de cada viaje se construye una red dirigida con pesos en donde se relaciona cada indicador con una métrica de la red. Los centros son asociados al PageRank, los centros de pasada al Betweenness y las comunidades a Infomap. Los resultados muestran que la ciudad continúa manteniendo su estructura con un distrito comercial central (CBD por sus siglas en inglés) en el sector centro-oriente, los centros de pasada son fuertemente influenciados por la red de metro y las comunidades presentan una fuerte unión espacial con una expresión de todas ellas en el CBD, mostrando que el centro de la ciudad es un territorio que les pertenece a todos. Lo anterior muestra una estructura de la ciudad más cercana al monocentrismo al compararla con Singapur. Transporte urbano - Chile Big Data
175	Parallelization of Push-based System for Molecular Simulation Data Analysis with GPU Akhmedov, Iliiazbek 19 October 2016 (has links) Modern simulation systems generate big amount of data, which consequently has to be analyzed in a timely fashion. Traditional database management systems follow principle of pulling the needed data, processing it, and then returning the results. This approach is then optimized by means of caching, storing in different structures, or doing some sacrifices on precision of the results to make it faster. When it comes to the point of doing various queries that require analysis of the whole data, this design has the following disadvantages: considerable overhead on traditional disk random I/O framework while reading from the simulation output files and low throughput of the data that consequently results in long latency, and, if there was any indexing to optimize selections, overhead of storing those becomes too big, too. Beside it, indexing will also cause delay during write operations and since most of the queries work with the entire data sets, indexing loses its point. There is a new approach to this problem – Push-based System for Molecular Simulation Data Analysis for processing network of queries proposed in the previous paper and its primary steps are: i) it uses traditional scan-based I/O framework to load the data from files to the main memory and then ii) the data is pushed through a network of queries which consequently filters the data and collect all the needed information which increases efficiency and data throughput. It has a considerable advantage in analysis of molecular simulation data, because it normally involves all the data sets to be processed by the queries. In this paper, we propose improved version of Push-based System for Molecular Simulation Data Analysis. Its major difference with the previous design is usage of GPU for the actual processing part of the data flow. Using the same scan-based I/O framework the data is pushed through the network of queries which are processed by GPU, and due to the nature of science simulation data, this gives a big advantage for processing it faster and easier (it will be explained more in later sections). In the old approach there were some custom data structures such as quad-tree for calculation of histograms to make the processing faster and those involved loss of data and some expectations from the data nature, too. In the new approach due to high performance of GPU processing and its nature, custom data structures were not even needed much, though it didn’t bear any loss in precision and performance. data processing cuda optimization big data streaming Computer Sciences
176	Scaling Big Data Cleansing Khayyat, Zuhair 31 July 2017 (has links) Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identify- ing and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is de- sired to optimize inequality joins in sophisticated error discovery rather than na ̈ıvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing. First, I present BigDansing as a general system to tackle efficiency, scalability, and ease-of-use issues in data cleansing for Big Data. It automatically parallelizes the user’s code on top of general-purpose distributed platforms. Its programming inter- face allows users to express data quality rules independently from the requirements of parallel and distributed environments. Without sacrificing their quality, BigDans- ing also enables parallel execution of serial repair algorithms by exploiting the graph representation of discovered errors. The experimental results show that BigDansing outperforms existing baselines up to more than two orders of magnitude. Although BigDansing scales cleansing jobs, it still lacks the ability to handle sophisticated error discovery requiring inequality joins. Therefore, I developed IEJoin as an algorithm for fast inequality joins. It is based on sorted arrays and space efficient bit-arrays to reduce the problem’s search space. By comparing IEJoin against well- known optimizations, I show that it is more scalable, and several orders of magnitude faster. BigDansing depends on vertex-centric graph systems, i.e., Pregel, to efficiently store and process discovered errors. Although Pregel scales general-purpose graph computations, it is not able to handle skewed workloads efficiently. Therefore, I introduce Mizan, a Pregel system that balances the workload transparently during runtime to adapt for changes in computing needs. Mizan is general; it does not assume any a priori knowledge of the graph structure or the algorithm behavior. Through extensive evaluations, I show that Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning. data cleansing inequality join graph processing big data pregel spark
177	Interpretable and Scalable Bayesian Models for Advertising and Text Bischof, Jonathan Michael 04 June 2016 (has links) In the era of "big data", scalable statistical inference is necessary to learn from new and growing sources of quantitative information. However, many commercial and scientific applications also require models to be interpretable to end users in order to generate actionable insights about quantities of interest. We present three case studies of Bayesian hierarchical models that improve the interpretability of existing models while also maintaining or improving the efficiency of inference. The first paper is an application to online advertising that presents an augmented regression model interpretable in terms of the amount of revenue a customer is expected to generate over his or her entire relationship with the company---even if complete histories are never observed. The resulting Poisson Process Regression employs a marginal inference strategy that avoids specifying customer-level latent variables used in previous work that complicate inference and interpretability. The second and third papers are applications to the analysis of text data that propose improved summaries of topic components discovered by these mixture models. While the current practice is to summarize topics in terms of their most frequent words, we show significantly greater interpretability in online experiments with human evaluators by using words that are also relatively exclusive to the topic of interest. In the process we develop a new class of topic models that directly regularize the differential usage of words across topics in order to produce stable estimates of the combined frequency-exclusivity metric as well as proposing efficient and parallelizable MCMC inference strategies. / Statistics Statistics Advertising Bayesian statistics Big data Topic modeling
178	Využití dat ze sociálních sítí pro BI / The utilisation of social network data in BI Linhart, Ondřej January 2014 (has links) The thesis deals with the topic of social networks, particularly with the opportunities the utilisation of social network data can provide to an enterprise. The thesis is divided into two parts: The theoretical part contains definitions of the terms of data, information and knowledge, followed by descriptions of Business Intelligence and Big Data -- the two means of data analysis in an enterprise, and later by describing social networks themselves. The practical part contains an analysis of the data provided by social networks Facebook and Twitter, and at the same time defines the process of data extraction. The outcome of the analysis is a set of data that may possibly be obtained by the enterprise. This data is then used to determine the possible ways in which enterprises can leverage the data for their business. Finally data provided by Czech e shop is used to provide an example of how an entity can utilise social network data.
179	Porovnanie metód machine learningu pre analýzu kreditného rizika / Comparison of machine learning methods for credit risk analysis Bušo, Bohumír January 2015 (has links) Recently, machine learning has been put into connection with a field called ,,Big Data'' more and more. Usually, in this field, a lot of data is available and we need to gather useful information based on this data. Nowadays, when still more and more data is generated by use of mobile phones, credit cards, etc., a need for high-performance methods is serious. In this work, we describe six different methods that serve this purpose. These are logistic regression, neural networks and deep neural networks, bagging, boosting and stacking. Last three methods compose a group called Ensemble Learning. We apply all six methods on real data, which were generously provided by one of the loan providers. These methods can help them to distinguish between good and bad potential takers of loans, when the decision about the loan is being made. Lastly, the results of particular methods are compared and we also briefly outline possible ways of interpretation.
180	Vliv vývojových trendů na řešení projektu BI / The influence of trends in BI project Kapitán, Lukáš January 2012 (has links) The aim of this these is to analyse the trends occurring in Business intelligence. It does examine, summarise and judge each of the trends from the point of their usability in the real world, their influence and modification of each phase of the implementation of Bussiness intelligence. It is clear that each of these trends has its positives and negatives which can influence the statements in the evaluation. These factors are taken into consideration and analysed as well. The advantages and disadvantages of the trends are occurring especially in the areas of economical demand and technical difficultness. The main aim is to compare the methods of implementation of Bussiness intelligence with actual trends in BI. In order to achieve this a few crucial points were set: to investigate recent trends in the BI and to define the methods of implementation in the broadest terms. The awaited benefit of this these is already mentioned investigation and analysis of trends in the area of Bussiness intelligence and its use in implementation.

Search results