Global ETD Search

371	Data governance in big data : How to improve data quality in a decentralized organization / Datastyrning och big data Landelius, Cecilia January 2021 (has links) The use of internet has increased the amount of data available and gathered. Companies are investing in big data analytics to gain insights from this data. However, the value of the analysis and decisions made based on it, is dependent on the quality ofthe underlying data. For this reason, data quality has become a prevalent issue for organizations. Additionally, failures in data quality management are often due to organizational aspects. Due to the growing popularity of decentralized organizational structures, there is a need to understand how a decentralized organization can improve data quality. This thesis conducts a qualitative single case study of an organization currently shifting towards becoming data driven and struggling with maintaining data quality within the logistics industry. The purpose of the thesis is to answer the questions: • RQ1: What is data quality in the context of logistics data? • RQ2: What are the obstacles for improving data quality in a decentralized organization? • RQ3: How can these obstacles be overcome? Several data quality dimensions were identified and categorized as critical issues,issues and non-issues. From the gathered data the dimensions completeness, accuracy and consistency were found to be critical issues of data quality. The three most prevalent obstacles for improving data quality were data ownership, data standardization and understanding the importance of data quality. To overcome these obstacles the most important measures are creating data ownership structures, implementing data quality practices and changing the mindset of the employees to a data driven mindset. The generalizability of a single case study is low. However, there are insights and trends which can be derived from the results of this thesis and used for further studies and companies undergoing similar transformations. / Den ökade användningen av internet har ökat mängden data som finns tillgänglig och mängden data som samlas in. Företag påbörjar därför initiativ för att analysera dessa stora mängder data för att få ökad förståelse. Dock är värdet av analysen samt besluten som baseras på analysen beroende av kvaliteten av den underliggande data. Av denna anledning har datakvalitet blivit en viktig fråga för företag. Misslyckanden i datakvalitetshantering är ofta på grund av organisatoriska aspekter. Eftersom decentraliserade organisationsformer blir alltmer populära, finns det ett behov av att förstå hur en decentraliserad organisation kan arbeta med frågor som datakvalitet och dess förbättring. Denna uppsats är en kvalitativ studie av ett företag inom logistikbranschen som i nuläget genomgår ett skifte till att bli datadrivna och som har problem med att underhålla sin datakvalitet. Syftet med denna uppsats är att besvara frågorna: • RQ1: Vad är datakvalitet i sammanhanget logistikdata? • RQ2: Vilka är hindren för att förbättra datakvalitet i en decentraliserad organisation? • RQ3: Hur kan dessa hinder överkommas? Flera datakvalitetsdimensioner identifierades och kategoriserades som kritiska problem, problem och icke-problem. Från den insamlade informationen fanns att dimensionerna, kompletthet, exakthet och konsekvens var kritiska datakvalitetsproblem för företaget. De tre mest förekommande hindren för att förbättra datakvalité var dataägandeskap, standardisering av data samt att förstå vikten av datakvalitet. För att överkomma dessa hinder är de viktigaste åtgärderna att skapa strukturer för dataägandeskap, att implementera praxis för hantering av datakvalitet samt att ändra attityden hos de anställda gentemot datakvalitet till en datadriven attityd. Generaliseringsbarheten av en enfallsstudie är låg. Dock medför denna studie flera viktiga insikter och trender vilka kan användas för framtida studier och för företag som genomgår liknande transformationer. Data quality Data governance Big data analytics Decentralization Organic organizations Datakvalitet Datastyrning Analys av big data Decentralisering Organiska organisationsstrukturer Engineering and Technology Teknik och teknologier Economics and Business Ekonomi och näringsliv
372	Moving beyond connecting things : What are the factors telecommunication service providers need to consider when developing a Data-as-a-Service offering? / Steget vidare från uppkoppling av produkter: : Vilka faktorer bör IoT-operatörer ta hänsyn till vid utveckling av Data-as-a-Service tjänster? GOHARI MOGHADAM, SHERVIN, ÅHLGREN, THOR January 2020 (has links) The Internet of Things and connected devices has been getting more and more recognition in multiple industries the last few years. At the same time, the gathering of data is withholding a more central role for both companies and civilians. One type of Internet of Things is enabled by telecommunication Service Providers(TSP)providing SIMcards in devices, functioning by an advanced telecommunication infrastructure. This study aims to examine how these TSPs can leverage data generated by the communication infrastructure, by providing an additional data-as-a-service (DaaS) to current customers. The study was done at a global TSP within the area of SIM-fleet management/IoT enablement. The number of industries that are starting to connect devices are growing extensively, to get all types of information regarding the devices ranging from electricity-usage & geocoordinates to performance or other useful information. The data that is sent by the SIM-card belongs to the customer, and the TSPs does not access it. However, the telecommunication infrastructure generates data created by the communication of the devices, which is gathered by the TSP. Since a huge amount of data is attained by the TSP offering the infrastructure, the aim for this study is to examine eventual obstacles and opportunities of a DaaSoffering. How the data is to be delivered, customers connectivity-needs and how current insights streams are delivered are examples of subjects the study will examine. The work has its foundation in a theoretical reference frame and a qualitative empirical study. The theoretical reference provides a theoretical overview of the industry's development and earlier research within the area. It was created by conducting a literature study combined with reports provided by trade organizations and other stakeholders. The empirical study contains 6 interviews with employees at a global TSP, with an extensive history of connected devices. The two parts were then compiled in order to compare the result with the theoretical background. It appeared that a lot of the challenges of developing a DaaS from the result agreed with the theoretical reference frame. The customers' differences in connectivity-maturity was shown to pose a great challenge to standardize a DaaS-offering, and the clients analytical needs were dependent on the same premises. Furthermore DaaS is considered to have a great effect on the industry's future development, / Internet of Things och uppkopplade produkter har blivit ett allt vanligare begrepp inom flertalet branscher. Samtidigt har datainsamling blivit en mer central del av alltifrån affärsmodeller till något vanliga konsumenter har i åtanke. En variant av Internet of Things tillhandahålls genom SIM-kort i produkter, som tillhandahålls av operatörer, och funktionerar genom kommunikationsnätverk. Denna studie är en akademisk utredning kring hur dessa operatörer kan utnyttja data genererat från telekommunikations-infrastruktur till en tjänst för nuvarande kunder. Studien är utförd hos en global operatör inom området av SIM-fleet Management/IoT-enablement. Fler och fler industrier går mot att koppla upp produkter för att få information kring alltifrån prestanda, elanvändning hos produkten, geografisk position eller annan information som önskas. Den data som skickas tillhandahålls av kund, vilket operatören inte har någon tillgång till. Dock så genererar kommunikationen i sig data genom kommunikationsnätverket, som operatören samlar in. I och med att mängder av data blir tillgänglig för operatörerna som tillhandahåller infrastrukturen, är syftet med denna rapport att undersöka eventuella hinder och möjligheter att erbjuda kunder ytterligare data som en tjänst i sig. Hur datan ska levereras, kundernas analysbehov och hur nuvarande insikter levereras är några exempel på det studien utreder. Arbetet grundar sig i en litteraturstudie och en kvalitativ empirisk studie. Litteraturstudien ger en bakgrund och teoretisk överblick kring branschens utveckling och litteraturens syn på området. Detta gjordes genom vetenskapliga publikationer samt diverse rapporter från branschorganisationer och intressenter. Den empiriska studien genomfördes genom 6 intervjuer med anställda på en global operatör med lång historisk inom uppkopplade produkter. De två delarna sammanställdes sedan för att jämföra resultatet med den teoretiska bakgrunden. Det visades sig vara mycket i resultat som stämde överens med de teoretiska aspekterna kring utmaningar med att erbjuda Data-as-aService (DaaS). Kundernas olika mognadsgrad i sin uppkoppling visades sig vara en stor utmaning i att standardisera en DaaS, och kundernas analysbehov gick ofta isär på samma premisser. Vidare anses DaaS ha stor påverkan på hur branschen fortsätter utvecklas i framtiden, och konsensus är tjänsten i framtiden kommer bli mer och mer datadriven. Internet of Things Data-as-a-Service Telecommunication SIM-Fleet Management Big Data. Internet of Things Data-as-a-Service Telekommunikation SIM-Fleet Management Big Data Engineering and Technology Teknik och teknologier
373	Produktutvecklingsprocesser vid digitalisering av hemprodukter : Påverkan på intern struktur, projekttid, användardata och produktutvecklingsmetod / Product Development in Digitalization of Home Products BRICK, ADÉLE, HABBERSTAD, HELENA January 2020 (has links) Under de senaste åren har digitaliseringen av fysiska produkter ökat, och allt fler företag har därmed börjat implementera digitala komponenter i sina produkter. Att implementera mjukvara i en analog produkt innebär nya utmaningar för produktutvecklingsteam som tidigare arbetat med att ta fram analoga produkter. Många företag har i och med digitaliseringen valt att anpassa sina produktutvecklingsmetoder med målet att integrera de digitala och analoga produktutvecklingsprocesserna med varandra. Syftet med studien är att undersöka hur produktutvecklingsprocesserna ser ut idag på produktutvecklande företag som har genomgått en digitalisering. De aspekter som har tagits extra hänsyn till är projekttid, produktutvecklingsmetod, företagets organisering och struktur samt insamling och implementering av användardata i produktutvecklingsprocessen. Företag som utvecklar uppkopplade produkter för hemmet är exempel på företag som just nu genomgår en digitalisering av tidigare analoga produkter, därför har företag med detta spår valts som inriktning vid denna studie. Studien har utförts genom en inledande litteraturstudie följt av kvalitativa intervjuer med fyra responderande företag, som samtliga utvecklar uppkopplade hemprodukter vilka innehåller IoT-teknologi. Studien visar att företagen strävar mot ett agilt arbetssätt, men att det finns svårigheter med att integrera hårdvaru- och mjukvaruutveckling i produktutvecklingsprocesserna. Trots detta upplevs utvecklingstiden i projekt som oförändrad jämfört med innan digitaliseringen. Det framkommer även att tvärfunktionalitet hos utvecklingsteamen är en fördel i samspelet mellan de digitala och analoga delarna av produktutvecklingen. Studien visade slutligen att kunddata som samlas in via digitaliserade produkter används av företag som ett verktyg för att effektivisera produktutvecklingen. / In recent years digitalization of physical products has increased, and many companies has therefore started to implement digital components in their products. To add software to an analog product creates new challenges for the product development teams, which up until then mainly have been developing analog products. Many companies have, as a result of the digitalization, chosen to adapt their product development methods to manage the integration between digital and analog development processes. The purpose of this study is to investigate what the product developing process looks like today in companies that have digitalized their products. The aspects that are specifically considered are; project duration, product development method, organizational structure of the company, and implementation of big data in the product development process. Companies that develop products for home use is one example of companies that are going through a digitalization process of their previously analog products, which is why this branch of companies is targeted in this study. The study was conducted through an initial literature study, followed by interviews with four responding companies, who all develop connected home products containing IoT-technology. The study shows that the companies aim for a more agile work procedure, but that there are problems with integrating hardware and software development in product development processes. Nonetheless, the time duration of the projects does not appear to have changed significantly in comparison to pre-digitalization. It is also revealed that cross-functional teams are an advantage within the collaboration between the digital and analog parts in the development process. The study finally shows that big data, collected through digitalized products, is used by the companies as a tool for increasing the effectiveness of the product development. Digitalization product development processes integrated product development agile IoT big data cross-functional teams. digitalisering produktutvecklingsprocesser integrerad produktutveckling agil IoT big data tvärfunktionella team. Engineering and Technology Teknik och teknologier
374	[pt] ENSAIOS SOBRE NOWCASTING COM DADOS EM ALTA DIMENSÃO / [en] ESSAYS ON NOWCASTING WITH HIGH DIMENSIONAL DATA HENRIQUE FERNANDES PIRES 02 June 2022 (has links) [pt] Em economia, Nowcasting é a previsão do presente, do passado recente ou mesmo a previsão do futuro muito próximo de um determinado indicador. Geralmente, um modelo nowcast é útil quando o valor de uma variável de interesse é disponibilizado com um atraso significativo em relação ao seu período de referência e/ou sua realização inicial é notavelmente revisada ao longo do tempo, se estabilizando somente após um tempo. Nesta tese, desenvolvemos e analisamos vários métodos de Nowcasting usando dados de alta dimensão (big data) em diferentes contextos: desde a previsão de séries econômicas até o nowcast de óbitos pela COVID-19. Em um de nossos estudos, comparamos o desempenho de diferentes algoritmos de Machine Learning com modelos mais naive na previsão de muitas variáveis econômicas em tempo real e mostramos que, na maioria das vezes, o Machine Learning supera os modelos de benchmark. Já no restante dos nossos exercícios, combinamos várias técnicas de nowcasting com um grande conjunto de dados (incluindo variáveis de alta frequência, como o Google Trends) para rastrear a pandemia no Brasil, mostrando que fomos capazes de antecipar os números reais de mortes e casos muito antes de estarem disponíveis oficialmente para todos. / [en] Nowcasting in economics is the prediction of the present, the recent past or even the prediction of the very near future of a certain indicator. Generally, a nowcast model is useful when the value of a target variable is released with a significant delay with respect to its reference period and/or when its value gets notably revised over time and stabilizes only after a while. In this thesis, we develop and analyze several Nowcasting methods using high-dimensional (big) data in different contexts: from the forecasting of economic series to the nowcast of COVID-19. In one of our studies, we compare the performance of different Machine Learning algorithms with more naive models in predicting many economic variables in real-time and we show that, most of the time, Machine Learning beats benchmark models. Then, in the rest of our exercises, we combine several nowcasting techniques with a big dataset (including high-frequency variables, such as Google Trends) in order to track the pandemic in Brazil, showing that we were able to nowcast the true numbers of deaths and cases way before they got available to everyone. [pt] APRENDIZADO DE MAQUINA [pt] MODELOS DE ALTA DIMENSAO [pt] COVID-19 [pt] NOWCASTING [pt] BIG DATA [pt] PREVISAO [en] MACHINE LEARNING [en] HIGHDIMENSIONAL MODELS [en] COVID-19 [en] NOWCASTING [en] BIG DATA [en] FORECASTING
375	Case Studies on Fractal and Topological Analyses of Geographic Features Regarding Scale Issues Ren, Zheng January 2017 (has links) Scale is an essential notion in geography and geographic information science (GIScience). However, the complex concepts of scale and traditional Euclidean geometric thinking have created tremendous confusion and uncertainty. Traditional Euclidean geometry uses absolute size, regular shape and direction to describe our surrounding geographic features. In this context, different measuring scales will affect the results of geospatial analysis. For example, if we want to measure the length of a coastline, its length will be different using different measuring scales. Fractal geometry indicates that most geographic features are not measurable because of their fractal nature. In order to deal with such scale issues, the topological and scaling analyses are introduced. They focus on the relationships between geographic features instead of geometric measurements such as length, area and slope. The scale change will affect the geometric measurements such as length and area but will not affect the topological measurements such as connectivity. This study uses three case studies to demonstrate the scale issues of geographic features though fractal analyses. The first case illustrates that the length of the British coastline is fractal and scale-dependent. The length of the British coastline increases with the decreased measuring scale. The yardstick fractal dimension of the British coastline was also calculated. The second case demonstrates that the areal geographic features such as British island are also scale-dependent in terms of area. The box-counting fractal dimension, as an important parameter in fractal analysis, was also calculated. The third case focuses on the scale effects on elevation and the slope of the terrain surface. The relationship between slope value and resolution in this case is not as simple as in the other two cases. The flat and fluctuated areas generate different results. These three cases all show the fractal nature of the geographic features and indicate the fallacies of scale existing in geography. Accordingly, the fourth case tries to exemplify how topological and scaling analyses can be used to deal with such unsolvable scale issues. The fourth case analyzes the London OpenStreetMap (OSM) streets in a topological approach to reveal the scaling or fractal property of street networks. The fourth case further investigates the ability of the topological metric to predict Twitter user’s presence. The correlation between number of tweets and connectivity of London named natural streets is relatively high and the coefficient of determination r2 is 0.5083. Regarding scale issues in geography, the specific technology or method to handle the scale issues arising from the fractal essence of the geographic features does not matter. Instead, the mindset of shifting from traditional Euclidean thinking to novel fractal thinking in the field of GIScience is more important. The first three cases revealed the scale issues of geographic features under the Euclidean thinking. The fourth case proved that topological analysis can deal with such scale issues under fractal way of thinking. With development of data acquisition technologies, the data itself becomes more complex than ever before. Fractal thinking effectively describes the characteristics of geographic big data across all scales. It also overcomes the drawbacks of traditional Euclidean thinking and provides deeper insights for GIScience research in the big data era. scale fractal geometry big data topological analyses Geosciences, Multidisciplinary Multidisciplinär geovetenskap
376	Exploring Spatio-Temporal Patterns of Volunteered Geographic Information : A Case Study on Flickr Data of Sweden Miao, Yufan January 2013 (has links) This thesis aims to seek interesting patterns from massive amounts of Flickr data in Sweden with pro- posed new clustering strategies. The aim can be further divided into three objectives. The first one is to acquire large amount of timestamped geolocation data from Flickr servers. The second objective is to develop effective and efficient methods to process the data. More specifically, the methods to be developed are bifold, namely, the preprocessing method to solve the “Big Data” issue encountered in the study and the new clustering method to extract spatio-temporal patterns from data. The third one is to analyze the extracted patterns with scaling analysis techniques in order to interpret human social activities underlying the Flickr Data within the urban envrionment of Sweden. During the study, the three objectives were achieved sequentially. The data employed for this study was vector points downloaded through Flickr Application Programming Interface (API). After data ac- quisition, preprocessing was performed on the raw data. The whole dataset was firstly separated by year based on the temporal information. Then data of each year was accumulated with its former year(s) so that the evovling process can be explored. After that, large datasets were splitted into small pieces and each piece was clipped, georeferenced, and rectified respectively. Then the pieces were merged together for clustering. With respect to clustering, the strategy was developed based on the Delaunay Triangula- tion (DT) and head/tail break rule. After that, the generated clusters were analyzed with scaling analysis techniques and spatio-temporal patterns were interpreted from the analysis results. It has been found that the spatial pattern of the human social activities in the urban environment of Sweden generally follows the power-law distribution and the cities defined by human social activities are evolving as time goes by. To conclude, the contributions of this research are threefold and fulfill the objectives of this study, respectively. Firstly, large amount of Flickr data is acquired and collated as a contribution to other aca- demic researches related to Flickr. Secondly, the clustering strategy based on the DT and head/tail break rule is proposed for spatio-temporal pattern seeking. Thirdly, the evolving of the cities in terms of human activities in Sweden is detected from the perspective of scaling. Future work is expected in major two aspects, namely, data and data processing. For the data aspect, the downloaded Flickr data is expected to be employed by other studies, especially those closely related to human social activities within urban environment. For the processing aspect, new algorithms are expected to either accelerate the processing process or better fit machines with super computing capacities. Big Data VGI Flickr Delaunay Triangulation Power Law Scaling Analysis Spatio-Temporal Pattern
377	Inhämtning & analys av Big Data med fokus på sociala medier Åhlander, Niclas, Aldaamsah, Saed January 2015 (has links) I en värld som till allt större del använder sig av sociala medier skapas och synliggörs information om användarna som tidigare inte varit enkel att i stor mängd analysera. I det här arbetet visas processen för att skapa ett automatiserat insamlingssätt av specifik data från sociala medier. Insamlad data analyseras därefter med noggrant utformade algoritmer och slutligen demonstreras processens nytta i sin helhet. Datainhämtningen från sociala medier automatiserades med hjälp av en mängd kombinerade metoder. Därefter kunde analysen av det inhämtade datat utföras med hjälp av specifika algoritmer som redovisades i det här arbetet. Tillsammans resulterade metoderna i att vissa mönster framkom i datan, vilket avslöjade en mängd olika typer av information kring analysens utvalda individer. IT-forensik IT Sociala nätverk Big data Grafalgoritmer Data-analys Datainsamling Databaser Facebook
378	An Explorative Parameter Sweep: Spatial-temporal Data Mining in Stochastic Reaction-diffusion Simulations Wrede, Fredrik January 2016 (has links) Stochastic reaction-diffusion simulations has become an efficient approach for modelling spatial aspects of intracellular biochemical reaction networks. By accounting for intrinsic noise due to low copy number of chemical species, stochastic reaction-diffusion simulations have the ability to more accurately predict and model biological systems. As with many simulations software, exploration of the parameters associated with the model can be needed to yield new knowledge about the underlying system. The exploration can be conducted by executing parameter sweeps for a model. However, with little or no prior knowledge about the modelled system, the effort for practitioners to explore the parameter space can get overwhelming. To account for this problem we perform a feasibility study on an explorative behavioural analysis of stochastic reaction-diffusion simulations by applying spatial-temporal data mining to large parameter sweeps. By reducing individual simulation outputs into a feature space involving simple time series and distribution analytics, we were able to find similar behaving simulations after performing an agglomerative hierarchical clustering. Big data feature extraction clustering stochastic reaction-diffusion simulation spatial-temporal data mining cloud computing
379	Sequential estimation in statistics and steady-state simulation Tang, Peng 22 May 2014 (has links) At the onset of the "Big Data" age, we are faced with ubiquitous data in various forms and with various characteristics, such as noise, high dimensionality, autocorrelation, and so on. The question of how to obtain accurate and computationally efficient estimates from such data is one that has stoked the interest of many researchers. This dissertation mainly concentrates on two general problem areas: inference for high-dimensional and noisy data, and estimation of the steady-state mean for univariate data generated by computer simulation experiments. We develop and evaluate three separate sequential algorithms for the two topics. One major advantage of sequential algorithms is that they allow for careful experimental adjustments as sampling proceeds. Unlike one-step sampling plans, sequential algorithms adapt to different situations arising from the ongoing sampling; this makes these procedures efficacious as problems become more complicated and more-delicate requirements need to be satisfied. We will elaborate on each research topic in the following discussion. Concerning the first topic, our goal is to develop a robust graphical model for noisy data in a high-dimensional setting. Under a Gaussian distributional assumption, the estimation of undirected Gaussian graphs is equivalent to the estimation of inverse covariance matrices. Particular interest has focused upon estimating a sparse inverse covariance matrix to reveal insight on the data as suggested by the principle of parsimony. For estimation with high-dimensional data, the influence of anomalous observations becomes severe as the dimensionality increases. To address this problem, we propose a robust estimation procedure for the Gaussian graphical model based on the Integrated Squared Error (ISE) criterion. The robustness result is obtained by using ISE as a nonparametric criterion for seeking the largest portion of the data that "matches" the model. Moreover, an l₁-type regularization is applied to encourage sparse estimation. To address the non-convexity of the objective function, we develop a sequential algorithm in the spirit of a majorization-minimization scheme. We summarize the results of Monte Carlo experiments supporting the conclusion that our estimator of the inverse covariance matrix converges weakly (i.e., in probability) to the latter matrix as the sample size grows large. The performance of the proposed method is compared with that of several existing approaches through numerical simulations. We further demonstrate the strength of our method with applications in genetic network inference and financial portfolio optimization. The second topic consists of two parts, and both concern the computation of point and confidence interval (CI) estimators for the mean µ of a stationary discrete-time univariate stochastic process X \equiv \{X_i: i=1,2,...} generated by a simulation experiment. The point estimation is relatively easy when the underlying system starts in steady state; but the traditional way of calculating CIs usually fails since the data encountered in simulation output are typically serially correlated. We propose two distinct sequential procedures that each yield a CI for µ with user-specified reliability and absolute or relative precision. The first sequential procedure is based on variance estimators computed from standardized time series applied to nonoverlapping batches of observations, and it is characterized by its simplicity relative to methods based on batch means and its ability to deliver CIs for the variance parameter of the output process (i.e., the sum of covariances at all lags). The second procedure is the first sequential algorithm that uses overlapping variance estimators to construct asymptotically valid CI estimators for the steady-state mean based on standardized time series. The advantage of this procedure is that compared with other popular procedures for steady-state simulation analysis, the second procedure yields significant reduction both in the variability of its CI estimator and in the sample size needed to satisfy the precision requirement. The effectiveness of both procedures is evaluated via comparisons with state-of-the-art methods based on batch means under a series of experimental settings: the M/M/1 waiting-time process with 90% traffic intensity; the M/H_2/1 waiting-time process with 80% traffic intensity; the M/M/1/LIFO waiting-time process with 80% traffic intensity; and an AR(1)-to-Pareto (ARTOP) process. We find that the new procedures perform comparatively well in terms of their average required sample sizes as well as the coverage and average half-length of their delivered CIs. Sequential algorithm Standardized time series Steady-state simulation Big data Algorithms Computer simulation
380	Modern Computing Techniques for Solving Genomic Problems Yu, Ning 12 August 2016 (has links) With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing. Deep learning Cloud computing Biological big data DNA annotation CpG island CpG box

Search results