Global ETD Search

1	Multiple stratification when one of the stratifying variables is time Ghazali, Syed Shakir Ali January 1991 (has links) No description available. 519.5 Data sampling
2	Um modelo econométrico Painel-MIDAS dos retornos dos ativos do mercado acionário brasileiro Silva, Aline Moura Costa da 17 November 2017 (has links) Tese (doutorado)—Universidade de Brasília, Universidade Federal da Paraíba, Universidade Federal do Rio Grande do Norte, Programa Multi-Institucional e Inter-Regional de Pós-Graduação em Ciências Contábeis, 2017. / Submitted by Raquel Almeida (raquel.df13@gmail.com) on 2018-02-27T16:30:48Z No. of bitstreams: 1 2017_AlineMouraCostadaSilva.pdf: 2061144 bytes, checksum: 7bbb473ff7ffbaef08720ac9941667bf (MD5) / Approved for entry into archive by Raquel Viana (raquelviana@bce.unb.br) on 2018-03-14T19:26:27Z (GMT) No. of bitstreams: 1 2017_AlineMouraCostadaSilva.pdf: 2061144 bytes, checksum: 7bbb473ff7ffbaef08720ac9941667bf (MD5) / Made available in DSpace on 2018-03-14T19:26:27Z (GMT). No. of bitstreams: 1 2017_AlineMouraCostadaSilva.pdf: 2061144 bytes, checksum: 7bbb473ff7ffbaef08720ac9941667bf (MD5) Previous issue date: 2018-03-14 / Esta tese teve por objetivo desenvolver um modelo econométrico estrutural para o mercado acionário brasileiro, de modo a explicar a determinação dos retornos de suas ações, por meio de uma modelagem denominada MIDAS. Para tal, foram utilizadas variáveis explanatórias que sintetizam as especificidades das empresas analisadas, assim como do ambiente econômico brasileiro. Com o propósito de realizar um teste de robustez do modelo MIDAS desenvolvido, um modelo de regressão convencional para dados em painel também foi estimado com as mesmas variáveis presentes naquele modelo. Posteriormente, buscou-se analisar as projeções dos retornos acionários desenvolvidas pelo modelo MIDAS, comparando-as com as projeções advindas do modelo convencional e da série histórica. Carteiras de ativos foram montadas com base no modelo MIDAS, ainda com o intuito de analisar as suas projeções. A amostra contemplou as instituições não financeiras listadas na BM&FBovespa (atual B3) e o período de análise compreendeu de 2010 a 2016. Os resultados indicaram que o modelo MIDAS desenvolvido nesta tese se mostrou robusto para a explicação e projeção dos retornos trimestrais das ações listadas no mercado acionário brasileiro, permitindo, inclusive, a construção de carteiras de ativos para investimento. Esse modelo superou o modelo convencional para dados em painel na explicação dos retornos acionários e, no que tange à projeção dos retornos das ações, o modelo MIDAS mostrou-se mais preciso estatisticamente do que a média histórica. Os resultados apresentados nesta tese reforçam a importância de estudos relacionados à modelagem dos retornos acionários em mercados emergentes, ao desenvolver um modelo robusto para a análise e a tomada de decisões de investimento no Brasil, o que corrobora para uma melhor compreensão e desenvolvimento de seu mercado acionário. / The purpose of this thesis was to develop a structural econometric model for the Brazilian stock market, in order to explain the determination of the returns of its shares, utilizing a model known as MIDAS. To accomplish that, explanatory variables that synthesize the fundamentals of the companies analyzed and other variables associated with the Brazilian economic environment were included. In order to perform a robustness test of the MIDAS model proposed, a conventional panel data regression model was also estimated with the same variables included in the first model. Subsequently, we sought to analyze stock return forecasts generated by the MIDAS model, by comparing them with forecasts generated by the conventional model and with the historical series as well. Asset portfolios were built based on the MIDAS model, also with the purpose of analyzing its forecasts. The sample includes the non-financial institutions listed on the BM&FBovespa (current B3) within the period comprised from 2010 to 2016. The results indicate that the MIDAS model developed in this thesis is robust for explaining and forecasting the quarterly returns of shares listed in the stock market including the construction of investment portfolios. This model overcomes the conventional panel data model in explaining stock returns and, regarding the forecasting of stock returns, the MIDAS model was also statistically more robust than the historical average. The results presented in this thesis strengthen the importance of studies related to the modeling of stock returns in emerging markets, by developing a robust model for investment analysis and decision-making in Brazil, which contributes to a better understanding and development of its stock market. Mercado de ações Modelo econométrico Mixed Data Sampling (MIDAS)
3	Synthetic Data Generation and Sampling for Online Training of DNN in Manufacturing Supervised Learning Problems Thiyagarajan, Prithivrajan 29 May 2024 (has links) The deployment of Industrial Internet offers abundant passive data from manufacturing systems and networks, which enables data-driven modeling with high-data-demand, advanced statistical models such as Deep Neural Networks (DNNs). Deep Neural Networks (DNNs) have proven to be remarkably effective in supervised learning in critical manufacturing applications, such as AI-enabled automatic inspection, quality modeling, etc. However, there is a lack of performance guarantee of DNN models primarily due to data class imbalance, shifting distribution, multi-modality variables (e.g., time series and images) in training and testing datasets collected in manufacturing. Moreover, implementing these models on the manufacturing shop floor is difficult due to limitations in human-machine interaction. Inspired by active data generation through Design of Experiments (DoE) and passive observational data collection for manufacturing data analytics, we propose a SynthetIc Data gEneration and Sampling (SIDES) framework with a Graphical User Interface named SIDESync. This framework is designed to streamline SIDES execution within manufacturing environments, to provide adequate DNN model performance through the improvement of training data preparation and enhancing human-machine interaction. In the SIDES framework, a bi-level Hierarchical Contextual Bandits is proposed to provide a scientific way to integrate DoE and observational data sampling, which optimizes DNNs' online learning performance. Multimodality-aligned variational Autoencoder transforms the multimodal predictors from manufacturing into a shared low-dimensional latent space for controlled data generation from DoE and effective sampling from observational data. The SIDESync Graphical User Interface (GUI), developed using the Streamlit library in Python, simplifies the configuration, monitoring, and analysis of SIDES experiments. This streamlined approach facilitates access to the SIDES framework and enhances human-machine interaction capabilities. The merits of SIDES are evaluated by a real case study of printed electronics with a binary multimodal data classification problem. Results show the advantages of the cost-effective integration of DoE in improving the DNNs' online learning performance. / Master of Science / The Industrial Internet's growth has brought in a massive amount of data from manufacturing systems leading to advanced data analysis methods using techniques like Deep Neural Networks (DNNs). These powerful models have shown great promise in critical manufacturing tasks, such as AI-driven quality control. However, challenges remain in ensuring these models perform well. For example, the lack of good data results in models with poor performance. Furthermore, deploying these models on the manufacturing shop floor poses challenges due to limited human-machine interaction capabilities. To tackle these challenges, we introduce the SynthetIc Data gEneration and Sampling (SIDES) framework with a user-friendly interface called SIDESync to enhance the human-machine interaction. This framework will improve how training data is prepared, ultimately boosting the performance of DNN models. Within this framework, we proposed a method called bi-level Hierarchical Contextual Bandits that combines real-world data sampling with a technique called Design of Experiments (DoE) to help Deep Neural Networks (DNNs) learn more effectively as they operate. We also used a tool called a Multimodality-Aligned Variational Autoencoder, which helps convert various types of manufacturing data (like sensor readings and images) into a standard format. This conversion makes it easier to generate new data from experiments and efficiently use real-world data samples. The SIDESync Graphical User Interface (GUI) is created using Python's Streamlit library. It makes setting up, monitoring, and analyzing SIDES experiments much easier. This user-friendly system improves access to the SIDES framework and boosts interactions between humans and machines. To prove how effective SIDES is, we conducted a real case study of data collected from printed electronics manufacturing. We focused on a problem where we needed to classify the final product quality using in-situ data with DNN model prediction. Our results clearly showed that integrating DoE improved how DNNs learned online, all while keeping costs in check. This work opens up exciting possibilities for making data-driven decisions in manufacturing smarter and more efficient. Data Generation Data Sampling Deep Neural Networks Industrial Internet
4	Analysis of the Effects of Sampling Sampled Data Hicks, William T. 10 1900 (has links) International Telemetering Conference Proceedings / October 28-31, 1996 / Town and Country Hotel and Convention Center, San Diego, California / The traditional use of active RC-type filters as anti-aliasing filters in Pulse Code Modulation (PCM) systems is being replaced by the use of Digital Signal Processing (DSP) filters, especially when performance requirements are tight and when operation over a wide environmental temperature range is required. In order to keep systems more flexible, it is often desired to let the DSP filters run asynchronous to the PCM sample clock. This results in the PCM output signal being a sampling of the output of the DSP, which is itself a sampling of the input signal. In the analysis of the PCM data, the signal will have a periodic repeat of a previous sample, or a missing sample, depending on the relative sampling rates of the DSP and the PCM. This paper analyzes what effects can be expected in the analysis of the PCM data when these anomalies are present. Results are presented which allow the telemetry engineer to make an effective value judgment based on the type of filtering technology to be employed and on the desired system performance. Digital Signal Processing (DSP) Digital Filters Sampled Data Data Sampling Sampling
5	Robustness of the One-Sample Kolmogorov Test to Sampling from a Finite Discrete Population Tucker, Joanne M. (Joanne Morris) 12 1900 (has links) One of the most useful and best known goodness of fit test is the Kolmogorov one-sample test. The assumptions for the Kolmogorov (one-sample test) test are: 1. A random sample; 2. A continuous random variable; 3. F(x) is a completely specified hypothesized cumulative distribution function. The Kolmogorov one-sample test has a wide range of applications. Knowing the effect fromusing the test when an assumption is not met is of practical importance. The purpose of this research is to analyze the robustness of the Kolmogorov one-sample test to sampling from a finite discrete distribution. The standard tables for the Kolmogorov test are derived based on sampling from a theoretical continuous distribution. As such, the theoretical distribution is infinite. The standard tables do not include a method or adjustment factor to estimate the effect on table values for statistical experiments where the sample stems from a finite discrete distribution without replacement. This research provides an extension of the Kolmogorov test when the hypothesized distribution function is finite and discrete, and the sampling distribution is based on sampling without replacement. An investigative study has been conducted to explore possible tendencies and relationships in the distribution of Dn when sampling with and without replacement for various parameter settings. In all, 96 sampling distributions were derived. Results show the standard Kolmogorov table values are conservative, particularly when the sample sizes are small or the sample represents 10% or more of the population. Kolmogorov one-sample test data sampling statistics Goodness-of-fit tests. Statistical hypothesis testing.
6	Rotulação de indivíduos representativos no aprendizado semissupervisionado baseado em redes: caracterização, realce, ganho e filosofia / Representatives labeling for network-based semi-supervised learning:characterization, highlighting, gain and philosophy Araújo, Bilzã Marques de 29 April 2015 (has links) Aprendizado semissupervisionado (ASS) é o nome dado ao paradigma de aprendizado de máquina que considera tanto dados rotulados como dados não rotulados. Embora seja considerado frequentemente como um meio termo entre os paradigmas supervisionado e não supervisionado, esse paradigma é geralmente aplicado a tarefas preditivas ou descritivas. Na tarefa preditiva de classificação, p. ex., o objetivo é rotular dados não rotulados de acordo com os rótulos dos dados rotulados. Nesse caso, enquanto que os dados não rotulados descrevem as distribuições dos dados e mediam a propagação dos rótulos, os itens de dados rotulados semeiam a propagação de rótulos e guiam-na à estabilidade. No entanto, dados são gerados tipicamente não rotulados e sua rotulação requer o envolvimento de especialistas no domínio, rotulando-os manualmente. Dificuldades na visualização de grandes volumes de dados, bem como o custo associado ao envolvimento do especialista, são desafios que podem restringir o desempenho dessa tarefa. Por- tanto, o destacamento automático de bons candidatos a dados rotulados, doravante denominados indivíduos representativos, é uma tarefa de grande importância, e pode proporcionar uma boa relação entre o custo com especialista e o desempenho do aprendizado. Dentre as abordagens de ASS discriminadas na literatura, nosso interesse de estudo se concentra na abordagem baseada em redes, onde conjuntos de dados são representados relacionalmente, através da abstração gráfica. Logo, o presente trabalho tem como objetivo explorar a influência dos nós rotulados no desempenho do ASS baseado em redes, i.e., estudar a caracterização de nós representativos, como a estrutura da rede pode realçá-los, o ganho de desempenho de ASS proporcionado pela rotulação manual dos mesmos, e aspectos filosóficos relacionados. Em relação à caracterização, critérios de caracterização de nós centrais em redes são estudados considerando-se redes com estruturas modulares bem definidas. Contraintuitivamente, nós bastantes conectados (hubs) não são muito representativos. Nós razoavelmente conectados em vizinhanças pouco conectadas, por outro lado, são; estritamente local, esse critério de caracterização é escalável a grandes volumes de dados. Em redes com distribuição de grau homogênea - modelo Girvan-Newman (GN), nós com alto coeficiente de agrupamento também mostram-se representativos. Por outro lado, em redes com distribuição de grau heterogênea - modelo Lancichinetti-Fortunato-Radicchi (LFR), nós com alta intermedialidade se destacam. Nós com alto coeficiente de agrupamento em redes GN estão tipicamente situados em motifs do tipo quase-clique; nós com alta intermedialidade em redes LFR são hubs situados na borda das comunidades. Em ambos os casos, os nós destacados são excelentes regularizadores. Além disso, como critérios diversos se destacam em redes com características diversas, abordagens unificadas para a caracterização de nós representativos também foram estudadas. Crítica para o realce de indivíduos representativos e o bom desempenho da classificação semissupervisionada, a construção de redes a partir de bases de dados vetoriais também foi estudada. O método denominado AdaRadius foi proposto, e apresenta vantagens tais como adaptabilidade em bases de dados com densidade variada, baixa dependência da configuração de seus parâmetros, e custo computacional razoável, tanto sobre dados pool-based como incrementais. As redes resultantes, por sua vez, são esparsas, porém conectadas, e permitem que a classificação semissupervisionada se favoreça da rotulação prévia de indivíduos representativos. Por fim, também foi estudada a validação de métodos de construção de redes para o ASS, sendo proposta a medida denominada coerência grafo-rótulos de Katz. Em suma, os resultados discutidos apontam para a validade da seleção de indivíduos representativos para semear a classificação semissupervisionada, corroborando a hipótese central da presente tese. Analogias são encontrados em diversos problemas modelados em redes, tais como epidemiologia, propagação de rumores e informações, resiliência, letalidade, grandmother cells, e crescimento e auto-organização. / Semi-supervised learning (SSL) is the name given to the machine learning paradigm that considers both labeled and unlabeled data. Although often defined as a mid-term between unsupervised and supervised machine learning, this paradigm is usually applied to predictive or descriptive tasks. In the classification task, for example, the goal is to label the unlabeled data according to the labels of the labeled data. In this case, while the unlabeled data describes the data distributions and mediate the label propagation, the labeled data seeds the label propagation and guide it to the stability. However, as a whole, data is generated unlabeled, and to label data requires the involvement of domain specialists, labeling it by hand. Difficulties on visualizing huge amounts of data, as well as the cost of the specialists involvement, are challenges which may constraint the labeling task performance. Therefore, the automatic highlighting of good candidates to label by hand, henceforth called representative individuals, is a high value task, which may result in a good tradeoff between the cost with the specialist and the machine learning performance. Among the SSL approaches in the literature, our study is focused on the network--based approache, where datasets are represented relationally, through the graphic abstraction. Thus, the current study aims to explore and exploit the influence of the labeled data on the SSL performance, that is, the proper characterization of representative nodes, how the network structure may enhance them, the SSL performance gain due to labeling them by hand, and related philosophical aspects. Concerning the characterization, central nodes characterization criteria were studied on networks with well-defined modular structures. Counterintuitively, highly connected nodes (hubs) are not much representatives. Not so connected nodes placed in low connectivity neighborhoods are, though. Strictly local, this characterization is scalable to huge volumes of data. In networks with homogeneous degree distribution - Girvan-Newman networks (GN), nodes with high clustering coefficient also figure out as representatives. On the other hand, in networks with inhomogeneous degree distribution - Lancichinetti-Fortunato-Radicchi networks (LFR), nodes with high betweenness stand out. Nodes with high clustering coefficient in GN networks typically lie in almost-cliques motifs; nodes with high betweenness in LFR networks are highly connected nodes, which lie in communities borders. In both cases, the highlighted nodes are outstanding regularizers. Besides that, unified approaches to characterize representative nodes were studied because diverse criteria stand out for diverse networks. Crucial for highlighting representative nodes and ensure good SSL performance, the graph construction from vector-based datasets was also studied. The method called AdaRadius was introduced and presents advantages such as adaptability to data with variable density, low dependency on parameters settings, and reasonable computational cost on both pool based and incremental data. Yielding networks are sparse but connected and allow the semi-supervised classification to take great advantage of the manual labeling of representative nodes. Lastly, the validation of graph construction methods for SSL was studied, being proposed the validation measure called graph-labels Katz coherence. Summing up, the discussed results give rise to the validity of representative individuals selection to seed the semi-supervised classification, supporting the central assumption of current thesis. Analogies may be found in several real-world network problems, such as epidemiology, rumors and information spreading, resilience, lethality, grandmother cells, and network evolving and self-organization. Amostragem de dados Aprendizado semisupervisionado Compelx networks Data sampling Redes complexas Semi-supervised learning
7	Rotulação de indivíduos representativos no aprendizado semissupervisionado baseado em redes: caracterização, realce, ganho e filosofia / Representatives labeling for network-based semi-supervised learning:characterization, highlighting, gain and philosophy Bilzã Marques de Araújo 29 April 2015 (has links) Aprendizado semissupervisionado (ASS) é o nome dado ao paradigma de aprendizado de máquina que considera tanto dados rotulados como dados não rotulados. Embora seja considerado frequentemente como um meio termo entre os paradigmas supervisionado e não supervisionado, esse paradigma é geralmente aplicado a tarefas preditivas ou descritivas. Na tarefa preditiva de classificação, p. ex., o objetivo é rotular dados não rotulados de acordo com os rótulos dos dados rotulados. Nesse caso, enquanto que os dados não rotulados descrevem as distribuições dos dados e mediam a propagação dos rótulos, os itens de dados rotulados semeiam a propagação de rótulos e guiam-na à estabilidade. No entanto, dados são gerados tipicamente não rotulados e sua rotulação requer o envolvimento de especialistas no domínio, rotulando-os manualmente. Dificuldades na visualização de grandes volumes de dados, bem como o custo associado ao envolvimento do especialista, são desafios que podem restringir o desempenho dessa tarefa. Por- tanto, o destacamento automático de bons candidatos a dados rotulados, doravante denominados indivíduos representativos, é uma tarefa de grande importância, e pode proporcionar uma boa relação entre o custo com especialista e o desempenho do aprendizado. Dentre as abordagens de ASS discriminadas na literatura, nosso interesse de estudo se concentra na abordagem baseada em redes, onde conjuntos de dados são representados relacionalmente, através da abstração gráfica. Logo, o presente trabalho tem como objetivo explorar a influência dos nós rotulados no desempenho do ASS baseado em redes, i.e., estudar a caracterização de nós representativos, como a estrutura da rede pode realçá-los, o ganho de desempenho de ASS proporcionado pela rotulação manual dos mesmos, e aspectos filosóficos relacionados. Em relação à caracterização, critérios de caracterização de nós centrais em redes são estudados considerando-se redes com estruturas modulares bem definidas. Contraintuitivamente, nós bastantes conectados (hubs) não são muito representativos. Nós razoavelmente conectados em vizinhanças pouco conectadas, por outro lado, são; estritamente local, esse critério de caracterização é escalável a grandes volumes de dados. Em redes com distribuição de grau homogênea - modelo Girvan-Newman (GN), nós com alto coeficiente de agrupamento também mostram-se representativos. Por outro lado, em redes com distribuição de grau heterogênea - modelo Lancichinetti-Fortunato-Radicchi (LFR), nós com alta intermedialidade se destacam. Nós com alto coeficiente de agrupamento em redes GN estão tipicamente situados em motifs do tipo quase-clique; nós com alta intermedialidade em redes LFR são hubs situados na borda das comunidades. Em ambos os casos, os nós destacados são excelentes regularizadores. Além disso, como critérios diversos se destacam em redes com características diversas, abordagens unificadas para a caracterização de nós representativos também foram estudadas. Crítica para o realce de indivíduos representativos e o bom desempenho da classificação semissupervisionada, a construção de redes a partir de bases de dados vetoriais também foi estudada. O método denominado AdaRadius foi proposto, e apresenta vantagens tais como adaptabilidade em bases de dados com densidade variada, baixa dependência da configuração de seus parâmetros, e custo computacional razoável, tanto sobre dados pool-based como incrementais. As redes resultantes, por sua vez, são esparsas, porém conectadas, e permitem que a classificação semissupervisionada se favoreça da rotulação prévia de indivíduos representativos. Por fim, também foi estudada a validação de métodos de construção de redes para o ASS, sendo proposta a medida denominada coerência grafo-rótulos de Katz. Em suma, os resultados discutidos apontam para a validade da seleção de indivíduos representativos para semear a classificação semissupervisionada, corroborando a hipótese central da presente tese. Analogias são encontrados em diversos problemas modelados em redes, tais como epidemiologia, propagação de rumores e informações, resiliência, letalidade, grandmother cells, e crescimento e auto-organização. / Semi-supervised learning (SSL) is the name given to the machine learning paradigm that considers both labeled and unlabeled data. Although often defined as a mid-term between unsupervised and supervised machine learning, this paradigm is usually applied to predictive or descriptive tasks. In the classification task, for example, the goal is to label the unlabeled data according to the labels of the labeled data. In this case, while the unlabeled data describes the data distributions and mediate the label propagation, the labeled data seeds the label propagation and guide it to the stability. However, as a whole, data is generated unlabeled, and to label data requires the involvement of domain specialists, labeling it by hand. Difficulties on visualizing huge amounts of data, as well as the cost of the specialists involvement, are challenges which may constraint the labeling task performance. Therefore, the automatic highlighting of good candidates to label by hand, henceforth called representative individuals, is a high value task, which may result in a good tradeoff between the cost with the specialist and the machine learning performance. Among the SSL approaches in the literature, our study is focused on the network--based approache, where datasets are represented relationally, through the graphic abstraction. Thus, the current study aims to explore and exploit the influence of the labeled data on the SSL performance, that is, the proper characterization of representative nodes, how the network structure may enhance them, the SSL performance gain due to labeling them by hand, and related philosophical aspects. Concerning the characterization, central nodes characterization criteria were studied on networks with well-defined modular structures. Counterintuitively, highly connected nodes (hubs) are not much representatives. Not so connected nodes placed in low connectivity neighborhoods are, though. Strictly local, this characterization is scalable to huge volumes of data. In networks with homogeneous degree distribution - Girvan-Newman networks (GN), nodes with high clustering coefficient also figure out as representatives. On the other hand, in networks with inhomogeneous degree distribution - Lancichinetti-Fortunato-Radicchi networks (LFR), nodes with high betweenness stand out. Nodes with high clustering coefficient in GN networks typically lie in almost-cliques motifs; nodes with high betweenness in LFR networks are highly connected nodes, which lie in communities borders. In both cases, the highlighted nodes are outstanding regularizers. Besides that, unified approaches to characterize representative nodes were studied because diverse criteria stand out for diverse networks. Crucial for highlighting representative nodes and ensure good SSL performance, the graph construction from vector-based datasets was also studied. The method called AdaRadius was introduced and presents advantages such as adaptability to data with variable density, low dependency on parameters settings, and reasonable computational cost on both pool based and incremental data. Yielding networks are sparse but connected and allow the semi-supervised classification to take great advantage of the manual labeling of representative nodes. Lastly, the validation of graph construction methods for SSL was studied, being proposed the validation measure called graph-labels Katz coherence. Summing up, the discussed results give rise to the validity of representative individuals selection to seed the semi-supervised classification, supporting the central assumption of current thesis. Analogies may be found in several real-world network problems, such as epidemiology, rumors and information spreading, resilience, lethality, grandmother cells, and network evolving and self-organization. Amostragem de dados Aprendizado semisupervisionado Redes complexas Compelx networks Data sampling Semi-supervised learning
8	Predicting the Helpfulness of Online Product Reviews Hjalmarsson, Felicia January 2021 (has links) Review helpfulness prediction has attracted growing attention of researchers that proposed various solutions using Machine Learning (ML) techniques. Most of the studies used online reviews from Amazon to predict helpfulness where each review is accompanied with information indicating how many people found the review helpful. This research aims to analyze the complete process of modelling review helpfulness from several perspectives. Experiments are conducted comparing different methods for representing the review text as well as analyzing the importance of data sampling for regression compared to using non-sampled datasets. Additionally, a set of review, review meta-data and product features are evaluated on their ability to capture the helpfulness of reviews. Two Amazon product review datasets are utilized for the experiments and two of the most widely used machine-learning algorithms, Linear Regression and Convolutional Neural Network (CNN). The experiments empirically demonstrate that the choice of representation of the textual data has an impact on performance with tf-idf and word2Vec obtaining the lowest Mean Squared Error (MSE) values. The importance of data sampling is also evident from the experiments as the imbalanced ratios in the unsampled dataset negatively affected the performance of both models with bias predictions in favor of the majority group of high ratios in the dataset. Lastly, the findings suggest that review features such as unigrams of review text and title, length of review text in words, polarity of title along with rating as review meta-data feature are the most influential features for determining helpfulness of reviews. Review helpfulness prediction product reviews machine learning data sampling regression Software Engineering Programvaruteknik
9	Choosing a data frequency to forecast the quarterly yen-dollar exchange rate Cann, Benjamin 03 October 2016 (has links) Potentially valuable information about the underlying data generating process of a dependent variable is often lost when an independent variable is transformed to fit into the same sampling frequency as a dependent variable. With the mixed data sampling (MIDAS) technique and increasingly available data at high frequencies, the issue of choosing an optimal sampling frequency becomes apparent. We use financial data and the MIDAS technique to estimate thousands of regressions and forecasts in the quarterly, monthly, weekly, and daily sampling frequencies. Model fit and forecast performance measurements are calculated from each estimation and used to generate summary statistics for each sampling frequency so that comparisons can be made between frequencies. Our regression models contain an autoregressive component and five additional independent variables and are estimated with varying lag length specifications that incrementally increase up to five years of lags. Each regression is used to forecast a rolling, one and two-step ahead, static forecast of the quarterly Yen and U.S Dollar spot exchange rate. Our results suggest that it may be favourable to include high frequency variables for closer modeling of the underlying data generating process but not necessarily for increased forecasting performance. / Graduate / 0501 / 0508 / 0511 / benjamincann@gmail.com mixed data sampling forecasting model selection criteria time-series yen dollar exchange rate econometrics economics MIDAS foreign exchange rates
10	Approximate Data Analytics Systems Le Quoc, Do 22 January 2018 (has links) Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications. info:eu-repo/classification/ddc/004 ddc:004

Search results