Global ETD Search

411	Machine learning and statistical analysis in fuel consumption prediction for heavy vehicles / Maskininlärning och statistisk analys för prediktion av bränsleförbrukning i tunga fordon Almér, Henrik January 2015 (has links) I investigate how to use machine learning to predict fuel consumption in heavy vehicles. I examine data from several different sources describing road, vehicle, driver and weather characteristics and I find a regression to a fuel consumption measured in liters per distance. The thesis is done for Scania and uses data sources available to Scania. I evaluate which machine learning methods are most successful, how data collection frequency affects the prediction and which features are most influential for fuel consumption. I find that a lower collection frequency of 10 minutes is preferable to a higher collection frequency of 1 minute. I also find that the evaluated models are comparable in their performance and that the most important features for fuel consumption are related to the road slope, vehicle speed and vehicle weight. / Jag undersöker hur maskininlärning kan användas för att förutsäga bränsleförbrukning i tunga fordon. Jag undersöker data från flera olika källor som beskriver väg-, fordons-, förar- och väderkaraktäristiker. Det insamlade datat används för att hitta en regression till en bränsleförbrukning mätt i liter per sträcka. Studien utförs på uppdrag av Scania och jag använder mig av datakällor som är tillgängliga för Scania. Jag utvärderar vilka maskininlärningsmetoder som är bäst lämpade för problemet, hur insamlingsfrekvensen påverkar resultatet av förutsägelsen samt vilka attribut i datat som är mest inflytelserika för bränsleförbrukning. Jag finner att en lägre insamlingsfrekvens av 10 minuter är att föredra framför en högre frekvens av 1 minut. Jag finner även att de utvärderade modellerna ger likvärdiga resultat samt att de viktigaste attributen har att göra med vägens lutning, fordonets hastighet och fordonets vikt. machine learning statistical analysis data science fuel consumption prediction support vector regression artificial neural networks random forest linear regression Computer Sciences Datavetenskap (datalogi)
412	Evaluating supervised machine learning algorithms to predict recreational fishing success : A multiple species, multiple algorithms approach / Utvärdering av övervakade maskininlärningsalgoritmer för att förutsäga framgång inom sportfiske Wikström, Johan January 2015 (has links) This report examines three different machine learning algorithms and their effectiveness for predicting recreational fishing success. Recreational fishing is a huge pastime but reliable methods of predicting fishing success have largely been missing. This report compares random forest, linear regression and multilayer perceptron to a reasonable baseline model for predicting fishing success. Fishing success is defined as the expected weight of the fish caught. Previous reports have mainly focused on commercial fishing or limited the research to examining the impact of a single variable. In this exploratory study, multiple attributes and multiple algorithms are examined to determine if supervised machine learning is a viable tool to predict recreational fishing success. Recreational fishing success can potentially be predicted by a large number of attributes, which may be different for different species. In this report, data is fetched from multiple sources and combined into a unified data format. The primary source of data is a database from the fishing app FishBrain, containing data of over 250000 logged catches. Another is the World Weather Online API which supplies weather data. The report focuses on the four most common species in the database, largemouth bass, Micropterus salmoides, northern pike, Esox lucius, rainbow trout, Oncorhynchus mykiss and European perch, Perca fluviatilis with a focus on largemouth bass since it has the most data available. Algorithms are evaluated using the Weka data mining software. Hyperparameters are found using cross-validation and some data is used as a test set to validate the results after cross-validation. Results are measured as the error compared to a baseline algorithm. Random forest is the most effective algorithm in the experiments, reducing error compared to the baseline for all the examined fish species. It is also found that no single variable affects the chosen metric of fishing success much, but rather a combination of most of the examined variables is needed to give optimal predictions. In conclusion, the random forest algorithm can be used to predict fishing success across multiple species. It performs significantly better than linear regression, multilayer perceptron and the baseline on crossvalidation and on the testing set. / I denna rapport evalueras tre olika maskininlärningsalgoritmer och deras effektivitet för att förutsäga framgång inom sportfiske. Sport- fiske är en mycket populär hobby, men pålitliga metoder att förutsäga framgångsrikt sportfiske saknas. Denna rapport jämför random forest, linjär regression och flerlagers neurala nätverk mot en rimlig baselinealgorithm för att förutsäga framgång inom sportfiske. Framgång defineras som fiskens förväntade vikt i kg. Tidigare undersökningar har huvudsakligen fokuserat på kommersiellt fiske eller begränsat undersökningen till påverkan av en enskild variabel. I denna studie undersöks flera attribut och algoritmer för att avgöra om övervakad maskininlärning är ett användbart verktyg för att förutsäga framgång inom sportfiske. Framgång inom sportfiske kan potentiellt påverkas av ett stort antal attribut som kan vara olika för olika arter. I denna studie hämtas data från ett flertal källor som kombineras i ett unifierat dataformat. Den primära datakällan är en databas tillhörande sportfiskeappen FishBrain som innehåller över 250000 loggade fångster. En annan källa är World Weather Online:s API som bidrar med väderdata. Rapporten fokuserar på de fyra vanligaste arterna i databasen, largemouth bass, Micropterus salmoides, gädda, Esox lucius, regnbågsöring, Oncorhynchus mykiss och europeisk abborre, Perca fluviatilis med ett särskilt fokus på largemouth bass eftersom den har mest data tillgängligt. Algoritmerna evalueras med hjälp av data mining-verktyget Weka. Hyperparametrar bestäms med hjälp av korsvalidering och en delmängd av datan separeras och används för att validera resultaten efter korsvalidering. Resultaten mäts relativt en baseline-algoritm. Random forest är den mest effektiva algoritmen i experimenten och reducerar felet jämfört med baseline-algoritmen för alla undersökta fiskarter. Inget enskilt attribut påverkar slutresultatet mycket utan det behövs en kombination av flera attribut för att ge optimala prediktioner. Slutsatsen blir att random forest kan användas för att förutsäga framgång inom sportfiske för flera olika fiskarter. Den presterar signifikant bättre än linjär regression, flerlagers neuralt nätverk och baselinealgoritmen på korsvalidering och på testdelmängden. sport fishing recreational fishing fishing supervised machine learning random forest linear regression artificial neural networks sportfiske fiske Computer Sciences Datavetenskap (datalogi)
413	Predicción de la demanda para un general sales service agent (GSSA) mediante regresión lineal simple / Demand forecasting for a general sales service agent through simple linear regression Rojas García, Freddy Wiliam 09 December 2020 (has links) Pacific Feeder Services (PFS) es un agente general de venta de espacios aéreos de distintas aerolíneas; por ejemplo, Korean Air, Aeroméxico, Alitalia, Aerolíneas Argentinas y Gol. Estas aerolíneas no cuentan con infraestructura propia en el Perú, de modo que PFS actúa como representante de estas aerolíneas ante sus clientes. En el presente trabajo de investigación se utilizará la metodología iterativa de la ciencia de datos para abordar el problema relacionado a la demanda, puesto que esta es incierta en algunos meses del año. Para ello, se plantea la siguiente hipótesis: ¿Será una regresión lineal simple el modelo adecuado para realizar el pronóstico de los volúmenes de la demanda que tendrá PFS en los próximos meses? El objetivo por alcanzar será proyectar la demanda mediante una regresión lineal simple, para lo cual se está tomando como base los datos de los kilos exportados por PFS en el año 2019. Asimismo, el presente trabajo de investigación académico presenta una arquitectura de datos funcional y una arquitectura de datos tecnológica que da soporte al modelo de regresión lineal simple. La primera explica cuáles son los insumos, almacenamiento y consumo que se requieren para implementar el mencionado modelo, mientras que la segunda expone las herramientas del modelo. Finalmente, el trabajo acaba con las conclusiones y recomendaciones asociadas a la correcta implementación del modelo de regresión lineal simple en el caso específico de PFS. / Pacific Feeder Services (PFS) is a general sales service agent (GSSA) whose main duty is to commercialize air freight capacity of different airlines; for example, Korean Air, Aeroméxico, Alitalia, Aerolineas Argentinas and Gol. These airlines do not have their own infrastructure in the country, so PFS acts as a representative of these airlines to their customers. In this research paper, the iterative methodology of data science will be used to address the problem related to demand, inasmuch as this is uncertain in some months of the year. To do this, the following hypothesis is proposed: Will a simple linear regression be the appropriate model to forecast the volumes of demand that PFS will have in the coming months? The objective to be achieved will be to project the demand through a simple linear regression, for which the data of the kilos exported by PFS in 2019 is being taken as a basis. Likewise, this academic research paper presents a functional data architecture and a technological data architecture that supports the simple linear regression model. The first explains what the inputs, storage and consumption required to implement the mentioned model are, while the second exposes the tools of the model. Finally, the research paper ends with the conclusions and recommendations associated with the correct implementation of the simple linear regression model in the specific case of PFS. / Trabajo de investigación Modelado de datos Regresión lineal simple Arquitectura de datos Data modeling Simple linear regression Data architecture
414	Aplicación de Data Science en la productividad de emisiones de pólizas / Data Science Application in productivity of policies issuance Escobar Pacheco, Víctor Eduardo, Lazo vera, Zadith Elizabeth, Padilla Mantilla, Bryan Obed, Sangay Espinoza, Almendra Alessandra 13 December 2020 (has links) El presente trabajo tiene como objetivo identificar las variables que influyeron en la productividad de las emisiones de nuevas pólizas en el periodo del 2019 en las agencias de Lima, en las siguientes páginas se detalla de manera concreta y pormenorizada. Por otro lado, dicho estudio abarca temas tales como comprensión del negocio y enfoque analítico, compresión y preparación de los datos, producción, análisis e interpretación de los datos, modelado y evaluación de la data. Asimismo, la metodología utilizada para el estudio en mención está basada en la metodología de la ciencia de datos de IBM, que se espera contribuya a la obtención de resultados favorables de cara a responder la pregunta de investigación. El presente trabajo de investigación tiene un enfoque descriptivo que emplea la técnica de aprendizaje supervisado con la ayuda de regresión lineal. Por último, el propósito del presente proyecto de investigación no es el de brindar una solución en concreto para la organización en estudio, sino el de dar alternativas para posibles planes de acción y/o mejor toma de decisiones al problema identificado, los cuales puedan ser implementados en pro de la mejora departamental de la compañía. / The work purpose is to identify the variables that influenced the productivity of the issuance of new policies in the 2019 period in the Lima agencies, which is detailed in a concrete and detailed way in the following pages. On the other hand, this study covers topics such as business understanding and analytical approach, data compression and preparation, data production, analysis and interpretation, data modeling and evaluation. Likewise, the methodology used for the study in question is based on the IBM data science methodology, which is expected to contribute to obtaining favorable results in order to answer the research question. The present research work has a descriptive approach, the same one that uses the supervised learning technique with the help of linear regression. Finally, the purpose of this research project is not to provide a specific solution for the organization under study, but to provide alternatives for possible action plans and / or better decision-making to the identified problem, which may be implemented in favor of the departmental improvement of the company. / Trabajo de investigación Ciencia de datos Productividad Regresión lineal Data science Productivity Linear regression
415	Sales Forecasting by Assembly of Multiple Machine Learning Methods : A stacking approach to supervised machine learning Falk, Anton, Holmgren, Daniel January 2021 (has links) Today, digitalization is a key factor for businesses to enhance growth and gain advantages and insight in their operations. Both in planning operations and understanding customers the digitalization processes today have key roles, and companies are spending more and more resources in this fields to gain critical insights and enhance growth. The fast-food industry is no exception where restaurants need to be highly flexible and agile in their work. With this, there exists an immense demand for knowledge and insights to help restaurants plan their daily operations and there is a great need for organizations to continuously adapt new technological solutions into their existing processes. Well implemented Machine Learning solutions in combination with feature engineering are likely to bring value into the existing processes. Sales forecasting, which is the main field of study in this thesis work, has a vital role in planning of fast food restaurant's operations, both for budgeting purposes, but also for staffing purposes. The word fast food describes itself. With this comes a commitment to provide high quality food and rapid service to the customers. Understaffing can risk violating either quality of the food or service while overstaffing leads to low overall productivity. Generating highly reliable sales forecasts are thus vital to maximize profits and minimize operational risk. SARIMA, XGBoost and Random Forest were evaluated on training data consisting of sales numbers, business hours and categorical variables describing date and month. These models worked as base learners where sales predictions from a specific dataset were used as training data for a Support Vector Regression model (SVR). A stacking approach to this type of project shows sufficient results with a significant gain in prediction accuracy for all investigated restaurants on a 6-week aggregated timeline compared to the existing solution. / Digitalisering har idag en nyckelroll för att skapa tillväxt och insikter för företag, dessa insikter ger fördelar både inom planering och i förståelsen om deras kunder. Det här är ett område som företag lägger mer och mer resurser på för att skapa större förståelse om sin verksamhet och på så sätt öka tillväxten. Snabbmatsindustrin är inget undantag då restauranger behöver en hög grad av flexibilitet i sina arbetssätt för att möta kundbehovet. Det här skapar en stor efterfrågan av kunskap och insikter för att hjälpa dem i planeringen av deras dagliga arbete och det finns ett stort behov från företagen att kontinuerligt implementera nya tekniska lösningar i befintliga processer. Med väl implementerade maskininlärningslösningar i kombination med att skapa mer informativa variabler från befintlig data kan aktörer skapa mervärde till redan existerande processer. Försäljningsprognostisering, som är huvudområdet för den här studien, har en viktig roll för verksamhetsplaneringen inom snabbmatsindustrin, både inom budgetering och bemanning. Namnet snabbmat beskriver sig själv, med det följer ett löfte gentemot kunden att tillhandahålla hög kvalitet på maten samt att kunna tillhandahålla snabb service. Underbemanning kan riskera att bryta någon av dessa löften, antingen i undermålig kvalitet på maten eller att inte kunna leverera snabb service. Överbemanning riskerar i stället att leda till ineffektivitet i användandet av resurser. Att generera högst tillförlitliga prognoser är därför avgörande för att kunna maximera vinsten och minimera operativ risk. SARIMA, XGBoost och Random Forest utvärderades på ett träningsset bestående av försäljningssiffror, timme på dygnet och kategoriska variabler som beskriver dag och månad. Dessa modeller fungerar som basmodeller vars prediktioner från ett specifikt testset används som träningsdata till en Stödvektorsreggresionsmodell (SVR). Att använda stapling av maskininlärningsmodeller till den här typen av problem visade tillfredställande resultat där det påvisades en signifikant förbättring i prediktionssäkerhet under en 6 veckors aggregerad period gentemot den redan existerande modellen. machine learning statistical learning statistics random forest xgboost sarima stacking support vector regression svr linear regression sales sales forcasting forecasting time series Mathematics Matematik
416	Analizar el incremento de suscriptores de Netflix con respecto a la competencia desde el 2010 hasta lo que va del año 2020 Figueroa López, Romina Beatriz, Uriarte Mori, José André 28 November 2020 (has links) El presente trabajo de investigación tiene como finalidad analizar el incremento de suscriptores de Netflix con respecto a la competencia desde el 2010 hasta lo que va del año 2020. Hemos determinado que el enfoque será predictivo para que la organización a cargo pueda hacer uso del modelo supervisado de la manera que más le favorezca y estos puedan tomar las mejores decisiones estratégicas. Para ello, se ha generado una base de datos recopilada de diversas fuentes públicas confiables para obtener las variables: “cantidad de suscriptores”, “costo de contenido original”, “covid-19” … y posterior a ello, con toda la data adquirida se procederá a realizar cada etapa de la metodología de la ciencia de datos descrita en el curso durante el programa de ciencia de datos. Para aclarar el panorama hemos optado por el uso de la técnica de correlación de Pearson, lo cual nos permitió determinar las variables que tenían mejor correlación entre ellas, esto advierte que la variable más adecuada para determinar futuros pronósticos y analizar el incremento de suscriptores es la del costo de contenido original. Finalmente, para mostrar los resultados de la investigación se ha decidido utilizar como herramienta de visualización Power BI para exponer el presente estudio y responder a los objetivos planteados. / The purpose of this research work is to analyze the increase in Netflix subscribers with respect to the competition from 2010 to so far in 2020. We have determined that the approach will be predictive so that the organization in charge can make use of the supervised model in the way that best suits them and they can make the best strategic decisions. For this, a database compiled from various reliable public sources has been generated to obtain the variables: "number of subscribers", "cost of original content", "covid-19" ... and after that, with all the data acquired Each stage of the data science methodology described in the course will be carried out during the data science program. To clarify the panorama we have opted for the use of the Pearson correlation technique, which allowed us to determine the variables that had the best correlation between them, this warns that the most appropriate variable to determine future forecasts and analyze the increase in subscribers is the of the cost of original content. Finally, to show the results of the research, it has been decided to use Power BI as a visualization tool to present the present study and respond to the objectives set. / Trabajo de investigación Netflix Suscriptores Ciencia de datos Regresión lineal simple Subscribers Data science Simple linear regression
417	Comparison of linear regression and neural networks for stock price prediction Karlsson, Nils January 2021 (has links) Stock market prediction has been a hot topic lately due to advances in computer technology and economics. One economic theory, called Efficient Market Hypothesis (EMH), states that all known information is already factored into the prices which makes it impossible to predict the stock market. Despite the EMH, many researchers have been successful in predicting the stock market using neural networks on historical data. This thesis investigates stock prediction using both linear regression and neural networks (NN), with a twist. The inputs to the proposed methods are a number of profit predictions calculated with stochastic methods such as generalized autoregressive conditional heteroskedasticity (GARCH) and autoregressive integrated moving average (ARIMA). By contrast the traditional approach was instead to use raw data as inputs. The proposed methods show superior result in yielding profit: at best 1.1% in the Swedish market and 4.6% in the American market. The neural network yielded more profit than the linear regression model, which is reasonable given its ability to find nonlinear patterns. The historical data was used with different window sizes. This gives a good understanding of the window size impact on the prediction performance. Stock market Stock market prediction prediction neural network feed forward neural network artificial neural network linear regression finance efficient market hypothesis EMH Engineering and Technology Teknik och teknologier
418	Factores que influyeron en la exportación de mango fresco del perú hacia EE.UU durante el periodo 2002-2019 / Factor that influence the export of fresh mango from Peru to the US during the period 2002-2019 Flores Otoya, Brunela Belén, Martinez Suarez, Franco Alonso 13 January 2021 (has links) Dentro del sector no tradicional agrícola, el mango fresco es la quinta fruta peruana más exportada, con un crecimiento promedio anual de 12.92 por ciento desde el 2002 al 2019. Estados Unidos es el segundo destino de las exportaciones peruanas de mango, abarcando un promedio del 24% por ciento de las ventas totales en el 2019. Ante ello, la presente investigación busca determinar los factores que influyeron en la exportación de mango fresco del Perú hacia Estados Unidos durante el periodo 2002 – 2019. El estudio tuvo un enfoque mixto, con un alcance descriptivo, correlacional y causal; con un diseño no experimental longitudinal, donde se analizaron las variables Producción, Precio FOB, Tipo de Cambio, Demanda EE.UU y PBI de EE.UU. Además, de carácter descriptivo se analizaron las variables Gestión empresarial, Apoyo del Estado y Clima. Para el análisis cuantitativo se obtuvo información de fuentes secundarias como Adex Data Trade, BCRP, Banco Mundial, MINAGRI y Bureau of Economic donde se procesaron los datos a través de un modelo de regresión lineal múltiple. Mientras que, para el análisis cualitativo, se usó la técnica de entrevistas semi estructuradas en el cual se entrevistaron a trece actores clave pertenecientes a entidades del sector privado, sector público y gremios relacionados donde se procesó la información a través de la herramienta Atlas Ti. Sobre los resultados de la investigación se concluyeron que las variables producción nacional y demanda de Estados Unidos influyeron en la exportación de mango fresco peruano hacia Estados Unidos. / Within the non-traditional agricultural sector, fresh mango is the fifth most exported Peruvian fruit, with an average annual growth of 12.92 percent from 2002 to 2019. The United States is the second destination for Peruvian mango exports, covering an average of 24% percent of total sales in 2019. Given this, this research seeks to determine the factors that influenced the export of fresh mango from Peru to the United States during the period 2002 - 2019. The study had a mixed approach, with a descriptive, correlational and causal scope; With a non-experimental longitudinal design, where the variables Production, FOB Price, Exchange Rate, US Demand and US GDP were analyzed. In addition, of a descriptive nature, the variables Business Management, State Support and Climate were analyzed. For the quantitative analysis, information was obtained from secondary sources such as Adex Data Trade, BCRP, World Bank, MINAGRI and Bureau of Economic, where the data was processed through a multiple linear regression model. While, for the qualitative analysis, the semi-structured interview technique was used in which thirteen key actors belonging to entities of the private sector, public sector and related unions were interviewed where the information was processed through the Atlas Ti tool. Regarding the results of the investigation, it was concluded that the variables national production and demand from the United States influenced the export of fresh Peruvian mango to the United States. / Tesis Mango Exportación Regresión lineal múltiple Export Multiple linear regression
419	Stigmatizace osob s duševním onemocněním / Stigma toward people with mental illness Weissová, Aneta January 2015 (has links) Stigmatization of people with mental illness has negative impact on quality of their life. There are few Czech studies focusing on stigma, but they rather focus only on one element of the problem. Aim of this thesis is to identify level of stigma in the Czech Republic and its socio- demographic predictors. This knowledge will help when choosing target groups for stigma reducing campaign. Thesis focuses on three elements of stigma - problems in knowledge, attitudes and behaviour. Four datasets are used - one from survey conducted within this theses, one from CVVM and two from INRES, which were conducted for NUDZ. Standardised research tools were used to measure knowledge (MAKS scale), attitudes (CAMI scale), behaviour (RIBS scale) and social distance. Predictors were identified using multivariate linear regression analysis. When comparing level of stigma among three elements, behaviour has the highest level and knowledge has the lowest level. Higher level of stigma in knowledge and attitudes is associated with being a male, lower education level and smaller size of residence. Higher level of stigma in behaviour is related to higher age, region and previous contact with person with mental illness. However these relations are rather weak and there are other nonsocio-demographic factors influencing...
420	A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data Zuber, Verena 27 June 2012 (has links) In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error. info:eu-repo/classification/ddc/000 ddc:000

Search results