Global ETD Search

1	Tidsaspekt för informationsklassificering inom svenska myndigheter / Timescale for information classification in Swedish governmental agencies Susi, Tommy January 2016 (has links) No description available. information classification data classification data labeling Computer Sciences Datavetenskap (datalogi)
2	Using active learning for semi-automatically labeling a dataset of fisheye distorted images for object detection Bourghardt, Olof January 2022 (has links) Self-driving vehicles has become a hot topic in today's industry during the past years and companies all around the globe are attempting to solve the complex task of developing vehicles that can safely navigate roads and traffic without the assistance of a driver. As deep learning and computer vision becomes more streamlined and with the possibility of using fisheye cameras as a cheap alternative to external sensors some companies have begun researching the possibility for assisted driving on vehicles such as electrical scooters to prevent injuries and accidents by detecting dangerous situations as well as promoting a sustainable infrastructure. However training such a model requires gathering large amounts of data which needs to be labeled by a human annotator. This process is expensive, time consuming, and requires extensive quality checking which can be difficult for companies to afford. This thesis presents an application that allows for semi-automatically labeling a dataset with the help of a human annotator and an object detector. The application trains an object detector together with an active learning framework on a small part of labeled data sampled from the woodscape dataset of fisheye distorted images and uses the knowledge of the trained model as well as using a human annotator as assistance to label more data. This thesis examines the labeled data produced by using the application described in this thesis and compares them with the quality of the annotations in the woodscape dataset. Results show that the model can't make any quality annotations compared to the woodscape dataset and resulted in the human annotator having to label all of the data, and the model achieved an accuracy of 0.00099 mAP. Machine learning Object detection Computer vision Semi-automatic data labeling Fisheye images Computer Engineering Datorteknik
3	Automatic vs. Manual Data Labeling : A System Dynamics Modeling Approach / Automatisk Kontra Manuell Dataannotering : med Systemdynamiksmodellering Blank, Clas January 2020 (has links) Labeled data, which is a collection of data samples that have been tagged with one or more labels, play an important role many software organizations in today's market. It can help in solving automation problems, training and validating machine learning models, or analysing data. Many organizations therefore set up their own labeled data gathering system which supplies them with the data they require. Labeling data can either be done by humans or be done via some automated process. However, labeling datasets comes with costs to these organizations. This study will examine what this labeled data gathering system could look like and determine which components that play a crucial role when determining how costly an automatic approach is compared to a manual approach using the company Klarna's label acquisition system as a case study. Two models are presented where one describes a system that solely uses humans for data annotation, while the other model describes a system where labeling is done via an automatic process. These models are used to compare costs to an organization taking those approaches. Important findings include the identification of important components that affects which approach would be more economically efficient to an organization under certain circumstances. Some of these important components are the label decay rate, automatic and manual expected accuracy, and number of data points that require labeling. / Annoterad data, vilket är en kollektion utav datapunkter som har blivit annoterade med en eller flera taggar, spelar en viktig roll för många mjukvaruföretag i dagens marknad. Det kan hjälpa till att lösa automatiseringsingsproblem, träna och validera maskininlärningsmodeller, eller analysera data. Många organisationer sätter därför upp sina egna dataannoteringssystem som kan leverera den annoterade data som behövs inom organisationen. Annotering kan göras av människor, men kan också göras via en automatiserad process. Emellertid kommer annotering utav data med kostnader för organisationen. Denna studie undersöker hur ett sådant dataannoteringssystem kan se ut och analyserar vilka komponenter som spelar en betydande roll när kostnader mellan ett automatiserat system och ett manuellt system ska jämföras. Klarnas dataannoteringssystem kommer att användas som en case-studie. Två modeller presenteras varav den ena beskriver ett system där enbart manuellt annoteringsarbete utförs, och den andra beskriver ett system där annotering utav data utförs via en automatisk process. Några viktiga resultat av denna studie är identifikationen utav betydelsefulla parametrar i modellerna när det kommer till att jämföra den ekonomiska effektiviteten mellan de två olika dataannoteringsstrategierna. Exempel på dessa komponenter är annoteringens förfalltakt, den förväntade manuella/automatiska pricksäkerheten, och mängden data som behöver annoteras. System Dynamics Modeling Data Annotation Data Labeling Cost Comparison Systemdynamik Modellering Dataannotation Kostnadsjämförelse Engineering and Technology Teknik och teknologier
4	Human mobility behavior : Transport mode detection by GPS data Sadeghian, Paria January 2021 (has links) GPS tracking data are widely used to understand human travel behavior and to evaluate the impact of travel. A major advantage with the usage of GPS tracking devices for collecting data is that it enables the researcher to collect large amounts of highly accurate and detailed human mobility data. However, unlabeled GPS tracking data does not easily lend itself to detecting transportation mode and this has given rise to a range of methods and algorithms for this purpose. The algorithms used vary in design and functionality, from defining specific rules to advanced machine learning algorithms. There is however no previous comprehensive review of these algorithms and this thesis aims to identify their essential features and methods and to develop and demonstrate a method for the detection of transport mode in GPS tracking data. To do this, it is necessary to have a detailed description of the particular journey undertaken by an individual. Therefore, as part of the investigation, a microdata analytic approach is applied to the problem areas, including the stages of data collection, data processing, analyzing the data, and decision making. In order to fill the research gap, Paper I consists of a systematic literature review of the methods and essential features used for detecting the transport mode in unlabeled GPS tracking data. Selected empirical studies were categorized into rule-based methods, statistical methods, and machine learning methods. The evaluation shows that machine learning algorithms are the most common. In the evaluation, I compared the methods previously used, extracted features, types of dataset, and model accuracy of transport mode detection. The results show that there is no standard method used in transport mode detection. In the light of these results, I propose in Paper II a stepwise methodology to detect five transport modes taking advantage of the unlabeled GPS data by first using an unsupervised algorithm to detect the five transport modes. A GIS multi-criteria process was applied to label part of the dataset. The performance of the five supervised algorithms was evaluated by applying them to different portions of the labeled dataset. The results show that stepwise methodology can achieve high accuracy in detecting the transport mode by labeling only 10% of the data from the entire dataset. For the future, one interesting area to explore would be the application of the stepwise methodology to a balanced and larger dataset. A semi-supervised deep-learning approach is suggested for development in transport mode detection, since this method can detect transport modes with only small amounts of labeled data. Thus, the stepwise methodology can be improved upon for further studies. Transport mode detection Machine learning Statistical learning Rule-based method Data labeling Transport Systems and Logistics Transportteknik och logistik Computer Sciences Datavetenskap (datalogi)
5	Duplicate detection of multimodal and domain-specific trouble reports when having few samples : An evaluation of models using natural language processing, machine learning, and Siamese networks pre-trained on automatically labeled data / Dublettdetektering av multimodala och domänspecifika buggrapporter med få träningsexempel : En utvärdering av modeller med naturlig språkbehandling, maskininlärning, och siamesiska nätverk förtränade på automatiskt märkt data Karlstrand, Viktor January 2022 (has links) Trouble and bug reports are essential in software maintenance and for identifying faults—a challenging and time-consuming task. In cases when the fault and reports are similar or identical to previous and already resolved ones, the effort can be reduced significantly making the prospect of automatically detecting duplicates very compelling. In this work, common methods and techniques in the literature are evaluated and compared on domain-specific and multimodal trouble reports from Ericsson software. The number of samples is few, which is a case not so well-studied in the area. On this basis, both traditional and more recent techniques based on deep learning are considered with the goal of accurately detecting duplicates. Firstly, the more traditional approach based on natural language processing and machine learning is evaluated using different vectorization techniques and similarity measures adapted and customized to the domain-specific trouble reports. The multimodality and many fields of the trouble reports call for a wide range of techniques, including term frequency-inverse document frequency, BM25, and latent semantic analysis. A pipeline processing each data field of the trouble reports independently and automatically weighing the importance of each data field is proposed. The best performing model achieves a recall rate of 89% for a duplicate candidate list size of 10. Further, obtaining knowledge on which types of data are most important for duplicate detection is explored through what is known as Shapley values. Results indicate that utilizing all types of data indeed improve performance, and that date and code parameters are strong indicators. Secondly, a Siamese network based on Transformer-encoders is evaluated on data fields believed to have some underlying representation of the semantic meaning or sequentially important information, which a deep model can capture. To alleviate the issues when having few samples, pre-training through automatic data labeling is studied. Results show an increase in performance compared to not pre-training the Siamese network. However, compared to the more traditional model it performs on par, indicating that traditional models may perform equally well when having few samples besides also being simpler, more robust, and faster. / Buggrapporter är kritiska för underhåll av mjukvara och för att identifiera fel — en utmanande och tidskrävande uppgift. I de fall då felet och rapporterna liknar eller är identiska med tidigare och redan lösta ärenden, kan tiden som krävs minskas avsevärt, vilket gör automatiskt detektering av dubbletter mycket önskvärd. I detta arbete utvärderas och jämförs vanliga metoder och tekniker i litteraturen på domänspecifika och multimodala buggrapporter från Ericssons mjukvara. Antalet tillgängliga träningsexempel är få, vilket inte är ett så välstuderat fall. Utifrån detta utvärderas både traditionella samt nyare tekniker baserade på djupinlärning med målet att detektera dubbletter så bra som möjligt. Först utvärderas det mer traditionella tillvägagångssättet baserat på naturlig språkbearbetning och maskininlärning med hjälp av olika vektoriseringstekniker och likhetsmått specialanpassade till buggrapporterna. Multimodaliteten och de många datafälten i buggrapporterna kräver en rad av tekniker, så som termfrekvens-invers dokumentfrekvens, BM25 och latent semantisk analys. I detta arbete föreslås en modell som behandlar varje datafält i buggrapporterna separat och automatiskt sammanväger varje datafälts betydelse. Den bäst presterande modellen uppnår en återkallningsfrekvens på 89% för en lista med 10 dubblettkandidater. Vidare undersöks vilka datafält som är mest viktiga för dubblettdetektering genom Shapley-värden. Resultaten tyder på att utnyttja alla tillgängliga datafält förbättrar prestandan, och att datum och kodparametrar är starka indikatorer. Sedan utvärderas ett siamesiskt nätverk baserat på Transformator-kodare på datafält som tros ha en underliggande representation av semantisk betydelse eller sekventiellt viktig information, vilket en djup modell kan utnyttja. För att lindra de problem som uppstår med få träningssexempel, studeras det hur den djupa modellen kan förtränas genom automatisk datamärkning. Resultaten visar på en ökning i prestanda jämfört med att inte förträna det siamesiska nätverket. Men jämfört med den mer traditionella modellen presterar den likvärdigt, vilket indikerar att mer traditionella modeller kan prestera lika bra när antalet träningsexempel är få, förutom att också vara enklare, mer robusta, och snabbare. Duplicate detection Bug reports Trouble reports Natural language processing Information retrieval Machine learning Siamese neural network Transformers Automated data labeling Shapley values Dubblettdetektering Felrapporter Buggrapporter Naturlig språkbehandling Informationssökning Maskininlärning Siamesiska neurala nätverk Transformatorer Automatiserad datamärkning Shapley-värden Computer and Information Sciences Data- och informationsvetenskap
6	Zero/Few-Shot Text Classification : A Study of Practical Aspects and Applications / Textklassificering med Zero/Few-Shot Learning : En Studie om Praktiska Aspekter och Applikationer Åslund, Jacob January 2021 (has links) SOTA language models have demonstrated remarkable capabilities in tackling NLP tasks they have not been explicitly trained on – given a few demonstrations of the task (few-shot learning), or even none at all (zero-shot learning). The purpose of this Master’s thesis has been to investigate practical aspects and potential applications of zero/few-shot learning in the context of text classification. This includes topics such as combined usage with active learning, automated data labeling, and interpretability. Two different methods for zero/few-shot learning have been investigated, and the results indicate that: • Active learning can be used to marginally improve few-shot performance, but it seems to be mostly beneficial in settings with very few samples (e.g. less than 10). • Zero-shot learning can be used produce reasonable candidate labels for classes in a dataset, given knowledge of the classification task at hand. • It is difficult to trust the predictions of zero-shot text classification without access to a validation dataset, but IML methods such as saliency maps could find usage in debugging zero-shot models. / Ledande språkmodeller har uppvisat anmärkningsvärda förmågor i att lösa NLP-problem de inte blivit explicit tränade på – givet några exempel av problemet (few-shot learning), eller till och med inga alls (zero-shot learning). Syftet med det här examensarbetet har varit att undersöka praktiska aspekter och potentiella tillämpningar av zero/few-shot learning inom kontext av textklassificering. Detta inkluderar kombinerad användning med aktiv inlärning, automatiserad datamärkning, och tolkningsbarhet. Två olika metoder för zero/few-shot learning har undersökts, och resultaten indikerar att: • Aktiv inlärning kan användas för att marginellt förbättra textklassificering med few-shot learning, men detta verkar vara mest fördelaktigt i situationer med väldigt få datapunkter (t.ex. mindre än 10). • Zero-shot learning kan användas för att hitta lämpliga etiketter för klasser i ett dataset, givet kunskap om klassifikationsuppgiften av intresse. • Det är svårt att lita på robustheten i textklassificering med zero-shot learning utan tillgång till valideringsdata, men metoder inom tolkningsbar maskininlärning såsom saliency maps skulle kunna användas för att felsöka zero-shot modeller. zero-shot learning few-shot learning text classification active learning automated data labeling interpretable machine learning deep learning NLP NLU zero-shot learning few-shot learning textklassificering aktiv inlärning automatiserad datamärkning tolkningsbar maskininlärning djupinlärning NLP NLU Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.0746 seconds