Global ETD Search

51	Etude du passage à l'échelle des algorithmes de segmentation et de classification en télédétection pour le traitement de volumes massifs de données / Study of the scalability of segmentation and classification algorithms to process massive datasets for remote sensing applications Lassalle, Pierre 06 November 2015 (has links) Les récentes missions spatiales d'observation de la Terre fourniront des images optiques à très hautes résolutions spatiale, spectrale et temporelle générant des volumes de données massifs. L'objectif de cette thèse est d'apporter de nouvelles solutions pour le traitement efficace de grands volumes de données ne pouvant être contenus en mémoire. Il s'agit de lever les verrous scientifiques en développant des algorithmes efficaces qui garantissent des résultats identiques à ceux obtenus dans le cas où la mémoire ne serait pas une contrainte. La première partie de la thèse se consacre à l'adaptation des méthodes de segmentation pour le traitement d'images volumineuses. Une solution naïve consiste à découper l'image en tuiles et à appliquer la segmentation sur chaque tuile séparément. Le résultat final est reconstitué en regroupant les tuiles segmentées. Cette stratégie est sous-optimale car elle entraîne des modifications par rapport au résultat obtenu lors de la segmentation de l'image sans découpage. Une étude des méthodes de segmentation par fusion de régions a conduit au développement d'une solution permettant la segmentation d'images de taille arbitraire tout en garantissant un résultat identique à celui obtenu avec la méthode initiale sans la contrainte de la mémoire. La faisabilité de la solution a été vérifiée avec la segmentation de plusieurs scènes Pléiades à très haute résolution avec des tailles en mémoire de l'ordre de quelques gigaoctets. La seconde partie de la thèse se consacre à l'étude de l'apprentissage supervisé lorsque les données ne peuvent être contenues en mémoire. Dans le cadre de cette thèse, nous nous focalisons sur l'algorithme des forêts aléatoires qui consiste à établir un comité d'arbres de décision. Plusieurs solutions ont été proposées dans la littérature pour adapter cet algorithme lorsque les données d'apprentissage ne peuvent être stockées en mémoire. Cependant, ces solutions restent soit approximatives, car la contrainte de la mémoire réduit à chaque fois la visibilité de l'algorithme à une portion des données d'apprentissage, soit peu efficaces, car elles nécessitent de nombreux accès en lecture et écriture sur le disque dur. Pour pallier ces problèmes, nous proposons une solution exacte et efficace garantissant une visibilité de l'algorithme sur l'ensemble des données d'apprentissage. L'exactitude des résultats est vérifiée et la solution est testée avec succès sur de grands volumes de données d'apprentissage. / Recent Earth observation spatial missions will provide very high spectral, spatial and temporal resolution optical images, which represents a huge amount of data. The objective of this research is to propose innovative algorithms to process efficiently such massive datasets on resource-constrained devices. Developing new efficient algorithms which ensure identical results to those obtained without the memory limitation represents a challenging task. The first part of this thesis focuses on the adaptation of segmentation algorithms when the input satellite image can not be stored in the main memory. A naive solution consists of dividing the input image into tiles and segment each tile independently. The final result is built by grouping the segmented tiles together. Applying this strategy turns out to be suboptimal since it modifies the resulting segments compared to those obtained from the segmentation without tiling. A deep study of region-merging segmentation algorithms allows us to develop a tile-based scalable solution to segment images of arbitrary size while ensuring identical results to those obtained without tiling. The feasibility of the solution is shown by segmenting different very high resolution Pléiades images requiring gigabytes to be stored in the memory. The second part of the thesis focuses on supervised learning methods when the training dataset can not be stored in the memory. In the frame of the thesis, we decide to study the Random Forest algorithm which consists of building an ensemble of decision trees. Several solutions have been proposed to adapt this algorithm for processing massive training datasets, but they remain either approximative because of the limitation of memory imposes a reduced visibility of the algorithm on a small portion of the training datasets or inefficient because they need a lot of read and write access on the hard disk. To solve those issues, we propose an exact solution ensuring the visibility of the algorithm on the whole training dataset while minimizing read and write access on the hard disk. The running time is analysed by varying the dimension of the training dataset and shows that our proposed solution is very competitive with other existing solutions and can be used to process hundreds of gigabytes of data. Segmentation Classification Ensemble de données Carte mémoire Télédétection Segmentation Classification Remote sensing Tiling Memory-aware Massive datasets
52	Hierarchical Additive Spatial and Spatio-Temporal Process Models for Massive Datasets Ma, Pulong 29 October 2018 (has links) No description available. Statistics Fused Gaussian process Statistical Downscaling Spatial and Spatio-Temporal Modeling Nonstationarity Nonseparability Massive datasets
53	Event and Intrusion Detection Systems for Cyber-Physical Power Systems Adhikari, Uttam 14 August 2015 (has links) High speed data from Wide Area Measurement Systems (WAMS) with Phasor Measurement Units (PMU) enables real and non-real time monitoring and control of power systems. The information and communication infrastructure used in WAMS efficiently transports information but introduces cyber security vulnerabilities. Adversaries may exploit such vulnerabilities to create cyber-attacks against the electric power grid. Control centers need to be updated to be resilient not only to well-known power system contingencies but also to cyber-attacks. Therefore, a combined event and intrusion detection systems (EIDS) is required that can provide precise classification for optimal response. This dissertation describes a WAMS cyber-physical power system test bed that was developed to generate datasets and perform cyber-physical power system research related to cyber-physical system vulnerabilities, cyber-attack impact studies, and machine learning algorithms for EIDS. The test bed integrates WAMS components with a Real Time Digital Simulator (RTDS) with hardware in the loop (HIL) and includes various sized power systems with a wide variety of implemented power system and cyber-attack scenarios. This work developed a novel data processing and compression method to address the WAMS big data problem. The State Tracking and Extraction Method (STEM) tracks system states from measurements and creates a compressed sequence of states for each observed scenario. Experiments showed STEM reduces data size significantly without losing key event information in the dataset that is useful to train EIDS and classify events. Two EIDS are proposed and evaluated in this dissertation. Non-Nested Generalized Exemplars (NNGE) is a rule based classifier that creates rules in the form of hyperrectangles to classify events. NNGE uses rule generalization to create a model that has high accuracy and fast classification time. Hoeffding adaptive trees (HAT) is a decision tree classifier and uses incremental learning which is suitable for data stream mining. HAT creates decision trees on the fly from limited number of instances, uses low memory, has fast evaluation time, and adapts to concept changes. The experiments showed NNGE and HAT with STEM make effective EIDS that have high classification accuracy, low false positives, low memory usage, and fast classification times. real-time classification data mining Cyber-attacks Scenarios Datasets Test bed
54	Enhancing Telecom Churn Prediction: Adaboost with Oversampling and Recursive Feature Elimination Approach Tran, Long Dinh 01 June 2023 (has links) (PDF) Churn prediction is a critical task for businesses to retain their valuable customers. This paper presents a comprehensive study of churn prediction in the telecom sector using 15 approaches, including popular algorithms such as Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, and AdaBoost. The study is segmented into three sets of experiments, each focusing on a different approach to building the churn prediction model. The model is constructed using the original training set in the first set of experiments. The second set involves oversampling the training set to address the issue of imbalanced data. Lastly, the third set combines oversampling with recursive feature selection to enhance the model's performance further. The results demonstrate that the Adaptive Boost classifier, implemented with oversampling and recursive feature selection, outperforms the other 14 techniques. It achieves the highest rank in all three evaluation metrics: recall (0.841), f1-score (0.655), and roc_auc (0.793), further indicating that the proposed approach effectively predicts churn and provides valuable insights into customer behavior. Churn Prediction Unbalanced Datasets Oversampling SMOTE Recursive Feature Selection RFE Machine Learning
55	A System for Managing Experiments in Data Mining Myneni, Greeshma 19 August 2010 (has links) No description available. Computer Science Mining datasets Test File Data Mining File Learn and Test data mining task
56	Mining Formal Concepts in Large Binary Datasets using Apache Spark Rayabarapu, Varun Raj 29 September 2021 (has links) No description available. Computer Science Formal Concepts Apache Spark Large Binary Datasets Scalable Algorithm Data Mining
57	Phosphoproteomic strategies for protein functional characterization of phosphatases and kinases Andrew G. DeMarco (17103610) 06 April 2024 (has links) <p dir="ltr">Protein phosphorylation is a ubiquitous post-translational modification controlled by the opposing activities of protein kinases and phosphatases, which regulate diverse biological processes in all kingdoms of life. One of the key challenges to a complete understanding of phosphoregulatory networks is the unambiguous identification of kinase and phosphatase substrates. Liquid chromatography-coupled mass spectrometry (LC-MS/MS) and associated phosphoproteomic tools enable global surveys of phosphoproteome changes in response to signaling events or perturbation of phosphoregulatory network components. Despite the power of LC-MS/MS, it is still challenging to directly link kinases and phosphatases to specific substrate phosphorylation sites in many experiments. Here we described two methods for the LC-MS/MS-based characterization of protein phosphatases and kinases. The first is an <i>in-vitro</i> method designed to probe the inherent substrate specificity of kinase or phosphatases. This method utilizes an enzyme reaction with synthetic peptides, serving served as substrate proxies, coupled with LC-MS/MS for rapid, accurate high-throughput quantification of the specificity constant (<i>k</i><sub><em>cat</em></sub><i>/K</i><sub><em>M</em></sub>) for each substrate in the reaction and amino acid preference in the enzyme active site, providing insight into their cellular roles. The second couple’s auxin-inducible degradation system (AID) with phosphoproteomics for protein functional characterization. AID is a surrogate for specific chemical inhibition, which minimizes non-specific effects associated with long-term target perturbation. Using this system, we demonstrate-PP2A in complex with its B-subunit Rox Three Suppressor 1 (PP2A<sup>Rts1</sup>) contributes to the phosphoregulation of a conserved fungal-specific membrane protein complex called the eisosome. By maintaining eisosomes in their hypophosphorylated state, PP2A<sup>Rts1</sup> aids fungal cells in preserving metabolic homeostasis. This work demonstrates the power of mass spectrometry as a critical tool for protein functional characterization.</p> Analytical biochemistry Enzymes Signal transduction proteomics assay development phosphoproteomics datasets
58	Automatic Semantic Segmentation of Indoor Datasets Rachakonda, Sai Swaroop January 2024 (has links) Background: In recent years, computer vision has undergone significant advancements, revolutionizing fields such as robotics, augmented reality, and autonomoussystems. Key to this transformation is Simultaneous Localization and Mapping(SLAM), a fundamental technology that allows machines to navigate and interactintelligently with their surroundings. Challenges persist in harmonizing spatial andsemantic understanding, as conventional methods often treat these tasks separately,limiting comprehensive evaluations with shared datasets. As applications continueto evolve, the demand for accurate and efficient image segmentation ground truthbecomes paramount. Manual annotation, a traditional approach, proves to be bothcostly and resource-intensive, hindering the scalability of computer vision systems.This thesis addresses the urgent need for a cost-effective and scalable solution byfocusing on the creation of accurate and efficient image segmentation ground truth,bridging the gap between spatial and semantic tasks. Objective: This thesis addresses the challenge of creating an efficient image segmentation ground truth to complement datasets with spatial ground truth. Theprimary objective is to reduce the time and effort taken for annotation of datasets. Method: Our methodology adopts a systematic approach to evaluate and combineexisting annotation techniques, focusing on precise object detection and robust segmentation. By merging these approaches, we aim to enhance annotation accuracywhile streamlining the annotation process. This approach is systematically appliedand evaluated across multiple datasets, including the NYU V2 dataset(consists ofover 1449 images), ARID(real-world sequential dataset), and Italian flats(sequentialdataset created in blender). Results: The developed pipeline demonstrates promising outcomes, showcasing asubstantial reduction in annotation time compared to manual annotation, thereby addressing the challenges posed by the cost and resource intensiveness of the traditionalapproach. We observe that although not initially optimized for SLAM datasets, thepipeline performs exceptionally well on both ARID and Italian flats datasets, highlighting its adaptability to real-world scenarios. Conclusion: In conclusion, this research introduces an innovative annotation pipeline,offering a systematic and efficient approach to annotation. It tries to bridge the gapbetween spatial and semantic tasks, addressing the pressing need for comprehensiveannotation tools in this domain. Semantic Segmentation Annotation SLAM Indoor datasets YOLO V8 DETIC Segment Anything Model. Computer Sciences Datavetenskap (datalogi)
59	Nonstationary Nearest Neighbors Gaussian Process Models Hanandeh, Ahmad Ali 05 December 2017 (has links) No description available. Statistics Bayesian hierarchical modeling Large datasets Binary tree TOMS ozone data Gaussian process Nonstationary covariance function
60	Efficient network based approaches for pattern recognition and knowledge discovery from large and heterogeneous datasets Zhu, Cheng 25 October 2013 (has links) No description available. Computer Science Network approaches pattern recognition heterogeneous datasets rare orphan disease drug repositioning gene prediction

Search results