Global ETD Search

151	Building the Dresden Web Table Corpus: A Classification Approach Lehner, Wolfgang, Eberius, Julian, Braunschweig, Katrin, Hentsch, Markus, Thiele, Maik, Ahmadov, Ahmad 12 January 2023 (has links) In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC). info:eu-repo/classification/ddc/004 ddc:004
152	Towards a Hybrid Imputation Approach Using Web Tables Lehner, Wolfgang, Ahmadov, Ahmad, Thiele, Maik, Eberius, Julian, Wrembel, Robert 12 January 2023 (has links) Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time. info:eu-repo/classification/ddc/004 ddc:004
153	Development of a Machine Learning Algorithm to Identify Error Causes of Automated Failed Test Results Pallathadka Shivarama, Anupama 15 March 2024 (has links) The automotive industry is continuously innovating and adapting new technologies. Along with that, the companies work towards maintaining the quality of a hardware product and meeting the customer demands. Before delivering the product to the customer, it is essential to test and approve it for the safe use. The concept remains the same when it comes to a software. Adapting modern technologies will further improve the efficiency of testing a software. The thesis aims to build a machine learning algorithm for the implementation during the software testing. In general, the evaluation of a generated test report after the testing consumes more time. The built algorithm should be able to reduce the time spent and the manual effort during the evaluation. Basically, the machine learning algorithms will analyze and learn the data available in the old test reports. Based on the learnt data pattern, it will suggest the possible root causes for the failed test cases in the future. The thesis report has the literature survey that helped in understanding the machine learning concepts in different industries for similar problems. The tasks involved while building the model are data loading, data pre-processing, selecting the best conditions for each algorithm and comparison of the performance among them. It also suggest the possible future work towards improving the performance of the models. The entire work is implemented in Jupyter notebook using pandas and scikit-learn libraries. info:eu-repo/classification/ddc/004 ddc:004 Softwaretest Maschinelles Lernen
154	Monocular Depth Estimation with Edge-Based Constraints using Active Learning Optimization Saleh, Shadi 04 April 2024 (has links) Depth sensing is pivotal in robotics; however, monocular depth estimation encounters significant challenges. Existing algorithms relying on large-scale labeled data and large Deep Convolutional Neural Networks (DCNNs) hinder real-world applications. We propose two lightweight architectures that achieve commendable accuracy rates of 91.2% and 90.1%, simultaneously reducing the Root Mean Square Error (RMSE) of depth to 4.815 and 5.036. Our lightweight depth model operates at 29-44 FPS on the Jetson Nano GPU, showcasing efficient performance with minimal power consumption. Moreover, we introduce a mask network designed to visualize and analyze the compact depth network, aiding in discerning informative samples for the active learning approach. This contributes to increased model accuracy and enhanced generalization capabilities. Furthermore, our methodology encompasses the introduction of an active learning framework strategically designed to enhance model performance and accuracy by efficiently utilizing limited labeled training data. This novel framework outperforms previous studies by achieving commendable results with only 18.3% utilization of the KITTI Odometry dataset. This performance reflects a skillful balance between computational efficiency and accuracy, tailored for low-cost devices while reducing data training requirements.:1. Introduction 2. Literature Review 3. AI Technologies for Edge Computing 4. Monocular Depth Estimation Methodology 5. Implementation 6. Result and Evaluation 7. Conclusion and Future Scope Appendix info:eu-repo/classification/ddc/000 ddc:000
155	Prediction of designer-recombinases for DNA editing with generative deep learning Schmitt, Lukas Theo, Paszkowski-Rogacz, Maciej, Jug, Florian, Buchholz, Frank 04 June 2024 (has links) Site-specific tyrosine-type recombinases are effective tools for genome engineering, with the first engineered variants having demonstrated therapeutic potential. So far, adaptation to new DNA target site selectivity of designerrecombinases has been achieved mostly through iterative cycles of directed molecular evolution. While effective, directed molecular evolution methods are laborious and time consuming. Here we present RecGen (Recombinase Generator), an algorithm for the intelligent generation of designerrecombinases. We gather the sequence information of over one million Crelike recombinase sequences evolved for 89 different target sites with whichwe train Conditional Variational Autoencoders for recombinase generation. Experimental validation demonstrates that the algorithm can predict recombinase sequences with activity on novel target-sites, indicating that RecGen is useful to accelerate the development of future designer-recombinases. info:eu-repo/classification/ddc/500 ddc:500
156	Segmentation and Tracking of Cells and Nuclei Using Deep Learning Hirsch, Peter Johannes 27 September 2023 (has links) Die Analyse von großen Datensätzen von Mikroskopiebilddaten, insbesondere Segmentierung und Tracking, ist ein sehr wichtiger Aspekt vieler biologischer Studien. Für die leistungsfähige und verlässliche Nutzung ist der derzeitige Stand der Wissenschaft dennoch noch nicht ausreichend. Die vorhandenen Methoden sind oft schwer zu benutzen für ungeübte Nutzer, die Leistung auf anderen Datensätzen ist häufig verbesserungswürdig und sehr große Mengen an Trainingsdaten werden benötigt. Ich ging dieses Problem aus verschiedenen Richtungen an: (i) Ich präsentiere klare Richtlinien wie Artefakte beim Arbeiten mit sehr großen Bilddaten verhindert werden können. (ii) Ich präsentiere eine Erweiterung für eine Reihe von grundlegenden Methoden zur Instanzsegmentierung von Zellkernen. Durch Verwendung einer unterstützenden Hilfsaufgabe ermöglicht die Erweiterung auf einfache und unkomplizierte Art und Weise Leistung auf dem aktuellen Stand der Wissenschaft. Dabei zeige ich zudem, dass schwache Label ausreichend sind, um eine effiziente Objekterkennung auf 3d Zellkerndaten zu ermöglichen. (iii) Ich stelle eine neue Methode zur Instanzsegmentierung vor, die auf eine große Auswahl von Objekten anwendbar ist, von einfachen Formen bis hin zu Überlagerungen und komplexen Baumstrukturen, die das gesamte Bild umfassen. (iv) Auf den vorherigen Arbeiten aufbauend präsentiere ich eine neue Trackingmethode, die auch mit sehr großen Bilddaten zurecht kommt, aber nur schwache und dünnbesetzte Labels benötigt und trotzdem besser als die bisherigen besten Methoden funktioniert. Die Anpassungsfähigkeit an neue Datensätze wird durch eine automatisierte Parametersuche gewährleistet. (v) Für Nutzer, die das Tracking von Objekten in ihrer Arbeit verwenden möchten, präsentiere ich zusätzlich einen detaillierten Leitfaden, der es ihnen ermöglicht fundierte Entscheidungen zu treffen, welche Methode am besten zu ihrem Projekt passt. / Image analysis of large datasets of microscopy data, in particular segmentation and tracking, is an important aspect of many biological studies. Yet, the current state of research is still not adequate enough for copious and reliable everyday use. Existing methods are often hard to use, perform subpar on new datasets and require vast amounts of training data. I approached this problem from multiple angles: (i) I present clear guidelines on how to operate artifact-free on huge images. (ii) I present an extension for existing methods for instance segmentation of nuclei. By using an auxiliary task, it enables state-of-the-art performance in a simple and straightforward way. In the process I show that weak labels are sufficient for efficient object detection for 3d nuclei data. (iii) I present an innovative method for instance segmentation that performs extremely well on a wide range of objects, from simple shapes to complex image-spanning tree structures and objects with overlaps. (iv) Building upon the above, I present a novel tracking method that operates on huge images but only requires weak and sparse labels. Yet, it outperforms previous state-of-the-art methods. An automated weight search method enables adaptability to new datasets. (v) For practitioners seeking to employ cell tracking, I provide a comprehensive guideline on how to make an informed decision about what methods to use for their project. Maschinelles Lernen Bildsegmentierung Objektverfolgung Mikroskopie Zellen machine learning segmentation tracking microscopy cells 004 Informatik ST 640 ST 300 ddc:004
157	Aggregate-based Training Phase for ML-based Cardinality Estimation Woltmann, Lucas, Hartmann, Claudio, Lehner, Wolfgang, Habich, Dirk 22 April 2024 (has links) Cardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 90 with our aggregate-based training phase and thus outperform indexes. info:eu-repo/classification/ddc/004 ddc:004
158	Advancing Electron Ptychography for High-Resolution Imaging in Electron Microscopy Schloz, Marcel 13 May 2024 (has links) In dieser Arbeit werden Fortschritte in der Elektronenptychographie vorgestellt, die ihre Vielseitigkeit als Technik in der Elektronen-Phasenkontrastmikroskopie verbessern. Anstatt sich auf eine hochauflösende Elektronenoptik zu stützen, rekonstruiert die Ptychographie die Proben auf der Grundlage ihrer kohärenten Beugungssignale mit Hilfe von Berechnungsalgorithmen. Dieser Ansatz ermöglicht es, die Grenzen der konventionellen, auf Optik basierenden Elektronenmikroskopie zu überwinden und eine noch nie dagewesene sub-Angstrom Auflösung in den resultierenden Bildern zu erreichen. In dieser Arbeit werden zunächst die theoretischen, experimentellen und algorithmischen Grundlagen der Elektronenptychographie vorgestellt und in den Kontext der bestehenden rastergestützten Elektronenmikroskopietechniken gestellt. Darüber hinaus wird ein alternativer ptychographischer Phasengewinnungsalgorithmus entwickelt und seine Leistungsfähigkeit sowie die Qualität und räumliche Auflösung der Rekonstruktionen analysiert. Weiterhin befasst sich die Arbeit mit der Integration von Methoden des maschinellen Lernens in die Elektronenptychographie und schlägt einen spezifischen Ansatz zur Verbesserung der Rekonstruktionsqualität unter suboptimalen Versuchsbedingungen vor. Außerdem wird die Kombination von Ptychographie mit Defokusserienmessungen hervorgehoben, die eine verbesserte Tiefenauflösung bei ptychographischen Rekonstruktionen ermöglicht und uns somit dem ultimativen Ziel näher bringt, quantitative Rekonstruktionen von beliebig dicker Proben mit atomarer Auflösung in drei Dimensionen zu erzeugen. Der letzte Teil der Arbeit stellt einen Paradigmenwechsel bei den Scananforderungen für die Ptychographie vor und zeigt Anwendungen dieses neuen Ansatzes unter Bedingungen niedriger Dosis. / This thesis presents advancements in electron ptychography, enhancing its versatility as an electron phase-contrast microscopy technique. Rather than relying on high-resolution electron optics, ptychography reconstructs specimens based on their coherent diffraction signals using computational algorithms. This approach allows us to surpass the limitations of conventional optics-based electron microscopy, achieving an unprecedented sub-Angstrom resolution in the resulting images. The thesis initially introduces the theoretical, experimental, and algorithmic principles of electron ptychography, contextualizing them within the landscape of existing scanning-based electron microscopy techniques. Additionally, it develops an alternative ptychographic phase retrieval algorithm, analyzing its performance and also the quality and the spatial resolution of its reconstructions. Moreover, the thesis delves into the integration of machine learning methods into electron ptychography, proposing a specific approach to enhance reconstruction quality under suboptimal experimental conditions. Furthermore, it highlights the fusion of ptychography with defocus series measurements, offering improved depth resolution in ptychographic reconstructions, which therefore brings us closer to the ultimate goal of quantitative reconstructions of arbitrarily thick specimens at atomic resolution in three dimensions. The final part of the thesis introduces a paradigm shift in scanning requirements for ptychography and showcases applications of this novel approach under low-dose conditions. Elektronenmikroskopie Ptychography Maschinelles Lernen Computergestützte Physik Ptychography Machine Learning Computational Physics Electron Microscopy 621 Angewandte Physik ddc:621
159	Diabatization via Gaussian Process Regression Rabe, Stefan Benjamin 07 August 2024 (has links) Moderne Methoden für maschinelles Lernen (ML) spielen heutzutage eine wichtige Rolle in der Wissenschaft und Industrie. Viele umfangreiche ML-Modelle basieren auf tiefen künstlichen neuronalen Netzen (KNN), welche großartige Erfolge erzielen, wenn große Datenmengen zur Verfügung stehen. In Fällen von spärlichen Datenmengen werden KNNe übertroffen von ML-Methoden, welche auf Gaußschen Prozessen (GP) basieren, aufgrund ihrer Interpretierbarkeit, Widerständigkeit gegenüber Überanpassung (Overfitting) und der Bereitstellung von verlässlichen Fehlermaßen. GPe wurden bereits erfolgreich angewandt für Mustererkennung und deren Extrapolation. Letztere ist kontrollierbar aufgrund der kleinen Anzahl von interpretierbaren Hyperparametern. In der vorliegenden Arbeit entwickeln wir eine Methode basierend auf GPen für die Extraktion von diabatischen Mustern aus Energiespektren, welche sich adiabatisch unter der Variation eines Parameters des Hamiltonoperators verhalten. Die resultierenden diabatischen Mannigfaltigkeiten (oder Energieflächen) weisen Kreuzungen auf, wohingegen die originalen (adiabatischen) Energiespektren Kreuzungen vermeiden. Im Bezug auf hoch angeregte, klassisch chaotische Dynamik demonstrieren wir, dass unsere Methode vollständige diabatische Spektren generiert anhand von zwei Beispielsystemen: zwei gekoppelte Morse-Oszillatoren und Wasserstoff im Magnetfeld. In beiden Fällen werden GPe trainiert anhand weniger klassischer Trajektorien, um deren Wirkungen zu interund extrapolieren über den gesamten Energie- und Parameterraum, und Punkte identifiziert, an denen die semiklassische Einstein-Brillouin-Keller (EBK)-Quantisierungsbedingung erfüllt ist. Obwohl die EBK-Methode auf reguläre klassische Dynamik beschränkt ist, erlaubt die Interpretierbarkeit von GPen eine kontrollierte Extrapolation zu Regionen, in denen keine Regularität mehr vorhanden ist. Dadurch können semiklassische diabatische Spektren ins chaotische Regime fortgesetzt werden, in welchem diese nicht mehr wohldefiniert sind. Des Weiteren untersuchen wir den Ursprung resonanter Dynamik im System zweier gekoppelter Morse-Oszillatoren und deren Beitrag zu den semiklassischen Spektren, welche Energien entlang stark abgestoßener adiabatischer Flächen liefern. Im Fall von Wasserstoff im Magnetfeld zeigen wir, dass eine geeignete Skalierung der Koordinaten durch die Feldstärke die Generierung einer unendlichen Folge von semiklassischen Energien mit nur einer EBK-quantisierten Trajektorie erlaubt. Die Implementierung von Randbedingungen in GPen, sowie Skaliermethoden für höhere Dimensionen und deren Eigenschaften werden diskutiert. / Modern supervised machine learning (ML) techniques have taken a prominent role in academia and industry due to their powerful predictive capabilities. While many large-scale ML models utilize deep artificial neural networks (ANNs), which have shown great success if large amounts of data are provided, ML methods employing Gaussian processes (GPs) outperform ANNs in cases with sparse training data due to their interpretability, resilience to overfitting, and provision of reliable uncertainty measures. GPs have already been successfully applied to pattern discovery and extrapolation. The latter can be done in a controlled manner due to their small numbers of interpretable hyperparameters. In this work we develop an approach based on GPs to extract diabatic patterns from energy spectra, adiabatic under variation of a parameter of the Hamiltonian. The emerging diabatic manifolds (or energy surfaces) exhibit crossings where the original (adiabatic) energy spectra avoid to cross. In the context of highly excited, classically chaotic dynamics, we demonstrate that our GP regression approach can generate complete diabatic energy spectra with two exemplary systems: two coupled Morse oscillators and hydrogen in a magnetic field. For both we train GPs with few classical trajectories in order to inter- and extrapolate actions throughout the whole energy and parameter range to identify all points where the semiclassical Einstein-Brillouin-Keller (EBK) quantization condition is fulfilled. While the direct EBK method is restricted to regular classical dynamics, the interpretability of the GPs allow for controlled extrapolation into regions where no more regular trajectories exist due to irregular motion. Hence, semiclassical diabatic spectra can be continued into chaotic regions, where such manifolds are no longer well-defined. Further, we investigate the origin of resonant motion in the coupled Morse oscillator system and their contributions to the semiclassical spectra, which provide energies along strongly repelled adiabatic surfaces. For the hydrogen atom in a magnetic field we show that a proper scaling of the coordinates by the magnetic field strength allows for the extraction of an infinite series of semiclassical energies with one single trajectory which fulfills the EBK condition. The implementation of boundary conditions into GPs, as well as scaling techniques to higher dimensions and their properties are discussed. info:eu-repo/classification/ddc/000 ddc:000
160	Time Dynamic Topic Models Jähnichen, Patrick 30 March 2016 (has links) (PDF) Information extraction from large corpora can be a useful tool for many applications in industry and academia. For instance, political communication science has just recently begun to use the opportunities that come with the availability of massive amounts of information available through the Internet and the computational tools that natural language processing can provide. We give a linguistically motivated interpretation of topic modeling, a state-of-the-art algorithm for extracting latent semantic sets of words from large text corpora, and extend this interpretation to cover issues and issue-cycles as theoretical constructs coming from political communication science. We build on a dynamic topic model, a model whose semantic sets of words are allowed to evolve over time governed by a Brownian motion stochastic process and apply a new form of analysis to its result. Generally this analysis is based on the notion of volatility as in the rate of change of stocks or derivatives known from econometrics. We claim that the rate of change of sets of semantically related words can be interpreted as issue-cycles, the word sets as describing the underlying issue. Generalizing over the existing work, we introduce dynamic topic models that are driven by general (Brownian motion is a special case of our model) Gaussian processes, a family of stochastic processes defined by the function that determines their covariance structure. We use the above assumption and apply a certain class of covariance functions to allow for an appropriate rate of change in word sets while preserving the semantic relatedness among words. Applying our findings to a large newspaper data set, the New York Times Annotated corpus (all articles between 1987 and 2007), we are able to identify sub-topics in time, \\\\textit{time-localized topics} and find patterns in their behavior over time. However, we have to drop the assumption of semantic relatedness over all available time for any one topic. Time-localized topics are consistent in themselves but do not necessarily share semantic meaning between each other. They can, however, be interpreted to capture the notion of issues and their behavior that of issue-cycles. Topic Modelle maschinelles Lernen Bayes Modelle Automatische Sprachverarbeitung Topic Models Machine Learning Bayesian Models Time Series Analysis Natural Language Processing ddc:500

Search results