Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents.
From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents.
Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information.
In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results.
In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION
1.1 Motivation
1.2 Contributions
1.3 Outline
2 FOUNDATIONS AND RELATED WORK
2.1 The Evolution of Spreadsheet Documents
2.1.1 Spreadsheet User Interface and Functionalities
2.1.2 Spreadsheet File Formats
2.1.3 Spreadsheets Are Partially-Structured
2.2 Analysis and Recognition in Electronic Documents
2.2.1 A General Overview of DAR
2.2.2 DAR in Spreadsheets
2.3 Spreadsheet Research Areas
2.3.1 Layout Inference and Table Recognition
2.3.2 Unifying Databases and Spreadsheets
2.3.3 Spreadsheet Software Engineering
2.3.4 Data Wrangling Approaches
3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS
3.1 Available Corpora
3.2 Creating a Gold Standard Dataset
3.2.1 Initial Selection
3.2.2 Annotation Methodology
3.3 Dataset Analysis
3.3.1 Takeaways from Business Spreadsheets
3.3.2 Comparison Between Domains
3.4 Summary and Discussion
3.4.1 Datasets for Experimental Evaluation
3.4.2 A Processing Pipeline
4 LAYOUT ANALYSIS
4.1 A Method for Layout Analysis in Spreadsheets
4.2 Feature Extraction
4.2.1 Content Features
4.2.2 Style Features
4.2.3 Font Features
4.2.4 Formula and Reference Features
4.2.5 Spatial Features
4.2.6 Geometrical Features
4.3 Cell Classification
4.3.1 Classification Datasets
4.3.2 Classifiers and Assessment Methods
4.3.3 Optimum Under-Sampling
4.3.4 Feature Selection
4.3.5 Parameter Tuning
4.3.6 Classification Evaluation
4.4 Layout Regions
4.5 Summary and Discussions
5 CLASSIFICATION POST-PROCESSING
5.1 Dataset for Post-Processing
5.2 Pattern-Based Revisions
5.2.1 Misclassification Patterns
5.2.2 Relabeling Cells
5.2.3 Evaluating the Patterns
5.3 Region-Based Revisions
5.3.1 Standardization Procedure
5.3.2 Extracting Features from Regions
5.3.3 Identifying Misclassified Regions
5.3.4 Relabeling Misclassified Regions
5.4 Summary and Discussion
6 TABLE DETECTION
6.1 A Method for Table Detection in Spreadsheets
6.2 Preliminaries
6.2.1 Introducing a Graph Model
6.2.2 Graph Partitioning for Table Detection
6.2.3 Pre-Processing for Table Detection
6.3 Rule-Based Detection
6.3.1 Remove and Conquer
6.4 Genetic-Based Detection
6.4.1 Undirected Graph
6.4.2 Header Cluster
6.4.3 Quality Metrics
6.4.4 Objective Function
6.4.5 Weight Tuning
6.4.6 Genetic Search
6.5 Experimental Evaluation
6.5.1 Testing Datasets
6.5.2 Training Datasets
6.5.3 Tuning Rounds
6.5.4 Search and Assessment
6.5.5 Evaluation Results
6.6 Summary and Discussions
7 XLINDY: A RESEARCH PROTOTYPE
7.1 Interface and Functionalities
7.1.1 Front-end Walkthrough
7.2 Implementation Details
7.2.1 Interoperability
7.2.2 Efficient Reads
7.3 Information Extraction
7.4 Summary and Discussions
8 CONCLUSION
8.1 Summary of Contributions
8.2 Directions of Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLES
A ANALYSIS OF REDUCED SAMPLES
B TABLE DETECTION WITH TIRS
B.1 Tables in TIRS
B.2 Pairing Fences with Data Regions
B.3 Heuristics Framework
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:71518 |
Date | 17 July 2020 |
Creators | Koci, Elvis |
Contributors | Lehner, Wolfgang, Romero, Oscar, Technische Universität Dresden, Polytechnic University of Catalonia |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Relation | info:eu-repo/grantAgreement/Education, Audiovisual and Culture Executive Agency/Erasmus Mundus Joint Doctorate/ EMJD IT4BI-DC 2013-0048// Information Technologies for Business Intelligence – Doctoral College/ IT4BI-DC, info:eu-repo/grantAgreement/Bundesministerium für Bildung und Forschung/Big Data “ScaDS Dresden/Leipzig”/01IS14014A-D/ |
Page generated in 0.0028 seconds