Data Preparation and Data Quality
Data-driven systems and machine learning-based decisions are becoming increasingly important and are having an impact on our everyday lives. The prerequisite for this is good data quality, which must be ensured by data preparation.
Keywords:
|
Data Preparation
Data Quality
Data Cleaning
Data Wrangling
Evaluation
|
GouDa - Generation of universal Data Sets
GouDa is a tool for the generation of universal data sets to evaluate and compare existing data preparation tools and new research approaches. It supports diverse error types and arbitrary error rates. Ground truth is provided as well. It thus permits better analysis and evaluation of data preparation pipelines and simplifies the reproducibility of results.
- Diverse error types - Arbitrary error rates - Ground truth provided - Scalable - Publicly available: Zenodo GitLab |
Publication: V. Restat, G. Boerner, A. Conrad, U. Störl: GouDa - Generation of universal Data Sets, DEEM@SIGMOD 2022
Presentation GouDa DEEM@SIGMOD 2022
Holistic Data Preparation Tool
We propose the design of a holistic tool to support domain experts in data preparation:
Publication: V. Restat, M. Klettke, U. Störl: Towards a Holistic Data Preparation Tool, DataPlat@EDBT 2022
Publication: V. Restat: Towards “all-inclusive” Data Preparation to ensure Data Quality, arXiv:2308.14617
CheDDaR: Checking Data - Data Quality Review
Data-driven systems and machine learning-based decisions are becoming increasingly important and are having an impact on our everyday lives. The prerequisite for good results and decisions is good data quality, which must be ensured by preprocessing the data. Therefore, we propose CheDDaR - a framework of metrics that allows for a flexible evaluation of data quality and data preparation results.
- Detailed evaluation of data quality - Comparison of data sets - Considers time of evaluation - Can be used flexibly |
Publication: V. Restat, M. Klettke, U. Störl: “FAIR” is not enough – A Metrics Framework to ensure Data Quality through Data Preparation, DE4DS@BTW 2023
Student theses
We regularly publish new topics for theses. An overview of open topics can be found here: dbis theses
Assigned
- Best Practices für die Reihenfolge von Data Cleaning Pipelines
Ongoing
- CheDDaR: Konzeption und prototypische Implementierung eines Tools zur Analyse von Datenqualität - I. Diestelkämper (Master)
- Data-Streaming: Technologie-Studien im Vergleich - F. Meier (Bachelor)
Overview (live generated gantt chart)
Completed
- Testdatengenerierung für die Analyse von Data Preparation Pipelines - G. Boerner (Bachelor)
- Analyse von Data Cleaning Tools - L. Lafleur (Bachelor)
- Analyse von Data Cleaning Pipelines - O. Schwammberger (Bachelor)
- Data Cleaning in Data Streaming Pipelines - N. Rodenhausen (Master)
- Reproduzierbarkeit von Data Cleaning Pipelines - A. Schwarz (Master)
- Analyse von Missing Value Imputation - K. Tejkl (Master)
- Visualisierung von Missing Values - D. Giesen (Bachelor)
- Reproducibility in Data Preprocessing: An Evaluation of Open Source Tools - S. Grimm (Bachelor)
- Data-Streaming-Technologien für Data Cleaning - C. Antonin (Bachelor)
- Data preparation of semi-structured data - A. Zeidler (Master)
- Fairness in Data Preprocessing - M. Werner (Bachelor)
- Constraints for Missing Value Imputation - A. Herling (Bachelor)
- Trade-offs between Performance-oriented and Sustainability-oriented Approaches for MLWorkloads in the Cloud - D. Senzel (Master)