Menue Symbol

Data Preparation and Data Quality

Data-driven systems and machine learning-based decisions are becoming increasingly important and are having an impact on our everyday lives. The prerequisite for this is good data quality, which must be ensured by data preparation.

Keywords:
Data Preparation
Data Quality
Data Cleaning
Data Wrangling
Evaluation

GouDa - Generation of universal Data Sets

GouDa is a tool for the generation of universal data sets to evaluate and compare existing data preparation tools and new research approaches. It supports diverse error types and arbitrary error rates. Ground truth is provided as well. It thus permits better analysis and evaluation of data preparation pipelines and simplifies the reproducibility of results.

GouDa Logo - Diverse error types
- Arbitrary error rates
- Ground truth provided
- Scalable
- Publicly available: Zenodo GitLab

Publication: V. Restat, G. Boerner, A. Conrad, U. Störl: GouDa - Generation of universal Data Sets, DEEM@SIGMOD 2022

Presentation GouDa DEEM@SIGMOD 2022

Slides GouDa DEEM@SIGMOD 2022

Holistic Data Preparation Tool

We propose the design of a holistic tool to support domain experts in data preparation:

Publication: V. Restat, M. Klettke, U. Störl: Towards a Holistic Data Preparation Tool, DataPlat@EDBT 2022

Presentation DataPlat@EDBT

Publication: V. Restat: Towards “all-inclusive” Data Preparation to ensure Data Quality, arXiv:2308.14617

CheDDaR: Checking Data - Data Quality Review

Data-driven systems and machine learning-based decisions are becoming increasingly important and are having an impact on our everyday lives. The prerequisite for good results and decisions is good data quality, which must be ensured by preprocessing the data. Therefore, we propose CheDDaR - a framework of metrics that allows for a flexible evaluation of data quality and data preparation results.

CheDDaR Logo - Detailed evaluation of data quality
- Comparison of data sets
- Considers time of evaluation
- Can be used flexibly

Publication: V. Restat, M. Klettke, U. Störl: “FAIR” is not enough – A Metrics Framework to ensure Data Quality through Data Preparation, DE4DS@BTW 2023

   

Student theses

We regularly publish new topics for theses. An overview of open topics can be found here: dbis theses

Assigned

  • Best Practices für die Reihenfolge von Data Cleaning Pipelines

Ongoing

  • CheDDaR: Konzeption und prototypische Implementierung eines Tools zur Analyse von Datenqualität - I. Diestelkämper (Master)
  • Data-Streaming: Technologie-Studien im Vergleich - F. Meier (Bachelor)

Overview (live generated gantt chart)

(...init in progress...)

Completed

  • Testdatengenerierung für die Analyse von Data Preparation Pipelines - G. Boerner (Bachelor)
  • Analyse von Data Cleaning Tools - L. Lafleur (Bachelor)
  • Analyse von Data Cleaning Pipelines - O. Schwammberger (Bachelor)
  • Data Cleaning in Data Streaming Pipelines - N. Rodenhausen (Master)
  • Reproduzierbarkeit von Data Cleaning Pipelines - A. Schwarz (Master)
  • Analyse von Missing Value Imputation - K. Tejkl (Master)
  • Visualisierung von Missing Values - D. Giesen (Bachelor)
  • Reproducibility in Data Preprocessing: An Evaluation of Open Source Tools - S. Grimm (Bachelor)
  • Data-Streaming-Technologien für Data Cleaning - C. Antonin (Bachelor)
  • Data preparation of semi-structured data - A. Zeidler (Master)
  • Fairness in Data Preprocessing - M. Werner (Bachelor)
  • Constraints for Missing Value Imputation - A. Herling (Bachelor)
  • Trade-offs between Performance-oriented and Sustainability-oriented Approaches for MLWorkloads in the Cloud - D. Senzel (Master)
Valerie Restat | 11.04.2024