DatE PiE - Data Engineering Pipeline Evolution
Evolutionary change over time in the context of data engineering pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e., providing long-term maintenance, is costly. This shows the need for evolution capabilities in the form of self-awareness and self-adaptation.
|
Keywords:
|
Data Engineering
Data Pipeline
Workflow
Self-Awareness
Self-Adaptation
|
Evolving Gracefully: Building Robust and Self-Adaptive Data Cleaning Pipelines for Schema Evolution and Uncertainty.
The lifecycle of data cleaning pipelines is accompanied by diverse forms of data and software evolution. Oftentimes, these changes are introduced upstream without communicating them to downstream data consumers which creates uncertainty. Evolution and uncertainty lead to substantial human involvement and thereby, high maintenance costs for long-running data cleaning pipelines. A significant factor contributing to this situation is the robustness of operators, i.e., if and how operators are affected by certain types of change and which consequences this might entail for the whole pipeline. In the present work we investigate and define the robustness of data cleaning operators towards schema evolution. To this end, we categorize data cleaning operations based on how they interact with the data on a structural level. Given these categories and the different cases of structural change, a decision tree is created which enables a systematic understanding of robustness for data cleaning pipelines towards schema evolution. Based on these theoretical findings, we present concepts and techniques that work towards a vision of self-adaptive data cleaning pipelines.
Publication: Kramer, K. M., Restat, V., & Störl, Uta. Evolving Gracefully: Building Robust and Self-Adaptive Data Cleaning Pipelines for Schema Evolution and Uncertainty. VLDB 2025 Workshop: 14th International Workshop on Quality in Databases (QDB’25). Link to paper
Towards Next Generation Data Engineering Pipelines
Data engineering pipelines are a widespread way to provide high-quality data for all kinds of data science applications. However, numerous challenges still remain in the composition and operation of such pipelines. Data engineering pipelines do not always deliver high-quality data. By default, they are also not reactive to changes. When new data is coming in which deviates from prior data, the pipeline could crash or output undesired results. We therefore envision three levels of next generation data engineering pipelines: optimized data pipelines, self-aware data pipelines, and self-adapting data pipelines. Pipeline optimization addresses the composition of operators and their parametrization in order to achieve the highest possible data quality. Self-aware data engineering pipelines enable a continuous monitoring of its current state, notifying data engineers on significant changes. Self-adapting data engineering pipelines are then even able to automatically react to those changes. We propose approaches to achieve each of these levels.
Publication: Kramer, K. M., Restat, V., Strasser, S., Störl, U., & Klettke, M. (2025). Towards Next Generation Data Engineering Pipelines. CoRR, abs/2507.13892. Link to paper
Towards Evolution Capabilities in Data Pipelines
Dealing with evolutionary change within data pipelines is a major goal with diverse challenges. At the core of our solution lies a two-step process consisting of self-awareness and self-adaption. In order to grasp these abstract concepts, we created a conceptual requirements model, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data, operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap, which we envision on filling with our future work. We created a roadmap with the most important steps towards this goal, which would contribute to scientists and practitioners alike.
Publication: Kramer, K. (2023). Towards Evolution Capabilities in Data Pipelines. In Proceedings of the 34th GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken), June 7–9, 2023 (CEUR Workshop Proceedings, Vol. 3714). CEUR-WS.org. Link to paper
Other publications
LLM-unterstützte Generierung von Intranets für den Einsatz in der Lehre
Publication: Kramer, K. M., Conrad, A., Restat, V., & Störl, U. (2025). LLM-unterstützte Generierung von Intranets für den Einsatz in der Lehre. Datenbank-Spektrum, 25(2), 103–114. Link to paper