Installation

The following section describes the installation and usage of the software required to complete the practical exercises (and exam) of the Data Engineering for Data Science course.

The required software (Jupyter Notebook, Hadoop, Spark, ...) is deployed as a docker compose script, so on the one hand there is no need to install and configure the required software stacks, and on the other hand everyone is working with the same software versions, which eliminates version-related problems.

Therefore, the only requirement is to install Docker and Docker Compose on Linux and Docker Desktop on Windows and Mac (as well as WSL on Windows).

Installation of Docker and Docker Compose

This section provides instructions for installing docker and docker compose on Linux, Windows and Mac.

NOTE: The installation guide primarily uses terminal commands! For those who have never worked with a Linux terminal before and/or are new to working with WSL, we have collected some hints: DEDS_Linux_Mini_Cheat_Sheet.pdf

Versions:

  • Docker-Engine: It is recommended to use at least version 20.10.17.
  • Docker-Compose: Make sure you use at least version 1.29.2, because older versions are not compatible with the docker-compose.yml file used here.

Linux

See the following links for installation instructions under Linux (Ubuntu):

NOTE: It is necessary to be able to limit the memory and CPU usage of Docker containers!

Use the following command to verify that the operating system is set up to use resource limits:

sudo docker info

If you receive the output WARNING: No swap limit support, limiting resources has not been enabled by default!

Here you will find instructions on how to configure the use of resource limitations.

Windows

For using docker and docker compose on Windows the installation of a Linux distribution is required (WSL 2 backend)!

NOTE: It is strongly recommended to use WSL 2. Otherwise, Docker for Windows creates a Hyper-V VM in the background. On the one hand, this leads to a significant performance overhead and, on the other hand, the default values of the main memory available for the VM have to be adjusted (at least 6 GB, better 8 GB)!

Follow this instructions to install WSL 2 (and also install Ubuntu >= 18.04)

Docker-Desktop can then be installed and configured to use WSL 2

Mac

Follow this instructions to install Docker-Desktop under Mac.

Since macOS uses a hypervisor in the background, default parameters such as available main memory may need to be adjusted! Docker containers should have at least 6 GB RAM (better 8 GB) available.

NOTE: We provide docker images for amd64 and also for arm64 ("Apple Chip")

Clone the Git Repository

If not already available, you can install git as follows (Ubuntu):

sudo apt install git

To clone the repository, open a terminal, change to a directory of your choice and execute the following command:

git clone https://code.dbis-pro1.fernuni-hagen.de/\
pub-access/data-engineering-infrastructure.git \
data-engineering-infrastructure.git

Change to the cloned directory:

cd data-engineering-infrastructure.git

Checkout the current release:

git checkout $(git tag | sort -V | tail -1)

Alternatively Download a Release

Go to GitLab and download the newest release.

You can also download the release on the command line:

wget https://code.dbis-pro1.fernuni-hagen.de/pub-access/data-engineering-\
infrastructure/-/archive/v0.5.0/data-engineering-infrastructure-v0.5.0.tar.bz2

Extract the downloaded archive:

tar xvjpf data-engineering-infrastructure-v0.5.0.tar.bz2