Data Engineering

IF5OT7 - 6 ECTS - 3rd Edition

โ€œA scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.โ€ โ€“ Gordon Lindsay Glegg

Overview

The course aims at giving an overview of Data Engineering foundational concepts. It is tailored for 1st and 2nd year Msc students and PhDs who would like to strengthen their fundamental understanding of Data Engineering, i.e., Data Modelling, Collection, and Wrangling.

Origin and Design

The course was originally held in different courses at Politecnico di Milano (๐Ÿ‡ฎ๐Ÿ‡น) by Emanuele Della Valle and Marco Brambilla. The first issue as a unified journey into the data world dates back to 2020 at the University of Tartu (๐Ÿ‡ช๐Ÿ‡ช) by Riccardo Tommasini, and where it is still held by professor Ahmed Awad (course LTAT.02.007). At the same time, the course was adopted by INSA Lyon (๐Ÿ‡ซ๐Ÿ‡ท) as OT7 (2022) and PLD โ€œDataโ€

Learning Objectives

Students of this course will obtain two sets of skills: one that is deeply technical and necessarily technologically biased, one that is more abstract (soft) yet essential to build the professional figure that fits in a data team.

Challenge-Based Learning

The course follows a challenge-based learning approach. In particular, each system is approached independently with a uniform interface (python API). The studentโ€™s main task is to build pipelines managed by Apache Airflow that integrates 3/5 of the presented systems. The course schedule does not include an explanation of how such integration should be done. Indeed, it is up for each group to figure out how to do it, if developing a custom operator or scheduling scripts or more. The students are then encouraged to discuss their approach and present their limitations.

Soft Skills

Creativity Bonus: the students will be encouraged to come up with their information needs about a domain of their choice.

Hard Skills

After a general overview of the data lifecycle, the course deeps into an (opinionated) view of the modern Data Warehouse. To this extent, the will touch basic notions of Data Wrangling, in particular Cleansing and Transformation.

At the core of the course learning outcome there is the ability to design, build, and maintain a data pipelines.

Technological Choice (Year 2024/25): Apache Airlfow

Regarding the technological stack of the course, modulo the choice of the lecturer, the following systems are encouraged.

The interaction with the systems above is via Python using Jupyter notebooks. The environment is powered by [[Docker]] and orchestrated using [[Docker Compose]].

Prerequisites

Classwork

Syllabus

๐Ÿ–๏ธ = Practice ๐Ÿ““ = Lecture

SCHEDULE

Topic Day Date From To Material Comment

Intro

๐Ÿ““

Tuesday

2024/09/24

10:00

12:00

Slides

Docker

๐Ÿ–

Wednesday

2024/09/25

14:00

18:00

Slides

Data Modeling I

๐Ÿ““

Monday

2024/10/07

10:00

12:00

Slides

Apache Workflow Intro + TD

๐Ÿ–

Monday

2024/10/07

14:00

18:00

Slides Video

Solutions

Data Modeling II

๐Ÿ““

Tuesday

2024/10/08

14:00

16:00

Slides

Data Wrangling

๐Ÿ““

Wednesday

2024/10/09

14:00

18:00

Slides cheatsheet

Solutions

Data Storage

๐Ÿ““

Monday

2024/10/21

10:00

12:00

TODO

Document Stores

๐Ÿ–

Monday

2024/10/21

14:00

18:00

Slides Video

Solutions

Project In Class

๐Ÿ–

Wednesday

2024/10/23

14:00

18:00

Graph DBs

๐Ÿ–

Monday

2024/11/04

14:00

18:00

Slides

Solutions

Key-Value Stores

๐Ÿ–

Wednesday

2024/11/06

14:00

18:00

Slides

Solutions

Exam

Exam

Wednesday

2024/11/27

10:00

12:00

TODO

Project In Class

๐Ÿ–

Wednesday

2024/11/27

14:00

18:00

External Talk (TBA)

Exam

Monday

2024/12/02

10:00

12:00

Poster Section

Exam

Monday

2024/12/02

14:00

18:00

NB: The course schedule can be subject to changes!

Practices

Exam

The course exam is done in class in the data indicated in the schedule. It will last around 1h. It includes 2-3 bigger topics (data modelling, pipeline design, etc) and 3-5 smaller topics (simple questions about DE in general, eg., what is an ETL?).

You are allowed to bring an 4A paper with all the notes you can fit in it, only requirement is that the notes is Handwritten

Projects

The goal of the project is implementing a few full stack data pipelines that go collect raw data, clean them, transform them, and make them accessible via to simple visualisations.

You should identify a domain, two different data sources, and formulate 2-3 questions in natural language that you would like to answer. Such question will be necessary for the data modelling effort (i.e., creating a database).

โ€œDifferentโ€ means different formats, access patterns, frequency of update, etc. In practice, you have to justify your choice!

If you cannot identify a domain yourself, you will be appointed assigned on by the curt teacher.

pipeline_physical

The final frontend can be implemented using Jupyter notebook, Grafana, StreamLit, or any software of preference to showcase the results. THIS IS NOT PART OF THE EVALUATION!

The project MUST include all the three areas discussed in class (see figure above), i.e., ingestion of (raw) data, staging zone for cleaned and enriched data, and a curated zone for production data analytics. To connect the various zones, you should implement the necessary data pipelines using Apache Airflow. Any alternative should be approved by the teacher. The minimum number of pipelines is 3:

The figure below is meant to depict the structure of the project using the meme dataset as an example.

project pipeline.jpg

Project Minimal Submission Checklist

Project grading 0-10 + 5 report (accuracy, using proper terminology, etc) + 5 for the poster.

Increasing Project Grade

The project grading follows a portfolio-based approach, once you achieved the minimum level (described above), you can start enchancing and collect extra points. How?

pipeline_physical_all.png

:bangbang::bangbang: Project Registration Form (courtesy of Kevin Kanaan) :bangbang::bangbang:

Pre-Approved Datasets

Example Projects From Previous Years

FAQ

Previous Editions