MSc Machine Learning

Year of entry: 2025

Course unit details:
Data Engineering Technologies

Course unit fact file
Unit code COMP63502
Credit rating 15
Unit level FHEQ level 7 – master's degree or fourth year of an integrated master's degree
Teaching period(s) Semester 2
Available as a free choice unit? No

Overview

In the world of data analytics, preparing and managing data of often the most time-consuming task --- estimated to take up to 80% of a whole workload by many reports and surveys. This unit focuses on the essential data engineering techniques that make large-scale data processing and analysis possible and efficient. Students will explore the foundational concepts and tools used in modern data engineering, including scalable data storage systems, advanced querying methods, parallel and distributed data processing, data interpretation, and effective data retrieval strategies. Emphasis is placed not just on theory, but on hands-on, practical skills that prepare students to work with real-world data.

Pre/co-requisites

Unit title Unit code Requirement type Description
Data Engineering Concepts COMP63301 Pre-Requisite Recommended

Prior knowledge of machine learning is needed

Aims

This unit aims to provide students with exposure to and experience of specialised technologies that support data storage, access, integration and use at scale. Data engineering relates to the processes, tools and techniques required to maximise the value that can be obtained from the data resources an individual or organisation has access to. Many of the challenges faced by data engineers have been prominent for a considerable period, and have benefited from research and development that has given rise to specialised techniques for obtaining value from data. This unit aims to provide potential data engineers with the ability to select, evaluate and apply data engineering technologies to problems that involve complex data at scale.

Learning outcomes

1. Describe technologies that underpin scalability in data intensive systems and their properties.

2. Describe and discuss data integration and data retrieval techniques.

3. Compare and contrast approaches to the development of data intensive applications.

4. Analyse how different algorithms and data structures affect data intensive system performance.

5. Construct and apply different data representations that support data curation and analysis.

6. Design experiments for comparing and analysing different data engineering techniques.

7. Write reports that analyse properties of data engineering techniques.

Syllabus

Part I: Techniques for Scalability

Week 1: Storage: Storing Datasets for Scalability 
•    File Systems
•    Storage structures
•    Indexes on disk and in memory

Week 2: Algorithms
•    Algorithmic strategies
•    Modelling algorithm behaviour

Week 3: Queries
•    Query processing 
•    Modelling query properties

Week 4: Parallelism/Distribution
•    Architectures
•    Paradigms

Week 5: Platforms
•    Batch
•    Interative
•    Streaming

Week 6: 
•    Complete laboratory work.


Part I: Data Curation and Analysis

Week 7: Graph-based Data Analysis
•    Graph database
•    Graph query

Week 8: Table Representation
•    Models and learning methods
•    Discussion and applications

Week 9: Semantic Table Interpretation
•    Entity annotation
•    Type annotation
•    Attribute and relation annotation
•    Table to graph transformation

Week 10: Data Integration
•    Schema inference
•    Entity alignment

Week 11: Advanced Topics and Recent Development
•    Question answering
•    Retrieval augmented generation

Week 12: 
•    Complete laboratory work

     

Teaching and learning methods

The unit will adopt a blended learning approach, with videos and quizzes for students to engage with asynchronously, in addition to synchronous activities in the form of: (i) workshops that include both presentation of new material and problem solving; (ii) laboratory sessions that explore specific techniques in more detail and apply them in practice.

Employability skills

Analytical skills
Innovation/creativity
Oral communication
Problem solving
Research
Written communication

Assessment methods

Method Weight
Written exam 50%
Written assignment (inc essay) 50%

Feedback methods

Summative lab-based coursework: individual rubric-based feedback after marking.
Formative weekly quizzes: Autograded quizzes providing immediate feedback.
Exam: cohort level feedback after marking.
 

Recommended reading

Martin Kleppmann, Designing Data-Intensive Applications, O’Reilly, 2017.

Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, 3rd Edition, Cambridge University Press, 2020.

Joe Reis and Matt Housley, Fundamentals of Data Engineering, O’Reilly, 2022.

 

Study hours

Scheduled activity hours
Assessment written exam 1.5
Lectures 20
Practical classes & workshops 12
Independent study hours
Independent study 116.5

Teaching staff

Staff member Role
Jiaoyan Chen Unit coordinator

Return to course details