Course unit details:
Data Engineering
Unit code | IIDS69011 |
---|---|
Credit rating | 15 |
Unit level | FHEQ level 7 – master's degree or fourth year of an integrated master's degree |
Teaching period(s) | Semester 1 |
Available as a free choice unit? | No |
Overview
Clinical Data Scientists need to be able to create data pipelines and merge data sets from different sources before it can be used for onwards analysis such as machine learning. They will also be required to 'wrangle' (pre-process) data into different formats and sub-sets for subsequent analysis. This includes an understanding of structured and unstructured data formats (e.g. tabular form, JSON, XML etc.), how data is modelled in various commonly used databases systems as well as an awareness of data/cyber security. They will be required to access data in a variety of formats and engineer pipelines for data analysis whilst adhering to wider concepts of data protection/privacy regulations and information governance. This module introduces these concepts with applied examples.
Aims
The unit aims to:
- Give students hands on experience applying tools and techniques used to access data in different common formats, how to transform and combine this data into a format suitable for subsequent data analysis (e.g. application of statistical methods/machine learning algorithms) by creating data processing pipelines
- Experience using, accessing and querying data in different database storage systems (e.g. relational and NoSQL database systems)
- Understand the importance of data security issues both from a technical and legislative perspective
- Explore the benefits and challenges with accessing health/clinical data
- Understand and practice data cleaning and understand the impact of data provenance and altering data (e.g. variable encoding, missing values, inconstantly entered data and data validation)
Learning outcomes
Learning outcomes
On completion of this unit, succesful students should be able too:
Category of outcome | Students should be able to: |
A: Knowledge and understanding | LO1: Describe the difference between structured and un-structured data citing relevant examples of each LO2: Discuss the consequences of cyber-attacks/data breaches and mitigation strategies LO3: Discuss principles involved in data sharing and information governance with reference to appropriate guidelines and legislation LO4: Critique common data standards depending on intended usage LO5: Explain the challenges and opportunities of big data and approaches for processing such data |
B: Intellectual Skills | This unit will cover the following indicative content:
|
Syllabus
This unit will cover the following indicative content:
- Fundamental data types and structures
- Structured and unstructured data
- The fundamentals of using Python for data science and associated libraries/modules
- How data is modelled in different database systems
- Querying and filtering data
- Representing data using dataframes
- Data cleaning (imputing missing values, encoding variables,
- Data transformations (wide/long, feature engineering)
- Combining datasets (data linkage)
- Data sharing agreements/plans
- Data and patients
- Data representation in diagrams (e.g. ERM, Data flow and UML)
- Common data standards
-
The unit will be delivered online making use of workshops, lectures, labs and self-directed learning material delivered through interactive digital (Jupyter) notebooks to impart core knowledge and skills. A series of synchronous labs using a variety of datasets and formats will be used to foster group work and collaborative working with problem based learning. Case-studies and data will be drawn from The University of Manchester and its affiliates as well as NHS and open-source projects where possible
Assessment methods
Assessment task | Length | Weighting within unit |
Data Management Plan | You will create an authentic data management plan for a fictional scenario or real-world project that you would like to implement in your organisation. | 100% |
Feedback methods
Formative assessment and feedback to students is a key feature of the online learning materials for this unit and is provided through self-directed learning activities in the interactive notebooks.
Recommended reading
- McKinney, W (2017) Python for Data Analysis. Beijing: O'Reilly
- Molin, S (2019) Hands-On Data Analysis with Pandas. Birmingham: Packt
- Medium (2021) Towards data science: A Medium publication sharing concepts, ideas and codes. https://towardsdatascience.com/about
Study hours
Independent study hours | |
---|---|
Independent study | 150 |
Teaching staff
Staff member | Role |
---|---|
Alan Davies | Unit coordinator |
Iliada Eleftheriou | Unit coordinator |