MSc Data Science (Computer Science Data Informatics)

Year of entry: 2024

Course unit details:
Statistics and Machine Learning 1: Statistical Foundations

Course unit fact file
Unit code DATA70121
Credit rating 15
Unit level FHEQ level 7 – master's degree or fourth year of an integrated master's degree
Teaching period(s) Semester 1
Available as a free choice unit? No

Overview

The module consists of a mixture of lectures designed to communicate key ideas in statistics and machine learning with practical sessions in which students will apply, and in simple cases, develop tools using Python and, where appropriate, other industry standard languages such as R.

There are five main sections:

  1. Thinking probabilistically: random variables, distributions and models for data.
  2. Exploratory data analysis: kinds of data, descriptive statistics and visualisation tools.
  3. Statistical estimation: point estimation, bias, maximum likelihood estimates, tests of difference, confidence intervals and hypothesis testing, Bayesian estimation, prior and posterior distributions, conjugate priors.
  4. Comparison and selection of models: linear regression, generalised linear regression, measures of goodness-of-fit and predictive power, comparison of models, generalisation to semi- and non-parametric approaches as well as hierarchical and spatial models, overfitting and regularisation.
  5. Special Topic: Depending on the teaching staff, a special topic will be chosen to demonstrate the general concepts in more depth. A likely example is Social Networks: networks and statistical models for them including Erdős -Rényi random graphs and exponential random graph models; network statistics including degree distribution, homophily and transitivity.

Aims

The unit aims to:

  • introduce students to the main ideas and methods of statistical approaches to data science, based on probability models, likelihoods and estimators, including such modern developments as Gaussian processes and regularisation;
  • enable students to explore data and to choose, fit, interpret and critique a range of standard and advanced statistical models;
  • enable students to communicate—in writing and in presentations—statistical analyses to audiences with varying levels of technical expertise.

Learning outcomes

Students should be able to:

  • Explain what probabilistic models are and can do: the sorts of relationships they can capture and the sorts of understanding and predictions they can yield;
  • Explain and critique statistical models;
  • Perform exploratory data analyses, fit standard statistical models and prepare illuminating visualisations;
  • Present the results of statistical analyses, both in writing and orally, justifying modelling choices and communicating effectively with audiences at various levels of statistical expertise.

Teaching and learning methods

Lectures will introduce keys ideas from probability, statistics and explain how to use them ideas to interpret the results of, for example, regression models. Computer-based practicals will allow the students to develop their software skills and to apply standard tools from R, Python or any other industry standard language to perform statistical analyses and prepare visualisations.

Assessment methods

Method Weight
Written exam 80%
Written assignment (inc essay) 20%

Feedback methods

Feedback available via Turnitin

Recommended reading

  •             Sheldon Ross (2014), A First Course in Probability, 9th edition, Pearson. ISBN 9780321926678
  •             Thomas Halswanter (2016), An Introduction to Statistics with Python, Springer. ISBN 9783319283159
  •             Simon Rogers & Mark Girolami (2017), A First Course in Machine Learning, 2nd edition, Chapman & Hall/CRC. ISBN 9781498738484
  •             G. James, D. Witten, T. Hastie, and R. Tibshirani (2013), An Introduction to Statistic Learning with Applications in R. Springer-Verlag, New York. ISBN 9781461471370
  •             T. Hastie, R. Tibshirani, and J. Friedman. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, Springer-Verlag, New York. ISBN 9780387848587
  •             John Tukey (1977), Exploratory Data Analysis, Addison Wesley. ISBN 0201076160
  •             Carl E. Rasmussen and Christopher K. I. Williams (2009), Gaussian Processes for Machine Learning, MIT Press. ISBN: 026218253X.

Teaching staff

Staff member Role
Mark Muldoon Unit coordinator

Return to course details