MSc Data Science (Computer Science Data Informatics) / Course details

Year of entry: 2024

Course unit details:
Text Mining

Course unit fact file
Unit code COMP61332
Credit rating 15
Unit level FHEQ level 7 – master's degree or fourth year of an integrated master's degree
Teaching period(s) Semester 2
Available as a free choice unit? Yes

Overview

Text mining has evolved in recent years as a way of mitigating information overload and information overlook, and of helping us discover new knowledge from old. To do this, it employs a battery of techniques from information retrieval, natural language processing and data mining. Although the holy grail of text mining is the discovery of previously unsuspected knowledge, text mining techniques find application in a wide number of areas, to do essentially with the organising, selecting, filtering, combining, association and exploitation of information. Text mining goes far beyond conventional search engine technology.

Aims

This course unit aims to provide students with an understanding of principles, issues, techniques and solutions connected with text mining, and to enable them to gain knowledge of how recent advances in text mining relate to innovative approaches to organising, characterising, finding and exploiting large scale textual information in the search for new knowledge.

Learning outcomes

  • To compare and contrast methods for sentence segmentation, tokenisation, part-of-speech tagging, syntactic parsing and semantic representation
  • To apply techniques such as named entity recognition, entity linking, relation and event extraction to extract information from text, while leveraging resources such as lexical and semantic resources (e.g. Framenet, VerbNet, WordNet), and terminological repositories

  • To design and customise text annotation workflows, taking into consideration various annotation formats

  • To explain how text mining supports the development of semantic search systems

  • To explain the distributional hypothesis, and to compare with each other (1) count-based and (2) compositional distributional semantics models

  • To apply various evaluation measures (e.g., Kappa, recall, precision and F-score)

  • To investigate methods for social media content analysis

Syllabus

Introduction: background, motivation, dealing with information overload and information overlook, unstructured vs. (semi-)structured data, evolving information needs and knowledge management issues, enhancing user experience of information provision and seeking, the business case for text mining.

The text mining pipeline: information retrieval, information extraction and data mining.

Fundamentals of natural language processing: linguistic foundations, levels of linguistic analysis.

Approaches to text mining: rule-based vs. machine learning based vs. hybrid; generic vs. domain specific; domain adaptation.

Dealing with real text: text types, document formats and conversion, character encodings, markup, low-level processes (sentence splitting, tokenisation, part of speech tagging, chunking).

Information extraction: term extraction, named entity recognition, relation extraction, fact and event extraction; partial analysis vs. full analysis.

Data mining and visualisation of results from text mining.

Evaluation of text mining systems: evaluation measures, role of evaluation challenges, usability evaluation.

Resources for text mining: annotated corpora, computational lexica, ontologies, computational grammars; design, construction and use issues.

Issues in large scale processing of text: distributed text mining, scalable text mining systems.

A sampler of text mining applications and services; case studies.

Teaching and learning methods

Lectures

15 hours of lectures.

Laboratories

15 hours of labs.

Consultation

5 hours of consultation

Transferable skills and personal qualities

Employability skills

  • Analytical skills
  • Problem solving
  • Research

Employability skills

Analytical skills
Problem solving
Research

Assessment methods

Method Weight
Written exam 50%
Written assignment (inc essay) 50%

Feedback methods

  • Oral feedback in class.
  • Email.
  • Course Web site.

Study hours

Scheduled activity hours
Assessment written exam 2
Lectures 15
Practical classes & workshops 20
Independent study hours
Independent study 113

Teaching staff

Staff member Role
Riza Batista-Navarro Unit coordinator

Return to course details