BASS Philosophy and Data Analytics / Course details

Year of entry: 2024

Course unit details:
Quantitative Text Analysis in the Social Sciences

Course unit fact file
Unit code SOST30071
Credit rating 20
Unit level Level 3
Teaching period(s) Semester 1
Available as a free choice unit? No

Overview

The availability of text data has increased exponentially in recent years, alongside a growing demand for its analysis. This course introduces students to the quantitative analysis of text from a social science perspective, with a wide coverage of applications in economics, sociology & communication, and political science. The course adopts an applied approach: while theoretical aspects will be addressed, the primary objective is to equip students with the skills to formulate research questions that can be explored through text data and to understand the methodologies required to answer them. To this end, we begin by explaining how text can be conceptualized and modelled quantitatively, examining methods for comparing textual data. Following this, we delve into both supervised and unsupervised techniques in considerable depth, before addressing several specialized topics pertinent to social science research. Ultimately, the course aims to enable students to undertake their own research projects using text as data, providing a foundation for more advanced and technical investigations.
 

Aims

The availability of text data has increased exponentially in recent years, alongside a growing demand for its analysis. This course introduces students to the quantitative analysis of text from a social science perspective, with a wide coverage of applications in economics, sociology & communication, and political science. The course adopts an applied approach: while theoretical aspects will be addressed, the primary objective is to equip students with the skills to formulate research questions that can be explored through text data and to understand the methodologies required to answer them. To this end, we begin by explaining how text can be conceptualized and modelled quantitatively, examining methods for comparing textual data. Following this, we delve into both supervised and unsupervised techniques in considerable depth, before addressing several specialized topics pertinent to social science research. Ultimately, the course aims to enable students to undertake their own research projects using text as data, providing a foundation for more advanced and technical investigations.
 

Learning outcomes

The primary objective of this course is to familiarize students with machine learning methods and contemporary quantitative text analysis techniques, equipping them with the skills needed to apply these statistical methods in their own research. In pursuit of this objective, students will also engage with foundational concepts in machine learning and statistics, cultivating skills that are applicable to a broad range of data and inference challenges. Additionally, students will have the opportunity to enhance their programming competencies and develop an original research project.

Syllabus

Lecture Schedule
(10 sessions of 2-hour lectures and weekly 1-hour computer lab sessions)

1. Introduction to Quantitative Text Analysis
Overview of the field, its applications in social sciences, and fundamental principles of text as data.

2. Descriptive Statistical Methods for Text Analysis
Exploration of foundational descriptive statistics in text analysis, focusing on word frequency, term-document matrices, and other basic text preprocessing and summarization techniques.

3. Supervised Techniques with Text Data I
Dictionary-based approaches, including sentiment analysis and the application of tools such as LIWC and other content dictionaries.

4. Supervised Techniques with Text Data II
Document classification, including precision and recall as evaluation metrics, the role of crowdsourcing in supervised learning, and comparisons of various commonly used classifiers.

5. Transition from Supervised to Unsupervised Techniques
Introduction to machine learning fundamentals, covering support vector machines, k- nearest neighbours, random forests, tree-based methods, and ensemble models.

6. Unsupervised Techniques with Text Data I
Basics of unsupervised learning, with a focus on dimensionality reduction methods, including principal component analysis and singular value decomposition.

7. Unsupervised Techniques with Text Data II
Clustering methods for document classification, scaling techniques, and various topic modelling approaches (e.g., Latent Dirichlet Allocation, Structural Topic Modelling, and BERT-based models).

8. Word Embeddings
Examination of word embeddings for semantic analysis, covering methods such as Word2Vec, GloVe, and embeddings derived from language models.
 

9. Neural Network-Based Models
Introduction to neural networks for text analysis, with a focus on recurrent neural networks, convolutional neural networks, and transformer architectures.

10. Advanced Applications of Large Language Models (LLMs)
Exploration of recent developments in LLMs, with an emphasis on their applications, limitations, and ethical considerations in text analysis.

 

Teaching and learning methods

Description of T&L Methods

Instruction will be conducted over a 10-week period, with each week comprising two one- hour lecture sessions. Additionally, students will engage in a weekly one-hour computer lab session to apply theoretical concepts through hands-on exercises

 

Knowledge and understanding

• Demonstrate a theoretical understanding of content analysis approaches and machine learning techniques

Intellectual skills

• Visualise, describe, and critically assess quantitative text analysis in R/Python, utilizing advanced methods

Practical skills

• Produce reports for academic and non-academic audiences

Transferable skills and personal qualities

• Design and execute small-scale projects applying machine learning to social science research questions using text data

Assessment methods

Method Weight
Report 100%

Formative Assessment (Assignments) – modelling and coding of text data in a single RMarkdown PDF/HTML document of both answers and code.

Final paper: final data analysis report summarizing key findings from quantitative text analysis, including visualization.
Report may be substantive / technical in nature. (2,000 words (including code, tables and figures): 100%)

 

Recommended reading

Core Textbook:
Grimmer, Justin, Margaret E. Roberts and Brandon M. Stewart (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press, Princeton, NJ. This textbook is a recent survey of quantitative text analysis as used in the social sciences.

Supplementary Texts:
• Jurafsky, Daniel and James H. Martin (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd edition. Online manuscript released August 20, 2024. Available at https://web.stanford.edu/~jurafsky/slp3. This is a great reference book for the more technical aspects of quantitative text analysis.
• Van Atteveldt, W., Trilling, D., & Calderon, C. A. (2022). Computational analysis of communication. John Wiley & Sons. Available at https://cssbook.net/ with codes and practices.

 

Study hours

Scheduled activity hours
Lectures 20
Tutorials 10
Independent study hours
Independent study 170

Teaching staff

Staff member Role
Yan Wang Unit coordinator

Return to course details