Course unit details:
Corpus Linguistics
Unit code | LELA60112 |
---|---|
Credit rating | 15 |
Unit level | FHEQ level 7 – master's degree or fourth year of an integrated master's degree |
Teaching period(s) | Semester 2 |
Available as a free choice unit? | No |
Overview
Corpus linguistics is the study of language through the use of large (usually digital) collections of text known as corpora. This course guides students through the process of collecting data from such corpora, analysing them with advanced statistical techniques to evaluate hypotheses relevant to the field of linguistics, and write up the results in a research paper. Students will learn about essential concepts in corpus linguistics, become familiar with design principles of corpora, and apply their knowledge of statistics and visualisation to practical linguistic problems.
Pre/co-requisites
In order to take this unit, students must have sufficient skills/knowledge in statistics (e.g. UG training in fitting and interpreting general and generalized linear models using R). The module co-ordinator will decide upon whether a student has a sufficient prior background following an initial meeting with the student.
Aims
The unit aims to:
- Provide experience with testing linguistic hypotheses using corpus data
- Provide opportunities to engage with linguistic concepts, such as alternations and grammatical functions, using corpora.
- Provide experience with applying programming skills and statistical knowledge to corpus processing and analysis
- Enable students to visualise patterns in data and model predictions
- Foster critical thinking in the discussion of research literature and interpretation of new findings
- Develop students' ability to report on corpus-based research and to demonstrate analytical and presentation skills
Syllabus
Week 1: Introduction (outline and syllabus, definitions, potential applications, history of CL, theoretical vs. corpus linguistics)
Week 2: Corpus basics (Corpus design, corpus construction, overview over available corpora, representativeness, important terminology, scientific method, relevance of hypotheses)
Introduce the linguistic topic for the final term paper. Group assignment.
Week 3: Data collection 1 (data storage, retrieving data with concordance software, online interfaces, illustrations of data collection with other software)
Seminar 1: Basic homework exercises. General remarks on writing a good paper in corpus linguistics.
Week 4: Data collection 2 (Using python to get data, regular expressions, scripting)
Week 5: Data collection 3 (Using python to get data, tagged and parsed corpora, other annotations)
Seminar 2: Discuss homework on methodology and data collection.
Week 6: Reading Week
Hand in data collection for the final term paper.
Week 7: Statistics 1 (Chi Square Test, R)
Seminar 3: Group presentation of background literature.
Week 8: Statistics 2 (Revision: Logistic Regression, R)
Week 9: Statistics 3 (Mixed Effects Logistic Regression, R)
Seminar 4: Literature review on the linguistic topic. Discussion of some relevant statistical notions for the final paper, like effect size measures, uncertainties, hypothesis testing.
Week 10: Statistics 4 (Model evaluation, comparison, variable selection, R)
Week 11: Statistics 5 (Other statistical methods, R)
Seminar 5: Exercises and Q&A for the statistics discussed. Specific requirement for term paper.
Week 12: Conclusion
Hand in final term paper.
Teaching and learning methods
Weekly 2 hour synchronous lecture. The theoretical and technical content will be followed up in the seminars.
Five 2 hour synchronous seminars
Knowledge and understanding
- Apply key concepts in corpus linguistics to specific problems
- Apply statistical techniques relevant to corpus linguistics to specific problems
- Describe a corpus in terms of content, size, annotations, as part of a research paper.
- Identify and explain linguistic variables, understand what influences them, and test them using statistical methods.
Intellectual skills
- Critically engage with the results of current corpus-based research and report on them in a research paper.
- Deduce testable hypotheses from the academic literature and apply them to a research paper.
- Select an appropriate statistical model and use it to explain and evaluate a quantitative hypothesis.
Practical skills
- Demonstrate the ability to load, store, search, manipulate, prepare and analyse large amounts of textual data with a computer.
- Apply statistical models to real-world problems
- Write computer code in R and Python
Transferable skills and personal qualities
- Develop time management skills by working to a deadline.
- Present results in a professional manner to a specialist audience using a range of engaging media.
- Demonstrate the ability to work collaboratively on a range of data and industry-related tasks in an academic setting
- Retrieve, gather and organise linguistic data from various sources and use it in a research paper.
Assessment methods
Assessment Task | Formative or Summative | Weighting |
Coursework | Formative | 0% |
Term Paper | Summative | 65% |
Seminar Presentation | Summative | 10% |
Exam | Summative | 25% |
Feedback methods
Oral feedback on coursework will be given in the seminars.
Written feedback will be given on the Term Paper, Seminar Presentation and Exam.
Recommended reading
Baayen, Harald R. (2008) Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge: Cambridge University Press.
Bard, Danielle and Vsevolod Kapatinski (2018) ‘Evaluating Logistic Mixed-Effects Models of Corpus-Linguistic Data in Light of Lexical Diffusion.’ In: Speelman, Dirk, Kris Heylen and Dirk Geeraerts (eds.) Mixed-Effects Regression Models in Linguistics. New York: Springer. 99-116. Bird, Steven, Ewan Klein and Edward Loper (2019) Natural Language Processing with Python. Boston: O’Reilly.
Bresnan, Joan, Anna Cueni, Tatiana Nikitina, Harald Baayen (2007) 'Predicting the Dative Alternation.' In: Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.) Cognitive Foundations of Interpretation. Amsterdam: KNAW, 69-94.
Brezina, Vaclav (2018) Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.
Desaguilier, Guillaume (2017) Corpus Linguistics and Statistics with R: Introduction to Quantitative Methods in Linguistics. Berlin: Springer
Gries, Stefan (2019) 'On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement.' Corpus Linguistics and Linguistic Theory 16.3, 617-47.
Monsalves, Maria Jose, Ananta Shrikant Bangdiwala, Alex Thabane and Shrikant Ishver Bangdiwala (2020) 'LEVEL (Logical Explanations & Visualizations of Estimates in Linear mixed models): Recommendations for reporting multilevel data and analyses.' BMC Medical Research Methodology 20.3, 1-9.
Study hours
Scheduled activity hours | |
---|---|
Lectures | 22 |
Seminars | 10 |
Independent study hours | |
---|---|
Independent study | 118 |
Teaching staff
Staff member | Role |
---|---|
Richard Zimmermann | Unit coordinator |