In April 2016 Manchester eScholar was replaced by the University of Manchester’s new Research Information Management System, Pure. In the autumn the University’s research outputs will be available to search and browse via a new Research Portal. Until then the University’s full publication record can be accessed via a temporary portal and the old eScholar content is available to search and browse via this archive.

CLASSIFICATION OF TWEETS USING MULTIPLE THRESHOLDS WITH SELF-CORRECTION AND WEIGHTED CONDITIONAL PROBABILITIES

Ahmad, Tariq Naseer

[Thesis]. Manchester, UK: The University of Manchester; 2020.

Access to files

Abstract

Emotion analysis aims to recognise emotions such as anger, joy and trust from texts. It is a trending topic because it can be applied in important areas such as marketing, healthcare and customer services. Current, state-of-the-art, solutions are based around supervised models that are trained using examples that have been manually annotated. This is subjective, expensive and time-consuming. This thesis explores the problem of multi-label emotion classification of tweets. The task is particularly difficult as tweets are notoriously awkward to work with as they are noisy in nature and may contain unstructured text, abbreviations, slang, acronyms, emoticons and incorrect grammar and spelling. Furthermore, single tweets, even if they have none of these issues, are usually short and often do not contain much context, making them difficult to work with. To overcome some of these problems we propose a new type of corpus and investigate strategies for linking news articles to create news-stories and linking tweets to create tweet-stories and hence linking the news-stories to the tweets-stories to create a corpus of linked tweets that contain emotion-bearing markers. We describe the process of collecting tweets and news articles, the annotation process and the problems therein, and show that a thematically-linked corpus aids the classification process. Preprocessing is an important step in classification. However, there is no standard set of steps. As such, we analyse a number of preprocessing steps, evaluating each to establish its contribution and, thus, form the best combination of steps to carry forward into later experiments. We consider both Arabic and English tweets, and whilst there are well-established Natural Language Processing (NLP) tools for English, the same is not true for Arabic. As part of this work we also evaluate a new Arabic tagger specifically for tweets, and a stemmer, and compare the results to other methods. The major contribution of this thesis is a new type of classifier based on conditional probabilities that are used to build a lexicon of scores that indicate the importance of a word to a specific emotion. We show that incorporating automatic mechanisms for autocorrection, by removing words that are unhelpful in an emotion, and calculating individual thresholds for each emotion, improves classifier performance. To the best of our knowledge, this is the first time these ideas have been explored. The results of this classifier, named CENTEMENT, are compared to other common algorithms such as K-nearest Neighbours (KNN), Support Vector Machine (SVM), and two different configurations of neural networks. We also evaluate a number of other datasets and demonstrate that our algorithm is robust and performs consistently well. The results are encouraging: our approach led to appreciably better performance than currently established classifiers and also many of the latest state-of-the-art classifiers. To further test the robustness of the classifier, it was entered into the worldwide emotion-classification competition, SemEval-2018, where it came second (out of thirteen) classifying Arabic tweets and twelfth (out of thirty-four) classifying English tweets.

Bibliographic metadata

Type of resource:
Content type:
Form of thesis:
Type of submission:
Degree type:
Doctor of Philosophy
Degree programme:
PhD Computer Science
Publication date:
Location:
Manchester, UK
Total pages:
286
Abstract:
Emotion analysis aims to recognise emotions such as anger, joy and trust from texts. It is a trending topic because it can be applied in important areas such as marketing, healthcare and customer services. Current, state-of-the-art, solutions are based around supervised models that are trained using examples that have been manually annotated. This is subjective, expensive and time-consuming. This thesis explores the problem of multi-label emotion classification of tweets. The task is particularly difficult as tweets are notoriously awkward to work with as they are noisy in nature and may contain unstructured text, abbreviations, slang, acronyms, emoticons and incorrect grammar and spelling. Furthermore, single tweets, even if they have none of these issues, are usually short and often do not contain much context, making them difficult to work with. To overcome some of these problems we propose a new type of corpus and investigate strategies for linking news articles to create news-stories and linking tweets to create tweet-stories and hence linking the news-stories to the tweets-stories to create a corpus of linked tweets that contain emotion-bearing markers. We describe the process of collecting tweets and news articles, the annotation process and the problems therein, and show that a thematically-linked corpus aids the classification process. Preprocessing is an important step in classification. However, there is no standard set of steps. As such, we analyse a number of preprocessing steps, evaluating each to establish its contribution and, thus, form the best combination of steps to carry forward into later experiments. We consider both Arabic and English tweets, and whilst there are well-established Natural Language Processing (NLP) tools for English, the same is not true for Arabic. As part of this work we also evaluate a new Arabic tagger specifically for tweets, and a stemmer, and compare the results to other methods. The major contribution of this thesis is a new type of classifier based on conditional probabilities that are used to build a lexicon of scores that indicate the importance of a word to a specific emotion. We show that incorporating automatic mechanisms for autocorrection, by removing words that are unhelpful in an emotion, and calculating individual thresholds for each emotion, improves classifier performance. To the best of our knowledge, this is the first time these ideas have been explored. The results of this classifier, named CENTEMENT, are compared to other common algorithms such as K-nearest Neighbours (KNN), Support Vector Machine (SVM), and two different configurations of neural networks. We also evaluate a number of other datasets and demonstrate that our algorithm is robust and performs consistently well. The results are encouraging: our approach led to appreciably better performance than currently established classifiers and also many of the latest state-of-the-art classifiers. To further test the robustness of the classifier, it was entered into the worldwide emotion-classification competition, SemEval-2018, where it came second (out of thirteen) classifying Arabic tweets and twelfth (out of thirty-four) classifying English tweets.
Thesis main supervisor(s):
Language:
en

Institutional metadata

University researcher(s):

Record metadata

Manchester eScholar ID:
uk-ac-man-scw:323659
Created by:
Ahmad, Tariq
Created:
13th February, 2020, 10:46:36
Last modified by:
Ahmad, Tariq
Last modified:
2nd March, 2021, 10:58:28

Can we help?

The library chat service will be available from 11am-3pm Monday to Friday (excluding Bank Holidays). You can also email your enquiry to us.