In April 2016 Manchester eScholar was replaced by the University of Manchester’s new Research Information Management System, Pure. In the autumn the University’s research outputs will be available to search and browse via a new Research Portal. Until then the University’s full publication record can be accessed via a temporary portal and the old eScholar content is available to search and browse via this archive.

APPLICATION OF CHEMOMETRICS FOR THE ROBUST ANALYSIS OF CHEMICAL AND BIOCHEMICAL DATA

Gromski, Piotr Sebastian

[Thesis]. Manchester, UK: The University of Manchester; 2015.

Access to files

Abstract

In the last two decades chemometrics has become an essential tool for the experimental biologist and chemist. The level of contribution varies strongly depending on the type of research performed. Therefore, chemometrics may be used to interpret and explain results, to compare experimental data with real-word ‘unseen’ data, to accurately detect certain chemical vapour, to identify cancerous related metabolites, to identify and rank potentially relevant/important variables or simply just for a pictorial interpretation and understanding of the results. Whilst many chemometrics methods are well-established in the area of chemistry and metabolomics many scientists are still using them with what is often referred to as a ‘black-box’ approach, that is without prior knowledge of the methods and well-recognised statistical properties. This lack of knowledge is thanks to the wide availability of powerful computers and – perhaps more notably – up-to-date, easy to use and reliable software. The main aim of this study is to reduce this gap by providing extensive demonstration of several approaches applied at different stages of the data analysis pipeline highlighting the importance of appropriate method selection. The comparisons are based both on chemical and biochemical (metabolomics) data and construct a firm basis for the researchers in terms of understanding of chemometric methods and the influence of parameter selection.Consequently, in this thesis the exploration and comparison of different approaches employed for various statistical steps are investigated. These include pre-treatment steps such as dealing with missing data and scaling. First, different substitution of missing values and their influence on unsupervised and supervised learning have been compared, where it has been shown that metabolites that display skewness in distribution can have a significant impact on the replacement approach. The scaling approaches were compared in terms of effect on classification accuracy for variety of metabolomics data sets. It was shown that the most standard option which is autoscaling is not always the best. In the next step a comparison of various variable selection methods which are commonly used for the analysis of chemical data has been carried out. The results revealed that random forests, with its variable selection techniques, and support vector machines, combined with recursive feature elimination as a variable selection method, displayed the best results in comparison to other approaches. Moreover, in this study a double cross-validation procedure was applied to minimize the consequence of over-fitting. Finally, seven different algorithms and two model validation procedures based on either 10-fold cross-validation or bootstrapping were investigated in order to allow direct comparison between different classification approaches.

Keyword(s)

Chemometrics

Bibliographic metadata

Type of resource:
Content type:
Form of thesis:
Type of submission:
Degree type:
Doctor of Philosophy
Degree programme:
PhD Chemistry
Publication date:
Location:
Manchester, UK
Total pages:
278
Abstract:
In the last two decades chemometrics has become an essential tool for the experimental biologist and chemist. The level of contribution varies strongly depending on the type of research performed. Therefore, chemometrics may be used to interpret and explain results, to compare experimental data with real-word ‘unseen’ data, to accurately detect certain chemical vapour, to identify cancerous related metabolites, to identify and rank potentially relevant/important variables or simply just for a pictorial interpretation and understanding of the results. Whilst many chemometrics methods are well-established in the area of chemistry and metabolomics many scientists are still using them with what is often referred to as a ‘black-box’ approach, that is without prior knowledge of the methods and well-recognised statistical properties. This lack of knowledge is thanks to the wide availability of powerful computers and – perhaps more notably – up-to-date, easy to use and reliable software. The main aim of this study is to reduce this gap by providing extensive demonstration of several approaches applied at different stages of the data analysis pipeline highlighting the importance of appropriate method selection. The comparisons are based both on chemical and biochemical (metabolomics) data and construct a firm basis for the researchers in terms of understanding of chemometric methods and the influence of parameter selection.Consequently, in this thesis the exploration and comparison of different approaches employed for various statistical steps are investigated. These include pre-treatment steps such as dealing with missing data and scaling. First, different substitution of missing values and their influence on unsupervised and supervised learning have been compared, where it has been shown that metabolites that display skewness in distribution can have a significant impact on the replacement approach. The scaling approaches were compared in terms of effect on classification accuracy for variety of metabolomics data sets. It was shown that the most standard option which is autoscaling is not always the best. In the next step a comparison of various variable selection methods which are commonly used for the analysis of chemical data has been carried out. The results revealed that random forests, with its variable selection techniques, and support vector machines, combined with recursive feature elimination as a variable selection method, displayed the best results in comparison to other approaches. Moreover, in this study a double cross-validation procedure was applied to minimize the consequence of over-fitting. Finally, seven different algorithms and two model validation procedures based on either 10-fold cross-validation or bootstrapping were investigated in order to allow direct comparison between different classification approaches.
Keyword(s):
Thesis main supervisor(s):
Language:
en

Institutional metadata

University researcher(s):

Record metadata

Manchester eScholar ID:
uk-ac-man-scw:262661
Created by:
Gromski, Piotr
Created:
11th April, 2015, 14:03:52
Last modified by:
Gromski, Piotr
Last modified:
9th September, 2016, 13:03:43

Can we help?

The library chat service will be available from 11am-3pm Monday to Friday (excluding Bank Holidays). You can also email your enquiry to us.