In April 2016 Manchester eScholar was replaced by the University of Manchester’s new Research Information Management System, Pure. In the autumn the University’s research outputs will be available to search and browse via a new Research Portal. Until then the University’s full publication record can be accessed via a temporary portal and the old eScholar content is available to search and browse via this archive.

Related resources

University researcher(s)

    Academic department(s)

    Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy

    Kinalwa-Nalule, Myra

    [Thesis]. Manchester, UK: The University of Manchester; 2012.

    Access to files

    Abstract

    The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. However, there are many bands of which little is known. There is a need, therefore, to find ways of extrapolating information from spectral bands and investigate which regions of the spectra contain the most useful structural information. Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage protein structural content from Raman and ROA spectral data. The analyses were performed on spectral bin widths of 10cm-1 and on the spectral amide regions I, II and III. The full spectra and different combinations of the amide regions were also analysed. The SVM analyses, classification and regression, generally did not perform well. SVM classification models for example, had low Matthew Correlation Coefficient (MCC) values below 0.5 but this is better than a negative value which would indicate a random chance prediction. The SVM regression analyses also showed very poor performances with average R2 values below 0.5. R2 is the Pearson’s correlations coefficient and shows how well predicted and observed structural content values correlate. An R2 value 1 indicates a good correlation and therefore a good prediction model. The Partial Least Squares regression analyses yielded much improved results with very high accuracies. Analyses of full spectrum and the spectral amide regions produced high R2 values of 0.8-0.9 for both ROA and Raman spectral data. This high accuracy was also seen in the analysis of the 850-1100 cm-1 backbone region for both ROA and Raman spectra which indicates that this region could have an important contribution to protein structure analysis. 2nd derivative Raman spectra PLS regression analysis showed very improved performance with high accuracy R2 values of 0.81-0.97. The Random Forest algorithm used here for classification showed good performance. The 2-dimensional plots used to visualise the classification clusters showed clear clusters in some analyses, for example tighter clustering was observed for amide I, amide I & III and amide I & II & III spectral regions than for amide II, amide III and amide II&III spectra analysis. The Random Forest algorithm also determines variable importance which showed spectral bins were crucial in the classification decisions. The ROA Random Forest analyses performed generally better than Raman Random Forest analyses. ROA Random Forest analyses showed 75% as the highest percentage of correctly classified proteins while Raman analyses reported 50% as the highest percentage.The analyses presented in this thesis have shown that Raman and ROA vibrational spectral contains information about protein secondary structure and these data can be extracted using mathematical methods such as the machine learning techniques presented here. The machine learning methods applied in this project were used to mine information about protein secondary structure and the work presented here demonstrated that these techniques are useful and could be powerful tools in the determination protein structure from spectral data.

    Layman's Abstract

    The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. This fingerprint is characterisitic of the strucutural elements in the protein and could be analysed using machine learning methods to extract information about the underlying structure. The methods used included Support Vector Machines (SVM) classification, Random Forests (RF) trees classification and Partial Least Squares (PLS) regression.Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage quantity of protein structural content from Raman and ROA spectral data. The Random Forests trees classification were able to distingush the different secondary structure classes with up to 75% correctly classified true positives. PLS regression analysis reported very high accuracies of up to correlation coefficient values R2 of 0.99. The methods used here showed that there is high potential in the mining of ROA and Raman spectral data of proteins for secondary structural information.

    Additional content not available electronically

    2 published papersAccurate Determination of Protein Secondary Structure Content from Raman and Raman Optical Activity Spectra.Kinalwa M., Blanch E. W. and Doig A. J. Analytical Chemistry, 2010, 82, 6463-6471Determination of Protein Fold Class from Raman or Raman Optical Activity Spectra using Random ForestKinalwa M., Blanch E. W. and Doig A. J.Protein Science,2010, 20, 1668–1674

    Bibliographic metadata

    Type of resource:
    Content type:
    Form of thesis:
    Degree type:
    Doctor of Philosophy
    Degree programme:
    PhD Bioinformatics
    Publication date:
    Location:
    Manchester, UK
    Total pages:
    282
    Abstract:
    The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. However, there are many bands of which little is known. There is a need, therefore, to find ways of extrapolating information from spectral bands and investigate which regions of the spectra contain the most useful structural information. Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage protein structural content from Raman and ROA spectral data. The analyses were performed on spectral bin widths of 10cm-1 and on the spectral amide regions I, II and III. The full spectra and different combinations of the amide regions were also analysed. The SVM analyses, classification and regression, generally did not perform well. SVM classification models for example, had low Matthew Correlation Coefficient (MCC) values below 0.5 but this is better than a negative value which would indicate a random chance prediction. The SVM regression analyses also showed very poor performances with average R2 values below 0.5. R2 is the Pearson’s correlations coefficient and shows how well predicted and observed structural content values correlate. An R2 value 1 indicates a good correlation and therefore a good prediction model. The Partial Least Squares regression analyses yielded much improved results with very high accuracies. Analyses of full spectrum and the spectral amide regions produced high R2 values of 0.8-0.9 for both ROA and Raman spectral data. This high accuracy was also seen in the analysis of the 850-1100 cm-1 backbone region for both ROA and Raman spectra which indicates that this region could have an important contribution to protein structure analysis. 2nd derivative Raman spectra PLS regression analysis showed very improved performance with high accuracy R2 values of 0.81-0.97. The Random Forest algorithm used here for classification showed good performance. The 2-dimensional plots used to visualise the classification clusters showed clear clusters in some analyses, for example tighter clustering was observed for amide I, amide I & III and amide I & II & III spectral regions than for amide II, amide III and amide II&III spectra analysis. The Random Forest algorithm also determines variable importance which showed spectral bins were crucial in the classification decisions. The ROA Random Forest analyses performed generally better than Raman Random Forest analyses. ROA Random Forest analyses showed 75% as the highest percentage of correctly classified proteins while Raman analyses reported 50% as the highest percentage.The analyses presented in this thesis have shown that Raman and ROA vibrational spectral contains information about protein secondary structure and these data can be extracted using mathematical methods such as the machine learning techniques presented here. The machine learning methods applied in this project were used to mine information about protein secondary structure and the work presented here demonstrated that these techniques are useful and could be powerful tools in the determination protein structure from spectral data.
    Layman's abstract:
    The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. This fingerprint is characterisitic of the strucutural elements in the protein and could be analysed using machine learning methods to extract information about the underlying structure. The methods used included Support Vector Machines (SVM) classification, Random Forests (RF) trees classification and Partial Least Squares (PLS) regression.Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage quantity of protein structural content from Raman and ROA spectral data. The Random Forests trees classification were able to distingush the different secondary structure classes with up to 75% correctly classified true positives. PLS regression analysis reported very high accuracies of up to correlation coefficient values R2 of 0.99. The methods used here showed that there is high potential in the mining of ROA and Raman spectral data of proteins for secondary structural information.
    Additional digital content not deposited electronically:
    2 published papersAccurate Determination of Protein Secondary Structure Content from Raman and Raman Optical Activity Spectra.Kinalwa M., Blanch E. W. and Doig A. J. Analytical Chemistry, 2010, 82, 6463-6471Determination of Protein Fold Class from Raman or Raman Optical Activity Spectra using Random ForestKinalwa M., Blanch E. W. and Doig A. J.Protein Science,2010, 20, 1668–1674
    Thesis main supervisor(s):
    Thesis co-supervisor(s):
    Thesis advisor(s):
    Funder(s):
    Language:
    en

    Institutional metadata

    University researcher(s):
    Academic department(s):

    Record metadata

    Manchester eScholar ID:
    uk-ac-man-scw:157165
    Created by:
    Kinalwa-Nalule, Myra
    Created:
    9th March, 2012, 16:27:17
    Last modified by:
    Kinalwa-Nalule, Myra
    Last modified:
    19th March, 2012, 19:41:01

    Can we help?

    The library chat service will be available from 11am-3pm Monday to Friday (excluding Bank Holidays). You can also email your enquiry to us.