Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy (Manchester eScholar

Type of resource:

text

Content type:

Administered thesis

Form of thesis:

Traditional

Type of submission:

Doctoral level ETD - final MPhil (re-classification)

Thesis title:

Using Machine Learning to Determine Fold Class and Secondary Structure Content from Raman Optical Activity and Raman Vibrational Spectroscopy

Degree type:

Doctor of Philosophy

Degree programme:

PhD Bioinformatics

Publication date:

2012-03-09T16:27:16

Institution:

The University of Manchester

Location:

Manchester, UK

Total pages:

282

Abstract:

The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. However, there are many bands of which little is known. There is a need, therefore, to find ways of extrapolating information from spectral bands and investigate which regions of the spectra contain the most useful structural information. Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage protein structural content from Raman and ROA spectral data. The analyses were performed on spectral bin widths of 10cm-1 and on the spectral amide regions I, II and III. The full spectra and different combinations of the amide regions were also analysed. The SVM analyses, classification and regression, generally did not perform well. SVM classification models for example, had low Matthew Correlation Coefficient (MCC) values below 0.5 but this is better than a negative value which would indicate a random chance prediction. The SVM regression analyses also showed very poor performances with average R2 values below 0.5. R2 is the Pearson’s correlations coefficient and shows how well predicted and observed structural content values correlate. An R2 value 1 indicates a good correlation and therefore a good prediction model. The Partial Least Squares regression analyses yielded much improved results with very high accuracies. Analyses of full spectrum and the spectral amide regions produced high R2 values of 0.8-0.9 for both ROA and Raman spectral data. This high accuracy was also seen in the analysis of the 850-1100 cm-1 backbone region for both ROA and Raman spectra which indicates that this region could have an important contribution to protein structure analysis. 2nd derivative Raman spectra PLS regression analysis showed very improved performance with high accuracy R2 values of 0.81-0.97. The Random Forest algorithm used here for classification showed good performance. The 2-dimensional plots used to visualise the classification clusters showed clear clusters in some analyses, for example tighter clustering was observed for amide I, amide I & III and amide I & II & III spectral regions than for amide II, amide III and amide II&III spectra analysis. The Random Forest algorithm also determines variable importance which showed spectral bins were crucial in the classification decisions. The ROA Random Forest analyses performed generally better than Raman Random Forest analyses. ROA Random Forest analyses showed 75% as the highest percentage of correctly classified proteins while Raman analyses reported 50% as the highest percentage.The analyses presented in this thesis have shown that Raman and ROA vibrational spectral contains information about protein secondary structure and these data can be extracted using mathematical methods such as the machine learning techniques presented here. The machine learning methods applied in this project were used to mine information about protein secondary structure and the work presented here demonstrated that these techniques are useful and could be powerful tools in the determination protein structure from spectral data.

Layman's abstract:

The objective of this project was to apply machine learning methods to determine protein secondary structure content and protein fold class from ROA and Raman vibrational spectral data. Raman and ROA are sensitive to biomolecular structure with the bands of each spectra corresponding to structural elements in proteins and when combined give a fingerprint of the protein. This fingerprint is characterisitic of the strucutural elements in the protein and could be analysed using machine learning methods to extract information about the underlying structure. The methods used included Support Vector Machines (SVM) classification, Random Forests (RF) trees classification and Partial Least Squares (PLS) regression.Support Vector Machines (SVM) classification and Random Forests (RF) trees classification were used to mine protein fold class information and Partial Least Squares (PLS) regression was used to determine secondary structure content of proteins. The classification methods were used to group proteins into α-helix, β-sheet, α/β and disordered fold classes. The PLS regression was used to determine percentage quantity of protein structural content from Raman and ROA spectral data. The Random Forests trees classification were able to distingush the different secondary structure classes with up to 75% correctly classified true positives. PLS regression analysis reported very high accuracies of up to correlation coefficient values R2 of 0.99. The methods used here showed that there is high potential in the mining of ROA and Raman spectral data of proteins for secondary structural information.

Additional digital content not deposited electronically:

2 published papersAccurate Determination of Protein Secondary Structure Content from Raman and Raman Optical Activity Spectra.Kinalwa M., Blanch E. W. and Doig A. J. Analytical Chemistry, 2010, 82, 6463-6471Determination of Protein Fold Class from Raman or Raman Optical Activity Spectra using Random ForestKinalwa M., Blanch E. W. and Doig A. J.Protein Science,2010, 20, 1668–1674

Keyword(s):

Thesis main supervisor(s):