In April 2016 Manchester eScholar was replaced by the University of Manchester’s new Research Information Management System, Pure. In the autumn the University’s research outputs will be available to search and browse via a new Research Portal. Until then the University’s full publication record can be accessed via a temporary portal and the old eScholar content is available to search and browse via this archive.

Experimenting with a Big Data Framework for Scaling a Data Quality Query System

Cisneros Cabrera, Sonia

[Thesis]. Manchester, UK: The University of Manchester; 2016.

Access to files

Abstract

The work presented in this thesis comprises the design, implementation and evaluation of extensions made to the Data Quality Query System (DQ2S), a state-of-the-art data quality aware query processing framework and query language, towards testing and improving its scalability when working with increasing amounts of data. The purpose of the evaluation is to assess to what extent a big data framework, such as Apache Spark, can offer significant gains in performance, including runtime, required amount of memory, processing capacity, and resource utilisation, when running over different environments. DQ2S enables assessing and improving data quality within information management by facilitating profiling of the data in use, and leading to the support of data cleansing tasks, which represent an important step in the big data life-cycle. Despite this, DQ2S, as the majority of data quality management systems, is not designed to process very large amounts of data. This research describes the journey of how data quality extensions from an earlier implementation that processed two datasets with 50 000 rows each one in 397 seconds, were designed, implemented and tested to achieve a big data solution capable of processing 105 000 000 rows in 145 seconds. The research described in this thesis provides a detailed account of the experimental journey followed to extend DQ2S towards exploring the capabilities of a popular big data framework (Apache Spark), including the experiments used to measure the scalability and usefulness of the approach. The study also provides a roadmap for researchers interested in re-purposing and porting existing information management systems and tools to explore the capabilities provided by big data frameworks, particularly useful given that re-purposing and re-writing existing software to work with big data frameworks is a less costly and risky approach when compared to greenfield engineering of information management systems and tools.

Bibliographic metadata

Type of resource:
Content type:
Form of thesis:
Type of submission:
Degree type:
Master of Philosophy
Degree programme:
MPhil Computer Science (CONACyT)
Publication date:
Location:
Manchester, UK
Total pages:
230
Abstract:
The work presented in this thesis comprises the design, implementation and evaluation of extensions made to the Data Quality Query System (DQ2S), a state-of-the-art data quality aware query processing framework and query language, towards testing and improving its scalability when working with increasing amounts of data. The purpose of the evaluation is to assess to what extent a big data framework, such as Apache Spark, can offer significant gains in performance, including runtime, required amount of memory, processing capacity, and resource utilisation, when running over different environments. DQ2S enables assessing and improving data quality within information management by facilitating profiling of the data in use, and leading to the support of data cleansing tasks, which represent an important step in the big data life-cycle. Despite this, DQ2S, as the majority of data quality management systems, is not designed to process very large amounts of data. This research describes the journey of how data quality extensions from an earlier implementation that processed two datasets with 50 000 rows each one in 397 seconds, were designed, implemented and tested to achieve a big data solution capable of processing 105 000 000 rows in 145 seconds. The research described in this thesis provides a detailed account of the experimental journey followed to extend DQ2S towards exploring the capabilities of a popular big data framework (Apache Spark), including the experiments used to measure the scalability and usefulness of the approach. The study also provides a roadmap for researchers interested in re-purposing and porting existing information management systems and tools to explore the capabilities provided by big data frameworks, particularly useful given that re-purposing and re-writing existing software to work with big data frameworks is a less costly and risky approach when compared to greenfield engineering of information management systems and tools.
Thesis main supervisor(s):
Thesis co-supervisor(s):
Funder(s):
Language:
en

Institutional metadata

University researcher(s):

Record metadata

Manchester eScholar ID:
uk-ac-man-scw:306252
Created by:
Cisneros Cabrera, Sonia
Created:
16th December, 2016, 21:12:15
Last modified by:
Cisneros Cabrera, Sonia
Last modified:
3rd November, 2017, 11:17:01

Can we help?

The library chat service will be available from 11am-3pm Monday to Friday (excluding Bank Holidays). You can also email your enquiry to us.