Sanghi, A and Sood, R and Haritsa, J and Tirthapura, S (2018) Scalable and dynamic regeneration of big data volumes. In: 21st International Conference on Extending Database Technology, EDBT 2018, 26 - 29 March 2018, Vienna, pp. 301-312.
PDF
EDBT 2018_2018_301-312_2018.pdf - Published Version Restricted to Registered users only Download (1MB) | Request a copy |
Abstract
A core requirement of database engine testing is the ability to create synthetic versions of the customer's data warehouse at the vendor site. A rich body of work exists on synthetic database regeneration, but suffers critical limitations with regard to: (a) maintaining statistical fidelity to the client's query processing, and/or (b) scaling to large data volumes. In this paper, we present HYDRA, a workload-dependent database regenerator that leverages a declarative approach to data regeneration to assure volumetric similarity, a crucial aspect of statistical fidelity, and materially improves on the prior art by adding scale, dynamism and functionality. Specifically, Hydra uses an optimized linear programming (LP) formulation based on a novel region-partitioning approach. This spatial strategy drastically reduces the LP complexity, enabling it to handle query workloads on which contemporary techniques fail. Second, Hydra incorporates deterministic post-LP processing algorithms that provide high efficiency and improved accuracy. Third, Hydra introduces the concept of dynamic regeneration by constructing a minuscule database summary that can on-the-fly regenerate databases of arbitrary size during query execution, while obeying volumetric specifications derived from the query workload. A detailed experimental evaluation on standard OLAP benchmarks demonstrates that Hydra can efficiently and dynamically regenerate large warehouses that accurately mimic the desired statistical characteristics.
Item Type: | Conference Paper |
---|---|
Publication: | Advances in Database Technology - EDBT |
Publisher: | OpenProceedings.org |
Additional Information: | The copyright for this article belongs to the OpenProceedings.org. |
Keywords: | Ability testing; Big data; Data warehouses; Linear programming, Contemporary techniques; Dynamic regeneration; Experimental evaluation; Large data volumes; Processing algorithms; Spatial strategies; Statistical characteristics; Synthetic database, Query processing |
Department/Centre: | Division of Electrical Sciences > Computer Science & Automation |
Date Deposited: | 19 Aug 2022 05:07 |
Last Modified: | 19 Aug 2022 05:07 |
URI: | https://eprints.iisc.ac.in/id/eprint/75975 |
Actions (login required)
View Item |