Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data, e.g., genomics, databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file.
In this paper, we analyze the shortcomings of the traditional database under different configuration and propose several novel solutions to overcome these problems. We firstly propose SCANRAW, an innovate database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic components to identify common functionality. We analyze alternative implementations and discuss possible optimization for each stage. Our primary contribution is a parallel superscalar pipeline design that allows SCANRAW to take advantage of the current many- and multi-core processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data in the database, such that subsequent queries execute faster. As a result, SCANRAW makes optimal use of the available system resources – CPU cycles and I/O bandwidth – by switching dynamically between tasks to achieve optimal performance. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves optimal performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution, while speculatively loading data, and without interfering with normal query processing.
Besides, incorporate query workload in raw data processing allows us to model raw data processing with partial loading as fully-replicated binary vertical partitioning. We model loading as binary vertical partitioning with full replication. We design a two-stage heuristic that combines the concepts of query coverage and attribute usage frequency. The heuristic comes within close range of the optimal solution in a fraction of the time. We extend the optimization formulation and the heuristic to a restricted type of pipeline raw data processing. The results confirm the superior performance of the proposed heuristic over related vertical partitioning algorithms and the accuracy of the formulation in capturing the execution details of a real operator.
Online aggregation (OLA) is an efficient method for data exploration that identifies uninteresting patterns faster by continuously estimating the result of a computation during the actual processing— as long as the estimate is accurate enough to be deemed uninteresting, the system can stop the query immediately. However, building an efficient OLA system has a high upfront cost of randomly shuffling and loading the data. We then propose OLA-RAW, a novel system for in-situ processing over raw files that integrates data loading and online aggregation seamlessly while preserving their advantages—generating accurate estimates as early as possible and having zero time-to-query. We design an accuracy-driven bi-level sampling process over raw files and define and analyze corresponding estimators. The samples are extracted and loaded adaptively in random order based on the current system resource utilization. We implement OLA-RAW starting from a state-of-the-art in-situ data processing system and evaluate its performance across a variety of datasets and file formats. Our results show that OLA-RAW maximizes resource utilization across a query workload and dynamically chooses the optimal sampling and loading plan that minimizes each query's execution time while guaranteeing the required accuracy. The result is a focused data exploration process that avoids unnecessary work and discards uninteresting data.
document