Big Data Analytics has been a hot topic in computing systems and varies systems have emerged to better support Big Data Analytics. Though databases have been the data hub for decades, they fall short of Big Data Analytics due to inherent limitations. This dissertation present GLADE-ML, a scalable and efficient parallel database that is specifically tailored for Big Data Analytics. Different from traditional databases, GLADE-ML provides iteration management and explicit or implicit randomization in the execution strategy. GLADE-ML provides in-database analytics which outperforms other in-database analytics solutions by several orders of magnitude.
GLADE-ML also introduces dot-product join operator in GLADE-ML. Dot-product join operator is specifically designed for Big Models. Big Data analytics has been approached exclusively from a data-parallel perspective, where data are partitioned to multiple workers – threads or separate servers – and model training is executed concurrently over different partitions, under various synchronization schemes that guarantee speedup and/or convergence. The dual -- Big Model -- problem that, surprisingly, has received no attention in database analytics, is how to manage models with millions if not billions of parameters that do not fit in memory. This distinction in model representation changes fundamentally how in-database analytics tasks are carried out.
GLADE-ML supports model parallelism over massive models that cannot fit in memory. GLADE-ML extends the lock-free HOGWILD!-family of algorithms to disk-resident models by vertically partitioning the model offline and asynchronously updating the resulting partitions online. Unlike HOGWILD!, concurrent requests to the common model are minimized by a preemptive push-based sharing mechanism that reduces both the number of disk accesses as well as the cache coherency messages between workers. Extensive experimental results for three widespread analytics tasks on real and synthetic datasets show that the proposed framework achieves similar convergence to HOGWILD!, while being the only scalable solution to disk-resident models.
Another distinct feature of GLADE-ML is Hyper-Parameter Tuning. Identifying the optimal hyper-parameters is a time-consuming process that the computation has to be executed from scratch for every dataset/model combination even by experienced data scientists. GLADE-ML provides speculative parameter testing which applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration.