There is an increased interest in building machine learning frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors --- as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPUs. The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. This procedure does not employ effectively the extensive CPU and memory resources -- which are used only to schedule computation and transfer data -- available on the servers. Moreover, for multi-GPU systems, the clock/memory speed may vary a lot even for the GPUs with the same model from the same vendor. The heterogeneity of GPUs has to be carefully considered. In addition, the optimization algorithm for the deep learning framework also plays a pivotal role in the training performance. Gradient descent (GD) is the most popular optimization method for model training on modern machine learning platforms. However, its convergence and its adaptation to heterogeneous systems is still an open research direction. In this dissertation, we perform a comprehensive experimental study of parallel GD for training machine learning models. We consider the impact of three factors -- computing architecture, synchronous or asynchronous model updates, and data sparsity -- on three measures --- hardware efficiency, statistical efficiency, and time to convergence. We introduce a generic heterogeneous CPU+GPU framework that exploits the difference in computational power and memory hierarchy between CPU and GPU through asynchronous message passing to maximize performance and resource utilization. Based on insights gained through experimentation with the framework, we design two heterogeneous asynchronous GD algorithms --- CPU+GPU HogBatch and Adaptive HogBatch. We also build a heterogeneity-aware multi-GPU framework to reduce the synchronization cost and data transfer cost of the training. We present the novel synchronous GD algorithm to tackle the heterogeneity challenge in multi-GPU systems --- Adaptive GD. We successfully show that the implementations of these algorithms in the proposed frameworks greatly accelerate the convergence and significantly achieve higher resource utilization than state-of-the-art machine learning systems on several real datasets.
Advisor
Author