Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Abstract

Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and then trains them simultaneously on a shared accelerator. To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs). Our results indicate that HFTA is highly effective in improving hardware utilization and achieves up to 15.1× higher training throughput vs. the standard practice of running each job on a separate accelerator.

What is Horizontally Fused Training Array (HFTA)?

Horizontally Fused Training Array (HFTA) is a PyTorch extension library that helps machine learning and deep learning researchers and practitioners to develop horizontally fused models. Each fused model is functionally/mathematically equivalent to an array of models with similar/same operators.

But why do we need HFTA?

Why developing horizontally fused models at all, you ask? This is because sometimes training a certain class of models can under-utilize the underlying accelerators. Such hardware under-utilization could then be greatly amplified if you train this class of models repetitively (e.g., when you tune its hyper-parameters). Fortunately, in such use cases, the models under repetitive training often have the same types of operators with the same shapes (e.g., think about what happens to the operators when you adjust the learning rate). Therefore, with HFTA, you can improve the hardware utilization by training an array of models (as a single fused model) on the same accelerator at the same time.

How capable is HFTA?

HFTA is device-agnostic. So far, we tested HFTA and observed significant training performance and hardware utilization improvements on NVIDIA V100, RTX6000 and A100 GPUs and Google Cloud TPU v3.

Wanna learn more about HFTA?

Watching our MLSys'21 talk before diving into our paper or GitHub repo could be very helpful:

To help you get started, you can

and play around with a 10-minute Jupyter Notebook tutorial on how to implement your own horizontally fused models via HFTA.

Citation

@inproceedings{MLSYS2021_HFTA,
 author = {Wang, Shang and Yang, Peiming and Zheng, Yuxuan and Li, Xin and Pekhimenko, Gennady},
 booktitle = {Proceedings of Machine Learning and Systems},
 editor = {A. Smola and A. Dimakis and I. Stoica},
 pages = {599--623},
 title = {Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models},
 url = {https://proceedings.mlsys.org/paper/2021/file/a97da629b098b75c294dffdc3e463904-Paper.pdf},
 volume = {3},
 year = {2021}
}