تقرير
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
العنوان: | Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes |
---|---|
المؤلفون: | Jia, Xianyan, Song, Shutao, He, Wei, Wang, Yangzihao, Rong, Haidong, Zhou, Feihu, Xie, Liqiang, Guo, Zhenyu, Yang, Yuanzhou, Yu, Liwei, Chen, Tiegang, Hu, Guangxiao, Shi, Shaohuai, Chu, Xiaowen |
سنة النشر: | 2018 |
المجموعة: | Computer Science Statistics |
مصطلحات موضوعية: | Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning |
الوصف: | Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes and achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048 Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training system can achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1 test accuracy within 4 minutes, which also outperforms all other existing systems. Comment: arXiv admin note: text overlap with arXiv:1803.03383 by other authors |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/1807.11205 |
رقم الأكسشن: | edsarx.1807.11205 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |