Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

التفاصيل البيبلوغرافية
العنوان: Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks
المؤلفون: Motetti, Beatrice Alessandra, Risso, Matteo, Burrello, Alessio, Macii, Enrico, Poncino, Massimo, Pagliari, Daniele Jahier
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Machine Learning
الوصف: The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.01054
رقم الأكسشن: edsarx.2407.01054
قاعدة البيانات: arXiv