Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration

التفاصيل البيبلوغرافية
العنوان: Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration
المؤلفون: Diego Perez, Esteban Meneses, Terry Jones, Leonardo Bautista Gomez, Jon Calhoun, Elvis Rojas
المساهمون: Barcelona Supercomputing Center
المصدر: CLUSTER
UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
بيانات النشر: Institute of Electrical and Electronics Engineers (IEEE), 2021.
سنة النشر: 2021
مصطلحات موضوعية: Artificial intelligence, Computer science, Deep learning (Machine learning), Distributed computing, HDF5, Silent data corruption, Hierarchical Data Format, Social network analysis, Supercomputadors, Fault injection, Convergence (routing), Leverage (statistics), Sensitivity (control systems), High-performance computing, Resilience, business.industry, Checkpoint, Deep learning, computer.file_format, Soft error, Informàtica::Intel·ligència artificial [Àrees temàtiques de la UPC], business, computer, Neural networks
الوصف: The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework—so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence
وصف الملف: application/pdf
اللغة: English
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::69bdb4706ed3b9afcd993b6ed529a1bb
https://hdl.handle.net/2117/364744
حقوق: OPEN
رقم الأكسشن: edsair.doi.dedup.....69bdb4706ed3b9afcd993b6ed529a1bb
قاعدة البيانات: OpenAIRE