دورية أكاديمية

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection.

التفاصيل البيبلوغرافية
العنوان: From Missteps to Milestones: A Journey to Practical Fail-Slow Detection.
المؤلفون: RUIMING LU, ERCI XU, YIMING ZHANG, FENGYI ZHU, ZHAOSHENG ZHU, MENGTIAN WANG, ZONGPENG ZHU, GUANGTAO XUE, JIWU SHU, MINGLU LI, JIESHENG WU
المصدر: ACM Transactions on Storage; Nov2023, Vol. 19 Issue 4, p1-28, 28p
مصطلحات موضوعية: ROOT cause analysis
مستخلص: The newly emerging "fail-slow" failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives.Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (nodelevel) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study. [ABSTRACT FROM AUTHOR]
Copyright of ACM Transactions on Storage is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
قاعدة البيانات: Complementary Index
الوصف
تدمد:15533077
DOI:10.1145/3617690