Analyzing the Performance Portability of Tensor Decomposition

التفاصيل البيبلوغرافية
العنوان: Analyzing the Performance Portability of Tensor Decomposition
المؤلفون: Anderson, S. Isaac Geronimo, Teranishi, Keita, Dunlavy, Daniel M., Choi, Jee
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Distributed, Parallel, and Cluster Computing, C.1.2, C.1.4, D.4.8, G.4
الوصف: We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, $\Phi^{(n)}$, is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for $\Phi^{(n)}$ computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.
Comment: 28 pages, 19 figures
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2307.03276
رقم الأكسشن: edsarx.2307.03276
قاعدة البيانات: arXiv