Testing Closeness of Multivariate Distributions via Ramsey Theory

التفاصيل البيبلوغرافية
العنوان: Testing Closeness of Multivariate Distributions via Ramsey Theory
المؤلفون: Diakonikolas, Ilias, Kane, Daniel M., Liu, Sihan
سنة النشر: 2023
المجموعة: Computer Science
Mathematics
Statistics
مصطلحات موضوعية: Computer Science - Data Structures and Algorithms, Computer Science - Information Theory, Computer Science - Machine Learning, Mathematics - Statistics Theory, Statistics - Machine Learning
الوصف: We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > \epsilon$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ distance between $\mathbf p$ and $\mathbf q$ -- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\mathbb R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2311.13154
رقم الأكسشن: edsarx.2311.13154
قاعدة البيانات: arXiv