MeanEmbeddingTest

class hyppo.ksample.MeanEmbeddingTest(num_randfreq=5)

Mean Embedding test statistic and p-value.

The Mean Embedding test is a two-sample test that uses differences in (analytic) mean embeddings of two data distributions in a reproducing kernel Hilbert space. 1.

Parameters

num_randfreq (int) -- Used to construct random array with size (p, q) where p is the number of dimensions of the data and q is the random frequency at which the test is performed. These are the random test points at which test occurs (see notes).

Notes

The test statistic, like the Smooth CF statistic, takes on the following form:

\[W_n\Sigma_n^{-1}W_n\]

As seen in the above formulation, this test-statistic takes the same form as the Hotelling \(T^2\) statistic found in hyppo.ksample.Hotelling. However, the components are defined differently in this case. Given data sets X and Y, define the following as \(Z_i\), the vector of differences:

\[Z_i = (k(X_i, T_1) - k(Y_i, T_1), \ldots, k(X_i, T_J) - k(Y_i, T_J)) \in \mathbb{R}^J\]

The above is the vector of differences between kernels at test points, \(T_j\). The kernel maps into the reproducing kernel Hilbert space. This same formulation is used in the Mean Embedding Test. Moving forward, \(W_n\) can be defined:

\[W_n = \frac{1}{n} \sum_{i = 1}^n Z_i\]

This leaves \(\Sigma_n\), the covariance matrix as:

\[\Sigma_n = \frac{1}{n}ZZ^T\]

Once \(S_n\) is calculated, a threshold \(r_{\alpha}\) corresponding to the \(1 - \alpha\) quantile of a Chi-squared distribution w/ J degrees of freedom is chosen. Null is rejected if \(S_n\) is larger than this threshold.

References

1

Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems, 2015.

Methods Summary

MeanEmbeddingTest.statistic(x, y, random_state)

Calculates the mean embedding test statistic.

MeanEmbeddingTest.test(x, y[, random_state])

Calculates the mean embedding test statistic and p-value.


MeanEmbeddingTest.statistic(x, y, random_state)

Calculates the mean embedding test statistic.

Parameters
  • x,y (ndarray of float) -- Input data matrices. x and y must have the same number of dimensions. That is, the shapes must be (n, p) and (m, p) where n is the number of samples and p and q are the number of dimensions.

  • random_state (int) -- Set random seed for generation of test points

Returns

stat (float) -- The computed mean embedding statistic.

MeanEmbeddingTest.test(x, y, random_state=None)

Calculates the mean embedding test statistic and p-value.

Parameters
  • x,y (ndarray of float) -- Input data matrices. x and y must have the same number of dimensions. That is, the shapes must be (n, p) and (m, p) where n is the number of samples and p and q are the number of dimensions.

  • random_state (int) -- Set random seed for generation of test points

Returns

  • stat (float) -- The computed mean embedding statistic.

  • pvalue (float) -- The computed mean embedding p-value.

Examples

>>> import numpy as np
>>> from hyppo.ksample import MeanEmbeddingTest
>>> np.random.seed(1234)
>>> x = np.random.randn(500, 10)
>>> y = np.random.randn(500, 10)
>>> stat, pvalue = MeanEmbeddingTest().test(x, y, random_state=1234)
>>> '%.2f, %.3f' % (stat, pvalue)
'5.33, 0.377'

Examples using hyppo.ksample.MeanEmbeddingTest