naginterfaces.library.univar.robust_2var_ci¶

naginterfaces.library.univar.robust_2var_ci(method, x, y, clevel)[source]¶

robust_2var_ci calculates a rank based (nonparametric) estimate and confidence interval for the difference in location between two independent populations.

For full information please refer to the NAG Library document for g07eb

https://support.nag.com/numeric/nl/nagdoc_30.3/flhtml/g07/g07ebf.html

Parameters

methodstr, length 1

Specifies the method to be used.

$m e t h o d ='E'$

The exact algorithm is used.

$m e t h o d ='A'$

The iterative algorithm is used.

xfloat, array-like, shape $(n)$

The observations of the first sample, $x_{i}$ , for $i = 1, 2, \dots, n$ .

yfloat, array-like, shape $(m)$

The observations of the second sample, $y_{j}$ , for $j = 1, 2, \dots, m$ .

clevelfloat

The confidence interval required, $1 - α$ ; e.g., for a $95 %$ confidence interval set $c l e v e l = 0.95$ .

Returns

thetafloat: The estimate of the difference in the location of the two populations, $^θ$ .
thetalfloat: The estimate of the lower limit of the confidence interval, $θ_{l}$ .
thetaufloat: The estimate of the upper limit of the confidence interval, $θ_{u}$ .
estclfloat: An estimate of the actual percentage confidence of the interval found, as a proportion between $(0.0, 1.0)$ .
ulowerfloat: The value of the Mann–Whitney $U$ statistic corresponding to the lower confidence limit, $U_{l}$ .
uupperfloat: The value of the Mann–Whitney $U$ statistic corresponding to the upper confidence limit, $U_{u}$ .

Raises

NagValueError

(errno $1$ )

On entry, $m = ⟨ v a l u e ⟩$ .

Constraint: $m \geq 1$ .

(errno $1$ )

On entry, $c l e v e l = ⟨ v a l u e ⟩$ .

Constraint: $0.0 < c l e v e l < 1.0$ .

(errno $1$ )

On entry, $n = ⟨ v a l u e ⟩$ .

Constraint: $n \geq 1$ .

(errno $1$ )

On entry, $m e t h o d = ⟨ v a l u e ⟩$ .

Constraint: $m e t h o d ='E'$ or $'A'$ .

Warns

NagAlgorithmicWarning

(errno $2$ ): Not enough information to compute an interval estimate since each sample has identical values. The common difference is returned in $t h e t a$ , $t h e t a l$ and $t h e t a u$ .
(errno $3$ ): The iterative procedure used to estimate $θ$ has not converged.
(errno $3$ ): The iterative procedure used to estimate, $θ_{u}$ , the upper confidence limit has not converged.
(errno $3$ ): The iterative procedure used to estimate, $θ_{l}$ , the lower confidence limit has not converged.

Notes

Consider two random samples from two populations which have the same continuous distribution except for a shift in the location. Let the random sample, $x = {(x_{1}, x_{2}, \dots, x_{n})}_{1}^{T}$ , have distribution $F (x)$ and the random sample, $y = {(y_{1}, y_{2}, \dots, y_{m})}_{1}^{T}$ , have distribution $F (x - θ)$ .

robust_2var_ci finds a point estimate, $^θ$ , of the difference in location $θ$ together with an associated confidence interval. The estimates are based on the ordered differences $y_{j} - x_{i}$ . The estimate $^θ$ is defined by

^θ=median{yj−xi, i=1,2,…,n;j=1,2,…,m}.

Let $d_{k}$ , for $k = 1, 2, \dots, n m$ , denote the $n m$ (ascendingly) ordered differences $y_{j} - x_{i}$ , for $j = 1, 2, \dots, m$ , for $i = 1, 2, \dots, n$ . Then

if $n m$ is odd, $^θ = d_{k}$ where $k = (n m - 1) / 2$ ;

if $n m$ is even, $^θ = (d_{k} + d_{k + 1}) / 2$ where $k = n m / 2$ .

This estimator arises from inverting the two sample Mann–Whitney rank test statistic, $U (θ_{0})$ , for testing the hypothesis that $θ = θ_{0}$ . Thus $U (θ_{0})$ is the value of the Mann–Whitney $U$ statistic for the two independent samples ${(x_{i} + θ_{0}), for i = 1, 2, \dots, n}$ and ${y_{j}, for j = 1, 2, \dots, m}$ . Effectively $U (θ_{0})$ is a monotonically increasing step function of $θ_{0}$ with

\begin{matrix} \begin{matrix} mean (U) = μ = \frac{n m}{2}, v a r (U) = σ^{2} \frac{n m (n + m + 1)}{12} . \end{matrix} \end{matrix}

The estimate $^θ$ is the solution to the equation $U (^θ) = μ$ ; two methods are available for solving this equation. These methods avoid the computation of all the ordered differences $d_{k}$ ; this is because for large $n$ and $m$ both the storage requirements and the computation time would be high.

The first is an exact method based on a set partitioning procedure on the set of all differences $y_{j} - x_{i}$ , for $j = 1, 2, \dots, m$ , for $i = 1, 2, \dots, n$ . This is adapted from the algorithm proposed by Monahan (1984) for the computation of the Hodges–Lehmann estimator for a single population.

The second is an iterative algorithm, based on the Illinois method which is a modification of the regula falsi method, see McKean and Ryan (1977). This algorithm has proved suitable for the function $U (θ_{0})$ which is asymptotically linear as a function of $θ_{0}$ .

The confidence interval limits are also based on the inversion of the Mann–Whitney test statistic.

Given a desired percentage for the confidence interval, $1 - α$ , expressed as a proportion between $0.0$ and $1.0$ initial estimates of the upper and lower confidence limits for the Mann–Whitney $U$ statistic are found;

\begin{matrix} \begin{matrix} U_{l} = μ - 0.5 + (σ \times Φ^{- 1} (α / 2)) U_{u} = μ + 0.5 + (σ \times Φ^{- 1} ((1 - α) / 2)) \end{matrix} \end{matrix}

where $Φ^{- 1}$ is the inverse cumulative Normal distribution function.

$U_{l}$ and $U_{u}$ are rounded to the nearest integer values. These estimates are refined using an exact method, without taking ties into account, if $n + m \leq 40$ and $m a x (n, m) \leq 30$ and a Normal approximation otherwise, to find $U_{l}$ and $U_{u}$ satisfying

\begin{matrix} \begin{matrix} P (U \leq U_{l}) \leq α / 2 P (U \leq U_{l} + 1) > α / 2 \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} P (U \geq U_{u}) \leq α / 2 P (U \geq U_{u} - 1) > α / 2 . \end{matrix} \end{matrix}

The function $U (θ_{0})$ is a monotonically increasing step function. It is the number of times a score in the second sample, $y_{j}$ , precedes a score in the first sample, $x_{i} + θ$ , where we only count a half if a score in the second sample actually equals a score in the first.

Let $U_{l} = k$ ; then $θ_{l} = d_{k + 1}$ . This is the largest value $θ_{l}$ such that $U (θ_{l}) = U_{l}$ .

Let $U_{u} = n m - k$ ; then $θ_{u} = d_{n m - k}$ . This is the smallest value $θ_{u}$ such that $U (θ_{u}) = U_{u}$ .

As in the case of $^θ$ , these equations may be solved using either the exact or iterative methods to find the values $θ_{l}$ and $θ_{u}$ .

Then $(θ_{l}, θ_{u})$ is the confidence interval for $θ$ . The confidence interval is thus defined by those values of $θ_{0}$ such that the null hypothesis, $θ = θ_{0}$ , is not rejected by the Mann–Whitney two sample rank test at the $(100 \times α) %$ level.

References

Lehmann, E L, 1975, Nonparametrics: Statistical Methods Based on Ranks, Holden–Day

McKean, J W and Ryan, T A, 1977, Algorithm 516: An algorithm for obtaining confidence intervals and point estimates based on ranks in the two-sample location problem, ACM Trans. Math. Software (10), 183–185

Monahan, J F, 1984, Algorithm 616: Fast computation of the Hodges–Lehman location estimator, ACM Trans. Math. Software (10), 265–270

NAG and Python

Return to Front

naginterfaces.library.univar.robust_2var_ci¶

naginterfaces.library.univar.robust_​2var_​ci¶

naginterfaces.library.univar.robust_2var_ci¶