PDF version (NAG web site
, 64-bit version, 64-bit version)
NAG Toolbox: nag_contab_binary (g11sa)
Purpose
nag_contab_binary (g11sa) fits a latent variable model (with a single factor) to data consisting of a set of measurements on individuals in the form of binary-valued sequences (generally referred to as score patterns). Various measures of goodness-of-fit are calculated along with the factor (theta) scores.
Syntax
[
x,
irl,
a,
c,
niter,
alpha,
pigam,
cm,
g,
expp,
obs,
exf,
y,
iob,
rlogl,
chi,
idf,
siglev,
ifail] = g11sa(
n,
gprob,
x,
irl,
a,
c,
cgetol,
chisqr, 'ip',
ip, 'ns',
ns, 'iprint',
iprint, 'maxit',
maxit)
[
x,
irl,
a,
c,
niter,
alpha,
pigam,
cm,
g,
expp,
obs,
exf,
y,
iob,
rlogl,
chi,
idf,
siglev,
ifail] = nag_contab_binary(
n,
gprob,
x,
irl,
a,
c,
cgetol,
chisqr, 'ip',
ip, 'ns',
ns, 'iprint',
iprint, 'maxit',
maxit)
Description
Given a set of
dichotomous variables
, where
denotes vector or matrix transpose, the objective is to investigate whether the association between them can be adequately explained by a latent variable model of the form (see
Bartholomew (1980) and
Bartholomew (1987))
The
are called item responses and take the value
or
.
denotes the latent variable assumed to have a standard Normal distribution over a population of individuals to be tested on
items. Call
the item response function: it represents the probability that an individual with latent ability
will produce a positive response
(1) to item
.
and
are item parameters which can assume any real values. The set of parameters,
, for
, being coefficients of the unobserved variable
, can be interpreted as ‘factor loadings’.
is a function selected by you as either
or logit, mapping the interval
onto the whole real line. Data from a random sample of
individuals takes the form of the matrices
and
defined below:
where
denotes the
th score pattern in the sample,
the frequency with which
occurs and
the number of different score patterns observed. (Thus
). It can be shown that the log-likelihood function is proportional to
where
(
being the probability density function of a standard Normal random variable).
denotes the unconditional probability of observing score pattern
. The integral in
(2) is approximated using Gauss–Hermite quadrature. If we take
in
(1) and reparameterise as follows,
then
(1) reduces to the logit model (see
Bartholomew (1980))
If we take
(where
is the cumulative distribution function of a standard Normal random variable) and reparameterise as follows,
then
(1) reduces to the probit model (see
Bock and Aitkin (1981))
An E-M algorithm (see
Bock and Aitkin (1981)) is used to maximize the log-likelihood function. The number of quadrature points used is set initially to
and once convergence is attained increased to
.
The theta score of an individual responding in score pattern is computed as the posterior mean, i.e., . For the logit model the component score is also calculated. (Note that in calculating the theta scores and measures of goodness-of-fit nag_contab_binary (g11sa) automatically reverses the coding on item if ; it is assumed in the model that a response at the one level is showing a higher measure of latent ability than a response at the zero level.)
The frequency distribution of score patterns is required as input data. If your data is in the form of individual score patterns (uncounted), then
nag_contab_binary_service (g11sb) may be used to calculate the frequency distribution.
References
Bartholomew D J (1980) Factor analysis for categorical data (with Discussion) J. Roy. Statist. Soc. Ser. B 42 293–321
Bartholomew D J (1987) Latent Variable Models and Factor Analysis Griffin
Bock R D and Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: Application of an E-M algorithm Psychometrika 46 443–459
Parameters
Compulsory Input Parameters
- 1:
– int64int32nag_int scalar
-
, the number of individuals in the sample.
Constraint:
.
- 2:
– logical scalar
-
Must be set equal to true if and false if .
- 3:
– logical array
-
ldx, the first dimension of the array, must satisfy the constraint
.
The first
rows of
x must contain the
different score patterns. The
th row of
x must contain the
th score pattern with
set equal to
true if
and
false if
. All rows of
x must be distinct.
- 4:
– int64int32nag_int array
-
The
th component of
irl must be set equal to the frequency with which the
th row of
x occurs.
Constraints:
- , for ;
- .
- 5:
– double array
-
must be set equal to an initial estimate of . In order to avoid divergence problems with the E-M algorithm you are strongly advised to set all the to .
- 6:
– double array
-
must be set equal to an initial estimate of . In order to avoid divergence problems with the E-M algorithm you are strongly advised to set all the to .
- 7:
– double scalar
-
The accuracy to which the solution is required.
If
cgetol is set to
and on exit
or
, then all elements of the gradient vector will be smaller than
in absolute value. For most practical purposes the value
should suffice. You should be wary of setting
cgetol too small since the convergence criterion may then have become too strict for the machine to handle.
If
cgetol has been set to a value which is less than the square root of the
machine precision,
, then
nag_contab_binary (g11sa) will use the value
instead.
- 8:
– logical scalar
-
If
chisqr is set equal to
true, then a likelihood ratio statistic will be calculated (see
chi).
If
chisqr is set equal to
false, no such statistic will be calculated.
Optional Input Parameters
- 1:
– int64int32nag_int scalar
-
Default:
the dimension of the arrays
a,
c and the second dimension of the array
x. (An error is raised if these dimensions are not equal.)
, the number of dichotomous variables.
Constraint:
.
- 2:
– int64int32nag_int scalar
-
Default:
the dimension of the array
irl and the first dimension of the array
x. (An error is raised if these dimensions are not equal.)
ns must be set equal to the number of different score patterns in the sample,
.
Constraint:
.
- 3:
– int64int32nag_int scalar
Default:
The frequency with which the maximum likelihood search function is to be monitored.
- The search is monitored once every iprint iterations, and when the number of quadrature points is increased, and again at the final solution point.
- The search is monitored once at the final point.
- The search is not monitored at all.
iprint should normally be set to a small positive number.
- 4:
– int64int32nag_int scalar
Default:
The maximum number of iterations to be made in the maximum likelihood search. There will be an error exit (see
Error Indicators and Warnings) if the search function has not converged in
maxit iterations.
Constraint:
.
Output Parameters
- 1:
– logical array
-
Given a valid parameter set then the first
rows of
x still contain the
different score patterns. However, the following points should be noted:
(i) |
If the estimated factor loading for the th item is negative then that item is re-coded, i.e., s and s (or true and false) in the th column of x are interchanged. |
(ii) |
The rows of x will be reordered so that the theta scores corresponding to rows of x are in increasing order of magnitude. |
- 2:
– int64int32nag_int array
-
Given a valid parameter set then the first
components of
irl are reordered as are the rows of
x.
- 3:
– double array
-
contains the latest estimate of
, for
. (Because of possible recoding all elements of
a will be positive.)
- 4:
– double array
-
contains the latest estimate of , for .
- 5:
– int64int32nag_int scalar
-
Given a valid parameter set then
niter contains the number of iterations performed by the maximum likelihood search function.
- 6:
– double array
-
Given a valid parameter set then
contains the latest estimate of
. (Because of possible recoding all elements of
alpha will be positive.)
- 7:
– double array
-
Given a valid parameter set then contains the latest estimate of either if (logit model) or if (probit model).
- 8:
– double array
-
Given a valid parameter set then the strict lower triangle of
cm contains the correlation matrix of the parameter estimates held in
alpha and
pigam on exit. The diagonal elements of
cm contain the standard errors. Thus:
| = | standard error |
| = | standard error |
| = | correlation , |
for
;
| = | correlation |
| = | correlation |
| = | correlation |
| = | correlation , |
for
.
If the second derivative matrix cannot be computed then all the elements of
cm are returned as zero.
- 9:
– double array
-
Given a valid parameter set then
g contains the estimated gradient vector corresponding to the final point held in the arrays
alpha and
pigam.
contains the derivative of the log-likelihood with respect to
, for
.
contains the derivative of the log-likelihood with respect to
, for
.
- 10:
– double array
-
Given a valid parameter set then
contains the expected percentage of individuals in the sample who respond positively to items
and
(
), corresponding to the final point held in the arrays
alpha and
pigam.
- 11:
– double array
-
Given a valid parameter set then contains the observed percentage of individuals in the sample who responded positively to items and ().
- 12:
– double array
-
Given a valid parameter set then
contains the expected frequency of the
th score pattern (
th row of
x), corresponding to the final point held in the arrays
alpha and
pigam.
- 13:
– double array
-
Given a valid parameter set then
contains the estimated theta score corresponding to the
th row of
x, for the final point held in the arrays
alpha and
pigam.
- 14:
– int64int32nag_int array
-
Given a valid parameter set then
contains the number of items in the
th row of
x for which the response was positive (
true).
- 15:
– double scalar
-
Given a valid parameter set then
rlogl contains the value of the log-likelihood kernel corresponding to the final point held in the arrays
alpha and
pigam, namely
- 16:
– double scalar
-
If
chisqr was set equal to
true on entry, then given a valid parameter set,
chi will contain the value of the likelihood ratio statistic corresponding to the final parameter estimates held in the arrays
alpha and
pigam, namely
The summation is over those elements of
irl which are positive. If
is less than
, then adjacent score patterns are pooled (the score patterns in
x being first put in order of increasing theta score).
If
chisqr has been set equal to
false, then
chi is not used.
- 17:
– int64int32nag_int scalar
-
If
chisqr was set equal to
true on entry, then given a valid parameter set,
idf will contain the degrees of freedom associated with the likelihood ratio statistic,
chi.
| if ; |
| if , |
where
denotes the number of terms summed to calculate
chi (
only if there is no pooling).
If
chisqr has been set equal to
false, then
idf is not used.
- 18:
– double scalar
-
If
chisqr was set equal to
true on entry, then given a valid parameter set,
siglev will contain the significance level of
chi based on
idf degrees of freedom. If
idf is zero or negative then
siglev is set to zero.
If
chisqr was set equal to
false, then
siglev is not used.
- 19:
– int64int32nag_int scalar
unless the function detects an error (see
Error Indicators and Warnings).
Error Indicators and Warnings
Note: nag_contab_binary (g11sa) may return useful information for one or more of the following detected errors or warnings.
Errors or warnings detected by the function:
Cases prefixed with W are classified as warnings and
do not generate an error of type NAG:error_n. See nag_issue_warnings.
-
-
On entry, | , |
or | , |
or | , |
or | , |
or | , |
or | two or more rows of x are identical, |
or | , |
or | , |
or | at least one of , for , |
or | , |
or | , |
or | , |
or | , |
or | , |
or | . |
-
-
For at least one of the
ip items the responses are all at the same level. To proceed, you must delete this item from the dataset.
-
-
There have been
maxit iterations of the maximum likelihood search function. If steady increases in the log-likelihood kernel were monitored up to the point where this exit occurred, then the exit probably occurred simply because
maxit was set too small, so the calculations should be restarted from the final point held in
a and
c. This type of exit may also indicate that there is no maximum to the likelihood surface.
-
-
One of the elements of
a has exceeded
in absolute value (see
Heywood Cases). If steady increases in the log-likelihood kernel were monitored up to the point where this exit occurred then this exit may indicate that there is no maximum to the likelihood surface. You are advised to restart the calculations from a different point to see whether the E-M algorithm moves off in the same direction.
-
-
This indicates a failure in
nag_matop_real_symm_posdef_inv (f01ab) which is used to invert the second derivative matrix for calculating the variance-covariance matrix of parameter estimates. It was also found that
maxit iterations had been performed (see
). The elements of
cm will then have been set to zero on exit. You are advised to restart the calculations with a larger value for
maxit.
-
-
This indicates a failure in
nag_matop_real_symm_posdef_inv (f01ab) which is used to invert the second derivative matrix for calculating the variance-covariance matrix of parameter estimates. It was also found that one of the elements of
a had exceeded
in absolute value (see
). The elements of
cm will have then been set to zero on exit. You are advised to restart the calculations from a different point to see whether the E-M algorithm moves off in the same direction.
- W
-
If
chisqr was set equal to
true on entry, so that a likelihood ratio statistic was calculated, then
merely indicates that the value of
idf on exit is
, i.e., the
statistic is meaningless. In this case
siglev is returned as zero.
All other output information should be correct, i.e., can be treated as if on exit.
-
An unexpected error has been triggered by this routine. Please
contact
NAG.
-
Your licence key may have expired or may not have been installed correctly.
-
Dynamic memory allocation failed.
Accuracy
On exit from
nag_contab_binary (g11sa) if
or
then the following condition will be satisfied:
If
or
on exit (i.e.,
maxit iterations have been performed but the above condition does not hold), then the elements in
a,
c,
alpha and
pigam may still be good approximations to the maximum likelihood estimates. You are advised to inspect the elements of
g to see whether this is confirmed.
Further Comments
Timing
The number of iterations required in the maximum likelihood search depends upon the number of observed variables, , and the distance of the starting point you supplied from the solution. The number of multiplications and divisions performed in an iteration is proportional to .
Initial Estimates
You are strongly advised to use the recommended starting values for the elements of
a and
c. Divergence may result from values you supplied even if they are very close to the solution. Divergence may also occur when an item has nearly all its responses at one level.
Heywood Cases
As in normal factor analysis, Heywood cases can often occur, particularly when
is small and
not very big. To overcome this difficulty the maximum likelihood search function is terminated when the absolute value of one of the
exceeds
.
You have the option of deciding whether to exit from
nag_contab_binary (g11sa) (by setting
on entry) or to permit
nag_contab_binary (g11sa) to proceed onwards as if it had exited normally from the maximum likelihood search function (setting
on entry).
The elements in
a,
c,
alpha and
pigam may still be good approximations to the maximum likelihood estimates. You are advised to inspect the elements
g to see whether this is confirmed.
Goodness of Fit Statistic
When is not very large compared to a goodness-of-fit statistic should not be calculated as many of the expected frequencies will then be less than .
First and Second Order Margins
The observed and expected
percentages of sample members responding to individual and pairs of items held in the arrays
obs and
expp on exit can be converted to observed and expected
numbers by multiplying all elements of these two arrays by
.
Example
A program to fit the logit latent variable model to the following data:
Index |
Score Pattern |
Observed Frequency |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
–––– |
Total |
|
|
Open in the MATLAB editor:
g11sa_example
function g11sa_example
fprintf('g11sa example results\n\n');
n = int64(1000);
x = [false, false, false, false;
true, false, false, false;
false, false, false, true;
false, true, false, false;
true, false, false, true;
true, true, false, false;
false, true, false, true;
false, false, true, false;
true, true, false, true;
true, false, true, false;
false, false, true, true;
false, true, true, false;
true, false, true, true;
true, true, true, false;
false, true, true, true;
true, true, true, true];
irl = [int64(154); 11; 42; 49;
2; 10; 27; 84;
10; 25; 75; 129;
30; 50; 181; 121];
a = [0.5; 0.5; 0.5; 0.5];
c = [0; 0; 0; 0];
gprob = false;
cgetol = 0.0001;
chisqr = true;
iprint = int64(-1);
[x, irl, a, c, niter, alpha, pigam, cm, g, expp, obs, exf, y, ...
iob, rlogl, chi, idf, siglev, ifail] = ...
g11sa( ...
n, gprob, x, irl, a, c, cgetol, chisqr, 'iprint', iprint);
fprintf('Log likelihood kernel on exit = %14.4e\n\n', rlogl);
fprintf('Maximum likelihood estimates of item parameters are as follows\n');
fprintf('--------------------------------------------------------------\n\n');
fprintf('%8s%12s%11s%16s%10s%13s\n\n', 'item j', 'alpha(j)', 's.e.', ...
'alpha(j,0)', 'pi(j)', 's.e.');
ivar = [1:numel(a)]';
se = diag(cm);
results = [ivar alpha se(1:2:end) c pigam se(2:2:end)];
fprintf('%5d%13.3f%13.3f%13.3f%13.3f%13.3f\n',results');
fprintf('\n\n');
mtitle = 'Expected percentage of cases producing positive responses';
[ifail] = x04ca( ...
'Lower', 'Non-unit', expp, mtitle);
fprintf('\n');
mtitle = 'Observed percentage of cases producing positive responses';
[ifail] = x04ca( ...
'Lower', 'Non-unit', obs, mtitle);
fprintf('\n\n');
fprintf(' Observed Expected Theta Component Raw Score\n');
fprintf(' frequency frequency score score score pattern\n\n');
cs = double(x)*alpha;
results = [ double(irl) exf y cs double(iob) double(x)];
fprintf('%7d%13.3f%8.3f%10.3f%8d %1d%1d%1d%1d\n',results');
fprintf('--------- ---------\n');
fprintf('%7d%13.3f\n\n',n,n);
fprintf('Likelihood ratio goodness of fit statistic = %10.3f\n', chi);
fprintf(' Significance level = %10.3f\n', siglev);
fprintf('(Based on %4d degrees of freedom)\n',idf)
g11sa example results
Log likelihood kernel on exit = -2.4039e+03
Maximum likelihood estimates of item parameters are as follows
--------------------------------------------------------------
item j alpha(j) s.e. alpha(j,0) pi(j) s.e.
1 1.045 0.148 -1.276 0.218 0.017
2 1.409 0.179 0.424 0.604 0.022
3 2.659 0.525 1.615 0.834 0.036
4 1.122 0.140 -0.062 0.485 0.020
Expected percentage of cases producing positive responses
1 2 3 4
1 25.8963
2 19.0888 57.6547
3 22.4987 47.9571 69.4214
4 16.4381 33.8672 40.5658 48.7712
Observed percentage of cases producing positive responses
1 2 3 4
1 25.9000
2 19.1000 57.7000
3 22.6000 48.1000 69.5000
4 16.3000 33.9000 40.7000 48.8000
Observed Expected Theta Component Raw Score
frequency frequency score score score pattern
154 147.061 -1.273 0.000 0 0000
11 13.444 -0.873 1.045 1 1000
42 42.420 -0.846 1.122 1 0001
49 54.818 -0.747 1.409 1 0100
2 5.886 -0.494 2.167 2 1001
10 8.410 -0.399 2.455 2 1100
27 27.511 -0.374 2.531 2 0101
84 92.062 -0.332 2.659 1 0010
10 6.237 -0.019 3.577 3 1101
25 21.847 0.027 3.705 2 1010
75 73.835 0.055 3.781 2 0011
129 123.766 0.162 4.069 2 0110
30 26.899 0.466 4.826 3 1011
50 50.881 0.591 5.114 3 1110
181 179.564 0.626 5.190 3 0111
121 125.360 1.144 6.236 4 1111
--------- ---------
1000 1000.000
Likelihood ratio goodness of fit statistic = 9.027
Significance level = 0.251
(Based on 7 degrees of freedom)
PDF version (NAG web site
, 64-bit version, 64-bit version)
© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2015