Experimental Protocol For Document Image Algorithm Evaluation
-------------------------------------------------------------

Robert M. HARALICK

Department of Electrical Engineering
University of Washington, Seattle, WA 98195 USA
haralick@ee.washington.edu Tel. +1-206-685-4974

Evaluating the performance of document image analysis algorithms is not
simple. In the general case where there is a performance requirement
specification, a point estimate of performance is not sufficient. Yet
point estimates of performance seem to predominate in papers discussing
performance or evaluation. In this paper, we sketch out the elements
of doing any kind of document image analysis performance evaluation.

The context in which performance evaluation is done is the controlled
experiment. Controlled experiments provide evidence that the algorithm
segments, recognizes , locates, and/or measures what it is designed to do
from image data. A properly designed scientific experiment provides
evidence to accept or reject the hypothesis that the algorithm performs
to the specified accuracy level, as given in the requirements specification.
To properly set up such an experiment, in a way that it can be repeated
and the evidence verified by another researcher, considerable attention
must be paid to experimental protocol.

The experimental protocol states the quantities to be estimated, how
they will be estimated, the accuracy to which they will be estimated,
and the population of document images on which the document image algorithm
is to be applied, and how the tuning parameters of the algorithm are to
be set or varied. Then the protocol must give the experimental design and
the data analysis plan.

The experimental design describes how a suitably random, independent, and
representative set of images from the specified population is to be sampled,
generated, or acquired. If the population includes, for example, a range of
sizes or orientations of the characters or zones or layouts or if the document
image objects of interest can appear in a variety of contextual situations,
then the sampling mechanism must assure that a reasonable number of images
are sampled with the document image objects appearing in each of the
different sizes or appearing with sizes and orientations throughout
its permissible range.

In general, the essential variation of the images in the population can
be described by some number, say N variables. If the N variables
X_1,...,X_N have respective range sets R_1,...,R_N, then the sampling
design must assure that images sample the domain
R_1 x R_2 x ... R_N
in a representative way. The number of samples may only be a small fraction
of the number of possibilities. This suggests that the experimental design
may have to make judicious use of a latin square layout. If there are some
free tuning parameters that must be set a priori in the document image
algorithm, then these parameters can as well be part of the N vaiables,
in which case, the statistical analysis will be conditioned on their
values and the experimental results can plot performance as a function
of their values.

The experiments must then be carried out for each image in the sample.
Suppose there are M different measurements Y_1,...,Y_M a document image
algorithm is to make on each image. These variables would be a subset
of the N variables which might describe the images in the population of
interest. As a result of these measurements, which are really statistical
inferences, there will be a difference between the true values y_1,...,y_M
of the measured quantities and the measured values themselves. The accuracy
criterion must state how the comparision between the true values and the
measured values will be evaluated.

Finally, the exerimental data analysis plan must state how the hypothesis
that the algorithm meets the specified requirement will be tested. It
indicates how the observed data (the true values and the corresponding
measured values) will be analyzed. The plan must be supported by
theoretically developed statistical analysis which shows that an experiment
carried out according to the experimental design and analyzed according
to the data analysis plan will produce a statistical test itself having
a given accuracy. That is, since the entire population was only sampled,
the sampling fluctuation will introduce random fluctuation in the test
results. For some fraction of experiments carried out according to the
protocol, the hypothesis to be tested will be accepted but if the algorithm
were to be tried on the complete population of image variations, it would
not meet the specified requirements; and for some fraction of experiments
carried out according to the protocol, the hypotheis to be tested will
be rejected but if the algorithm were to be tried on the complete population
of image variations, it would meet the specified requirements.

The specified size of these errors of false acceptance and missed
acceptance will dictate the number of images to be in the sample. This
relation between sample size and false acceptance rate and missed
acceptance rate of the test for the hypothesis must be determined on the
basis of statistical theory. One would certainly expect that the sample
size would be large enough so that these error rates would be below 20%.

For example, if the error rate of a document image algorithm is to be
less than 1/1000, then in order to be about 85% sure that the performance
meets specification, 10,000 tests will have to be run. If the document
image algorithm performs incorrectly 9 or fewer times, then we can assert
that with 85% probability, the document image algorithm meets specification
(Haralick, 1989).

In summary, any experimental protocol must include the following items.

1. The protocol must state:
- what is to be measured on each experimental trial
- what is to be inferred (estimated) from the measurements
or what hypothesis is to be tested (the requirement)
- what is the theoretically expected behavior of what
is to be estimated
- with what precision (standard deviation) is the inferred
quantity to be estimated or with what misdetect and false
alarm error is the hypothesis to be tested
- over what population of document images and algorithm
tuning parameters are the experiments to be done

2. The protocol must layout an experimental design which describes
- how a suitable random, independent, and representative set
of images from the specified population is to be sampled,
generated, or acquired, including the setting of all parameters
relevant to the population
- what parameters of the algorithm will be varied and
how their values will be varied
- what parameters of the algorithm will be set and what values
they will be set at
- the experiments which must carried out for each image
in the population
- the accuracy criterion which states how the comparison between
the true values and the measured values will be evaluated

3. The protocol must have a data analysis plan which
- states how to statistically compare the behavior
of what is estimated and the theoretically expected behavior
of what is estimated to verify the hypothesis that the
observed behavior is consistent with the theoretical expectations
- states how to test the hypothesis that the algorithm meets the
specified requirement
- indicates how the data (the true values and the corresponding
measured values) will be analyzed
- tells explicitly what performance curves will be generated

4. The data analysis plan
- must be supported by a theoretically developed statistical
analysis; and
- show that an experiment carried out according to the experimental
design and analyzed according the data analysis plan will
produce results having a given accuracy (i.e., the rates of
false alarms and miss detections due to the experimental
sampling variation meet the stated requirement)