Experimental Protocol For Document Image Algorithm Evaluation
      -------------------------------------------------------------
	                Robert M. HARALICK
	       Department of Electrical Engineering
	   University of Washington, Seattle, WA 98195 USA
           haralick@ee.washington.edu  Tel. +1-206-685-4974
   Evaluating the performance of document image analysis algorithms is not
simple.  In the general case where there is a performance requirement
specification, a point estimate of performance is not sufficient.  Yet
point estimates of performance seem to predominate in papers discussing
performance or evaluation.  In this paper, we sketch out the elements
of doing any kind of document image analysis performance evaluation.
   The context in which performance evaluation is done is the controlled
experiment.  Controlled experiments  provide evidence that the algorithm
segments, recognizes , locates, and/or measures what it is designed to do
from image data.  A properly designed scientific experiment provides
evidence to accept or reject the hypothesis that the algorithm performs
to the specified accuracy level, as given in the requirements specification.
To properly set up such an experiment, in a way that it can be repeated
and the evidence verified by another researcher, considerable attention
must be paid to experimental protocol.
   The experimental protocol states the quantities to be estimated, how
they will be estimated, the accuracy to which they will be estimated,
and the population of document images on which the document image algorithm
is to be applied, and how the tuning parameters of the algorithm are to
be set or varied.  Then the protocol must give the experimental design and
the data analysis plan.
   The experimental design describes how a suitably random, independent, and
representative set of images from the specified population is to be sampled,
generated, or acquired.  If the population includes, for example, a range of
sizes or orientations of the characters or zones or layouts or if the document
image objects of interest can appear in a variety of contextual situations, 
then the sampling mechanism must assure that a reasonable number of images 
are sampled with the document image objects appearing in each of the 
different sizes or appearing with sizes and orientations throughout 
its permissible range.
   In general, the essential variation of the images in the population can 
be described by some number, say N variables.  If the N variables
X_1,...,X_N have respective range sets R_1,...,R_N, then the sampling
design must assure that images sample the domain
	R_1 x R_2 x ... R_N
in a representative way.  The number of samples may only be a small fraction 
of the number of possibilities.  This suggests that the experimental design 
may have to make judicious use of a latin square layout.  If there are some 
free tuning parameters that must be set a priori in the document image 
algorithm, then these parameters can as well be part of the N vaiables,
in which case, the statistical analysis will be conditioned on their
values and the experimental results can plot performance as a function
of their values.
   The experiments must then be carried out for each image in the sample.
Suppose there are M different measurements Y_1,...,Y_M a document image
algorithm is to make on each image.  These variables would be a subset
of the N variables which might describe the images in the population of
interest.  As a result of these measurements, which are really statistical
inferences, there will be a difference between the true values y_1,...,y_M
of the measured quantities and the measured values themselves.  The accuracy
criterion must state how the comparision between the true values and the
measured values will be evaluated.
   Finally, the exerimental data analysis plan must state how the hypothesis
that the algorithm meets the specified requirement will be tested.  It
indicates how the observed data (the true values and the corresponding
measured values) will be analyzed.  The plan must be supported by 
theoretically developed statistical analysis which shows that an experiment 
carried out according to the experimental design and analyzed according 
to the data analysis plan will produce a statistical test itself having 
a given accuracy.  That is, since the entire population was only sampled, 
the sampling fluctuation will introduce random fluctuation in the test 
results.  For some fraction of experiments carried out according to the 
protocol, the hypothesis to be tested will be accepted but if the algorithm
were to be tried on the complete population of image variations, it would 
not meet the specified requirements; and for some fraction of experiments 
carried out according to the protocol, the hypotheis to be tested will 
be rejected but if the algorithm were to be tried on the complete population 
of image variations, it would meet the specified requirements.  
   The specified size of these errors of false acceptance and missed 
acceptance will dictate the number of images to be in the sample.  This 
relation between sample size and false acceptance rate and missed 
acceptance rate of the test for the hypothesis must be determined on the 
basis of statistical theory.  One would certainly expect that the sample 
size would be large enough so that these error rates would be below 20%.
   For example, if the error rate of a document image algorithm is to be
less than 1/1000, then in order to be about 85% sure that the performance
meets specification, 10,000 tests will have to be run.  If the document
image algorithm performs incorrectly 9 or fewer times, then we can assert
that with 85% probability, the document image algorithm meets specification
(Haralick, 1989).
   In summary, any experimental protocol must include the following items.
    1. The protocol must state:
	-  what is to be measured  on each experimental trial
	-  what is to be inferred (estimated) from the measurements
	   or what hypothesis is to be tested (the requirement)
	-  what is the theoretically expected behavior of what
	   is to be estimated
	-  with what precision (standard deviation) is the inferred
	   quantity to be estimated or with what misdetect and false
	   alarm error is the hypothesis to be tested
	-  over what population of  document images and algorithm
	   tuning parameters are the experiments to be done
    2.  The protocol must layout an experimental design which describes
	-  how a suitable random, independent, and representative set
	   of images from the specified population is to be sampled,
	   generated, or acquired, including the setting of all parameters 
	   relevant to the population
	-  what parameters of the algorithm will be varied and
	   how their values will be varied
	-  what parameters of the algorithm will be set and what values
	   they will be set at
	-  the experiments which must carried out for each image 
           in the population
	-  the accuracy criterion which states how the comparison between 
           the true values and the measured values will be evaluated
    3.  The protocol must have a data analysis plan which
	-  states how to statistically compare the behavior
	   of what is estimated and the theoretically expected behavior
	   of what is estimated to verify the hypothesis that the
	   observed behavior is consistent with the theoretical expectations
	-  states how to test the  hypothesis that the algorithm meets the 
           specified requirement
	-  indicates how the data (the true values and the corresponding
           measured values) will be analyzed
	-  tells explicitly what performance curves will be generated
    4.  The data analysis plan
	-  must be supported by a theoretically developed statistical
           analysis; and
	-  show that an experiment carried out according to the experimental 
 	   design and  analyzed according the data analysis plan will 
	   produce results having a given accuracy (i.e., the rates of 
	   false alarms and miss detections due to the experimental
	   sampling variation meet the stated requirement)