Chi-Square Bias in Runs-Up/Down RNG Tests

A Ciphers By Ritter Page

Terry Ritter

When (pseudo) random number generators (RNGs) are tested under a "runs up" or "runs down" test, the expectation values described in Knuth II may not be appropriate. Those expectation values seem designed for RNG's which produce a huge number of different values, and are not accurate for RNG's which generate only a small population of values. These unreasonable expectations can produce the appearance of a problem, even where no problem exists.
Path: news.io.com!not-for-mail
From: ritter@io.com (Terry Ritter)
Newsgroups: sci.crypt
Subject: Chi-Square Bias in Runs-Up/Down RNG Tests
Date: Mon, 30 Jun 1997 05:04:23 GMT
Organization: Illuminati Online
Lines: 304
Message-ID: <33b73dff.2257810@news.io.com>
NNTP-Posting-Host: dialup-02-014.austin.io.com
X-Newsreader: Forte Free Agent 1.11/32.235
Xref: news.io.com sci.crypt:68609


INTRODUCTION

When (pseudo) random number generators (RNGs) are tested under a "runs
up" or "runs down" test, the expectation values described in Knuth II
may not be appropriate.  Those expectation values seem designed for
RNG's which produce a huge number of different values, and are not
accurate for RNG's which generate only a small population of values.
These unreasonable expectations can produce the appearance of a
problem, even where no problem exists.  


BACKGROUND

Random number generators (RNG's) are state machines designed to
produce values which pass statistical tests of randomness.  One of the
better such tests, surprisingly, is the "runs up" test, described in
Knuth II [1] starting on p. 65.  Basically, we take a sample value,
whereupon we have a "run" of length 1.  Then we take another sample,
and if this is larger (smaller) than the previous value, we have
extended a run up (down) by one count.  Eventually the run "fails" and
we increment a count for the previous length.  Typically, we use
chi-square to compare each accumulated length count to the expected
value.  But this implies that we *know* the expected value.  

Knuth II p. 65 refers us to exercise 14 (p. 74) which is the
statistically simpler form where a value is discarded between each
run.  This in turn refers to the "Answers to Exercises" (p. 536) which
gives the probability of each run length as:

   1/r! - 1/(r+1)!

The first term is the probability of getting the given run length *or
more*, and the second is the probability for the next higher run
length *or more*.  Subtracting the second from the first gives us the
probability of getting a run of *exactly* length r.  

After examining figure (9) on p. 65 it is easy to imagine that runs up
(down) testing -- and the above formula -- apply directly to small
populations (there, 0..9).  The formula probably is correct for real
values, and should be close for floats.  But the formula is not right
for 8-bit values, and the difference is significant.  


EXAMPLE

Suppose we have a total population of 4 values, 0..3:  If the first
sample is 0, there are 3 possibilities which continue a run up.  If
the first sample is 1, there are 2; if 2 then 1; and if the first
sample is 3, there is no possibility of continuing a run up because
there is no higher value in the population.  So, out of the 16
possible combinations of values, only 6 are runs up of length 2 (*):  

|      0 1 2 3
|   0  - * * *
|   1  - - * *
|   2  - - - *
|   3  - - - -

The figure shows all possible runs up of length 2: the first sample
selects the row, and the next selects the column.   The symbol - means
we do not have a valid run up of 2; the symbol * means we do.  So we
find that the exact probability is 6 out of 16 instead of the 8 out of
16 we expect from the Knuth formula.  This shows that the population
size (the number of possible values) does indeed influence the
probability value, independent of run length.  


IMPLICATIONS

It is possible to count the number of acceptable byte combinations by
scanning all possibilities of 2, 3, 4 bytes or more.  Dividing the
number of successes by the total number of possibilities gives the
probability of success.  Approaching the problem this way gives us a
handful of results which are both unarguably correct and as precise as
we can handle.  For 8-bit values, pure brute force limits us to about
6 terms, which implies a maximum run length of only 5:

|    n   1/n!         by count
|
|    2   0.50000000   0.49804688
|    3   0.16666666   0.16471863
|    4   0.04166666   0.04069708
|    5   0.00833333   0.00801224
|    6   0.00138888   0.00130929

This generates a comparison by run length:

|    r    Knuth        counting      diff
|
|    1    0.500        0.502         0.002
|    2    0.333333     0.333328      0.000002
|    3    0.125        0.124         0.001
|    4    0.0333       0.0327        0.0006
|    5    0.0069       0.0067        0.0002

The difference between any pair of alternative probabilities is small.
But this difference is multiplied by the number of samples.  Then, in
the chi-square computation, that result is squared and divided by the
expected value.  For a particular trial of 51200 samples (200 * 256),
we would have the following expectations and results:

|    r   Knuth     counting  diff    sq      chi sq term diff
|
|    1   25600.0   25497.6   102.4   10486   0.4
|    2   17066.6   17066.4
|    3    6400.0    6348.8    51.2    2621   0.4
|    4    1705.0    1674.2    30.76    946   0.55
|    5     353.3     343.0    10.0     100   0.28
|                                            ----
|                                            1.63

So, over lengths 1..5, we see a chi-square bias of 1.63; including
lengths 6 and 7 we might predict a total bias of about 2.0 (this does
in fact occur at length 6).  At 51,200 samples, we generally expect
over 5 counts in each bin through r = 7, so we have 6 degrees of
freedom.  With a DF of 6, a chi-square difference of 2.0 is sufficient
to move a probability from 0.90 to 0.95 or from 0.95 to 0.98,
approximately, and from 0.01 to 0.05; this is a significant bias.
This unbalance is obvious to the eye simply by noting the number of
trials which have probabilities of 0.0xxxx versus those with
probability 0.9xxxx.  


EXPERIMENT

Exhaustive enumeration limits us to rather small term values, but this
is hardly the only way to investigate the phenomena.  Another way is
to simply take a huge number of samples and report the accumulated
probabilities.  With 100M or 10**8 runs, we can hope to see 4 decimal
digits of accuracy in the experimental values:

|    r   Experiment   Knuth        counting     
|
|    1   0.50194788   0.50000000   0.50195393
|    2   0.33332273   0.33333333   0.33332825
|    3   0.12400923   0.12500000   0.12402155
|    4   0.03270330   0.03333333   0.03268484
|    5   0.00670438   0.00694444   0.00670295
|    6   0.00112878   0.00119048  -1.0
|    7   0.00016154   0.00017361  -1.0
|    8   0.00001972   0.00002205  -1.0
|    9   0.00000222   0.00000248  -1.0
|   10   0.00000021   0.00000025  -1.0
|   11   0.00000001   0.00000002  -1.0

   *  At r = 1 we see an exact 4-place match to counting results, with
only a 2-place match for the equation.

   *  At r = 2 we get a 5-place match (versus only 4 for the
equation), but do not have the accuracy to believe it, so this
particular line is inconclusive.

   *  At r = 3 we see a 4-place match to counting, versus only 2 for
the equation.  

   *  At r = 4 we would have a 4-place match if we round to 4 places;
the equation only matches 2.

   *  At r = 5 we again get a 5-place match to counting, versus only 3
for the equation.  

   *  At r = 6 and above, there are no counting results.

Based on this data, one is forced to conclude that the experimental
run length values match the counting values typically 100x better than
the alternative.  Further, this is as good as one could reasonably
expect, given the size of the experiment.


THE COMBINATORIC MODEL

Consider what can happen with two samples from a population of 4:

|   S1  S2  run-up
|
|   0   0   -
|   0   1   *
|   0   2   *
|   0   3   *
|   1   0   -
|   1   1   -
|   1   2   *
|   1   3   *
|   2   0   -
|   2   1   -
|   2   2   -
|   2   3   *
|   3   0   -
|   3   1   -
|   3   2   -
|   3   3   -
|          ---
|           6

Here we have 6 ordered pairs which each can be a run up.  We do not
know that they are runs of length 2, because that depends upon having
a run failure (a non-increasing value) in the *next* sample.  But,
clearly, each run of length 3 can only be built upon a successful run
of length 2, so these pairs obviously constitute the only
possibilities for all runs of length 2 or more.  

The number of pairs which can participate in a run of length 2 is the
number of symbols N taken 2 at a time, with no symbol repeated.  This
is just the number of *combinations* of the possible symbols taken 2
at a time.  Each combination is a unique subset of the available
symbols.  Each of these combinations can be arranged either in an
increasing or a decreasing order, so for each combination there is
exactly one run up possibility and one run down possibility.  Clearly,
the number of possible runs up equals the number of possible runs
down.  We also note that N possibilities can be neither.  So for
length 2 we have:

|    N     N               N             2
|   ( ) + ( )  +  N  =  2 ( )  +  N  =  N      
|    2     2               2
|
|   (runs up + runs down + neither = total possibilities)

so that 

|    N        2
|   ( )  =  (N  - N) / 2  =  N (N-1) / 2
|    2

which is both correct and familiar.  

For higher runs lengths, there are more possible orderings for each
combination, but still only one of these will be a valid run up, and
only one other will be a valid run down.  So if we are just counting
possibilities, we can ignore the idea of ordering within each
combination.  Since each possible combination will have a single
success, we only need to find the number of combinations, which is a
standard calculation.  


COMPUTATION

The term probability function (the number of possibilities which can
be part of a run up or run down divided by the total number of
possibilities) is just another view of an old friend:  

|              N      r
|   Pt(N,r) = ( ) / N
|              r

The number of combinations of population N taken r at a time is the
number of possible runs up of length r *and higher*.  So, to find the
probability of runs of a particular length we have:

|   P(N,r) = Pt(N,r) - Pt(N,r+1)

which, for N = 256, produces the following correct results:

| Byte Runs Up/Dn Probability by Run Length
|
|    r   prob    
|
|    1   0.501953125000
|    2   0.333328247070
|    3   0.124021545053
|    4   0.032684844686
|    5   0.006702946664
|    6   0.001126633669
|    7   0.000160449945
|    8   0.000019817478
|    9   0.000002159795
|   10   0.000000210491
|   11   0.000000018541
|   12   0.000000001489
|   13   0.000000000110
|   14   0.000000000007
|   15   0.000000000000

These values closely match the experimental results from 2 billion (2
* 10**9) total runs.  (The first 5 lines each match the experimental
results in 5 places after the decimal, lines 6, 7 and 9 match 6
places, line 8 matches 7 places, and line 10 matches 8 places.)


SUMMARY

A source of experimental bias in "runs up" RNG testing has been
identified.  This bias is a result of using a formula perhaps intended
for real values in tests of small populations.  Based on exact
exhaustive results, the number of possible runs of length r and above
corresponds to the number of possible combinations of N things taken r
at a time, for population N.  This turns out to be a very
straightforward and believable combinatoric model for these tests.
The computation also corresponds to experimentally determined values
across a wide range of run lengths in massive experiments.  Correct
expectation probabilities for byte value runs up or down have been
tabulated.  


REFERENCE

1.  Knuth, D.  1981.  The Art of Computer Programming.  Vol. II,
Seminumerical Algorithms.  Addison-Wesley. 

---
Terry Ritter   ritter@io.com   http://www.io.com/~ritter/
Terry Ritter, his current address, and his top page.
Last updated: 1997-12-25