When (pseudo) random number generators (RNGs) are tested under a "runs up" or "runs down" test, the expectation values described in Knuth II may not be appropriate. Those expectation values seem designed for RNG's which produce a huge number of different values, and are not accurate for RNG's which generate only a small population of values. These unreasonable expectations can produce the appearance of a problem, even where no problem exists.
Path: news.io.com!not-for-mail From: ritter@io.com (Terry Ritter) Newsgroups: sci.crypt Subject: Chi-Square Bias in Runs-Up/Down RNG Tests Date: Mon, 30 Jun 1997 05:04:23 GMT Organization: Illuminati Online Lines: 304 Message-ID: <33b73dff.2257810@news.io.com> NNTP-Posting-Host: dialup-02-014.austin.io.com X-Newsreader: Forte Free Agent 1.11/32.235 Xref: news.io.com sci.crypt:68609 INTRODUCTION When (pseudo) random number generators (RNGs) are tested under a "runs up" or "runs down" test, the expectation values described in Knuth II may not be appropriate. Those expectation values seem designed for RNG's which produce a huge number of different values, and are not accurate for RNG's which generate only a small population of values. These unreasonable expectations can produce the appearance of a problem, even where no problem exists. BACKGROUND Random number generators (RNG's) are state machines designed to produce values which pass statistical tests of randomness. One of the better such tests, surprisingly, is the "runs up" test, described in Knuth II [1] starting on p. 65. Basically, we take a sample value, whereupon we have a "run" of length 1. Then we take another sample, and if this is larger (smaller) than the previous value, we have extended a run up (down) by one count. Eventually the run "fails" and we increment a count for the previous length. Typically, we use chi-square to compare each accumulated length count to the expected value. But this implies that we *know* the expected value. Knuth II p. 65 refers us to exercise 14 (p. 74) which is the statistically simpler form where a value is discarded between each run. This in turn refers to the "Answers to Exercises" (p. 536) which gives the probability of each run length as: 1/r! - 1/(r+1)! The first term is the probability of getting the given run length *or more*, and the second is the probability for the next higher run length *or more*. Subtracting the second from the first gives us the probability of getting a run of *exactly* length r. After examining figure (9) on p. 65 it is easy to imagine that runs up (down) testing -- and the above formula -- apply directly to small populations (there, 0..9). The formula probably is correct for real values, and should be close for floats. But the formula is not right for 8-bit values, and the difference is significant. EXAMPLE Suppose we have a total population of 4 values, 0..3: If the first sample is 0, there are 3 possibilities which continue a run up. If the first sample is 1, there are 2; if 2 then 1; and if the first sample is 3, there is no possibility of continuing a run up because there is no higher value in the population. So, out of the 16 possible combinations of values, only 6 are runs up of length 2 (*): | 0 1 2 3 | 0 - * * * | 1 - - * * | 2 - - - * | 3 - - - - The figure shows all possible runs up of length 2: the first sample selects the row, and the next selects the column. The symbol - means we do not have a valid run up of 2; the symbol * means we do. So we find that the exact probability is 6 out of 16 instead of the 8 out of 16 we expect from the Knuth formula. This shows that the population size (the number of possible values) does indeed influence the probability value, independent of run length. IMPLICATIONS It is possible to count the number of acceptable byte combinations by scanning all possibilities of 2, 3, 4 bytes or more. Dividing the number of successes by the total number of possibilities gives the probability of success. Approaching the problem this way gives us a handful of results which are both unarguably correct and as precise as we can handle. For 8-bit values, pure brute force limits us to about 6 terms, which implies a maximum run length of only 5: | n 1/n! by count | | 2 0.50000000 0.49804688 | 3 0.16666666 0.16471863 | 4 0.04166666 0.04069708 | 5 0.00833333 0.00801224 | 6 0.00138888 0.00130929 This generates a comparison by run length: | r Knuth counting diff | | 1 0.500 0.502 0.002 | 2 0.333333 0.333328 0.000002 | 3 0.125 0.124 0.001 | 4 0.0333 0.0327 0.0006 | 5 0.0069 0.0067 0.0002 The difference between any pair of alternative probabilities is small. But this difference is multiplied by the number of samples. Then, in the chi-square computation, that result is squared and divided by the expected value. For a particular trial of 51200 samples (200 * 256), we would have the following expectations and results: | r Knuth counting diff sq chi sq term diff | | 1 25600.0 25497.6 102.4 10486 0.4 | 2 17066.6 17066.4 | 3 6400.0 6348.8 51.2 2621 0.4 | 4 1705.0 1674.2 30.76 946 0.55 | 5 353.3 343.0 10.0 100 0.28 | ---- | 1.63 So, over lengths 1..5, we see a chi-square bias of 1.63; including lengths 6 and 7 we might predict a total bias of about 2.0 (this does in fact occur at length 6). At 51,200 samples, we generally expect over 5 counts in each bin through r = 7, so we have 6 degrees of freedom. With a DF of 6, a chi-square difference of 2.0 is sufficient to move a probability from 0.90 to 0.95 or from 0.95 to 0.98, approximately, and from 0.01 to 0.05; this is a significant bias. This unbalance is obvious to the eye simply by noting the number of trials which have probabilities of 0.0xxxx versus those with probability 0.9xxxx. EXPERIMENT Exhaustive enumeration limits us to rather small term values, but this is hardly the only way to investigate the phenomena. Another way is to simply take a huge number of samples and report the accumulated probabilities. With 100M or 10**8 runs, we can hope to see 4 decimal digits of accuracy in the experimental values: | r Experiment Knuth counting | | 1 0.50194788 0.50000000 0.50195393 | 2 0.33332273 0.33333333 0.33332825 | 3 0.12400923 0.12500000 0.12402155 | 4 0.03270330 0.03333333 0.03268484 | 5 0.00670438 0.00694444 0.00670295 | 6 0.00112878 0.00119048 -1.0 | 7 0.00016154 0.00017361 -1.0 | 8 0.00001972 0.00002205 -1.0 | 9 0.00000222 0.00000248 -1.0 | 10 0.00000021 0.00000025 -1.0 | 11 0.00000001 0.00000002 -1.0 * At r = 1 we see an exact 4-place match to counting results, with only a 2-place match for the equation. * At r = 2 we get a 5-place match (versus only 4 for the equation), but do not have the accuracy to believe it, so this particular line is inconclusive. * At r = 3 we see a 4-place match to counting, versus only 2 for the equation. * At r = 4 we would have a 4-place match if we round to 4 places; the equation only matches 2. * At r = 5 we again get a 5-place match to counting, versus only 3 for the equation. * At r = 6 and above, there are no counting results. Based on this data, one is forced to conclude that the experimental run length values match the counting values typically 100x better than the alternative. Further, this is as good as one could reasonably expect, given the size of the experiment. THE COMBINATORIC MODEL Consider what can happen with two samples from a population of 4: | S1 S2 run-up | | 0 0 - | 0 1 * | 0 2 * | 0 3 * | 1 0 - | 1 1 - | 1 2 * | 1 3 * | 2 0 - | 2 1 - | 2 2 - | 2 3 * | 3 0 - | 3 1 - | 3 2 - | 3 3 - | --- | 6 Here we have 6 ordered pairs which each can be a run up. We do not know that they are runs of length 2, because that depends upon having a run failure (a non-increasing value) in the *next* sample. But, clearly, each run of length 3 can only be built upon a successful run of length 2, so these pairs obviously constitute the only possibilities for all runs of length 2 or more. The number of pairs which can participate in a run of length 2 is the number of symbols N taken 2 at a time, with no symbol repeated. This is just the number of *combinations* of the possible symbols taken 2 at a time. Each combination is a unique subset of the available symbols. Each of these combinations can be arranged either in an increasing or a decreasing order, so for each combination there is exactly one run up possibility and one run down possibility. Clearly, the number of possible runs up equals the number of possible runs down. We also note that N possibilities can be neither. So for length 2 we have: | N N N 2 | ( ) + ( ) + N = 2 ( ) + N = N | 2 2 2 | | (runs up + runs down + neither = total possibilities) so that | N 2 | ( ) = (N - N) / 2 = N (N-1) / 2 | 2 which is both correct and familiar. For higher runs lengths, there are more possible orderings for each combination, but still only one of these will be a valid run up, and only one other will be a valid run down. So if we are just counting possibilities, we can ignore the idea of ordering within each combination. Since each possible combination will have a single success, we only need to find the number of combinations, which is a standard calculation. COMPUTATION The term probability function (the number of possibilities which can be part of a run up or run down divided by the total number of possibilities) is just another view of an old friend: | N r | Pt(N,r) = ( ) / N | r The number of combinations of population N taken r at a time is the number of possible runs up of length r *and higher*. So, to find the probability of runs of a particular length we have: | P(N,r) = Pt(N,r) - Pt(N,r+1) which, for N = 256, produces the following correct results: | Byte Runs Up/Dn Probability by Run Length | | r prob | | 1 0.501953125000 | 2 0.333328247070 | 3 0.124021545053 | 4 0.032684844686 | 5 0.006702946664 | 6 0.001126633669 | 7 0.000160449945 | 8 0.000019817478 | 9 0.000002159795 | 10 0.000000210491 | 11 0.000000018541 | 12 0.000000001489 | 13 0.000000000110 | 14 0.000000000007 | 15 0.000000000000 These values closely match the experimental results from 2 billion (2 * 10**9) total runs. (The first 5 lines each match the experimental results in 5 places after the decimal, lines 6, 7 and 9 match 6 places, line 8 matches 7 places, and line 10 matches 8 places.) SUMMARY A source of experimental bias in "runs up" RNG testing has been identified. This bias is a result of using a formula perhaps intended for real values in tests of small populations. Based on exact exhaustive results, the number of possible runs of length r and above corresponds to the number of possible combinations of N things taken r at a time, for population N. This turns out to be a very straightforward and believable combinatoric model for these tests. The computation also corresponds to experimentally determined values across a wide range of run lengths in massive experiments. Correct expectation probabilities for byte value runs up or down have been tabulated. REFERENCE 1. Knuth, D. 1981. The Art of Computer Programming. Vol. II, Seminumerical Algorithms. Addison-Wesley. --- Terry Ritter ritter@io.com http://www.io.com/~ritter/
Last updated: 1997-12-25