PUBLISHED: Ritter, T. 1994. Estimating Population from Repetitions in Accumulated Random Samples. Cryptologia. 18(2):155-190.
Errors corrected.

To read the complete article off-line, save these graphics files: Fig. 1.1 (BIRTHFL3.GIF), Fig. 7.1 (2310114A.GIF), Fig. 7.2 (1315114A.GIF), Fig. 7.3 (1320114A.GIF), Fig. 7.4 (8225114A.GIF), Fig. 7.5 (7230114A.GIF), Fig. 7.6 (5240114A.GIF), Fig. 9.1 (DIST91.GIF), Fig. 9.2 (DIST92.GIF), and Fig. 9.3 (DIST93.GIF).

Estimating Population from Repetitions in Accumulated Random Samples

Terry Ritter

ADDRESS: Ritter Software Engineering, 2609 Choctaw Trail, Austin, Texas 78745, (512) 892-0494

ABSTRACT: It was desired to estimate the number of unique values or codes produced by a physically-random number generator of the sort used to form cryptographic message keys; only the codes produced by the generator were available for analysis. Subsequently, the classical "birthday paradox" was reversed to estimate statistical population from the average number of repetitions found in trials containing a substantial number of random samples. Although similar in premise to previous work, a new relationship was found which includes combinatorial contributions from higher repetition levels. The resulting equations proved to be simple, intuitive, exact, and novel in that they were easily reversed to produce unique population estimates. The distribution of repetitions in different trials was investigated experimentally and found not inconsistent with a hypothesis of Poisson distribution. In practice, multiple trials establish the mean number of repetitions for a trial of a given size; the new equations then estimate population directly. These reversed birthday methods are able to detect population variations despite the presence of inverse probability effects which can hide variations from common distribution tests. Thus, reversed birthday methods could prove valuable in the certification of physically-random number generators. Reversed birthday methods also could be useful for estimating the effective number of keys in cipher certification tests as well as more general statistical applications.

KEYWORDS: Population estimation, birthday paradox, physically random generators, population deviation testing, augmented repetitions, augmented doubles, cryptography, statistics

INTRODUCTION

Cryptographic systems often need "really random" numbers for various applications such as message keys and authentication values. Really random numbers are valuable because they follow no pattern, so no pattern can be predicted by cryptanalysis. Unfortunately, computing machinery is fully deterministic and so cannot be used to generate really random numbers; algorithmic processes are inherently pseudo-random. But really random numbers can be generated by special "physically random" hardware designs. Normally, such devices measure some complex physical effect such as thermal noise, diode electron leakage, quantum particle decay, chaotic fluid flow, etc.

Various tests of distribution can (and should) be applied to physically random generators, but such tests are not conclusive. In marked contrast to most pseudorandom designs, a physically random generator gives no guarantee that the population of possible values are all available with equal probability. Thus, there is a need for some technique to measure the population (or show that it exceeds some measurable value) and also show that the population is evenly distributed. Notice that any such technique is limited to using only the random values produced by such a generator.

When the issue of testing a really random generator came up in the Usenet sci.crypt group (see Appendix A), it was suggested that "the birthday paradox" could be used, in reverse, to estimate population. As it turns out, the birthday paradox has been discussed in cryptography for the past decade (for example, see Meyer and Matyas (1982) [14: 672-673], and Davies and Price (1984) [4: 116-117, 278-280]). The birthday paradox has been proposed for public-key distribution in Merkle [13: 13-21], used for DES analysis in Kaliski, Rivest and Sherman [6] (also described in Patterson [17: 156-166]), and forms the basis for authentication attacks described in Seberry and Pieprzyk [21: 157]. Birthday methods were apparently used by Letham, Hoff and Folmsbee (1986) [11] during tests of the hardware random number generator on the Intel 27916 "Keprom" (see Appendix B); results were given, but the theory of the test itself was not described.

The method of the investigation in this paper was to find easily-reversed equations for predicting the average number of repeated values found in trials containing substantial numbers of random samples. Some well-known simple equations, previously described as approximations [9], were found to be exact predictors of a new relationship here called augmented repetitions. This relationship was exhaustively analyzed on some small cases, and then tested experimentally with good results. Experiments indicated that the number of augmented repetitions in separate trials apparently occur in Poisson distribution. Multiple trials established a mean value for the augmented repetitions in a trial of a given size, which gave a good estimate of the parent population.

During this development, it became apparent that certain types of population deviation which could be detected by reversed birthday methods probably would not be detected by conventional statistical tests. Because of the importance of the cryptographic objects created from physically random devices, the possibility that some hardware designs may have unsuspected problems is more than just interesting.

1. THE BIRTHDAY PARADOX

There are fewer than 30 students in a particular classroom; is it likely that at least two of the students have the same birthday? (See, for example, Goodman and Ratti [5: 139-140].)

At first the question may seem preposterous: There are 365 possible birthdays, and only 30 students. The paradox is that it is nevertheless probable (probability exceeding 0.5) that some pair of students does have the same birthday, provided that we have at least 23 students.

How can this be? Well, let's think about it: Suppose, as students enter the classroom, they call out their birthday, and any other student in the room with the same birthday calls out "match":

The first student has no possibility of a match.
The second student has a 1/365 probability of having a birthday which matches the first student. But if the birthdays do not match -- probability (1 - 1/365) -- there are now two students for possible matching.
The third student gives us a 1/365 probability of matching the first student, and also a 1/365 probability of matching the second student, for a match probability of 2/365. But if there is no match -- total probability (1 - 1/365)(1 - 2/365) -- there are now three students for possible matching.

The non-intuitive part of the paradox is that, while the number of students increases linearly, the number of pairs -- the number of possible matches -- increases combinatorially. When there are 23 students, there are (23 C 2) or 253 possible pairs; the next student gives us a total of 276 possible pairs. Whenever we add a single student, we can have a possible match with each student already in the room.

[Fig 1.1]

Fig. 1.1 Birthday Probability Logic. On its own, the first sample, W, has no probability of forming a double. With the second sample, X, there is just one possible match and therefore just one chance in N (for population N) of a match. The third sample, Y, presents two chances for a match (W and X), but the total probability of a match in three samples also includes all previous possible matches, for a total of three chances out of N. The logic for subsequent samples continues similarly.

The situation can be described by the flowchart in figure 1.1, where each sample (each different student) is labeled with a different letter (W, X, Y and Z). Each elemental test (each diamond-shaped box) is a comparison to a single previous value; the branch to the right is "match" or "success," while the path downward is "no match." For a single comparison, there is one possible match out of the entire population of N; thus, probability of a match is just 1/N. If we have two comparisons (both will be different, or we would have had a match with the previous sample), the probability of success is 2/N. This leaves 1 - 2/N (of the cases requiring a new sample!) to continue as the current "no match" probability. The expression for "success" gets complex; a good way to think about the problem is to follow the path down; this is the cumulative probability of having no matches.

By following the path down in figure 1.1, we can write the probability expressions shown in table 1.1.

Table 1.1  Unmatched Probability Expressions

          Samples  Probability of No Match

            1       1
            2       1-1/N
            3      (1-1/N)(1-2/N)
            4      (1-1/N)(1-2/N)(1-3/N)

Clearly, we have a well-defined sequence, and once we recognize this, it is relatively easy to compute the values for the birthday problem. We can formalize this slightly as the cumulative probability (from 0 to 1) of no match (q) for population N and the number of samples (students) s:

     (1.1)  Pq(N,s) = (1-1/N)(1-2/N)..(1-(s-1)/N)

and, since the total probability is 1, the probability of a match (p) is just 1 minus (1.1):

     (1.2)  Pd(N,s) = 1 - (1-1/N)(1-2/N)..(1-(s-1)/N).

For a small population like 365, equation 1.2 allows us to compute the birthday probabilities shown in table 1.2.

Table 1.2  Numerical Birthday Probabilities

            Students    Probability

                   5    0.0271
                  10    0.1169
                  15    0.2529
                  19    0.3791
                  20    0.4114
                  22    0.4757
                  23    0.5073
                  24    0.5383
                  25    0.5687
                  27    0.6269
                  30    0.7063
                  35    0.8144
                  40    0.8912
                  50    0.9704
                  60    0.9941
                  70    0.9992
                  80    0.9999

But for a large population, while (1.2) may be term-by-term computable, there will be too many terms to compute. Fortunately, we can write (1.2) as

                            N   N-1  N-2 .. N-s+1
          Pd(N,s) = 1  -   ------------------------ ,
                            N    N    N  ..   N

then recognize the numerator as N!/(N-s)!, and have

                                 N!
     (1.3)  Pd(N,s) = 1  -  ------------ .
                             (N-s)! N^s

This is exact, although it quickly becomes unruly as the factorials exceed any given precision, and the lower factorial limits the number of samples s to the population N. (Kullback [9: 29] promotes e^-s(s-1)/2N as a "good approximation" to the right-hand part of (1.3) for large N; we will see this exponent again in section 2.) Fortunately, we have a good approximation to Ln(x!) for at least modest x (say, x > 9) using Stirling's formula (e.g., Knuth [7: 111]):

     (1.4)  LnFact(x) =(approx.) (x + 1/2) Ln(x) - x + Ln(TwoPi)/2 +

                              1         1           1            1
                             ---  -  -------  +  --------  -  --------
                             12x     360 x^3     1260 x^5     1680 x^7

And so equation 1.3 can be tamed with logs (for s > 9, and s <= N) as:

Corrected 2000 Sept 06

     (1.5)  Pd(N,s) = 1  -  e^( LnFact(N) - LnFact(N-s) - s Ln(N) )

Equation 1.5 is fairly convenient for the machine evaluation of birthday probabilities (although it will not handle s larger than N) because only the log factorials need be evaluated, and these are easy to find with high accuracy. But (1.5) is not easy to reverse for use in estimating population N when given values for doubles d and samples s. Accordingly, let us pursue an alternate development:

METHOD

The introduction developed the classical birthday probability of at least one match. The method develops new equations for the expected number of a new quantity here called augmented repetitions. (An augmented 2-repetition or augmented double includes contributions from higher-order repetitions, since each one of these can be seen as multiple doubles.) The method later identifies binomial equations which give the expected number of exact repetitions, and yet another sort of "double." Of the four approaches (classical, augmented, binomial multiple and exact), both classical and augmented repetitions are easily developed using simple probability and algebra, but only augmented repetition equations are easily reversed for use in population estimation.

2. A NEW DEVELOPMENT FOR DOUBLES

Consider a trial of two samples (s=2): One sample is whatever it is, and, if the next sample is really independent, there should be one chance in population N that exactly the same value will repeat, so the expected number of doubles Ed for two samples is: Ed(N,2) = 1 / N.

Consider a trial of three samples (s=3): There are three possible pairs of values, and for each pair, one value is whatever it is, and the probability that the other sample will match is one in N, so the expected number of doubles Ed(N,3) = 3 / N.

Consider a trial of four samples (s=4): There are six possible pairs of values, so the expected number of doubles will be Ed(N,4) = 6 / N.

By now it should be obvious that we are walking down Pascal's triangle and finding the number of combinations of s things taken two at a time: (s C 2), also known as "n choose r" or "the binomial coefficient."

We can relate the number of samples s to the population N through the expected number of augmented doubles Ead we observe:

     (2.1)  Ead(N,s) = (s C 2) / N .

But

     (s C 2) = s(s-1) / 2 ,

     (2.2)  Ead(N,s) = s(s-1) / 2N .

As we shall see, in this development the expected value has an unusual meaning. Here the expected value represents not just the number of exact doubles, but also contributions from triples and higher-order repetitions (if any). We will call this augmented doubles.

Consider the cases of N=2 and s=3, and N=3 and s=2 in tables 2.1 and 2.2 (note that equation 2.2 works just fine with s > N). In table 2.1 equation 2.2 is used to predict the number of expected augmented doubles for these cases.

Table 2.1
Expected Augmented Doubles for N=2 and s=3, and N=3 and s=2

                    3 * 2                            2 * 1
        Ead(2,3) = ------- = 1.5         Ead(3,2) = ------- = 1/3
                    2 * 2                            2 * 3

Table 2.2 illustrates every possible trial for both of these cases, and gives totals for comparison to the predicted values.

Table 2.2  All Possible Trials for N=2 and s=3, and N=3 and s=2

             N=2 s=3                    N=3 s=2
          Trial   Aug. Doubles       Trial   Aug. Doubles

          0 0 0      3                0 0      1
          0 0 1      1                0 1      0
          0 1 0      1                0 2      0
          0 1 1      1                1 0      0
          1 0 0      1                1 1      1
          1 0 1      1                1 2      0
          1 1 0      1                2 0      0
          1 1 1      3                2 1      0
                    ---               2 2      1
                    12                        ---
                                               3

     Ead(2,3) = 12 aug. doubles / 8 cases = 1.5
     Ead(3,2) =  3 aug. doubles / 9 cases = 1/3

Note that the equations in this new development predict the number of doubles exactly, if we interpret a triple as three different pairs of doubles. Thus, the augmentation from triples to doubles is (3 C 2) or 3. We could express a particular augmented result as the sum of the contributions from higher repetition levels as:

Corrected
                        n
        (2.3)  Ar[k] = SUM (i C k) r[i] .
                       i=k

In (2.3) Ar[k] is the augmented repetition count at the k-th level (k=2 for doubles; k=3 for triples, etc.), and r[i] the exact repetition count at the i-th level. Notice that we are using the idea of combinations twice: Once to give us the number of pairs (and, thus, the number of possible matches) in s samples, and then again to represent the number of doubles hidden in any higher-order exact repetitions.

In Section 6 we will encounter equations which predict exact repetitions and thus avoid augmentation calculations. But when we use exact repetitions, if we have a double, and then find that same value again, we will have a new triple, and one fewer doubles. Thus, if we are using exact doubles to predict population, finding another match can actually decrease the number of exact doubles, and so change our prediction in exactly the wrong direction. In contrast, augmented values retain repetition information instead of throwing it away.

These simple examples have (so far) confirmed the equations. With a beginning confidence in the analysis, we can estimate the population N from the number of samples s and augmented doubles ad:

     (2.4)  Nad(s,ad) = s(s-1) / 2 ad.

With s positive and ad greater than zero, (2.4) produces an absolutely unambiguous unique positive result.

We can even create an equation for predicting samples s needed from population N to produce (on average) a given number of augmented doubles ad. Using (2.2), we can put s into normal quadratic form, use the quadratic formula, and end up with:

                                 ___________
                                /
                        1  +  \/  1 + 8 ad N
      (2.5)  s(N,ad) = ---------------------  .
                                2

(As we shall see in Section 5, the expectation value ad for a 50 percent probability of finding an augmented double is -Ln(0.5) or 0.693.)

3. A NEW DEVELOPMENT FOR TRIPLES

Now consider the situation with triples: Given any set of three samples, one is whatever it is, and one time out of N the second is the same value (a double), and of these, one time out of N the third value is also the same (a triple). So we have:

     (3.1)  Eat(N,s) = (s C 3) / N^2 .

But

     (s C 3) = s(s-1)(s-2) / 6 ,

     (3.2)  Eat(N,s) = s(s-1)(s-2) / 6N^2 .

Now we can estimate the population N from samples s and augmented triples at:

                              _____________
                             /
     (3.3)  Nat(s,at) =     /  s(s-1)(s-2)  .
                        _  /  -------------
                         \/       6 at

Consider the cases of N=2 and s=4, and N=3 and s=3, as estimated in table 3.1:

Table 3.1.
Expected Augmented Doubles and Triples for N=2 and s=4, and N=3 and s=3

                    4 * 3                            3 * 2
        Ead(2,4) = ------- = 3           Ead(3,3) = ------- = 1
                    2 * 2                            2 * 3

                    4 * 3 * 2                        3 * 2 * 1
        Eat(2,4) = ----------- = 1       Eat(3,3) = ----------- = 1/9
                      6 * 4                            6 * 9

with every possible trial shown in table 3.2.

Table 3.2  All Possible Trials for N=2 and s=4, and N=3 and s=3

                  Augmented                         Augmented
      Trial    Doubles  Triples          Trial   Doubles  Triples

     0 0 0 0      6       4              0 0 0      3       1
     0 0 0 1      3       1              0 0 1      1       0
     0 0 1 0      3       1              0 0 2      1       0
     0 0 1 1      2       0              0 1 0      1       0
     0 1 0 0      3       1              0 1 1      1       0
     0 1 0 1      2       0              0 1 2      0       0
     0 1 1 0      2       0              0 2 0      1       0
     0 1 1 1      3       1              0 2 1      0       0
     1 0 0 0      3       1              0 2 2      1       0
     1 0 0 1      2       0              1 0 0      1       0
     1 0 1 0      2       0              1 0 1      1       0
     1 0 1 1      3       1              1 0 2      0       0
     1 1 0 0      2       0              1 1 0      1       0
     1 1 0 1      3       1              1 1 1      3       1
     1 1 1 0      3       1              1 1 2      1       0
     1 1 1 1      6       4              1 2 0      0       0
                 ---     ---             1 2 1      1       0
                 48      16              1 2 2      1       0
                                         2 0 0      1       0
     Ead(2,4) = 48 doub / 16 = 3         2 0 1      0       0
                                         2 0 2      1       0
     Eat(2,4) = 16 trip / 16 = 1         2 1 0      0       0
                                         2 1 1      1       0
                                         2 1 2      1       0
                                         2 2 0      1       0
     Ead(3,3) = 27 doub / 27 = 1         2 2 1      1       0
                                         2 2 2      3       1
     Eat(3,3) = 3 trip / 27 = 1/9                  ---     ---
                                                   27       3

Note that the values from (2.2) and (3.2) are precisely the values we find by exhaustive enumeration.

For some parameters, when the number of samples s exceeds the population N, we can expect more triples than doubles; this is just the realization that (n C 3) is often greater than (n C 2). For example, suppose we have N=2 and s=6, a 6-bit binary value. Suppose all bits in s are the same. In this case we have only 15 doubles (6 C 2), but 20 triples (6 C 3). Of course, there are only two such cases (zeros and ones) at this length, and these are rare; it takes more such cases to affect the expected value (see table 3.3).

Table 3.3
Expected Augmented Doubles and Triples in Random Bit Arrays

     Bits    Aug. Doubles   Aug. Triples
       2          0.5             0
       4          3               1
       6          7.5             5
       8         14              14
      10         22.5            30

A similar effect occurs on observed counts: An observed count of six occurrences of the same value (a level-6 exact repetition) would be taken as 15 augmented doubles (6 C 2), but 20 augmented triples (6 C 3). Augmentation from higher repetitions can have more effect on triples than doubles.

4. AUGMENTED REPETITIONS IN GENERAL

Every repetition size r (r=2 for doubles; r=3 for triples) gives us an expected augmented repetition value Ear :

     (4.1)  Ear(r,N,s) = (s C r) / N^(r-1),

and from this we get an estimation of population N:

     (4.2)  Nr(r,s,ar) = ( (s C r) / ar )^(1/(r-1))

where ar is the augmented repetition count.

To bypass the limitations of (s C r) to parameters s >= r, we note that

     (s C r) = s(s-1)..(s-r+1) / (s-r)(s-r-1)..2

             = s! / r! (s-r)!

so factorial versions are:

                              s!
     (4.3)  Ear(r,N,s) = ------------------- ,
                          r! (s-r)! N^(r-1)

                              s!
     (4.4)  Nr(r,s,ar) = -------------- ^ (1/(r-1)) ,
                          r! (s-r)! ar

and, in logs:

     (4.7)  LnEar(r,N,s) = LnFact(s) - LnFact(r) - LnFact(s-r)
                            -(r-1) Ln(N) ,

                           LnFact(s) - LnFact(r) - LnFact(s-r) - Ln(ar)
     (4.8)  LnNr(r,s,ar) = -------------------------------------------- .
                                               (r-1)

Table 4.1 and table 4.2 illustrate two checks on equation 4.7 and higher-order repetitions. In table 4.1 we have all 64 possible trials of six bits (binary samples). The first column lists each particular trial, the second shows the number of repetitions found (from 6-reps through doubles), the third column gives the augmented repetitions, while the last column accumulates totals. At the bottom, equation 4.7 produces values for comparison; note that augmented 6-reps, 5-reps, 4-reps, 3-reps and doubles all occur exactly as equation 4.7 predicts. (Exact repetitions are also predicted, using (6.4).) Similarly, table 4.2 presents all 81 possible trials of four samples from a population of three, with similar success. Table 4.3 gives other tests, in summary version only.

5. EXPECTED VALUE AND APPROXIMATE PROBABILITY

The new development yields formulas for the expected number of augmented doubles, but sometimes we would prefer to have probability, a value from 0 to 1 (although this is unnecessary for population estimation). Let us assume, for the present, that the number of augmented doubles found in a fixed number of samples obeys a Poisson distribution. A Poisson distribution has only a single parameter, the quantity np (samples n times probability of success p); this is the mean u or expected value (see, for example, Parratt [16]). Given u we can calculate the probability that k augmented doubles will be found by using the Poisson cumulative distribution function shown in equation 5.1:

             n           (u^k)(e^-u)
     (5.1)  SUM P(k;u) = ----------- = 1
            k=0               k!

                                      u^2 e^-u         u^n e^-u
                =  e^-u  +  u e^-u  + -------- + ... + --------
                                          2               n!

Each term in (5.1) gives the probability of k successes. In our case, (5.1) gives us the probability of k doubles given a particular value of u, and a trial of a particular size will imply some value for u. Since the total probability (the sum of all terms) is 1, the probability of having at least one double is 1 minus the probability of zero doubles:

     (5.2)  Pd(u) = 1 - e^-u .

Now, since we already have an equation for expected augmented doubles, we might use Ead from (2.2) as an approximation to u, and then (5.2) should give us the approximate probability of an augmented double. Right?

Unconvinced? Well, how about this (from Meyer and Matyas [14: 672-673]): We start with (1.1), the probability of no match, from the original development for "at least one match":

     Pq(N,s) = (1-1/N)(1-2/N)..(1-(s-1)/N)

and take the log of each side:

     Ln(Pq(N,s)) = Ln(1-1/N) + Ln(1-2/N) + ... + Ln(1-(s-1)/N).

Now (assuming s << N), each of these terms can be expressed as a series:

                             1       1        1
     (5.3)  Ln(1-1/N) =  -  ---  -  ----  -  ----  -  . . .
                             N      2N^2     3N^3

but if N is large (over, say, 10^3), the first term will dominate, and the rest can be discarded. This leaves us with:

     Ln(Pq(N,s)) = -[ 1/N + 2/N + 3/N + ... + (s-1)/N ]

and this series is exactly the average of all the terms (the first plus the last divided by two), times the number of terms, after factoring out N:

     Ln(Pq(N,s)) = -(((1 + s-1) / 2) s-1) / N

     Ln(Pq(N,s)) = - s (s-1) / 2N.

But look at this equation: The right-hand-side is identical with equation (2.2) for expected augmented doubles, except for the sign! So the approximate probability of "at least one match" can be calculated from our development for augmented repetitions. Thus,

     Pq(N,s) = e^-Ead

and

     (5.4)  Pd(N,s) = 1 - e^-Ead ,

which is exactly what we expect from a Poisson distribution of augmented doubles (in separate trials each with the same fixed number of samples). (Kaliski, et. al. [6: 87] points out that the right hand part of (1.3) can be approximated by essentially e^-(s^2 / 2N), whose exponent is very nearly (2.2); also recall that Kullback [9: 29] promotes (2.2) as the exponent in a "good approximation" to the right hand part of (1.3) for large N.)

To finish up, we can easily write:

     (5.5)  Ed = -Ln( 1 - Pd ) .

This is why, in equation 2.4, we could not simply use 0.5 or 1.0 as the number of expected doubles. The number of doubles corresponding to a 50 percent probability of success is -Ln(0.5) or 0.693.

6. BINOMIAL REPETITIONS

As it turns out, Kullback [9] provides formulas which can be used directly on the birthday problem. (To conform to the previous notation, and to prevent confusion between n and N, we change Kullback's N to s (samples), and Kullback's n to N (population).) Starting on p. 29 we find: "For random text of s elements, where there are N different possible elements�the average number of elements occurring twice each is given by"

                      s(s-1)  N (1 - 1/N)^(s-2)
     (6.1)  Ed(N,s) = -------------------------
                          2          N^2

"the average number of elements occurring r times each is given by"

                        s(s-1)..(s-r+1)  N (1 - 1/N)^(s-r)
     (6.2)  Er(r,N,s) = ---------------------------------- .
                                r!              N^r

For machine evaluation, we may express these in logs:

     (6.3)  LnEd(N,s) = Ln(s(s-1)/2) + (s-2) Ln(1 - 1/N) - Ln(N) ,

     (6.4)  LnEr(r,N,s) = LnFact(s) - LnFact(s-r) - LnFact(r)
                           + (s-r) Ln(1 - 1/N) - (r-1) Ln(N) .

Although these equations do predict expected exact repetitions, they are going to be difficult to re-arrange to predict population. It would of course be possible to use any of these equations with an indirect root-finding numerical technique (for example, bisection [18]), but because the equations typically include powers higher than the first, any particular root may not be unique.

Interestingly, we see that (6.4) has five terms, and four of those are identical with the ones in (4.7). Thus, the additional term in (6.4) is the log of the fraction which converts the larger augmented repetition value ar into the smaller exact repetition value (or vice versa):

     (6.5)  LnExactReps(r,N,s) = (s-r) Ln(1 - 1/N) + Ln(ar) ,

or without logs:

     (6.6)  ExactReps(r,N,s) = (1 - 1/N)^(s-r) * ar .

The expression (1 - 1/N) is always fractional, so there can never be more exact repetitions than augmented ones. This expression is close to one when N is large, and as long as s is much smaller than N, the overall factor is still close to one. Thus, under normal circumstances, the number of exact repetitions is not too different from the number of augmented repetitions. (With N=10,000, the ratio is 0.99 at s=100, and 0.96 at s=400.) On the other hand, small N and large s maximize the differences. With N=2 and s=3 (the situation illustrated in table 2.2) there are only half as many exact doubles as augmented doubles, and the ratio increases exponentially with higher s. Of course, in order to use (6.5) or (6.6), we first must know the population N, and this is often what we hope to find.

The numerical evaluation of Ln(1 - 1/N) is trickier than it looks. For large N, (1 - 1/N) is going to be very close to 1, and Ln(1) is 0. Consequently, the evaluation of the log may well underflow to zero with a serious loss of precision. But we have seen Ln(1 - 1/N) expanded in equation 5.3 (for large N):

     (6.7)  Ln(1 - 1/N) =(approx.) -1/N ,

and, so, for large N (from 10^3 to 10^6, depending on the local numerical precision):

     (6.8)  LnExactReps(r,N,s) =(approx.) (r-s)/N ,

and the same could be done for (6.3) and (6.4).

Kullback does not explain how he came up with (6.1) and (6.2), but they turn out to be very nearly binomial equations (see, for example, Parratt [16]). The binomial cumulative distribution function is:

             n              n
     (6.9)  SUM B(k;n,p) = SUM (n C k) p^k q^n-k = 1 .
            k=0            k=0

Each term in the sum gives the probability of finding exactly k successes in n independent tries when each success has probability p. In our terminology, consider the probability of an exact double (k=2) in s samples (n=s) from population N (p=1/N):

     (6.10)  B(2;s,1/N) = (s C 2) ((1/N)^2) (1-1/N)^(s-2)

Observe that the first factor (s C 2) is s(s-1)/2 and that a factor of N is necessary to yield the expected number of doubles, and the link to (6.1) is immediate. (Note that the Poisson and binomial distributions have two different applications here: The binomial approaches are simply alternatives to (1.3) or (2.2), while the distribution of multiple trials requires the Poisson equations to estimate the resulting probability.)

Now that we see how the binomial distribution fits in, we can see that it also supports yet another approach: the probability of doubles or better. Since the sum of all terms (the probability of each repetition-level) is 1, the probability of having at least one match is 1 minus the probability of zero successes minus the probability of one success:

     (6.11)  Ebd(n,p) = N( 1 - B(0;s,p) - B(1;s,p) ) .

Alas, this is not just a new expression for (1.3), but is instead yet another different concept of "doubles." We now have four such concepts, here arranged in decreasing magnitude:

       (1.3)     >    (2.2)    >       (6.11)         >  (6.1)
  "at least one"   "augmented"   "binomial multiples"   "exact"

The classical birthday development in section 1 is for "at least one match." The new development in section 2 is also for two or more occurrences, albeit in the special and numerically enhanced form of "augmented doubles." And the development in this section predicts "binomial multiples" as well as exactly two occurrences of a value. These different viewpoints make direct result comparisons difficult, but augmented repetitions remain by far the easiest to reverse for predicting population.

RESULTS

Computer programs were implemented to simulate random sampling from a known population. That population was then estimated by sampling the population in trials having various numbers of random samples and using those results in the augmented repetition equations.

7. DISTRIBUTION

Ideally, each time we perform a trial we would like to be confident that the result reflects the larger reality. Unfortunately, that is not the way sampling works. Instead, the set of all possible trials form a probability distribution, so some individual trials will always yield misleading results. On the other hand, if we take many trials, we have some reason to hope that we can develop both an accurate distribution, and an accurate mean for use in population prediction.

The question of the distribution of trial values was addressed experimentally. A statistical random number generator was used to generate some number of samples in each trial, and this was repeated for a large number of trials:

First, storage was allocated for a symbol count associated with every possible symbol. Then, every time a symbol was drawn from the random number generator, the count associated with that symbol was incremented. Eventually this produced the number of times each symbol occurred in that particular trial.
Next, storage was allocated for a count associated with each repetition level. Depending on the number of samples taken, there will normally be empties, singletons, doubles, triples, and so on; we want to count the number of doubles, triples, etc., in each trial. So for each possible symbol, the associated symbol count value was found, and the repetition count for that value was incremented. This produced the number of times each repetition level occurred in that particular trial.
Then, for each trial, counters were incremented which were associated with the number of times each repetition level occurred in that trial. Consider doubles: Various trials found different numbers of doubles; each was a valid result. But a large number of trials produced a general distribution of doubles counts (and distributions of other repetition levels as well). We can collect the experimental doubles distribution by incrementing a count associated with the number of doubles found in each trial.

[Fig. 7.1]

Fig. 7.1 Augmented Doubles in Sqrt(N) Samples. 2000 trials of 100 samples each from population 10,000. Each bar represents a particular number of augmented doubles, and the height of each bar represents the number of experimental trials which found that many augmented doubles. The fat white line connects each number of trials predicted by a Poisson distribution having an expected value calculated as e to the power given by equation 4.7 (log of expected augmented repetitions), for augmented doubles, assuming the same overall number of trials.

[Fig. 7.2]

Fig. 7.2 Augmented Doubles, 1.5 Sqrt(N) Samples. 1333 trials of 150 samples each from population 10,000.

The graphs in figures 7.1 through 7.6 illustrate the distribution of the number of augmented doubles found in large numbers of trials using six different quantities of random samples. Each graph represents essentially the same collection effort. The vertical bars represent the number of trials which produced the associated number of augmented doubles.

[Fig. 7.3]

Fig. 7.3 Augmented Doubles in 2 Sqrt(N) Samples. 1000 trials of 200 samples each from population 10,000.

The thick white line on each graph illustrates the number of trials predicted to have the given number of doubles, as computed by a Poisson distribution for expected doubles developed from (4.7) and the same total trials.

[Fig. 7.4]

Fig. 7.4 Augmented Doubles, 2.5 Sqrt(N) Samples. 800 trials of 250 samples each from population 10,000.

The author finds these graphs compelling evidence that large numbers of trials do find augmented doubles in a distribution which is essentially Poisson in nature (and this happy result nicely supports the probability development in section 5).

[Fig. 7.5]

Fig. 7.5 Augmented Doubles in 3 Sqrt(N) Samples. 667 trials of 300 samples each from population 10,000.

[Fig 7.6]

Fig. 7.6 Augmented Doubles in 4 Sqrt(N) Samples. 500 trials of 400 samples each from population 10,000.

Table 7.1 lists the conditions under which each experiment occurred.

Table 7.1.  Means of Augmented Doubles over Many Trials and Other
         Results from Distribution Experiments

       Samples   Trials     Mean     Chi-Sq     Est. N

         100      2000      0.492       -        10061
         150      1333      1.093       -        10224
         200      1000      1.962     0.336      10143
         250       800      3.111     0.261      10004
         300       667      4.493     0.573       9982
         400       500      8.034     0.033       9933

This was a single contiguous set of experiments; the samples were taken from a fast statistical RNG and a "population" of 10,000 symbols. The mean is the arithmetic average of augmented doubles over all trials. The chi-square column gives the probability of the chi-square "p" statistic (the area of the chi-square density function below the experimental value) comparing the experimental results to the Poisson distribution (but only for experiments which had more than five bins exceeding five counts). The chi-square value of 0.033 in the last row indicates that the results in the last set were unexpectedly close to the Poisson distribution (for random samples); this was unusual. The predicted population values were relatively accurate.

Whenever we randomly sample, we can expect to get various result values. While it is useful to know the expected distribution of those values, we must remember that even very improbable values can and will occur occasionally. We will need enough experimental trials to see the form the distribution is taking. But then the experimental distribution should be fairly obvious, so the theoretical distribution may be of somewhat less import than one might think.

8. SMALL NUMBERS OF SAMPLES AND ACCURACY

Obtaining samples is expensive in terms of time, while storing and comparing sampled values is expensive in both storage and time. There is ample motive to use the fewest possible samples. However, those who would understand the accuracy of reversed birthday techniques should refer again to the graphs in figures 7.1 through 7.6. Every one of the repetition counts in each bar was formed as a valid result of a separate trial on the exact same population. An experiment measuring population could have produced any of these results in any single trial.

A zero result, common in trials with under 2 Sqrt(N) samples, will give no estimate at all (by itself). Results of 1 and 24, both of which occur in figure 7.6 (with 4 Sqrt(N) samples), estimate populations which differ by a factor of 24. Since we are measuring population, we will not know, beforehand, the shape of the particular distribution we are measuring, and which estimates are most likely. Consequently, the reversed birthday approach is really useful only after we have enough trials to be fairly sure of the mean number of repetitions.

When we develop a mean value across many trials, we expect that the more trials we have, the more accurate the result will be. But how much more accurate? (See, for example, Boas [3: 732-735].) Without assuming a particular underlying distribution, we can use the central limit theorem as grounds for assuming that errors in the mean will be distributed normally (over many trials). Using the sample standard deviation sd, and trials t, the standard deviation of the mean sd(m) is expected to be:

                       sd
     (8.1)  sd(m) = --------  .
                       ____
                      /
                    \/  t

(Four times as many trials should give us half the standard deviation and thus twice the accuracy in the sample mean.) For a 50 percent confidence interval (probability 0.5 that the true mean lies within the interval), we have limits u - r and u + r, for mean u, and r the probable error. Since we assume a normal distribution (large t) and a 50 percent confidence interval, we have the probable error r = 0.6745 sd(m).

If we assume the underlying distribution is Poisson, we might think to avoid computing the sample variance sd^2 since that should be equal to the mean u. However, in the experimental data, while the variance and mean do track fairly well, it is not at all unusual to see them differ by ten percent or more. Thus, we probably should compute the variance explicitly from the data. But for a quick-and-dirty answer given only mean u and trials t, the probable error r should be:

                          ___
                         /
                0.6745 \/  u
     (8.2)  r = -------------
                     ____
                    /
                  \/  t

The normal distribution demands large t and produces a wider confidence interval factor (0.6745) than we really need (especially for our expected small values of the mean u [16: 206]). Nevertheless, using the normal distribution for the confidence interval appears to be standard practice. Excessive concern about accuracy in these statistical predictions may be misplaced anyway, since each experiment will be different.

9. UNIQUENESS

During statistical tests of the reversed birthday approach, it became apparent that these methods could reveal problems which are virtually invisible to other statistical tests. There are many statistical tests of distribution (see, for example, Knuth [8] and Marsaglia [12]), but most of these can be fooled by simultaneous inverse relationships between population and probability.

[Fig 9.1]

Fig. 9.1 Even Distribution without Population Deviation. Each possible symbol has exactly the same probability. The ideal distribution for a discrete random number generator (RNG).

Consider figure 9.1, which is intended to represent the even distribution we expect from a good RNG. The separate spikes represent the discrete individual values (or codes or symbols) that the RNG can produce (any digital result can give us only a finite number of representations, no matter how the values are produced). The height of each spike represents the probability of obtaining that particular value. Figure 9.1 represents the ideal in which any possible value can be obtained with equal probability.

[Fig. 9.2]

Fig. 9.2 Even Distribution with Population Deviation. A distribution in which half of the range has only half of the expected values, yet each of these values is twice as probable as it should be. Any common test which groups even a couple of adjacent symbols is unlikely to find much wrong with this distribution.

Now consider figure 9.2; strangely, this is also an even distribution. While half of the range contains only half the symbols of the other half (and one-quarter of the symbols are thus unused), each symbol in the reduced side is twice as probable. Consequently, this distribution has essentially the same range, mean, deviation, and comparison to flat distribution as 9.1, yet only three-quarters the uniqueness. (The means will be very close when there are a large number of symbols; a chi-square comparison to a flat distribution will indicate similarity if, as usual, the comparison bins cover multiple symbols.)

Table 9.1 illustrates all possible trials for two cases of N=4 and s=2: First, the expected ideal distribution with four symbols, and then a deviant distribution in which symbol 1 never occurs, and symbol 0 occurs twice as often to compensate for the loss.

Table 9.1  All Possible Trials for N=4, s=2, for Both Ideal and
         Deviant Populations

             Ideal Population            Deviant Population
           Trial   Aug. Doubles        Trial   Aug. Doubles

            0  0      1                 0  0      1
            0  1      0                 0  0      1
            0  2      0                 0  2      0
            0  3      0                 0  3      0
            1  0      0                 0  0      1
            1  1      1                 0  0      1
            1  2      0                 0  2      0
            1  3      0                 0  3      0
            2  0      0                 2  0      0
            2  1      0                 2  0      0
            2  2      1                 2  2      1
            2  3      0                 2  3      0
            3  0      0                 3  0      0
            3  1      0                 3  0      0
            3  2      0                 3  2      0
            3  3      1                 3  3      1
                     ---                         ---
                      4                           6

Note that reversed birthday methods do detect something strange about the population. (With N=4, the half-symbol difference in means between the two cases is obvious without birthday computations, but for large N, a huge number of samples would be necessary to confirm any such difference.) Table 9.1 shows that repetition counts are sensitive to symbols of higher than expected probability, even if those symbols are nominally compensated by low probability symbols. When low-probability symbols and high-probability symbols are in close proximity, binned tests are going to have a difficult time detecting a deviant population.

[Fig. 9.3]

Fig. 9.3 Even Distribution with Smaller Population Deviation.

Figure 9.3 continues the development of a distribution which is nominally flat statistically, but which has local population deviations. Although a large number of high-probability values would clearly dominate repetition results, a small number of high-probability values could be hard to find even using repetition methods.

Actual probability variations could be continuous and arbitrary. One could easily imagine that some "physically random" noise-based designs might well have an inherent inverse relationship between local population and probability. A cryptanalyst who knew of this relationship would naturally try the most-probable values first.

10. KEPROM RESULTS

At this point, we can use the development in this paper and compare it to the Keprom results mentioned in the introduction. Table 10.1 illustrates the use of equation 4.8 to estimate population based on augmented repetitions calculated from the keprom results [11].

Table 10.1  Estimating Keprom Population from Augmented Repetitions

     s: 900000  r: 18  Ear:      1  Aug Pop: 17.9 bits
     s: 900000  r: 17  Ear:     18  Aug Pop: 17.7 bits
     s: 900000  r: 16  Ear:    153  Aug Pop: 17.7 bits
     s: 900000  r: 15  Ear:    816  Aug Pop: 17.6 bits
     s: 900000  r: 14  Ear:   3060  Aug Pop: 17.6 bits
     s: 900000  r: 13  Ear:   8568  Aug Pop: 17.6 bits
     s: 900000  r: 12  Ear:  18564  Aug Pop: 17.7 bits
     s: 900000  r: 11  Ear:  31824  Aug Pop: 17.7 bits
     s: 900000  r: 10  Ear:  43758  Aug Pop: 17.8 bits
     s: 900000  r:  9  Ear:  48621  Aug Pop: 18.0 bits
     s: 900000  r:  8  Ear:  43768  Aug Pop: 18.2 bits
     s: 900000  r:  7  Ear:  31868  Aug Pop: 18.5 bits
     s: 900000  r:  6  Ear:  18676  Aug Pop: 19.0 bits
     s: 900000  r:  5  Ear:   8751  Aug Pop: 19.7 bits
     s: 900000  r:  4  Ear:   3262  Aug Pop: 21.0 bits
     s: 900000  r:  3  Ear:    972  Aug Pop: 23.4 bits
     s: 900000  r:  2  Ear:    362  Aug Pop: 30.1 bits

Of course, these results should be considered a single huge trial, which means that the sampling distribution has not been resolved. Thus, the reported results may not accurately represent the true experimental means.

Table 10.2 illustrates the indirect use of equation 6.4 in a numerical root-finding procedure to estimate the population based on exact repetitions.

Table 10.2  Estimating Keprom Population from Smallest Root of
         Exact Repetition Equations

     s: 900000  r: 18  Er:      1  Xct Pop: 17.4 bits
     s: 900000  r:  9  Er:      1  Xct Pop: 19.8 bits
     s: 900000  r:  8  Er:      1  Xct Pop: 20.3 bits
     s: 900000  r:  5  Er:      1  Xct Pop: 23.0 bits
     s: 900000  r:  4  Er:      1  Xct Pop: 24.8 bits
     s: 900000  r:  3  Er:      2  Xct Pop: 27.9 bits
     s: 900000  r:  2  Er:    123  Xct Pop: 31.6 bits

There is some agreement with table 10.1, as one would expect with a large population. However, the numerical procedures can find more than one root, and selecting a particular root makes a big difference. Here the root-finding procedure was intended to find the smallest root (although, without specific analysis of each case, we cannot be sure it did). This sort of indirect root-search is exceedingly slow in comparison to direct expression evaluation.

Note that both tables predict a much-reduced effective population based on the existence an 18-rep. If the reported data are actually representative of the device, it would appear that we may be warranted in entertaining serious doubts about the effectiveness of the design. Note that higher-order repetitions can provide the clearest indication of local distribution irregularities.

DISCUSSION

11. UTILITY

The ability to estimate population based on mean augmented repetitions is based on two questions:

Does the new development really predict augmented repetitions, and
Are augmented repetitions approximately Poisson-distributed over multiple similar trials?

These questions have been investigated experimentally (to some extent), and it appears that both answers are "yes." If so, multiple trials can be used to estimate the mean augmented repetitions (in trials with a fixed number of samples), and that value used in the new reversed equations to estimate the population. (Note that birthday techniques do not address common software linear-congruential or shift-register RNG sequences in which any particular result occurs only once before the system repeats. Birthday techniques require the possibility that multiple samples may produce the same value.)

Reversed birthday techniques are useful in that they allow estimation of population based only on random samples from that population. Useful estimation is possible (over tens or hundreds of trials) with as few as Sqrt(N) samples per trial; with "only" Sqrt(N) samples (per trial), we can begin to estimate N. The difficulty is that we must in some way take, save and compare Sqrt(N) distinct samples, and this can be a lot of samples. Nevertheless, population analysis of 32-bit physically-random RNG's should be well within the range of a modern personal computer.

Triples and higher repetitions do not help us much with the sampling size problem; in fact, they make it worse (see table 11.1).

Table 11.1  Expected Augmented Doubles and Triples for Various
           Populations

                         2^16 Samples           2^18 Samples
        Population    Doubles   Triples      Doubles   Triples

          2^16          32767     10922       524284    699042
          2^20           2048        43        32768      2731
          2^24            128         0         2048        11
          2^28              8         0          128         0
          2^32              0.5       0            8         0

For repetitions to be useful in population estimation, they must occur in numbers sufficient to support a reliable mean. For any given population, it is far easier to collect data on doubles than triples. But the occasional triple or higher repetition will make a fine contribution to the total of augmented doubles.

On the other hand, higher-order repetitions can be sensitive indicators of deviant populations. If we limit all trials to a size which only produces a few doubles, we could miss indications that our doubles come from a subset of the overall population. While this may not matter if we are interested only in confirming a minimum population, it could be a mistake if we wish to identify problems so they can be fixed.

In practice, it is important first to survey the situation with a few large trials and follow up on any suspect indications. Assuming the initial survey goes well, we should have enough data to fix the size of each trial for record; we might like to see three or more augmented doubles per trial, which implies at least 2.5 Sqrt(N) samples. We would also like to perform as many trials as possible (tens, hopefully; hundreds, if possible), find the augmented doubles in each, and examine the experimental distribution. Then we can come up with a reasonable estimate of the mean, and plug that value into (2.4) to estimate the population. We can do a lot of this automatically, of course, but no simple computation is likely to be as effective as an experimenter who is actively involved with the data.

12. SUMMARY

An easy and intuitive development has produced simple, exact and easily-reversed equations for estimating population from the number of repeated values found in multiple trials. For a reasonable hope (probability 0.63 or more) of finding at least one augmented double, each trial needs a number of random samples at least equal to the square root of the actual population. Because the number of repetitions in separate trials occur in a broad distribution, multiple trials are necessary to calculate a mean augmented repetition value which can be easily used to estimate the parent population.

Earlier equations predict the number of exact repetitions expected from a known population, but those equations are difficult to reverse to predict population. Even if we find numerical solutions to those equations (using inherently slow numerical root-finding techniques), the resulting solutions generally are not unique. The equations from the new development eliminate these problems, at the expense of relatively-simple augmentation computations.

The raw estimation of population has obvious applications in the design of physically random number generators. The ability to identify actual element values which appear in higher than expected quantities is an obvious benefit of repetition methods. But the fact that unexpectedly-probable elements can be detected despite being intermixed with unexpectedly-improbable elements seems to provide new analysis technology. Since population variations would weaken any really random generator, testing for these defects could be an important tool in the future development of physically random generators.

Another potential application is the analysis of cipher designs. Most cipher systems function by hiding a message "needle" in a "haystack" of other decipherings. Ordinarily we assume that the number of different cipherings is equal to the number of possible keys, but in a defective cipher this may not be the case. If many keys produce the same result, the cipher may be unexpectedly weak. Reversed birthday methods may provide an opportunity to test for such a problem.

Reversed birthday techniques presumably could be useful also in more general statistical applications.

13. ACKNOWLEDGEMENTS

Thanks to Nico de Vries (100115.2303@compuserve.com) for proposing (on Usenet sci.crypt) a random number generator computer program which he promoted as physically random. Normally, any computer RNG is a deterministic mechanism [20], and simple examination will easily disclose the "amount of internal data" or "state" which it uses. Nico's design was unusual in that it was based on real-time program interactions with IBM PC-type hardware, so it was difficult to know how much internal state there was, or where that state was located. Such a design begged the question of measuring RNG population when only the values themselves were available.

Thanks to Ross Anderson (Ross.Anderson@cl.cam.ac.uk) for suggesting (also on Usenet sci.crypt) that a birthday test could be used to estimate population, and apologies to him also, for my obsession with this idea. Ross sent me email preprints of his birthday test papers [1,2], but then I developed the concept of augmented repetitions, which produced simple equations and exact unique results. All errors in this development are, of course, mine alone.

REFERENCES

1. Anderson, R., R. Gibbens, C. Jagger, F. Kelly and M. Roe. Measuring the Diversity of Random Number Generators. In progress.

2. Anderson, R. Testing Random Number Generators. Submitted for publication in Electronics Letters.

3. Boas, M. 1983. Mathematical Methods in the Physical Sciences. John Wiley & Sons.

4. Davies, D. and W. Price. 1984. Security for Computer Networks. John Wiley & Sons.

5. Goodman, A. and J. Ratti. 1971. Finite Mathematics with Applications. Macmillan.

6. Kaliski, B., R. Rivest and A. Sherman. 1985. Is the Data Encryption Standard a Group? (Preliminary Abstract) Advances in Cryptology -- Eurocrypt '85. 81-95.

7. Knuth, D. 1973. The Art of Computer Programming. Vol. I, Fundamental Algorithms. Addison-Wesley.

8. Knuth, D. 1981. The Art of Computer Programming. Vol. II, Seminumerical Algorithms. Addison-Wesley.

9. Kullback, S. 1976 (first published 1938). Statistical Methods in Cryptanalysis. Aegean Park Press, P.O. Box 2837, Laguna Hills, CA 92654-0837.

10. Holtzman, J. 1987. The KEPROM: Sinking the Software Pirates. Radio-Electronics. June. 100-104.

11. Letham, L., D. Hoff and A. Folmsbee. 1986. A 128K EPROM Using Encryption of Pseudorandom Numbers to Enable Read Access. IEEE Journal of Solid-State Circuits. SC-21(5): 881-888.

12. Marsaglia, G. 1984. A Current View of Random Number Generators. Proceedings of the Sixteenth Symposium on the Interface Between Computer Science and Statistics. 3-10.

13. Merkle, R. 1982. Secrecy, Authentication, and Public Key Systems. UMI Research Press, University Microfilms International, Ann Arbor, Michigan 48106.

14. Meyer, C. and S. Matyas. 1982. Cryptography: A New Dimension in Data Security. John Wiley & Sons.

15. Morita, H., K. Ohta and S. Miyaguchi. 1991. A switching closure test to analyze cryptosystems. Advances in Cryptology -- CRYPTO '91. 183-193.

16. Parratt, L. 1961. Probability and Experimental Errors in Science. Dover.

17. Patterson, W. 1987. Mathematical Cryptology. Rowman & Littlefield, 81 Adams Drive, Totowa, New Jersey 07512.

18. Press, W. et. al. 1986. Numerical Recipes. Cambridge University Press.

19. Quisquater, J. and J. Delescaille. 1989. How easy is collision search? Applications to DES. Advances in Cryptology -- Eurocrypt '89. 429-434.

20. Ritter, T. 1991. The Efficient Generation of Cryptographic Confusion Sequences. Cryptologia. 15(2): 81-139.

21. Seberry, J. and J. Pieprzyk. 1989. Cryptography: An Introduction to Computer Security. Prentice Hall.

22. Zollo, S. 1985. IC Keeps Pirates Out. Electronics Week. February 11: 73, 76.

APPENDIX A: The sci.crypt Newsgroup

Sci.crypt is the name of a news group on Usenet, a bulletin-board network of computers which use dial-up lines for network communication. Usenet was originally developed using the UUCP (Unix-to-Unix Copy) package distributed with Unix, plus free news software which collects messages in many hundreds of different topics or "groups." On Usenet, messages are passed from computer to computer (until the messages hopefully reach every connected computer), so there is no central facility and no single overall ownership. Electronic mail is supported in the same way.

Access to Usenet News is generally also available on the Internet, an exponentially growing interconnection of networks. Although once confined to higher education and research laboratories, the Internet is increasingly available to ordinary citizens who pay a fee for an account on a computer which is Internet connected.

Those unacquainted with the Internet and News could be disappointed in what they find. Huge numbers of individuals have access to these groups, and many of the groups (like sci.crypt) are unmoderated. Anyone can respond to any message, and much of what is authoritatively expounded is also wrong, or perhaps just incomplete. New rank novices appear almost every day. Not only is it hard to know what to believe, it can be hard simply to find time to read through the chaff (currently tens of messages per day in sci.crypt) to get to the occasional kernel of new information. In addition, the bulletin-board format seems to encourage vituperative responses or "flames," which naturally produce similar counter-responses. These "flame wars" can be disconcerting to those used to respectful intellectual discourse.

At one time sci.crypt supported mostly technical discussions, and these occasionally still appear. But with continuing exponential growth, the advent of an available public key package (Pretty Good Privacy or PGP) and the U.S. government's "Clipper Chip" proposal, most of the traffic in the past year has addressed political issues.

APPENDIX B: The Intel 27916 Keprom

The 27916 Keprom is an integrated circuit memory device designed at Intel Corporation in the early 80's [22]. The Keprom is a "Keyed-Access EPROM" (Erasable Programmable Read-Only Memory) which performs an authentication handshake, during which physically-random values are encrypted under a programmed key [10]. In this way, the data in the EPROM are unlocked only if the associated equipment contains another Keprom with the correct key. Apparently the device was intended to be a technical solution to the problem of software piracy, perhaps especially in the games market. Unfortunately, the device appeared at about the same time that major software manufacturers were dropping copy protection.

The issue of interest here is that the Keprom device included an on-chip physically-random number generator, and a technical paper [11] reported results from evaluation tests on that generator. Briefly, the design consists of a 32-bit shift-register which accumulates bits sampled from a drifting on-chip oscillator, noise, and feedback from the register itself. Because 32-bit values are reported from generator, we expect to estimate a 32-bit population. However, the higher-order repetitions in the reported data imply a value very much less than this, which may indicate problems in the physically-random design.

LARGE TABLES

Table 4.1  All Possible Trials for N=2 and s=6

      Trial            Repetitions       Aug. Reps.       Accum. Aug. Reps.

 0  0  0  0  0  0     1  0  0  0  0     1  6 15 20 15     1   6  15  20  15
 0  0  0  0  0  1     0  1  0  0  0     0  1  5 10 10     1   7  20  30  25
 0  0  0  0  1  0     0  1  0  0  0     0  1  5 10 10     1   8  25  40  35
 0  0  0  0  1  1     0  0  1  0  1     0  0  1  4  7     1   8  26  44  42
 0  0  0  1  0  0     0  1  0  0  0     0  1  5 10 10     1   9  31  54  52
 0  0  0  1  0  1     0  0  1  0  1     0  0  1  4  7     1   9  32  58  59
 0  0  0  1  1  0     0  0  1  0  1     0  0  1  4  7     1   9  33  62  66
 0  0  0  1  1  1     0  0  0  2  0     0  0  0  2  6     1   9  33  64  72
 0  0  1  0  0  0     0  1  0  0  0     0  1  5 10 10     1  10  38  74  82
 0  0  1  0  0  1     0  0  1  0  1     0  0  1  4  7     1  10  39  78  89
 0  0  1  0  1  0     0  0  1  0  1     0  0  1  4  7     1  10  40  82  96
 0  0  1  0  1  1     0  0  0  2  0     0  0  0  2  6     1  10  40  84 102
 0  0  1  1  0  0     0  0  1  0  1     0  0  1  4  7     1  10  41  88 109
 0  0  1  1  0  1     0  0  0  2  0     0  0  0  2  6     1  10  41  90 115
 0  0  1  1  1  0     0  0  0  2  0     0  0  0  2  6     1  10  41  92 121
 0  0  1  1  1  1     0  0  1  0  1     0  0  1  4  7     1  10  42  96 128
 0  1  0  0  0  0     0  1  0  0  0     0  1  5 10 10     1  11  47 106 138
 0  1  0  0  0  1     0  0  1  0  1     0  0  1  4  7     1  11  48 110 145
 0  1  0  0  1  0     0  0  1  0  1     0  0  1  4  7     1  11  49 114 152
 0  1  0  0  1  1     0  0  0  2  0     0  0  0  2  6     1  11  49 116 158
 0  1  0  1  0  0     0  0  1  0  1     0  0  1  4  7     1  11  50 120 165
 0  1  0  1  0  1     0  0  0  2  0     0  0  0  2  6     1  11  50 122 171
 0  1  0  1  1  0     0  0  0  2  0     0  0  0  2  6     1  11  50 124 177
 0  1  0  1  1  1     0  0  1  0  1     0  0  1  4  7     1  11  51 128 184
 0  1  1  0  0  0     0  0  1  0  1     0  0  1  4  7     1  11  52 132 191
 0  1  1  0  0  1     0  0  0  2  0     0  0  0  2  6     1  11  52 134 197
 0  1  1  0  1  0     0  0  0  2  0     0  0  0  2  6     1  11  52 136 203
 0  1  1  0  1  1     0  0  1  0  1     0  0  1  4  7     1  11  53 140 210
 0  1  1  1  0  0     0  0  0  2  0     0  0  0  2  6     1  11  53 142 216
 0  1  1  1  0  1     0  0  1  0  1     0  0  1  4  7     1  11  54 146 223
 0  1  1  1  1  0     0  0  1  0  1     0  0  1  4  7     1  11  55 150 230
 0  1  1  1  1  1     0  1  0  0  0     0  1  5 10 10     1  12  60 160 240
 1  0  0  0  0  0     0  1  0  0  0     0  1  5 10 10     1  13  65 170 250
 1  0  0  0  0  1     0  0  1  0  1     0  0  1  4  7     1  13  66 174 257
 1  0  0  0  1  0     0  0  1  0  1     0  0  1  4  7     1  13  67 178 264
 1  0  0  0  1  1     0  0  0  2  0     0  0  0  2  6     1  13  67 180 270
 1  0  0  1  0  0     0  0  1  0  1     0  0  1  4  7     1  13  68 184 277
 1  0  0  1  0  1     0  0  0  2  0     0  0  0  2  6     1  13  68 186 283
 1  0  0  1  1  0     0  0  0  2  0     0  0  0  2  6     1  13  68 188 289
 1  0  0  1  1  1     0  0  1  0  1     0  0  1  4  7     1  13  69 192 296
 1  0  1  0  0  0     0  0  1  0  1     0  0  1  4  7     1  13  70 196 303
 1  0  1  0  0  1     0  0  0  2  0     0  0  0  2  6     1  13  70 198 309
 1  0  1  0  1  0     0  0  0  2  0     0  0  0  2  6     1  13  70 200 315
 1  0  1  0  1  1     0  0  1  0  1     0  0  1  4  7     1  13  71 204 322
 1  0  1  1  0  0     0  0  0  2  0     0  0  0  2  6     1  13  71 206 328
 1  0  1  1  0  1     0  0  1  0  1     0  0  1  4  7     1  13  72 210 335
 1  0  1  1  1  0     0  0  1  0  1     0  0  1  4  7     1  13  73 214 342
 1  0  1  1  1  1     0  1  0  0  0     0  1  5 10 10     1  14  78 224 352
 1  1  0  0  0  0     0  0  1  0  1     0  0  1  4  7     1  14  79 228 359
 1  1  0  0  0  1     0  0  0  2  0     0  0  0  2  6     1  14  79 230 365
 1  1  0  0  1  0     0  0  0  2  0     0  0  0  2  6     1  14  79 232 371
 1  1  0  0  1  1     0  0  1  0  1     0  0  1  4  7     1  14  80 236 378
 1  1  0  1  0  0     0  0  0  2  0     0  0  0  2  6     1  14  80 238 384
 1  1  0  1  0  1     0  0  1  0  1     0  0  1  4  7     1  14  81 242 391
 1  1  0  1  1  0     0  0  1  0  1     0  0  1  4  7     1  14  82 246 398
 1  1  0  1  1  1     0  1  0  0  0     0  1  5 10 10     1  15  87 256 408
 1  1  1  0  0  0     0  0  0  2  0     0  0  0  2  6     1  15  87 258 414
 1  1  1  0  0  1     0  0  1  0  1     0  0  1  4  7     1  15  88 262 421
 1  1  1  0  1  0     0  0  1  0  1     0  0  1  4  7     1  15  89 266 428
 1  1  1  0  1  1     0  1  0  0  0     0  1  5 10 10     1  16  94 276 438
 1  1  1  1  0  0     0  0  1  0  1     0  0  1  4  7     1  16  95 280 445
 1  1  1  1  0  1     0  1  0  0  0     0  1  5 10 10     1  17 100 290 455
 1  1  1  1  1  0     0  1  0  0  0     0  1  5 10 10     1  18 105 300 465
 1  1  1  1  1  1     1  0  0  0  0     1  6 15 20 15     2  24 120 320 480

 For all 64 possible trials of 6 samples from a population of 2 symbols:
 XctTot:         2       12       30       40       30
 XctEr:          2       12       30       40       30
 AugTot:         2       24      120      320      480
 AugEar:         2       24      120      320      480

Table 4.2  All Possible Trials for N=3 and s=4

     Trial          Reps     Aug. Reps   Accum. Aug. Reps

   0  0  0  0     1  0  0     1  4  6     1   4   6
   0  0  0  1     0  1  0     0  1  3     1   5   9
   0  0  0  2     0  1  0     0  1  3     1   6  12
   0  0  1  0     0  1  0     0  1  3     1   7  15
   0  0  1  1     0  0  2     0  0  2     1   7  17
   0  0  1  2     0  0  1     0  0  1     1   7  18
   0  0  2  0     0  1  0     0  1  3     1   8  21
   0  0  2  1     0  0  1     0  0  1     1   8  22
   0  0  2  2     0  0  2     0  0  2     1   8  24
   0  1  0  0     0  1  0     0  1  3     1   9  27
   0  1  0  1     0  0  2     0  0  2     1   9  29
   0  1  0  2     0  0  1     0  0  1     1   9  30
   0  1  1  0     0  0  2     0  0  2     1   9  32
   0  1  1  1     0  1  0     0  1  3     1  10  35
   0  1  1  2     0  0  1     0  0  1     1  10  36
   0  1  2  0     0  0  1     0  0  1     1  10  37
   0  1  2  1     0  0  1     0  0  1     1  10  38
   0  1  2  2     0  0  1     0  0  1     1  10  39
   0  2  0  0     0  1  0     0  1  3     1  11  42
   0  2  0  1     0  0  1     0  0  1     1  11  43
   0  2  0  2     0  0  2     0  0  2     1  11  45
   0  2  1  0     0  0  1     0  0  1     1  11  46
   0  2  1  1     0  0  1     0  0  1     1  11  47
   0  2  1  2     0  0  1     0  0  1     1  11  48
   0  2  2  0     0  0  2     0  0  2     1  11  50
   0  2  2  1     0  0  1     0  0  1     1  11  51
   0  2  2  2     0  1  0     0  1  3     1  12  54
   1  0  0  0     0  1  0     0  1  3     1  13  57
   1  0  0  1     0  0  2     0  0  2     1  13  59
   1  0  0  2     0  0  1     0  0  1     1  13  60
   1  0  1  0     0  0  2     0  0  2     1  13  62
   1  0  1  1     0  1  0     0  1  3     1  14  65
   1  0  1  2     0  0  1     0  0  1     1  14  66
   1  0  2  0     0  0  1     0  0  1     1  14  67
   1  0  2  1     0  0  1     0  0  1     1  14  68
   1  0  2  2     0  0  1     0  0  1     1  14  69
   1  1  0  0     0  0  2     0  0  2     1  14  71
   1  1  0  1     0  1  0     0  1  3     1  15  74
   1  1  0  2     0  0  1     0  0  1     1  15  75
   1  1  1  0     0  1  0     0  1  3     1  16  78
   1  1  1  1     1  0  0     1  4  6     2  20  84
   1  1  1  2     0  1  0     0  1  3     2  21  87
   1  1  2  0     0  0  1     0  0  1     2  21  88
   1  1  2  1     0  1  0     0  1  3     2  22  91
   1  1  2  2     0  0  2     0  0  2     2  22  93
   1  2  0  0     0  0  1     0  0  1     2  22  94
   1  2  0  1     0  0  1     0  0  1     2  22  95
   1  2  0  2     0  0  1     0  0  1     2  22  96
   1  2  1  0     0  0  1     0  0  1     2  22  97
   1  2  1  1     0  1  0     0  1  3     2  23 100
   1  2  1  2     0  0  2     0  0  2     2  23 102
   1  2  2  0     0  0  1     0  0  1     2  23 103
   1  2  2  1     0  0  2     0  0  2     2  23 105
   1  2  2  2     0  1  0     0  1  3     2  24 108
   2  0  0  0     0  1  0     0  1  3     2  25 111
   2  0  0  1     0  0  1     0  0  1     2  25 112
   2  0  0  2     0  0  2     0  0  2     2  25 114
   2  0  1  0     0  0  1     0  0  1     2  25 115
   2  0  1  1     0  0  1     0  0  1     2  25 116
   2  0  1  2     0  0  1     0  0  1     2  25 117
   2  0  2  0     0  0  2     0  0  2     2  25 119
   2  0  2  1     0  0  1     0  0  1     2  25 120
   2  0  2  2     0  1  0     0  1  3     2  26 123
   2  1  0  0     0  0  1     0  0  1     2  26 124
   2  1  0  1     0  0  1     0  0  1     2  26 125
   2  1  0  2     0  0  1     0  0  1     2  26 126
   2  1  1  0     0  0  1     0  0  1     2  26 127
   2  1  1  1     0  1  0     0  1  3     2  27 130
   2  1  1  2     0  0  2     0  0  2     2  27 132
   2  1  2  0     0  0  1     0  0  1     2  27 133
   2  1  2  1     0  0  2     0  0  2     2  27 135
   2  1  2  2     0  1  0     0  1  3     2  28 138
   2  2  0  0     0  0  2     0  0  2     2  28 140
   2  2  0  1     0  0  1     0  0  1     2  28 141
   2  2  0  2     0  1  0     0  1  3     2  29 144
   2  2  1  0     0  0  1     0  0  1     2  29 145
   2  2  1  1     0  0  2     0  0  2     2  29 147
   2  2  1  2     0  1  0     0  1  3     2  30 150
   2  2  2  0     0  1  0     0  1  3     2  31 153
   2  2  2  1     0  1  0     0  1  3     2  32 156
   2  2  2  2     1  0  0     1  4  6     3  36 162

 For all 81 possible trials of 4 samples from a population of 3 symbols:
 XctTot:         3       24       72
 XctEr:          3       24       72
 AugTot:         3       36      162
 AugEar:         3       36      162

Table 4.3  Summary of All Possible Trials for N=10 s=3 and N=100 s=2

 For all 1000 possible trials of 3 samples from a population of 10 symbols:
 XctTot:        10      270
 XctEr:         10      270
 AugTot:        10      300
 AugEar:        10      300

 For all 10000 possible trials of 2 samples from a population of 100 symbols:
 XctTot:       100
 XctEr:        100
 AugTot:       100
 AugEar:       100

BIOGRAPHICAL SKETCH

Terry Ritter is a registered Professional Engineer who has spent the last five years conducting independent research and development on cryptographic systems. Mr. Ritter is a member of the IEEE and ACM, and holds the US patent on Dynamic Substitution technology; he has previously contributed articles on reversible nonlinear cryptographic combiners and a survey of cryptographic random number generators. Ritter Software Engineering offers fast cipher engines to hardware and software developers.

Terry Ritter, his current address, and his top page.

Last updated: 2000-09-06