Suppose we have a random-number machine for use in a game of chance. The machine has an opaque covering, and contains some number of tokens which the machine somehow physically mixes. The tokens can be identified by serial number, and each is replaced before another is drawn. Is there a way to externally check the honesty of the system?
If we collect samples until we start seeing tokens appear multiple times, then repeat this number of samples for a number of different trials, we can make some statistical inferences. The relationship between the size of the population and the expected number of repetitions is discussed in my 1994 article [1], Estimating Population from Repetitions in Accumulated Random Samples (82K).
If we get some token twice (in a single trial), we have a "double" or a "2-rep," and if some token occurs three times, we have a "triple" or a "3-rep." It turns out that one 3-rep has the same probability of occurring as three 2-reps. We can convert each "n-rep" to a count of C(n,2) [that is, the number of combinations of n things taken 2 at a time] 2-reps. When all "n-reps" have been converted to 2-reps and summed, we get a value for augmented 2-reps which is easily related to population, assuming that all tokens have an equal probability. If some tokens appear more frequently than expected (on average), we will get more augmented repetitions, and predict a smaller than expected value for the "effective" population.
The first worksheet is used to enter the assumed population, which is used to compute a reasonable sample size. The computed sample size should produce 4 augmented 2-reps, on average, provided each element has an equal probability. Any significant number of non-equal probabilities will raise the average number of augmented doubles found (over a number of experiments).
The first worksheet is also used to enter the actual sample size, which is used to compute the theoretically-expected count for each repetition level. The predicted counts are displayed on the second worksheet as benchmarks to which we can compare the counts found by experiment. This comparison will highlight problems in the distribution of n-reps.
The second worksheet performs the computations necessary to find the equivalent number of "augmented 2-reps." The user conducts a substantial number of same-size trials on the population, and develops an average count for the number of occurrences of each n-rep. (In one trial, if the value 17 occurs three times, and the value 5 also occurs three times, we have two 3-reps, for an occurrence count of 2 at the 3-reps level.) By entering the (generally fractional) average occurrence counts at each repetition level, the equivalent number of augmented 2-reps is accumulated as the final value in the rightmost or "Running Total 2-Reps" column.
The third worksheet estimates the population from the entered trial size and a total of augmented 2-reps. In general, finding twice the number of 2-reps predicts a population of half the size or a one bit smaller "address space."
In the early 80's, Intel Corporation designed an IC they called a "Keprom," or "Keyed-Access EPROM." The Keprom device had an on-chip physically-random number generator, and a technical paper [2] reported results from tests on that generator. Briefly, the design consists of a 32-bit shift-register which accumulates bits sampled from a drifting on-chip oscillator, noise, and feedback from the register itself.
Clicking on the "Keprom" button introduces the measured values into the worksheets. While the resulting 30-bit population value may seem "not too different" from the expected 32 bits, this is only 1/4 the expected size. That is, in this design, 3 out of every 4 of the possible values might as well not even exist.
The Keprom experiment results also show quite a few repetitions above the 4-rep level. Such higher-level repetitions should be very improbable in a true random system of the indicated size. Accordingly, the experimental results would warrant an investigation into the actual values involved and how these particular values could be so vastly more probable than one would expect. For example, we should expect to find about 4x10-73 18-reps on average (that is, we should never actually see one), given the assumed Keprom population and sample size. This makes the 18-rep which was found experimentally extremely unlikely to have been produced by a flat distribution. The ideal model predicts a certain level of each repetition, and when those are not found, the model we are testing is shown to be something less than ideal.
[1] Ritter, T. 1994. Estimating Population from Repetitions in Accumulated Random Samples. Cryptologia. 18(2):155-190.
[2] Letham, L., D. Hoff and A. Folmsbee. 1986. A 128K EPROM Using Encryption of Pseudorandom Numbers to Enable Read Access. IEEE Journal of Solid-State Circuits. SC-21(5): 881-888.
Last updated:2005-02-21 (from 1996-09-20)