PUBLISHED: Ritter, T. 1994. Estimating Population from Repetitions in Accumulated Random Samples. Cryptologia. 18(2):155-190.
Errors corrected.
To read the complete article off-line, save these graphics files: Fig. 1.1 (BIRTHFL3.GIF), Fig. 7.1 (2310114A.GIF), Fig. 7.2 (1315114A.GIF), Fig. 7.3 (1320114A.GIF), Fig. 7.4 (8225114A.GIF), Fig. 7.5 (7230114A.GIF), Fig. 7.6 (5240114A.GIF), Fig. 9.1 (DIST91.GIF), Fig. 9.2 (DIST92.GIF), and Fig. 9.3 (DIST93.GIF).
ADDRESS: Ritter Software Engineering, 2609 Choctaw Trail, Austin, Texas 78745, (512) 892-0494
ABSTRACT: It was desired to estimate the number of unique values or codes produced by a physically-random number generator of the sort used to form cryptographic message keys; only the codes produced by the generator were available for analysis. Subsequently, the classical "birthday paradox" was reversed to estimate statistical population from the average number of repetitions found in trials containing a substantial number of random samples. Although similar in premise to previous work, a new relationship was found which includes combinatorial contributions from higher repetition levels. The resulting equations proved to be simple, intuitive, exact, and novel in that they were easily reversed to produce unique population estimates. The distribution of repetitions in different trials was investigated experimentally and found not inconsistent with a hypothesis of Poisson distribution. In practice, multiple trials establish the mean number of repetitions for a trial of a given size; the new equations then estimate population directly. These reversed birthday methods are able to detect population variations despite the presence of inverse probability effects which can hide variations from common distribution tests. Thus, reversed birthday methods could prove valuable in the certification of physically-random number generators. Reversed birthday methods also could be useful for estimating the effective number of keys in cipher certification tests as well as more general statistical applications.
KEYWORDS: Population estimation, birthday paradox, physically random generators, population deviation testing, augmented repetitions, augmented doubles, cryptography, statistics
Cryptographic systems often need "really random" numbers for various applications such as message keys and authentication values. Really random numbers are valuable because they follow no pattern, so no pattern can be predicted by cryptanalysis. Unfortunately, computing machinery is fully deterministic and so cannot be used to generate really random numbers; algorithmic processes are inherently pseudo-random. But really random numbers can be generated by special "physically random" hardware designs. Normally, such devices measure some complex physical effect such as thermal noise, diode electron leakage, quantum particle decay, chaotic fluid flow, etc.
Various tests of distribution can (and should) be applied to physically random generators, but such tests are not conclusive. In marked contrast to most pseudorandom designs, a physically random generator gives no guarantee that the population of possible values are all available with equal probability. Thus, there is a need for some technique to measure the population (or show that it exceeds some measurable value) and also show that the population is evenly distributed. Notice that any such technique is limited to using only the random values produced by such a generator.
When the issue of testing a really random generator came up in the Usenet sci.crypt group (see Appendix A), it was suggested that "the birthday paradox" could be used, in reverse, to estimate population. As it turns out, the birthday paradox has been discussed in cryptography for the past decade (for example, see Meyer and Matyas (1982) [14: 672-673], and Davies and Price (1984) [4: 116-117, 278-280]). The birthday paradox has been proposed for public-key distribution in Merkle [13: 13-21], used for DES analysis in Kaliski, Rivest and Sherman [6] (also described in Patterson [17: 156-166]), and forms the basis for authentication attacks described in Seberry and Pieprzyk [21: 157]. Birthday methods were apparently used by Letham, Hoff and Folmsbee (1986) [11] during tests of the hardware random number generator on the Intel 27916 "Keprom" (see Appendix B); results were given, but the theory of the test itself was not described.
The method of the investigation in this paper was to find easily-reversed equations for predicting the average number of repeated values found in trials containing substantial numbers of random samples. Some well-known simple equations, previously described as approximations [9], were found to be exact predictors of a new relationship here called augmented repetitions. This relationship was exhaustively analyzed on some small cases, and then tested experimentally with good results. Experiments indicated that the number of augmented repetitions in separate trials apparently occur in Poisson distribution. Multiple trials established a mean value for the augmented repetitions in a trial of a given size, which gave a good estimate of the parent population.
During this development, it became apparent that certain types of population deviation which could be detected by reversed birthday methods probably would not be detected by conventional statistical tests. Because of the importance of the cryptographic objects created from physically random devices, the possibility that some hardware designs may have unsuspected problems is more than just interesting.
There are fewer than 30 students in a particular classroom; is it likely that at least two of the students have the same birthday? (See, for example, Goodman and Ratti [5: 139-140].)
At first the question may seem preposterous: There are 365 possible birthdays, and only 30 students. The paradox is that it is nevertheless probable (probability exceeding 0.5) that some pair of students does have the same birthday, provided that we have at least 23 students.
How can this be? Well, let's think about it: Suppose, as students enter the classroom, they call out their birthday, and any other student in the room with the same birthday calls out "match":
The non-intuitive part of the paradox is that, while the number of students increases linearly, the number of pairs -- the number of possible matches -- increases combinatorially. When there are 23 students, there are (23 C 2) or 253 possible pairs; the next student gives us a total of 276 possible pairs. Whenever we add a single student, we can have a possible match with each student already in the room.
Fig. 1.1 Birthday Probability Logic. On its own, the first sample, W, has no probability of forming a double. With the second sample, X, there is just one possible match and therefore just one chance in N (for population N) of a match. The third sample, Y, presents two chances for a match (W and X), but the total probability of a match in three samples also includes all previous possible matches, for a total of three chances out of N. The logic for subsequent samples continues similarly.
The situation can be described by the flowchart in figure 1.1,
where each sample (each different student) is labeled with a
different letter (W, X, Y and Z). Each elemental test (each
diamond-shaped box) is a comparison to a single previous value;
the branch to the right is "match" or "success," while the path
downward is "no match." For a single comparison, there is one
possible match out of the entire population of N; thus, probability
of a match is just 1/N. If we have two comparisons (both will be
different, or we would have had a match with the previous sample),
the probability of success is 2/N. This leaves 1 - 2/N (of the
cases requiring a new sample!) to continue as the current
"no match" probability. The expression for "success" gets
complex; a good way to think about the problem is to follow the
path down; this is the cumulative probability of having no matches.
By following the path down in figure 1.1, we can write the probability expressions shown in table 1.1.
Table 1.1 Unmatched Probability Expressions Samples Probability of No Match 1 1 2 1-1/N 3 (1-1/N)(1-2/N) 4 (1-1/N)(1-2/N)(1-3/N)
Clearly, we have a well-defined sequence, and once we recognize this, it is relatively easy to compute the values for the birthday problem. We can formalize this slightly as the cumulative probability (from 0 to 1) of no match (q) for population N and the number of samples (students) s:
(1.1) Pq(N,s) = (1-1/N)(1-2/N)..(1-(s-1)/N)and, since the total probability is 1, the probability of a match (p) is just 1 minus (1.1):
(1.2) Pd(N,s) = 1 - (1-1/N)(1-2/N)..(1-(s-1)/N).
For a small population like 365, equation 1.2 allows us to compute the birthday probabilities shown in table 1.2.
Table 1.2 Numerical Birthday Probabilities Students Probability 5 0.0271 10 0.1169 15 0.2529 19 0.3791 20 0.4114 22 0.4757 23 0.5073 24 0.5383 25 0.5687 27 0.6269 30 0.7063 35 0.8144 40 0.8912 50 0.9704 60 0.9941 70 0.9992 80 0.9999But for a large population, while (1.2) may be term-by-term computable, there will be too many terms to compute. Fortunately, we can write (1.2) as
N N-1 N-2 .. N-s+1 Pd(N,s) = 1 - ------------------------ , N N N .. Nthen recognize the numerator as N!/(N-s)!, and have
N! (1.3) Pd(N,s) = 1 - ------------ . (N-s)! N^sThis is exact, although it quickly becomes unruly as the factorials exceed any given precision, and the lower factorial limits the number of samples s to the population N. (Kullback [9: 29] promotes e^-s(s-1)/2N as a "good approximation" to the right-hand part of (1.3) for large N; we will see this exponent again in section 2.) Fortunately, we have a good approximation to Ln(x!) for at least modest x (say, x > 9) using Stirling's formula (e.g., Knuth [7: 111]):
(1.4) LnFact(x) =(approx.) (x + 1/2) Ln(x) - x + Ln(TwoPi)/2 + 1 1 1 1 --- - ------- + -------- - -------- 12x 360 x^3 1260 x^5 1680 x^7And so equation 1.3 can be tamed with logs (for s > 9, and s <= N) as:
Corrected 2000 Sept 06 (1.5) Pd(N,s) = 1 - e^( LnFact(N) - LnFact(N-s) - s Ln(N) )
Equation 1.5 is fairly convenient for the machine evaluation of birthday probabilities (although it will not handle s larger than N) because only the log factorials need be evaluated, and these are easy to find with high accuracy. But (1.5) is not easy to reverse for use in estimating population N when given values for doubles d and samples s. Accordingly, let us pursue an alternate development:
Consider a trial of two samples (s=2): One sample is whatever it is, and, if the next sample is really independent, there should be one chance in population N that exactly the same value will repeat, so the expected number of doubles Ed for two samples is: Ed(N,2) = 1 / N.
Consider a trial of three samples (s=3): There are three possible pairs of values, and for each pair, one value is whatever it is, and the probability that the other sample will match is one in N, so the expected number of doubles Ed(N,3) = 3 / N.
Consider a trial of four samples (s=4): There are six possible pairs of values, so the expected number of doubles will be Ed(N,4) = 6 / N.
By now it should be obvious that we are walking down Pascal's triangle and finding the number of combinations of s things taken two at a time: (s C 2), also known as "n choose r" or "the binomial coefficient."
We can relate the number of samples s to the population N through the expected number of augmented doubles Ead we observe:
(2.1) Ead(N,s) = (s C 2) / N .But
(s C 2) = s(s-1) / 2 ,so
(2.2) Ead(N,s) = s(s-1) / 2N .
As we shall see, in this development the expected value has an unusual meaning. Here the expected value represents not just the number of exact doubles, but also contributions from triples and higher-order repetitions (if any). We will call this augmented doubles.
Consider the cases of N=2 and s=3, and N=3 and s=2 in tables 2.1 and 2.2 (note that equation 2.2 works just fine with s > N). In table 2.1 equation 2.2 is used to predict the number of expected augmented doubles for these cases.
Table 2.1 Expected Augmented Doubles for N=2 and s=3, and N=3 and s=2 3 * 2 2 * 1 Ead(2,3) = ------- = 1.5 Ead(3,2) = ------- = 1/3 2 * 2 2 * 3Table 2.2 illustrates every possible trial for both of these cases, and gives totals for comparison to the predicted values.
Table 2.2 All Possible Trials for N=2 and s=3, and N=3 and s=2 N=2 s=3 N=3 s=2 Trial Aug. Doubles Trial Aug. Doubles 0 0 0 3 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 2 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 2 0 1 1 0 1 2 0 0 1 1 1 3 2 1 0 --- 2 2 1 12 --- 3 Ead(2,3) = 12 aug. doubles / 8 cases = 1.5 Ead(3,2) = 3 aug. doubles / 9 cases = 1/3
Note that the equations in this new development predict the number of doubles exactly, if we interpret a triple as three different pairs of doubles. Thus, the augmentation from triples to doubles is (3 C 2) or 3. We could express a particular augmented result as the sum of the contributions from higher repetition levels as:
Corrected n (2.3) Ar[k] = SUM (i C k) r[i] . i=kIn (2.3) Ar[k] is the augmented repetition count at the k-th level (k=2 for doubles; k=3 for triples, etc.), and r[i] the exact repetition count at the i-th level. Notice that we are using the idea of combinations twice: Once to give us the number of pairs (and, thus, the number of possible matches) in s samples, and then again to represent the number of doubles hidden in any higher-order exact repetitions.
In Section 6 we will encounter equations which predict exact repetitions and thus avoid augmentation calculations. But when we use exact repetitions, if we have a double, and then find that same value again, we will have a new triple, and one fewer doubles. Thus, if we are using exact doubles to predict population, finding another match can actually decrease the number of exact doubles, and so change our prediction in exactly the wrong direction. In contrast, augmented values retain repetition information instead of throwing it away.
These simple examples have (so far) confirmed the equations. With a beginning confidence in the analysis, we can estimate the population N from the number of samples s and augmented doubles ad:
(2.4) Nad(s,ad) = s(s-1) / 2 ad.With s positive and ad greater than zero, (2.4) produces an absolutely unambiguous unique positive result.
We can even create an equation for predicting samples s needed from population N to produce (on average) a given number of augmented doubles ad. Using (2.2), we can put s into normal quadratic form, use the quadratic formula, and end up with:
___________ / 1 + \/ 1 + 8 ad N (2.5) s(N,ad) = --------------------- . 2(As we shall see in Section 5, the expectation value ad for a 50 percent probability of finding an augmented double is -Ln(0.5) or 0.693.)
Now consider the situation with triples: Given any set of three samples, one is whatever it is, and one time out of N the second is the same value (a double), and of these, one time out of N the third value is also the same (a triple). So we have:
(3.1) Eat(N,s) = (s C 3) / N^2 .But
(s C 3) = s(s-1)(s-2) / 6 ,so
(3.2) Eat(N,s) = s(s-1)(s-2) / 6N^2 .Now we can estimate the population N from samples s and augmented triples at:
_____________ / (3.3) Nat(s,at) = / s(s-1)(s-2) . _ / ------------- \/ 6 at
Consider the cases of N=2 and s=4, and N=3 and s=3, as estimated in table 3.1:
Table 3.1. Expected Augmented Doubles and Triples for N=2 and s=4, and N=3 and s=3 4 * 3 3 * 2 Ead(2,4) = ------- = 3 Ead(3,3) = ------- = 1 2 * 2 2 * 3 4 * 3 * 2 3 * 2 * 1 Eat(2,4) = ----------- = 1 Eat(3,3) = ----------- = 1/9 6 * 4 6 * 9with every possible trial shown in table 3.2.
Table 3.2 All Possible Trials for N=2 and s=4, and N=3 and s=3 Augmented Augmented Trial Doubles Triples Trial Doubles Triples 0 0 0 0 6 4 0 0 0 3 1 0 0 0 1 3 1 0 0 1 1 0 0 0 1 0 3 1 0 0 2 1 0 0 0 1 1 2 0 0 1 0 1 0 0 1 0 0 3 1 0 1 1 1 0 0 1 0 1 2 0 0 1 2 0 0 0 1 1 0 2 0 0 2 0 1 0 0 1 1 1 3 1 0 2 1 0 0 1 0 0 0 3 1 0 2 2 1 0 1 0 0 1 2 0 1 0 0 1 0 1 0 1 0 2 0 1 0 1 1 0 1 0 1 1 3 1 1 0 2 0 0 1 1 0 0 2 0 1 1 0 1 0 1 1 0 1 3 1 1 1 1 3 1 1 1 1 0 3 1 1 1 2 1 0 1 1 1 1 6 4 1 2 0 0 0 --- --- 1 2 1 1 0 48 16 1 2 2 1 0 2 0 0 1 0 Ead(2,4) = 48 doub / 16 = 3 2 0 1 0 0 2 0 2 1 0 Eat(2,4) = 16 trip / 16 = 1 2 1 0 0 0 2 1 1 1 0 2 1 2 1 0 2 2 0 1 0 Ead(3,3) = 27 doub / 27 = 1 2 2 1 1 0 2 2 2 3 1 Eat(3,3) = 3 trip / 27 = 1/9 --- --- 27 3Note that the values from (2.2) and (3.2) are precisely the values we find by exhaustive enumeration.
For some parameters, when the number of samples s exceeds the population N, we can expect more triples than doubles; this is just the realization that (n C 3) is often greater than (n C 2). For example, suppose we have N=2 and s=6, a 6-bit binary value. Suppose all bits in s are the same. In this case we have only 15 doubles (6 C 2), but 20 triples (6 C 3). Of course, there are only two such cases (zeros and ones) at this length, and these are rare; it takes more such cases to affect the expected value (see table 3.3).
Table 3.3 Expected Augmented Doubles and Triples in Random Bit Arrays Bits Aug. Doubles Aug. Triples 2 0.5 0 4 3 1 6 7.5 5 8 14 14 10 22.5 30
A similar effect occurs on observed counts: An observed count of six occurrences of the same value (a level-6 exact repetition) would be taken as 15 augmented doubles (6 C 2), but 20 augmented triples (6 C 3). Augmentation from higher repetitions can have more effect on triples than doubles.
Every repetition size r (r=2 for doubles; r=3 for triples) gives us an expected augmented repetition value Ear :
(4.1) Ear(r,N,s) = (s C r) / N^(r-1),and from this we get an estimation of population N:
(4.2) Nr(r,s,ar) = ( (s C r) / ar )^(1/(r-1))where ar is the augmented repetition count.
To bypass the limitations of (s C r) to parameters s >= r, we note that
(s C r) = s(s-1)..(s-r+1) / (s-r)(s-r-1)..2 = s! / r! (s-r)!so factorial versions are:
s! (4.3) Ear(r,N,s) = ------------------- , r! (s-r)! N^(r-1)
s! (4.4) Nr(r,s,ar) = -------------- ^ (1/(r-1)) , r! (s-r)! arand, in logs:
(4.7) LnEar(r,N,s) = LnFact(s) - LnFact(r) - LnFact(s-r) -(r-1) Ln(N) ,
LnFact(s) - LnFact(r) - LnFact(s-r) - Ln(ar) (4.8) LnNr(r,s,ar) = -------------------------------------------- . (r-1)
Table 4.1 and table 4.2 illustrate two checks on equation 4.7 and higher-order repetitions. In table 4.1 we have all 64 possible trials of six bits (binary samples). The first column lists each particular trial, the second shows the number of repetitions found (from 6-reps through doubles), the third column gives the augmented repetitions, while the last column accumulates totals. At the bottom, equation 4.7 produces values for comparison; note that augmented 6-reps, 5-reps, 4-reps, 3-reps and doubles all occur exactly as equation 4.7 predicts. (Exact repetitions are also predicted, using (6.4).) Similarly, table 4.2 presents all 81 possible trials of four samples from a population of three, with similar success. Table 4.3 gives other tests, in summary version only.
The new development yields formulas for the expected number of augmented doubles, but sometimes we would prefer to have probability, a value from 0 to 1 (although this is unnecessary for population estimation). Let us assume, for the present, that the number of augmented doubles found in a fixed number of samples obeys a Poisson distribution. A Poisson distribution has only a single parameter, the quantity np (samples n times probability of success p); this is the mean u or expected value (see, for example, Parratt [16]). Given u we can calculate the probability that k augmented doubles will be found by using the Poisson cumulative distribution function shown in equation 5.1:
n (u^k)(e^-u) (5.1) SUM P(k;u) = ----------- = 1 k=0 k! u^2 e^-u u^n e^-u = e^-u + u e^-u + -------- + ... + -------- 2 n!
Each term in (5.1) gives the probability of k successes. In our case, (5.1) gives us the probability of k doubles given a particular value of u, and a trial of a particular size will imply some value for u. Since the total probability (the sum of all terms) is 1, the probability of having at least one double is 1 minus the probability of zero doubles:
(5.2) Pd(u) = 1 - e^-u .
Now, since we already have an equation for expected augmented doubles, we might use Ead from (2.2) as an approximation to u, and then (5.2) should give us the approximate probability of an augmented double. Right?
Unconvinced? Well, how about this (from Meyer and Matyas [14: 672-673]): We start with (1.1), the probability of no match, from the original development for "at least one match":
Pq(N,s) = (1-1/N)(1-2/N)..(1-(s-1)/N)and take the log of each side:
Ln(Pq(N,s)) = Ln(1-1/N) + Ln(1-2/N) + ... + Ln(1-(s-1)/N).Now (assuming s << N), each of these terms can be expressed as a series:
1 1 1 (5.3) Ln(1-1/N) = - --- - ---- - ---- - . . . N 2N^2 3N^3but if N is large (over, say, 10^3), the first term will dominate, and the rest can be discarded. This leaves us with:
Ln(Pq(N,s)) = -[ 1/N + 2/N + 3/N + ... + (s-1)/N ]and this series is exactly the average of all the terms (the first plus the last divided by two), times the number of terms, after factoring out N:
Ln(Pq(N,s)) = -(((1 + s-1) / 2) s-1) / Nor
Ln(Pq(N,s)) = - s (s-1) / 2N.But look at this equation: The right-hand-side is identical with equation (2.2) for expected augmented doubles, except for the sign! So the approximate probability of "at least one match" can be calculated from our development for augmented repetitions. Thus,
Pq(N,s) = e^-Eadand
(5.4) Pd(N,s) = 1 - e^-Ead ,which is exactly what we expect from a Poisson distribution of augmented doubles (in separate trials each with the same fixed number of samples). (Kaliski, et. al. [6: 87] points out that the right hand part of (1.3) can be approximated by essentially
To finish up, we can easily write:
(5.5) Ed = -Ln( 1 - Pd ) .This is why, in equation 2.4, we could not simply use 0.5 or 1.0 as the number of expected doubles. The number of doubles corresponding to a 50 percent probability of success is -Ln(0.5) or 0.693.
As it turns out, Kullback [9] provides formulas which can be used directly on the birthday problem. (To conform to the previous notation, and to prevent confusion between n and N, we change Kullback's N to s (samples), and Kullback's n to N (population).) Starting on p. 29 we find: "For random text of s elements, where there are N different possible elementsūthe average number of elements occurring twice each is given by"
s(s-1) N (1 - 1/N)^(s-2) (6.1) Ed(N,s) = ------------------------- 2 N^2"the average number of elements occurring r times each is given by"
s(s-1)..(s-r+1) N (1 - 1/N)^(s-r) (6.2) Er(r,N,s) = ---------------------------------- . r! N^rFor machine evaluation, we may express these in logs:
(6.3) LnEd(N,s) = Ln(s(s-1)/2) + (s-2) Ln(1 - 1/N) - Ln(N) ,
(6.4) LnEr(r,N,s) = LnFact(s) - LnFact(s-r) - LnFact(r) + (s-r) Ln(1 - 1/N) - (r-1) Ln(N) .
Although these equations do predict expected exact repetitions, they are going to be difficult to re-arrange to predict population. It would of course be possible to use any of these equations with an indirect root-finding numerical technique (for example, bisection [18]), but because the equations typically include powers higher than the first, any particular root may not be unique.
Interestingly, we see that (6.4) has five terms, and four of those are identical with the ones in (4.7). Thus, the additional term in (6.4) is the log of the fraction which converts the larger augmented repetition value ar into the smaller exact repetition value (or vice versa):
(6.5) LnExactReps(r,N,s) = (s-r) Ln(1 - 1/N) + Ln(ar) ,or without logs:
(6.6) ExactReps(r,N,s) = (1 - 1/N)^(s-r) * ar .
The expression (1 - 1/N) is always fractional, so there can never be more exact repetitions than augmented ones. This expression is close to one when N is large, and as long as s is much smaller than N, the overall factor is still close to one. Thus, under normal circumstances, the number of exact repetitions is not too different from the number of augmented repetitions. (With N=10,000, the ratio is 0.99 at s=100, and 0.96 at s=400.) On the other hand, small N and large s maximize the differences. With N=2 and s=3 (the situation illustrated in table 2.2) there are only half as many exact doubles as augmented doubles, and the ratio increases exponentially with higher s. Of course, in order to use (6.5) or (6.6), we first must know the population N, and this is often what we hope to find.
The numerical evaluation of Ln(1 - 1/N) is trickier than it looks. For large N, (1 - 1/N) is going to be very close to 1, and Ln(1) is 0. Consequently, the evaluation of the log may well underflow to zero with a serious loss of precision. But we have seen Ln(1 - 1/N) expanded in equation 5.3 (for large N):
(6.7) Ln(1 - 1/N) =(approx.) -1/N ,and, so, for large N (from 10^3 to 10^6, depending on the local numerical precision):
(6.8) LnExactReps(r,N,s) =(approx.) (r-s)/N ,and the same could be done for (6.3) and (6.4).
Kullback does not explain how he came up with (6.1) and (6.2), but they turn out to be very nearly binomial equations (see, for example, Parratt [16]). The binomial cumulative distribution function is:
n n (6.9) SUM B(k;n,p) = SUM (n C k) p^k q^n-k = 1 . k=0 k=0Each term in the sum gives the probability of finding exactly k successes in n independent tries when each success has probability p. In our terminology, consider the probability of an exact double (k=2) in s samples (n=s) from population N (p=1/N):
(6.10) B(2;s,1/N) = (s C 2) ((1/N)^2) (1-1/N)^(s-2)Observe that the first factor (s C 2) is s(s-1)/2 and that a factor of N is necessary to yield the expected number of doubles, and the link to (6.1) is immediate. (Note that the Poisson and binomial distributions have two different applications here: The binomial approaches are simply alternatives to (1.3) or (2.2), while the distribution of multiple trials requires the Poisson equations to estimate the resulting probability.)
Now that we see how the binomial distribution fits in, we can see that it also supports yet another approach: the probability of doubles or better. Since the sum of all terms (the probability of each repetition-level) is 1, the probability of having at least one match is 1 minus the probability of zero successes minus the probability of one success:
(6.11) Ebd(n,p) = N( 1 - B(0;s,p) - B(1;s,p) ) .Alas, this is not just a new expression for (1.3), but is instead yet another different concept of "doubles." We now have four such concepts, here arranged in decreasing magnitude:
(1.3) > (2.2) > (6.11) > (6.1) "at least one" "augmented" "binomial multiples" "exact"
The classical birthday development in section 1 is for "at least one match." The new development in section 2 is also for two or more occurrences, albeit in the special and numerically enhanced form of "augmented doubles." And the development in this section predicts "binomial multiples" as well as exactly two occurrences of a value. These different viewpoints make direct result comparisons difficult, but augmented repetitions remain by far the easiest to reverse for predicting population.
Computer programs were implemented to simulate random sampling from a known population. That population was then estimated by sampling the population in trials having various numbers of random samples and using those results in the augmented repetition equations.
Ideally, each time we perform a trial we would like to be confident that the result reflects the larger reality. Unfortunately, that is not the way sampling works. Instead, the set of all possible trials form a probability distribution, so some individual trials will always yield misleading results. On the other hand, if we take many trials, we have some reason to hope that we can develop both an accurate distribution, and an accurate mean for use in population prediction.
The question of the distribution of trial values was addressed experimentally. A statistical random number generator was used to generate some number of samples in each trial, and this was repeated for a large number of trials:
Fig. 7.1 Augmented Doubles in Sqrt(N) Samples.
2000 trials of 100 samples each from population 10,000.
Each bar
represents a particular number of augmented doubles, and the height
of each bar represents the number of experimental trials which found
that many augmented doubles. The fat white line connects each
number of trials predicted by a Poisson distribution having an
expected value calculated as e to the power given by equation 4.7
(log of expected augmented repetitions), for augmented doubles,
assuming the same overall number of trials.
Fig. 7.2 Augmented Doubles, 1.5 Sqrt(N) Samples. 1333 trials of 150 samples each from population 10,000.
The graphs in figures 7.1 through 7.6 illustrate the distribution
of the number of augmented doubles found in large numbers of trials
using six different quantities of random samples. Each graph
represents essentially the same collection effort. The
vertical bars represent the number of trials which produced the
associated number of augmented doubles.
Fig. 7.3 Augmented Doubles in 2 Sqrt(N) Samples. 1000 trials of 200 samples each from population 10,000.
The thick white line on each graph illustrates the number of trials
predicted to have the given number of doubles, as computed by a
Poisson distribution for expected doubles developed from
(4.7)
and the same total trials.
Fig. 7.4 Augmented Doubles, 2.5 Sqrt(N) Samples. 800 trials of 250 samples each from population 10,000.
The author finds these graphs compelling
evidence that large numbers of trials do find augmented doubles
in a distribution which is essentially Poisson in nature (and this
happy result nicely supports the probability development in
section 5).
Fig. 7.5 Augmented Doubles in 3 Sqrt(N) Samples.
667 trials of 300 samples each from population 10,000.
Fig. 7.6 Augmented Doubles in 4 Sqrt(N) Samples.
500 trials of 400 samples each from population 10,000.
Table 7.1 lists the conditions under which each experiment occurred.
Table 7.1. Means of Augmented Doubles over Many Trials and Other Results from Distribution Experiments Samples Trials Mean Chi-Sq Est. N 100 2000 0.492 - 10061 150 1333 1.093 - 10224 200 1000 1.962 0.336 10143 250 800 3.111 0.261 10004 300 667 4.493 0.573 9982 400 500 8.034 0.033 9933This was a single contiguous set of experiments; the samples were taken from a fast statistical RNG and a "population" of 10,000 symbols. The mean is the arithmetic average of augmented doubles over all trials. The chi-square column gives the probability of the chi-square "p" statistic (the area of the chi-square density function below the experimental value) comparing the experimental results to the Poisson distribution (but only for experiments which had more than five bins exceeding five counts). The chi-square value of 0.033 in the last row indicates that the results in the last set were unexpectedly close to the Poisson distribution (for random samples); this was unusual. The predicted population values were relatively accurate.
Whenever we randomly sample, we can expect to get various result values. While it is useful to know the expected distribution of those values, we must remember that even very improbable values can and will occur occasionally. We will need enough experimental trials to see the form the distribution is taking. But then the experimental distribution should be fairly obvious, so the theoretical distribution may be of somewhat less import than one might think.
Obtaining samples is expensive in terms of time, while storing and comparing sampled values is expensive in both storage and time. There is ample motive to use the fewest possible samples. However, those who would understand the accuracy of reversed birthday techniques should refer again to the graphs in figures 7.1 through 7.6. Every one of the repetition counts in each bar was formed as a valid result of a separate trial on the exact same population. An experiment measuring population could have produced any of these results in any single trial.
A zero result, common in trials with under 2 Sqrt(N) samples, will give no estimate at all (by itself). Results of 1 and 24, both of which occur in figure 7.6 (with 4 Sqrt(N) samples), estimate populations which differ by a factor of 24. Since we are measuring population, we will not know, beforehand, the shape of the particular distribution we are measuring, and which estimates are most likely. Consequently, the reversed birthday approach is really useful only after we have enough trials to be fairly sure of the mean number of repetitions.
When we develop a mean value across many trials, we expect that the more trials we have, the more accurate the result will be. But how much more accurate? (See, for example, Boas [3: 732-735].) Without assuming a particular underlying distribution, we can use the central limit theorem as grounds for assuming that errors in the mean will be distributed normally (over many trials). Using the sample standard deviation sd, and trials t, the standard deviation of the mean sd(m) is expected to be:
sd (8.1) sd(m) = -------- . ____ / \/ t(Four times as many trials should give us half the standard deviation and thus twice the accuracy in the sample mean.) For a 50 percent confidence interval (probability 0.5 that the true mean lies within the interval), we have limits u - r and u + r, for mean u, and r the probable error. Since we assume a normal distribution (large t) and a 50 percent confidence interval, we have the probable error r = 0.6745 sd(m).
If we assume the underlying distribution is Poisson, we might think to avoid computing the sample variance sd^2 since that should be equal to the mean u. However, in the experimental data, while the variance and mean do track fairly well, it is not at all unusual to see them differ by ten percent or more. Thus, we probably should compute the variance explicitly from the data. But for a quick-and-dirty answer given only mean u and trials t, the probable error r should be:
___ / 0.6745 \/ u (8.2) r = ------------- ____ / \/ t
The normal distribution demands large t and produces a wider confidence interval factor (0.6745) than we really need (especially for our expected small values of the mean u [16: 206]). Nevertheless, using the normal distribution for the confidence interval appears to be standard practice. Excessive concern about accuracy in these statistical predictions may be misplaced anyway, since each experiment will be different.
During statistical tests of the reversed birthday approach, it became apparent that these methods could reveal problems which are virtually invisible to other statistical tests. There are many statistical tests of distribution (see, for example, Knuth [8] and Marsaglia [12]), but most of these can be fooled by simultaneous inverse relationships between population and probability.
Fig. 9.1 Even Distribution without Population Deviation. Each possible symbol has exactly the same probability. The ideal distribution for a discrete random number generator (RNG).
Consider figure 9.1, which is intended to represent the even distribution we expect from a good RNG. The separate spikes represent the discrete individual values (or codes or symbols) that the RNG can produce (any digital result can give us only a finite number of representations, no matter how the values are produced). The height of each spike represents the probability of obtaining that particular value. Figure 9.1 represents the ideal in which any possible value can be obtained with equal probability.
Fig. 9.2 Even Distribution with Population Deviation. A distribution in which half of the range has only half of the expected values, yet each of these values is twice as probable as it should be. Any common test which groups even a couple of adjacent symbols is unlikely to find much wrong with this distribution.
Now consider figure 9.2; strangely, this is also an even distribution. While half of the range contains only half the symbols of the other half (and one-quarter of the symbols are thus unused), each symbol in the reduced side is twice as probable. Consequently, this distribution has essentially the same range, mean, deviation, and comparison to flat distribution as 9.1, yet only three-quarters the uniqueness. (The means will be very close when there are a large number of symbols; a chi-square comparison to a flat distribution will indicate similarity if, as usual, the comparison bins cover multiple symbols.)
Table 9.1 illustrates all possible trials for two cases of N=4 and s=2: First, the expected ideal distribution with four symbols, and then a deviant distribution in which symbol 1 never occurs, and symbol 0 occurs twice as often to compensate for the loss.
Table 9.1 All Possible Trials for N=4, s=2, for Both Ideal and Deviant Populations Ideal Population Deviant Population Trial Aug. Doubles Trial Aug. Doubles 0 0 1 0 0 1 0 1 0 0 0 1 0 2 0 0 2 0 0 3 0 0 3 0 1 0 0 0 0 1 1 1 1 0 0 1 1 2 0 0 2 0 1 3 0 0 3 0 2 0 0 2 0 0 2 1 0 2 0 0 2 2 1 2 2 1 2 3 0 2 3 0 3 0 0 3 0 0 3 1 0 3 0 0 3 2 0 3 2 0 3 3 1 3 3 1 --- --- 4 6Note that reversed birthday methods do detect something strange about the population. (With N=4, the half-symbol difference in means between the two cases is obvious without birthday computations, but for large N, a huge number of samples would be necessary to confirm any such difference.) Table 9.1 shows that repetition counts are sensitive to symbols of higher than expected probability, even if those symbols are nominally compensated by low probability symbols. When low-probability symbols and high-probability symbols are in close proximity, binned tests are going to have a difficult time detecting a deviant population.
Fig. 9.3 Even Distribution with Smaller Population Deviation.
Figure 9.3 continues the development of a distribution which is nominally flat statistically, but which has local population deviations. Although a large number of high-probability values would clearly dominate repetition results, a small number of high-probability values could be hard to find even using repetition methods.
Actual probability variations could be continuous and arbitrary. One could easily imagine that some "physically random" noise-based designs might well have an inherent inverse relationship between local population and probability. A cryptanalyst who knew of this relationship would naturally try the most-probable values first.
At this point, we can use the development in this paper and compare it to the Keprom results mentioned in the introduction. Table 10.1 illustrates the use of equation 4.8 to estimate population based on augmented repetitions calculated from the keprom results [11].
Table 10.1 Estimating Keprom Population from Augmented Repetitions s: 900000 r: 18 Ear: 1 Aug Pop: 17.9 bits s: 900000 r: 17 Ear: 18 Aug Pop: 17.7 bits s: 900000 r: 16 Ear: 153 Aug Pop: 17.7 bits s: 900000 r: 15 Ear: 816 Aug Pop: 17.6 bits s: 900000 r: 14 Ear: 3060 Aug Pop: 17.6 bits s: 900000 r: 13 Ear: 8568 Aug Pop: 17.6 bits s: 900000 r: 12 Ear: 18564 Aug Pop: 17.7 bits s: 900000 r: 11 Ear: 31824 Aug Pop: 17.7 bits s: 900000 r: 10 Ear: 43758 Aug Pop: 17.8 bits s: 900000 r: 9 Ear: 48621 Aug Pop: 18.0 bits s: 900000 r: 8 Ear: 43768 Aug Pop: 18.2 bits s: 900000 r: 7 Ear: 31868 Aug Pop: 18.5 bits s: 900000 r: 6 Ear: 18676 Aug Pop: 19.0 bits s: 900000 r: 5 Ear: 8751 Aug Pop: 19.7 bits s: 900000 r: 4 Ear: 3262 Aug Pop: 21.0 bits s: 900000 r: 3 Ear: 972 Aug Pop: 23.4 bits s: 900000 r: 2 Ear: 362 Aug Pop: 30.1 bitsOf course, these results should be considered a single huge trial, which means that the sampling distribution has not been resolved. Thus, the reported results may not accurately represent the true experimental means.
Table 10.2 illustrates the indirect use of equation 6.4 in a numerical root-finding procedure to estimate the population based on exact repetitions.
Table 10.2 Estimating Keprom Population from Smallest Root of Exact Repetition Equations s: 900000 r: 18 Er: 1 Xct Pop: 17.4 bits s: 900000 r: 9 Er: 1 Xct Pop: 19.8 bits s: 900000 r: 8 Er: 1 Xct Pop: 20.3 bits s: 900000 r: 5 Er: 1 Xct Pop: 23.0 bits s: 900000 r: 4 Er: 1 Xct Pop: 24.8 bits s: 900000 r: 3 Er: 2 Xct Pop: 27.9 bits s: 900000 r: 2 Er: 123 Xct Pop: 31.6 bitsThere is some agreement with table 10.1, as one would expect with a large population. However, the numerical procedures can find more than one root, and selecting a particular root makes a big difference. Here the root-finding procedure was intended to find the smallest root (although, without specific analysis of each case, we cannot be sure it did). This sort of indirect root-search is exceedingly slow in comparison to direct expression evaluation.
Note that both tables predict a much-reduced effective population based on the existence an 18-rep. If the reported data are actually representative of the device, it would appear that we may be warranted in entertaining serious doubts about the effectiveness of the design. Note that higher-order repetitions can provide the clearest indication of local distribution irregularities.
The ability to estimate population based on mean augmented repetitions is based on two questions:
Does the new development really predict augmented repetitions, and
Are augmented repetitions approximately Poisson-distributed over multiple similar trials?
These questions have been investigated experimentally (to some extent), and it appears that both answers are "yes." If so, multiple trials can be used to estimate the mean augmented repetitions (in trials with a fixed number of samples), and that value used in the new reversed equations to estimate the population. (Note that birthday techniques do not address common software linear-congruential or shift-register RNG sequences in which any particular result occurs only once before the system repeats. Birthday techniques require the possibility that multiple samples may produce the same value.)
Reversed birthday techniques are useful in that they allow estimation of population based only on random samples from that population. Useful estimation is possible (over tens or hundreds of trials) with as few as Sqrt(N) samples per trial; with "only" Sqrt(N) samples (per trial), we can begin to estimate N. The difficulty is that we must in some way take, save and compare Sqrt(N) distinct samples, and this can be a lot of samples. Nevertheless, population analysis of 32-bit physically-random RNG's should be well within the range of a modern personal computer.
Triples and higher repetitions do not help us much with the sampling size problem; in fact, they make it worse (see table 11.1).
Table 11.1 Expected Augmented Doubles and Triples for Various Populations 2^16 Samples 2^18 Samples Population Doubles Triples Doubles Triples 2^16 32767 10922 524284 699042 2^20 2048 43 32768 2731 2^24 128 0 2048 11 2^28 8 0 128 0 2^32 0.5 0 8 0For repetitions to be useful in population estimation, they must occur in numbers sufficient to support a reliable mean. For any given population, it is far easier to collect data on doubles than triples. But the occasional triple or higher repetition will make a fine contribution to the total of augmented doubles.
On the other hand, higher-order repetitions can be sensitive indicators of deviant populations. If we limit all trials to a size which only produces a few doubles, we could miss indications that our doubles come from a subset of the overall population. While this may not matter if we are interested only in confirming a minimum population, it could be a mistake if we wish to identify problems so they can be fixed.
In practice, it is important first to survey the situation with a few large trials and follow up on any suspect indications. Assuming the initial survey goes well, we should have enough data to fix the size of each trial for record; we might like to see three or more augmented doubles per trial, which implies at least 2.5 Sqrt(N) samples. We would also like to perform as many trials as possible (tens, hopefully; hundreds, if possible), find the augmented doubles in each, and examine the experimental distribution. Then we can come up with a reasonable estimate of the mean, and plug that value into (2.4) to estimate the population. We can do a lot of this automatically, of course, but no simple computation is likely to be as effective as an experimenter who is actively involved with the data.
An easy and intuitive development has produced simple, exact and easily-reversed equations for estimating population from the number of repeated values found in multiple trials. For a reasonable hope (probability 0.63 or more) of finding at least one augmented double, each trial needs a number of random samples at least equal to the square root of the actual population. Because the number of repetitions in separate trials occur in a broad distribution, multiple trials are necessary to calculate a mean augmented repetition value which can be easily used to estimate the parent population.
Earlier equations predict the number of exact repetitions expected from a known population, but those equations are difficult to reverse to predict population. Even if we find numerical solutions to those equations (using inherently slow numerical root-finding techniques), the resulting solutions generally are not unique. The equations from the new development eliminate these problems, at the expense of relatively-simple augmentation computations.
The raw estimation of population has obvious applications in the design of physically random number generators. The ability to identify actual element values which appear in higher than expected quantities is an obvious benefit of repetition methods. But the fact that unexpectedly-probable elements can be detected despite being intermixed with unexpectedly-improbable elements seems to provide new analysis technology. Since population variations would weaken any really random generator, testing for these defects could be an important tool in the future development of physically random generators.
Another potential application is the analysis of cipher designs. Most cipher systems function by hiding a message "needle" in a "haystack" of other decipherings. Ordinarily we assume that the number of different cipherings is equal to the number of possible keys, but in a defective cipher this may not be the case. If many keys produce the same result, the cipher may be unexpectedly weak. Reversed birthday methods may provide an opportunity to test for such a problem.
Reversed birthday techniques presumably could be useful also in more general statistical applications.
Thanks to Nico de Vries (100115.2303@compuserve.com) for proposing (on Usenet sci.crypt) a random number generator computer program which he promoted as physically random. Normally, any computer RNG is a deterministic mechanism [20], and simple examination will easily disclose the "amount of internal data" or "state" which it uses. Nico's design was unusual in that it was based on real-time program interactions with IBM PC-type hardware, so it was difficult to know how much internal state there was, or where that state was located. Such a design begged the question of measuring RNG population when only the values themselves were available.
Thanks to Ross Anderson (Ross.Anderson@cl.cam.ac.uk) for suggesting (also on Usenet sci.crypt) that a birthday test could be used to estimate population, and apologies to him also, for my obsession with this idea. Ross sent me email preprints of his birthday test papers [1,2], but then I developed the concept of augmented repetitions, which produced simple equations and exact unique results. All errors in this development are, of course, mine alone.
1. Anderson, R., R. Gibbens, C. Jagger, F. Kelly and M. Roe.
Measuring the Diversity of Random Number Generators. In
progress.
2. Anderson, R. Testing Random Number Generators. Submitted
for publication in Electronics Letters.
3. Boas, M. 1983. Mathematical Methods in the Physical
Sciences. John Wiley & Sons.
4. Davies, D. and W. Price. 1984. Security for Computer
Networks. John Wiley & Sons.
5. Goodman, A. and J. Ratti. 1971. Finite Mathematics with
Applications. Macmillan.
6. Kaliski, B., R. Rivest and A. Sherman. 1985. Is the Data
Encryption Standard a Group? (Preliminary Abstract)
Advances in Cryptology -- Eurocrypt '85. 81-95.
7. Knuth, D. 1973. The Art of Computer Programming.
Vol. I, Fundamental Algorithms. Addison-Wesley.
8. Knuth, D. 1981. The Art of Computer Programming.
Vol. II, Seminumerical Algorithms. Addison-Wesley.
9. Kullback, S. 1976 (first published 1938). Statistical
Methods in Cryptanalysis. Aegean Park Press, P.O. Box 2837,
Laguna Hills, CA 92654-0837.
10. Holtzman, J. 1987. The KEPROM: Sinking the Software Pirates.
Radio-Electronics. June. 100-104.
11. Letham, L., D. Hoff and A. Folmsbee. 1986. A 128K EPROM Using
Encryption of Pseudorandom Numbers to Enable Read Access.
IEEE Journal of Solid-State Circuits. SC-21(5): 881-888.
12. Marsaglia, G. 1984. A Current View of Random Number
Generators. Proceedings of the Sixteenth Symposium on the
Interface Between Computer Science and Statistics. 3-10.
13. Merkle, R. 1982. Secrecy, Authentication, and Public Key
Systems. UMI Research Press, University Microfilms
International, Ann Arbor, Michigan 48106.
14. Meyer, C. and S. Matyas. 1982. Cryptography: A New
Dimension in Data Security. John Wiley & Sons.
15. Morita, H., K. Ohta and S. Miyaguchi. 1991. A switching
closure test to analyze cryptosystems. Advances in
Cryptology -- CRYPTO '91. 183-193.
16. Parratt, L. 1961. Probability and Experimental Errors in
Science. Dover.
17. Patterson, W. 1987. Mathematical Cryptology. Rowman &
Littlefield, 81 Adams Drive, Totowa, New Jersey 07512.
18. Press, W. et. al. 1986. Numerical Recipes.
Cambridge University Press.
19. Quisquater, J. and J. Delescaille. 1989. How easy is
collision search? Applications to DES. Advances in
Cryptology -- Eurocrypt '89. 429-434.
20. Ritter, T. 1991. The Efficient Generation of Cryptographic
Confusion Sequences. Cryptologia. 15(2): 81-139.
21. Seberry, J. and J. Pieprzyk. 1989. Cryptography: An
Introduction to Computer Security. Prentice Hall.
22. Zollo, S. 1985. IC Keeps Pirates Out. Electronics
Week. February 11: 73, 76.
Sci.crypt is the name of a news group on Usenet, a
bulletin-board network of computers which use dial-up lines for
network communication. Usenet was originally developed using the
UUCP (Unix-to-Unix Copy) package distributed with Unix, plus free
news software which collects messages in many hundreds of different
topics or "groups." On Usenet, messages are passed from computer
to computer (until the messages hopefully reach every connected
computer), so there is no central facility and no single overall
ownership. Electronic mail is supported in the same way.
Access to Usenet News is generally also available on the
Internet, an exponentially growing interconnection of networks.
Although once confined to higher education and research
laboratories, the Internet is increasingly available to ordinary
citizens who pay a fee for an account on a computer which is
Internet connected.
Those unacquainted with the Internet and News could be
disappointed in what they find. Huge numbers of individuals have
access to these groups, and many of the groups (like sci.crypt) are
unmoderated. Anyone can respond to any message, and much of what
is authoritatively expounded is also wrong, or perhaps just
incomplete. New rank novices appear almost every day. Not only
is it hard to know what to believe, it can be hard simply to find
time to read through the chaff (currently tens of messages per day
in sci.crypt) to get to the occasional kernel of new information.
In addition, the bulletin-board format seems to encourage
vituperative responses or "flames," which naturally produce similar
counter-responses. These "flame wars" can be disconcerting to
those used to respectful intellectual discourse.
At one time sci.crypt supported mostly technical discussions,
and these occasionally still appear. But with continuing exponential
growth, the advent of an available public key package (Pretty Good
Privacy or PGP) and the U.S. government's "Clipper Chip" proposal,
most of the traffic in the past year has addressed political issues.
The 27916 Keprom is an integrated circuit memory device designed
at Intel Corporation in the early 80's
[22].
The Keprom is a "Keyed-Access EPROM" (Erasable Programmable Read-Only
Memory) which performs an authentication handshake, during which
physically-random values are encrypted under a programmed key
[10].
In this way, the data in the EPROM are unlocked only if the
associated equipment contains another Keprom with the correct key.
Apparently the device was intended to be a technical solution to
the problem of software piracy, perhaps especially in the games
market. Unfortunately, the device appeared at about the same
time that major software manufacturers were dropping copy
protection.
The issue of interest here is that the Keprom device included an
on-chip physically-random number generator, and a technical paper
[11]
reported results from evaluation tests on that generator. Briefly,
the design consists of a 32-bit shift-register which accumulates
bits sampled from a drifting on-chip oscillator, noise, and feedback
from the register itself. Because 32-bit values are reported from
generator, we expect to estimate a 32-bit population. However, the
higher-order repetitions in the reported data imply a value very
much less than this, which may indicate problems in the
physically-random design.
Terry Ritter is a registered Professional Engineer who has spent
the last five years conducting independent research and development
on cryptographic systems. Mr. Ritter is a member of the IEEE and
ACM, and holds the US patent on Dynamic Substitution technology; he
has previously contributed articles on reversible nonlinear
cryptographic combiners and a survey of cryptographic random number
generators. Ritter Software Engineering offers fast cipher engines
to hardware and software developers.
APPENDIX A: The sci.crypt Newsgroup
APPENDIX B: The Intel 27916 Keprom
LARGE TABLES
Table 4.1 All Possible Trials for N=2 and s=6
Trial Repetitions Aug. Reps. Accum. Aug. Reps.
0 0 0 0 0 0 1 0 0 0 0 1 6 15 20 15 1 6 15 20 15
0 0 0 0 0 1 0 1 0 0 0 0 1 5 10 10 1 7 20 30 25
0 0 0 0 1 0 0 1 0 0 0 0 1 5 10 10 1 8 25 40 35
0 0 0 0 1 1 0 0 1 0 1 0 0 1 4 7 1 8 26 44 42
0 0 0 1 0 0 0 1 0 0 0 0 1 5 10 10 1 9 31 54 52
0 0 0 1 0 1 0 0 1 0 1 0 0 1 4 7 1 9 32 58 59
0 0 0 1 1 0 0 0 1 0 1 0 0 1 4 7 1 9 33 62 66
0 0 0 1 1 1 0 0 0 2 0 0 0 0 2 6 1 9 33 64 72
0 0 1 0 0 0 0 1 0 0 0 0 1 5 10 10 1 10 38 74 82
0 0 1 0 0 1 0 0 1 0 1 0 0 1 4 7 1 10 39 78 89
0 0 1 0 1 0 0 0 1 0 1 0 0 1 4 7 1 10 40 82 96
0 0 1 0 1 1 0 0 0 2 0 0 0 0 2 6 1 10 40 84 102
0 0 1 1 0 0 0 0 1 0 1 0 0 1 4 7 1 10 41 88 109
0 0 1 1 0 1 0 0 0 2 0 0 0 0 2 6 1 10 41 90 115
0 0 1 1 1 0 0 0 0 2 0 0 0 0 2 6 1 10 41 92 121
0 0 1 1 1 1 0 0 1 0 1 0 0 1 4 7 1 10 42 96 128
0 1 0 0 0 0 0 1 0 0 0 0 1 5 10 10 1 11 47 106 138
0 1 0 0 0 1 0 0 1 0 1 0 0 1 4 7 1 11 48 110 145
0 1 0 0 1 0 0 0 1 0 1 0 0 1 4 7 1 11 49 114 152
0 1 0 0 1 1 0 0 0 2 0 0 0 0 2 6 1 11 49 116 158
0 1 0 1 0 0 0 0 1 0 1 0 0 1 4 7 1 11 50 120 165
0 1 0 1 0 1 0 0 0 2 0 0 0 0 2 6 1 11 50 122 171
0 1 0 1 1 0 0 0 0 2 0 0 0 0 2 6 1 11 50 124 177
0 1 0 1 1 1 0 0 1 0 1 0 0 1 4 7 1 11 51 128 184
0 1 1 0 0 0 0 0 1 0 1 0 0 1 4 7 1 11 52 132 191
0 1 1 0 0 1 0 0 0 2 0 0 0 0 2 6 1 11 52 134 197
0 1 1 0 1 0 0 0 0 2 0 0 0 0 2 6 1 11 52 136 203
0 1 1 0 1 1 0 0 1 0 1 0 0 1 4 7 1 11 53 140 210
0 1 1 1 0 0 0 0 0 2 0 0 0 0 2 6 1 11 53 142 216
0 1 1 1 0 1 0 0 1 0 1 0 0 1 4 7 1 11 54 146 223
0 1 1 1 1 0 0 0 1 0 1 0 0 1 4 7 1 11 55 150 230
0 1 1 1 1 1 0 1 0 0 0 0 1 5 10 10 1 12 60 160 240
1 0 0 0 0 0 0 1 0 0 0 0 1 5 10 10 1 13 65 170 250
1 0 0 0 0 1 0 0 1 0 1 0 0 1 4 7 1 13 66 174 257
1 0 0 0 1 0 0 0 1 0 1 0 0 1 4 7 1 13 67 178 264
1 0 0 0 1 1 0 0 0 2 0 0 0 0 2 6 1 13 67 180 270
1 0 0 1 0 0 0 0 1 0 1 0 0 1 4 7 1 13 68 184 277
1 0 0 1 0 1 0 0 0 2 0 0 0 0 2 6 1 13 68 186 283
1 0 0 1 1 0 0 0 0 2 0 0 0 0 2 6 1 13 68 188 289
1 0 0 1 1 1 0 0 1 0 1 0 0 1 4 7 1 13 69 192 296
1 0 1 0 0 0 0 0 1 0 1 0 0 1 4 7 1 13 70 196 303
1 0 1 0 0 1 0 0 0 2 0 0 0 0 2 6 1 13 70 198 309
1 0 1 0 1 0 0 0 0 2 0 0 0 0 2 6 1 13 70 200 315
1 0 1 0 1 1 0 0 1 0 1 0 0 1 4 7 1 13 71 204 322
1 0 1 1 0 0 0 0 0 2 0 0 0 0 2 6 1 13 71 206 328
1 0 1 1 0 1 0 0 1 0 1 0 0 1 4 7 1 13 72 210 335
1 0 1 1 1 0 0 0 1 0 1 0 0 1 4 7 1 13 73 214 342
1 0 1 1 1 1 0 1 0 0 0 0 1 5 10 10 1 14 78 224 352
1 1 0 0 0 0 0 0 1 0 1 0 0 1 4 7 1 14 79 228 359
1 1 0 0 0 1 0 0 0 2 0 0 0 0 2 6 1 14 79 230 365
1 1 0 0 1 0 0 0 0 2 0 0 0 0 2 6 1 14 79 232 371
1 1 0 0 1 1 0 0 1 0 1 0 0 1 4 7 1 14 80 236 378
1 1 0 1 0 0 0 0 0 2 0 0 0 0 2 6 1 14 80 238 384
1 1 0 1 0 1 0 0 1 0 1 0 0 1 4 7 1 14 81 242 391
1 1 0 1 1 0 0 0 1 0 1 0 0 1 4 7 1 14 82 246 398
1 1 0 1 1 1 0 1 0 0 0 0 1 5 10 10 1 15 87 256 408
1 1 1 0 0 0 0 0 0 2 0 0 0 0 2 6 1 15 87 258 414
1 1 1 0 0 1 0 0 1 0 1 0 0 1 4 7 1 15 88 262 421
1 1 1 0 1 0 0 0 1 0 1 0 0 1 4 7 1 15 89 266 428
1 1 1 0 1 1 0 1 0 0 0 0 1 5 10 10 1 16 94 276 438
1 1 1 1 0 0 0 0 1 0 1 0 0 1 4 7 1 16 95 280 445
1 1 1 1 0 1 0 1 0 0 0 0 1 5 10 10 1 17 100 290 455
1 1 1 1 1 0 0 1 0 0 0 0 1 5 10 10 1 18 105 300 465
1 1 1 1 1 1 1 0 0 0 0 1 6 15 20 15 2 24 120 320 480
For all 64 possible trials of 6 samples from a population of 2 symbols:
XctTot: 2 12 30 40 30
XctEr: 2 12 30 40 30
AugTot: 2 24 120 320 480
AugEar: 2 24 120 320 480
Table 4.2 All Possible Trials for N=3 and s=4
Trial Reps Aug. Reps Accum. Aug. Reps
0 0 0 0 1 0 0 1 4 6 1 4 6
0 0 0 1 0 1 0 0 1 3 1 5 9
0 0 0 2 0 1 0 0 1 3 1 6 12
0 0 1 0 0 1 0 0 1 3 1 7 15
0 0 1 1 0 0 2 0 0 2 1 7 17
0 0 1 2 0 0 1 0 0 1 1 7 18
0 0 2 0 0 1 0 0 1 3 1 8 21
0 0 2 1 0 0 1 0 0 1 1 8 22
0 0 2 2 0 0 2 0 0 2 1 8 24
0 1 0 0 0 1 0 0 1 3 1 9 27
0 1 0 1 0 0 2 0 0 2 1 9 29
0 1 0 2 0 0 1 0 0 1 1 9 30
0 1 1 0 0 0 2 0 0 2 1 9 32
0 1 1 1 0 1 0 0 1 3 1 10 35
0 1 1 2 0 0 1 0 0 1 1 10 36
0 1 2 0 0 0 1 0 0 1 1 10 37
0 1 2 1 0 0 1 0 0 1 1 10 38
0 1 2 2 0 0 1 0 0 1 1 10 39
0 2 0 0 0 1 0 0 1 3 1 11 42
0 2 0 1 0 0 1 0 0 1 1 11 43
0 2 0 2 0 0 2 0 0 2 1 11 45
0 2 1 0 0 0 1 0 0 1 1 11 46
0 2 1 1 0 0 1 0 0 1 1 11 47
0 2 1 2 0 0 1 0 0 1 1 11 48
0 2 2 0 0 0 2 0 0 2 1 11 50
0 2 2 1 0 0 1 0 0 1 1 11 51
0 2 2 2 0 1 0 0 1 3 1 12 54
1 0 0 0 0 1 0 0 1 3 1 13 57
1 0 0 1 0 0 2 0 0 2 1 13 59
1 0 0 2 0 0 1 0 0 1 1 13 60
1 0 1 0 0 0 2 0 0 2 1 13 62
1 0 1 1 0 1 0 0 1 3 1 14 65
1 0 1 2 0 0 1 0 0 1 1 14 66
1 0 2 0 0 0 1 0 0 1 1 14 67
1 0 2 1 0 0 1 0 0 1 1 14 68
1 0 2 2 0 0 1 0 0 1 1 14 69
1 1 0 0 0 0 2 0 0 2 1 14 71
1 1 0 1 0 1 0 0 1 3 1 15 74
1 1 0 2 0 0 1 0 0 1 1 15 75
1 1 1 0 0 1 0 0 1 3 1 16 78
1 1 1 1 1 0 0 1 4 6 2 20 84
1 1 1 2 0 1 0 0 1 3 2 21 87
1 1 2 0 0 0 1 0 0 1 2 21 88
1 1 2 1 0 1 0 0 1 3 2 22 91
1 1 2 2 0 0 2 0 0 2 2 22 93
1 2 0 0 0 0 1 0 0 1 2 22 94
1 2 0 1 0 0 1 0 0 1 2 22 95
1 2 0 2 0 0 1 0 0 1 2 22 96
1 2 1 0 0 0 1 0 0 1 2 22 97
1 2 1 1 0 1 0 0 1 3 2 23 100
1 2 1 2 0 0 2 0 0 2 2 23 102
1 2 2 0 0 0 1 0 0 1 2 23 103
1 2 2 1 0 0 2 0 0 2 2 23 105
1 2 2 2 0 1 0 0 1 3 2 24 108
2 0 0 0 0 1 0 0 1 3 2 25 111
2 0 0 1 0 0 1 0 0 1 2 25 112
2 0 0 2 0 0 2 0 0 2 2 25 114
2 0 1 0 0 0 1 0 0 1 2 25 115
2 0 1 1 0 0 1 0 0 1 2 25 116
2 0 1 2 0 0 1 0 0 1 2 25 117
2 0 2 0 0 0 2 0 0 2 2 25 119
2 0 2 1 0 0 1 0 0 1 2 25 120
2 0 2 2 0 1 0 0 1 3 2 26 123
2 1 0 0 0 0 1 0 0 1 2 26 124
2 1 0 1 0 0 1 0 0 1 2 26 125
2 1 0 2 0 0 1 0 0 1 2 26 126
2 1 1 0 0 0 1 0 0 1 2 26 127
2 1 1 1 0 1 0 0 1 3 2 27 130
2 1 1 2 0 0 2 0 0 2 2 27 132
2 1 2 0 0 0 1 0 0 1 2 27 133
2 1 2 1 0 0 2 0 0 2 2 27 135
2 1 2 2 0 1 0 0 1 3 2 28 138
2 2 0 0 0 0 2 0 0 2 2 28 140
2 2 0 1 0 0 1 0 0 1 2 28 141
2 2 0 2 0 1 0 0 1 3 2 29 144
2 2 1 0 0 0 1 0 0 1 2 29 145
2 2 1 1 0 0 2 0 0 2 2 29 147
2 2 1 2 0 1 0 0 1 3 2 30 150
2 2 2 0 0 1 0 0 1 3 2 31 153
2 2 2 1 0 1 0 0 1 3 2 32 156
2 2 2 2 1 0 0 1 4 6 3 36 162
For all 81 possible trials of 4 samples from a population of 3 symbols:
XctTot: 3 24 72
XctEr: 3 24 72
AugTot: 3 36 162
AugEar: 3 36 162
Table 4.3 Summary of All Possible Trials for N=10 s=3 and N=100 s=2
For all 1000 possible trials of 3 samples from a population of 10 symbols:
XctTot: 10 270
XctEr: 10 270
AugTot: 10 300
AugEar: 10 300
For all 10000 possible trials of 2 samples from a population of 100 symbols:
XctTot: 100
XctEr: 100
AugTot: 100
AugEar: 100
BIOGRAPHICAL SKETCH
Terry Ritter, his
current address, and his
top page.
Last updated: 2000-09-06