The Redundancy of English

Terry Ritter

A Ciphers By Ritter Page

What is the redundancy of English text, and what does the "index of coincidence" have to do with it?
2000-10-30 JPeschel: "The index of coincidence is a statistical measure of the redundancy of text.
2000-10-31 John Savard: "The index of coincidence itself is the probability that two letters, from a source stream, will be the same letter. This increases as redundancy increases, but it is not itself the redundancy."
2000-11- 1 Bill Unruh: "There are many redundancies."
2000-11- 3 Douglas A. Gwyn: "For a correct answer see Shannon's paper."
2000-11- 3 JPeschel: "I think this is the applicable paragraph by Shannon."
2000-11- 3 Matt Mahoney: "In 1950 Shannon estimated the entropy of written English to be 0.6 to 1.3 bits per character . . . ."


Subject: Re: Calculating the redudancy of english?
Date: 30 Oct 2000 23:57:35 GMT
From: jpeschel@aol.commune.org (JPeschel)
Message-ID: <20001030185735.02982.00000131@ng-cs1.aol.com>
References: <8tktfv$cem$1@nnrp1.deja.com>
Newsgroups: sci.crypt
Lines: 35

Simon Johnson pabalo@dimension.h3o.org writes:

>What is the index of coincidence, Applied crypto doesn't seem to give
>enough info for me to estimate the redudanct of english.

The index of coincidence is a statistical measure of
the redundancy of text. Around 1920, William Friedman 
published the article, "The Index of Coincidence and 
its Applications in Cryptography."

http://www.aegeanparkpress.com/books_by_author.html

The expected IC of English is around .667.

Instead of trying to write the equation here,
I'll refer you to:

http://members.fortunecity.com/jpeschel/gillog1.htm

where you'll find the equation.

On this page, Jim uses the IC to do a ciphertext-only
crack of Enigma messages.

You'll also find C source for an IC program on the
"Historic" page of my web site.

Joe
__________________________________________

Joe Peschel 
D.O.E. SysWorks					
http://members.aol.com/jpeschel/index.htm
__________________________________________



Subject: Re: Calculating the redudancy of english?
Date: Tue, 31 Oct 2000 01:11:04 GMT
From: jsavard@fNrOeSePnAeMt.edmonton.ab.ca (John Savard)
Message-ID: <39fe170f.1005468@news.powersurfr.com>
References: <8tktfv$cem$1@nnrp1.deja.com>
   <20001030171002.18842.00001199@ng-fi1.aol.com>
Newsgroups: sci.crypt
Lines: 79

On Mon, 30 Oct 2000 22:42:08 GMT, Simon Johnson
<pabalo@dimension.h3o.org> wrote, in part:
>In article <20001030171002.18842.00001199@ng-fi1.aol.com>,
>  jpeschel@aol.commune.org (JPeschel) wrote:
>> Simon Johnson pabalo@dimension.h3o.org writes:

>> >How does one calculate the redudancy of english?

>> Use the index of coincidence.

>What is the index of coincidence, Applied crypto doesn't seem to give
>enough info for me to estimate the redudanct of english.

No; Applied Cryptography doesn't discuss things like letter frequency
because it is mostly used in attacking pencil and paper ciphers, which
are the sort of things your "kid sister" might solve as puzzles - and
which are explicitly excluded from coverage in the book.

The index of coincidence, however, will only tell you how redundant
English text, encrypted by a secure method of letter transposition,
is. The actual redundancy of English text cannot be measured, only
lower bounds can be set for it: its single-letter frequencies indicate
it must have this much redundancy, its digraph frequencies indicate it
must also have this much more, its word frequencies indicate it must
have this much more yet.

Thus, getting the "real" redundancy of the English language requires
you construct a mathematical model that only generates sensible
English text, but isn't limited to only a fraction of the possible
English texts.

Still, a formula _is_ required to calculate each one of those lower
bounds. And it's something even my web page does not describe. (If you
were a mathematician, though, my page on optimized Morse armor would
give you enough clues to reinvent it.) "Cryptography: A Primer", by
Alan Konheim, does describe it, and so does "Elementary
Cryptanalysis", by Abraham Sinkov, and so does "The Codebreakers", by
David Kahn.

The index of coincidence itself is the probability that two letters,
from a source stream, will be the same letter. This increases as
redundancy increases, but it is not itself the redundancy.

Its formula is:

( the sum, over all possible letters i, of n(i) times (n(i) - 1) )
divided by ( N times ( N-1 ) )

where N is the number of letters in a message, and n(i) is the number
of times the letter whose index is i occurs in the message.

Given a probability distribution, rather than a specific text, one
would use the formula

the sum, over all possible letters i, of ( p(i) squared )

where p(i) is the probability of the letter whose index is i.

But for the redundancy, what you want is the entropy of the source
stream.

Each letter of an English text consumes log[base 2](26) bits, or
log[base 10](26) digits, of bandwidth.

The amount of bandwidth English text deserves, based only on letter
frequencies, would be:

the sum, over all possible letters i, of ( p(i) * log( 1/p(i) ) )

where p(i) is the probability of the letter whose index is i.

And the redundancy of English text is given by

1 - ( deserved bandwith / consumed bandwidth ).

So this is how to calculate redundancy from frequency tables.

John Savard
http://home.ecn.ab.ca/~jsavard/crypto.htm


Subject: Re: Calculating the redudancy of english?
Date: 1 Nov 2000 21:35:43 GMT
From: unruh@physics.ubc.ca (Bill Unruh)
Message-ID: <8tq2bf$2dt$1@nntp.itservices.ubc.ca>
References: <8tkosd$84d$1@nnrp1.deja.com>
Newsgroups: sci.crypt
Lines: 11

In <8tkosd$84d$1@nnrp1.deja.com> Simon Johnson <pabalo@dimension.h3o.org> writes:

> How does one calculate the redudancy of english?

Depends on what you mean. There are many redundancies. Eg, take the
trigrams and calculate Sum Pi ln Pi for the trigrams. or take the
frequencies of individual letters and calculate the same thing. 

Then you have to decide, are you going to take a typical passage (in
which many words occur far far far more frequently than others) or a
dictionary ( where all words occur equally-- once).


Subject: Re: Calculating the redudancy of english?
Date: Fri, 3 Nov 2000 15:14:43 GMT
From: "Douglas A. Gwyn" <gwyn@arl.army.mil>
Message-ID: <3A02D663.20AE4BB5@arl.army.mil>
References: <20001030171002.18842.00001199@ng-fi1.aol.com>
   <8tkosd$84d$1@nnrp1.deja.com>
Newsgroups: sci.crypt


JPeschel wrote:
> Simon Johnson pabalo@dimension.h3o.org writes:
> >How does one calculate the redudancy of english?
> Use the index of coincidence.

No.

For a correct answer see Shannon's paper.


Subject: Re: Calculating the redudancy of english?
Date: 03 Nov 2000 20:48:55 GMT
From: jpeschel@aol.commune.org (JPeschel)
Message-ID: <20001103154855.06266.00001078@ng-fi1.aol.com>
References: <3A02D663.20AE4BB5@arl.army.mil>
Newsgroups: sci.crypt
Lines: 48

Douglas A. Gwyn writes:

>JPeschel wrote:
>> Simon Johnson pabalo@dimension.h3o.org writes:
>> >How does one calculate the redudancy of english?
>> Use the index of coincidence.
>
>No.
>
>For a correct answer see Shannon's paper.

Oops. 
Thanks, Doug.
I think this is the applicable paragraph by Shannon.

"The ratio of the entropy of a source to the maximum 
value it could have while still restricted to the same
symbols will be called its relative entropy. 
This is the maximum compression possible when we 
encode into the same alphabet. One minus the relative 
entropy is the redundancy. The redundancy of ordinary English,
not considering statistical structure over greater 
distances than about eight letters, is roughly 50%.
This means that when we write English half of what we
write is determined by the structure of the language and
half is chosen freely. The figure 50% was found by 
several independent methods which all gave results in
this neighborhood. One is by calculation of the entropy
of the approximations to English. A second method
is to delete a certain fraction of the letters from a
sample of English text and then let someone attempt to
restore them. If they can be restored when 50% are 
deleted the redundancy must be greater than 50%. A
third method depends on certain known results in 
cryptography."

I'm off to re-read <i>Finnegan's Wake</i> now. :-)

Joe


__________________________________________

Joe Peschel 
D.O.E. SysWorks					
http://members.aol.com/jpeschel/index.htm
__________________________________________



Subject: Re: Calculating the redudancy of english?
Date: Fri, 03 Nov 2000 21:21:11 -0500
From: Matt Mahoney <matmahoney@yahoo.com>
Message-ID: <3A037297.12ED6CB5@yahoo.com>
References: <8tkosd$84d$1@nnrp1.deja.com>
Newsgroups: sci.crypt
Lines: 22

Simon Johnson wrote:
> 
>  How does one calculate the redudancy of english?

In 1950 Shannon estimated the entropy of written English to be 0.6 to
1.3 bits per character, based on how well people can predict successive
characters in text.  Cover and King in 1978 estimated 1.3 bpc using a
gambling game in which people placed bets on the next letter rather than
just ranking them.  I believe that this method is flawed because people
tend to overestimate the odds of unlikely events (e.g. lotteries), which
would overestimate the entropy.  I did some simulations of the Shannon
game which suggest a value around 1.0 bpc.  (See
http://cs.fit.edu/~mmahoney/dissertation/entropy1.html).

I'm surprised that more research hasn't been done on this question.  But
until we solve the AI problem, we have to use human subjects, because a
statistical model of English would be equivalent to passing the Turing
test.  The best model to date is one developed by Rosenfeld in 1996 on
hundreds of megabytes of text at 1.2 bpc.  Among text compressors, RK is
probably the best at 1.8-1.9 bpc.  Gzip gets about 3 bpc.

-- Matt Mahoney, matmahoney@yahoo.com, http://cs.fit.edu/~mmahoney/
Terry Ritter, his current address, and his top page.
Last updated: 2001-06-27
The Redundancy of English

Terry Ritter

A Ciphers By Ritter Page

Contents