EN4: Shannon’s Theory of Information & its Relevance to DNA

In one of the most important engineering breakthroughs of the 20th Century, Claude Shannon discovered just how much information – in principle – could be conveyed between a transmitter and a receiver. When you turn on the radio in the morning to catch a news broadcast, consider that there are a number of factors that determine how much information you can receive, or whether you can get any at all.

If your radio is too far away from the station’s transmitting antenna, you may receive a weak signal and the message may be garbled. That “garbling” comes in part from electronic noise involved in both the transmitting and receiving equipment, sources of radio noise in the environment, and perhaps even interference from some competing radio broadcast whose frequency is tuned too close to that of your favorite radio station. If the broadcast power is increased and if you buy better equipment that minimizes the noise, the information will come through with fewer errors. Shannon was able to quantify exactly how the factors of signal strength, noise, and bandwidth limit the transmission of information.

Bandwidth? You have probably noticed that FM stations “sound better” than AM stations. The FM band resides in a much higher frequency range than AM, allowing each station in the band to use a broader range of frequencies for its own broadcast. This higher bandwidth allows the transmission of a broader range of frequencies in the audio signal that rides on top of the radio station’s transmitter frequency. That is why music programs are generally found on FM, because the music sounds better when you are allowed to hear the full range of audio frequencies generated by a band or a singer.

To quantify exactly how much information can be sent over a communication channel, Shannon figured out how to measure information. Digital communications are at the heart of all modern technology, with messages converted into a stream of ones and zeroes. Each binary digit, or “bit,” is represented by a one or zero in the message. If you’re receiving a digital message, you know that (in general) the next bit you will see is equally likely to be a one or a zero. The probability of each is ½ . . . just like the probability of either a head or a tail with a flipped coin.

The amount of information in that bit is defined to be the logarithm, to the base 2, of 1/p, where “p” is the probability (1/2 in this case). Therefore the information in the next bit is equal to log(2) 2. (Since we will always be talking about “base 2” in this article, we’ll usually leave off the parenthetical “2”.)

Let’s say we have a message that is a short stream of just 2 bits. This message has 4 possibilities: 00, 01, 10, 11. The total information here is 2 bits = 2 log 2 = log 22 = log 4. “4”, of course, is the number of possible messages we could receive. Thus the amount of information that can be conveyed in this short message is log P, where “P” is the number of possibilities.

One more short example: A short message of 4 bits has 16 possibilities, or alternatives. Thus the amount of information in such a message is limited to log 16 = 4. So we see that as long as we’re operating in binary, 1s and 0s, the amount of information is just equal to the number of bits in the message. Of course we don’t know whether that message has been garbled or may be simply nonsense. Whether the message is useful or not may be a matter of opinion.

Here is a string of 1s and 0s. Can you tell by looking at this string whether it “means” anything, or is just a random sets of bits?

101001110011111010011

If you break the sequence into 3 sets of 7 bits, you can look up the ASCII code to find out that the English capital letters SOS are represented. In ASCII it takes 21 bits of information to represent those 3 letters, which convey a very meaningful message equivalent to “Emergency!! Help!!” The English alphabet, especially including its upper and lower case letters, along with numbers and punctuation marks, has a lot more characters than our simple binary code consisting of 1s and 0s. So it takes a lot of “bits” to represent a message equivalent to an alphabet with lots of characters.

Alternatively a random string of 21 bits might convey no practical message at all. Thus the “Shannon Information” is a measure of the capacity of a communication channel to convey practical information. Here the idea of “practical information” is used in our everyday sense. We don’t consider “nonsense” to convey practical information. It has to “make sense” for us to accept that a message conveys useful information. So “Shannon’s information” is a very mathematical concept, whereas our everyday use of the word “information” means something more.

The longer a digital message is, the more complex it is. Complexity implies that it takes more hardware to store, it takes longer to transmit and translate, and that you could not predict the string of 1s and 0s even if you were a very lucky guesser. As a counter-example, consider the string of symbols below that represents a chain of atoms in salt crystal, which consists strictly of sodium and chlorine:

Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl Na Cl . . .

If you had to guess what the next atom in the sequence is going to be . . . not too hard, is it? We note that a crystal is a very regular, orderly structure, and so has almost no complexity at all. The string of 21 bits above is both complex and specified. It is complex because it is long enough and appears to be random enough so that you can’t just easily guess the next bit in the sequence. “Specified” means that the order has to be just right in order to produce a useful message.

The DNA code has 4 characters, representing molecules whose names begin with the letters A, T, C, and G. Your cells use these in groups of 3 to form “codons” which get translated (through a very complicated system of systems – see Meyer’s book below for a wonderful, easy-to-read discussion) to manufacture 20 different amino acids. For example, the codon GCA is translated by cellular machinery into the amino acid alanine, CAC into histidine. You should look up diagrams of all of these molecules. How many “bits” of information are represented in a codon? Each codon has 4x4x4 = 64 possible outcomes, and so the Shannon information is log (64) = 6 bits.

What is the message produced by the DNA? The message often results in the production of very complex nano-machines called “proteins” used as enzymes to facilitate the chemical reactions of life or as structures to allow cells and tissues to exist and perform their life-sustaining functions. How much information resides in a typical protein?

Let’s first compare how much information is in a simple English sentence. For convenience, we’ll use a famous Bible verse, John 3:16, which makes use of 20 of the 26 letters in the English alphabet. Not used in this sequence are the letters c, j, k, q, x, z:

For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.

Ignore the punctuation, spaces, and the upper / lower case distinctions, and we see a sequence of 113 letters. Below is a fairly random sequence of the same length (hopefully) that I’ll generate as if I was a monkey randomly typing on a keyboard that contains the same letters:

Ueh Pwm lo rbtnb din sgpyo, lhao ty mrsp ubt dauh htrhybtr Ite, mgrl eugdgvolf otyefdbsl wr onr wotfld
odh yrlpdg, eni viht hirhufrtpdm dmen.

Both sequences are very complex, but it is clear to a reader of English that only the verse is specified. The complexity is quantified by the fact that an English sequence of 113 characters, using 20 different letters, has about 10145 different possibilities, or ways that the randomly typing monkey might generate it. For every power of 10 we have about 3.32 powers of 2, so that gives us about 481 bits of information. That’s a very long string of bits, which must be specified perfectly in order to produce the proper meaning.

How long would it take a monkey to come up with John 3:16 with the appropriate keyboard? If we had one monkey plus keyboard for each atom in the entire universe (about 1080), and they each typed a 113-character string a trillion times per second (1012/sec) and they all continued persistently for 30 billion years – about 1018 seconds (which is about twice the alleged age of the “Big Bang” universe), then they would only generate 10110 different sequences, a factor of 1035 short of 10145. So we would have to do this experiment 1035 times, which is 100 billion trillion trillion times. That would give our monkey team an even chance of getting the verse right once! Note that every 113-character sequence would contain enormous “information” in the Shannon sense, but without perfect specification, there is no meaningful message.

What about proteins? Proteins use 20 amino acids, somewhat equivalent to the 20 different letters of our abbreviated keyboard. Proteins are constructed of sequences of 300, 400 . . . even 1,000 or more amino acids which must be in the precise order to allow the cell’s machinery to fold these chains into precise 3-dimensional shapes. A protein of only 113 amino acids would be unusually small. Thus the information content of proteins is also huge in the Shannon sense, and must be perfectly well specified to perform their functions. The specification in the protein structure is recorded in the DNA codons, in the complex and specified information of our genetic code.

Where did this immense amount of brilliantly specified information come from? Just as it would be impossible for a universe of monkeys to generate a short verse of English, it is evident that a universe full of amino acids or DNA bases, even if they could react and form chains a trillion times per second for billions of years, could not hope to generate a single biological nano-machine. Complex, specified information – anywhere you find it – whether in English literature, computer software, or electrical circuit diagrams, always comes from an intelligent designer. It is reasonable to infer that the most impressively complex specified information ever discovered in the universe, namely the genetic code of biological life, has an Intelligent Designer as its Author.

References:
1. Shannon, C.E., “A Mathematical Theory of Communication,” BSTJ, vol 27, pp. 379-623, 1948.
2. Shannon, C.E., “Communication in the Presence of Noise,” Proc. IRE, vol. 37, p. 10, 1949.
3. Stephen C. Meyer, Signature in the Cell: DNA and the Evidence for Intelligent Design, Harper Collins, 2009.

Comments are closed.