The Information Challenge

In September 1997, I allowed an Australian film crew into my house in Oxford without realising that their purpose was creationist propaganda. In the courseof a suspiciously amateurish interview, they issued a truculent challenge to me to "give an example of a genetic mutation or an evolutionary process which canbe seen to increase the information in the genome."

It is the kind of question only a creationist would ask in that way, and it was at this point I tumbled tothe fact that I had been duped into granting an interview to creationists - a thing I normally don't do, for good reasons. In my anger I refused to discussthe question further, and told them to stop the camera. However, I eventually withdrew my peremptory termination of the interview as a whole.

This was solely because they pleaded with me that they had come all the way from Australia specifically in order to interview me. Even if this was a considerableexaggeration, it seemed, on reflection, ungenerous to tear up the legal release form and throw them out. I therefore relented.

My generosity was rewarded in a fashion that anyone familiar with fundamentalist tactics might have predicted. When Ieventually saw the film a year later¹, I found that it had been edited to give the false impression that I was incapableof answering the question about information content ².

In fairness, this may not have been quite as intentionally deceitful as it sounds. You have to understandthat these people really believe that their question cannot be answered! Pathetic as it sounds, their entire journey from Australia seems to have been aquest to film an evolutionist failing to answer it.

With hindsight - given that I had been suckered into admitting them into my house in the first place - it might have been wiser simply to answer thequestion. But I like to be understood whenever I open my mouth - I have a horror of blinding people with science - and this was not a question that could beanswered in a soundbite.

First you first have to explain the technical meaning of "information". Then the relevance to evolution, too, is complicated - notreally difficult but it takes time. Rather than engage now in further recriminations and disputes about exactly what happened at the time of theinterview (for, to be fair, I should say that the Australian producer's memory of events seems to differ from mine), I shall try to redress the matter now inconstructive fashion by answering the original question, the "Information Challenge", at adequate length - the sort of length you can achieve in a proper article.

Information

The technical definition of "information" was introduced by the American engineer Claude Shannon in 1948. An employee of the Bell Telephone Company,Shannon was concerned to measure information as an economic commodity. It is costly to send messages along a telephone line. Much of what passes in a messageis not information: it is redundant. You could save money by recoding the message to remove the redundancy.

Redundancy was a second technical term introduced by Shannon, as the inverse of information. Both definitions weremathematical, but we can convey Shannon's intuitive meaning in words. Redundancy is any part of a message that is notinformative, either because the recipient already knows it (is not surprised by it) or because it duplicates other parts of the message.

In the sentence "Rover is a poodle dog", the word "dog" is redundant because "poodle" already tells us that Rover is a dog. An economical telegram would omit it, thereby increasing the informative proportion of the message. "Arr JFK Fri pm pls mtBA Cncrd flt" carries the same information as the much longer, but more redundant, "I'll be arriving at John FKennedy airport on Friday evening; please meet the British Airways Concorde flight". Obviously the brief, telegraphic message is cheaper to send (althoughthe recipient may have to work harder to decipher it - redundancy has its virtues if we forget economics).

Shannon wanted to find a mathematical way to capture the idea that any message could be broken into the information (which is worth paying for), the redundancy (which can, with economic advantage, be deleted from the message because, in effect, it can be reconstructed by therecipient) and the noise (which is just random rubbish).

"It rained in Oxford every day this week" carries relatively little information, because the receiver is not surprised by it. On theother hand, "It rained in the Sahara desert every day this week" would be a message with high information content, well worth paying extra to send. Shannon wanted to capture this senseof information content as "surprise value". It is related to the other sense - "that which is not duplicated in other parts of the message" - becauserepetitions lose their power to surprise.

Note that Shannon's definition of the quantity of information is independent of whether it is true. The measure hecame up with was ingenious and intuitively satisfying. Let's estimate, he suggested, the receiver's ignorance or uncertainty before receivingthe message, and then compare it with the receiver's remaining ignorance after receiving the message. The quantity ofignorance-reduction is the information content.

Shannon's unit of information is the bit, short for "binary digit". One bit is defined as the amount of information needed to halvethe receiver's prior uncertainty, however great that prior uncertainty was (mathematical readers will notice that the bit is, therefore, a logarithmic measure).

In practice, you first have to find a way of measuring the prior uncertainty - that which is reduced by the information when it comes. For particular kinds ofsimple message, this is easily done in terms of probabilities. An expectant father watches the Caesarian birth of his child through a window into theoperating theatre. He can't see any details, so a nurse has agreed to hold up a pink card if it is a girl, blue for a boy.

How much information is conveyed when, say, the nurse flourishes the pink card to the delighted father? Theanswer is one bit - the prior uncertainty is halved. The father knows that a baby of some kind has been born, so his uncertainty amounts to just twopossibilities - boy and girl - and they are (for purposes of this discussion) equal. The pink card halves the father's prior uncertainty from twopossibilities to one (girl). If there'd been no pink card but a doctor had walked out of the operating theatre, shook the father's hand and said"Congratulations old chap, I'm delighted to be the first to tell you that you have a daughter", the information conveyed by the 17 word message would still beonly one bit.

Computer information

Computer information is held in a sequence of noughts and ones. There are only two possibilities, so each 0 or 1 can hold one bit. The memory capacity of acomputer, or the storage capacity of a disc or tape, is often measured in bits, and this is the total number of 0s or 1s that it can hold. For some purposes,more convenient units of measurement are the byte (8 bits), the kilobyte (1000 bytes or 8000 bits), the megabyte (a million bytes or 8 million bits) or thegigabyte (1000 million bytes or 8000 million bits).

Notice that these figures refer to the total available capacity. This is the maximum quantity of information that the device is capable of storing. The actual amount ofinformation stored is something else. The capacity of my hard disc happens to be 4.2 gigabytes. Of this, about 1.4 gigabytes are actually being used to storedata at present. But even this is not the true information content of the disc in Shannon's sense.

The true information content is smaller, because the information could be more economically stored. You can get some idea of the trueinformation content by using one of those ingenious compression programs like "Stuffit". Stuffit looks for redundancy in the sequence of 0s and 1s, andremoves a hefty proportion of it by recoding - stripping out internal predictability. Maximum information content would be achieved (probably never inpractice) only if every 1 or 0 surprised us equally. Before data is transmitted in bulk around the Internet, it is routinely compressed to reduce redundancy.

That's good economics. But on the other hand it is also a good idea to keep some redundancy in messages, to help correct errors. In a message that is totallyfree of redundancy, after there's been an error there is no means of reconstructing what was intended. Computer codes often incorporate deliberatelyredundant "parity bits" to aid in error detection. DNA, too, has various error-correcting procedures which depend upon redundancy.

When I come on to talk of genomes, I'll return to the three-way distinction between total information capacity, information capacity actually used, and true information content.It was Shannon's insight that information of any kind, no matter what it means, no matter whether it is true or false, and no matter by what physical medium itis carried, can be measured in bits, and is translatable into any other medium of information.

The great biologist J B S Haldane used Shannon's theory to compute the number of bits of information conveyed by a worker bee to herhivemates when she "dances" the location of a food source (about 3 bits to tell about the direction of the food and another 3 bits for the distance of thefood). In the same units, I recently calculated that I'd need to set aside 120 megabits of laptop computer memory to store the triumphal opening chords ofRichard Strauss's "Also Sprach Zarathustra" (the "2001" theme) which I wanted to play in the middle of a lecture about evolution.

Shannon's economics enable you to calculate how much modem time it'll cost you to e-mail the complete text of abook to a publisher in another land. Fifty years after Shannon, the idea of information as a commodity, as measurable andinterconvertible as money or energy, has come into its own.

DNA Information

DNA carries information in a very computer-like way, and we can measure the genome's capacity in bits too, if we wish.DNA doesn't use a binary code, but a quaternary one. Whereas the unit of information in the computer is a 1 or a 0,the unit in DNA can be T, A, C or G.

If I tell you that a particular location in a DNA sequence is a T, how much information is conveyed from me to you? Begin bymeasuring the prior uncertainty. How many possibilities are open before the message "T" arrives? Four. How manypossibilities remain after it has arrived? One. So you might think the information transferred is four bits, but actuallyit is two.

Here's why (assuming that the four letters are equally probable, like the four suits in a pack of cards). Remember thatShannon's metric is concerned with the most economical way of conveying the message. Think of it as the numberof yes/no questions that you'd have to ask in order to narrow down to certainty, from an initial uncertainty of four possibilities, assuming that you plannedyour questions in the most economical way. "Is the mystery letter before D in the alphabet?" No. That narrows it down to T or G, and now we need only one morequestion to clinch it. So, by this method of measuring, each "letter" of the DNA has an information capacity of 2 bits.

Whenever prior uncertainty of recipient can be expressed as a number of equiprobable alternatives N, the information contentof a message which narrows those alternatives down to one is log2N (the power to which 2 must be raised in order to yieldthe number of alternatives N). If you pick a card, any card, from a normal pack, a statement of the identity of the card carrieslog252, or 5.7 bits of information. In other words, given a large number of guessing games, it would take 5.7 yes/noquestions on average to guess the card, provided the questions are asked in the most economical way.

The first two questions might establish the suit. (Is it red? Is it a diamond?) the remaining three or fourquestions would successively divide and conquer the suit (is it a 7 or higher? etc.), finally homing in on the chosen card. When the prioruncertainty is some mixture of alternatives that are not equiprobable, Shannon's formula becomes a slightly more elaborateweighted average, but it is essentially similar.

By the way, Shannon's weighted average is the same formula as physicists have used, since the nineteenth century, forentropy. The point has interesting implications but I shall not pursue them here.

Information and evolution

That's enough background on information theory. It is a theory which has long held a fascination for me, and I have used it inseveral of my research papers over the years. Let's now think how we might use it to ask whether theinformation content of genomes increases in evolution. First, recall the three way distinction between total information capacity, the capacity that isactually used, and the true information content when stored in the most economical way possible.

The total information capacity of the human genome is measured in gigabits. That of the common gut bacterium Escherichia coli is measured in megabits. We, like all other animals, are descended from an ancestorwhich, were it available for our study today, we'd classify as a bacterium. So perhaps, during the billions of years of evolution since that ancestor lived,the information capacity of our genome has gone up about three orders of magnitude (powers of ten) - about a thousandfold.

This is satisfyingly plausible and comforting to human dignity. Should human dignity feel wounded, then, by thefact that the crested newt, Triturus cristatus, has a genome capacity estimated at 40 gigabits, an order of magnitude larger than the humangenome? No, because, in any case, most of the capacity of the genome of any animal is not used to store useful information.There are many nonfunctional pseudogenes (see below) and lots of repetitive nonsense, useful for forensic detectives but nottranslated into protein in the living cells.

The crested newt has a bigger "hard disc" than we have, but since the great bulk of both our hard discs is unused,we needn't feel insulted. Related species of newt have much smaller genomes. Why the Creator should have played fast and loose withthe genome sizes of newts in such a capricious way is a problem that creationists might like to ponder. From an evolutionary point of view the explanation is simple (see The Selfish Gene pp 44-45 and p 275 in the Second Edition).

Gene duplication

Evidently the total information capacity of genomes is very variable across the living kingdoms, and it must have changedgreatly in evolution, presumably in both directions. Losses of genetic material are called deletions. New genesarise through various kinds of duplication. This is well illustrated by haemoglobin, the complex protein molecule that transports oxygen in the blood.

Human adult haemoglobin is actually a composite of four protein chains called globins, knotted around each other. Theirdetailed sequences show that the four globin chains are closely related to each other, but they are not identical. Twoof them are called alpha globins (each a chain of 141 amino acids), and two are beta globins (each a chain of 146 amino acids). The genes coding for the alphaglobins are on chromosome 11; those coding for the beta globins are on chromosome 16. On each of these chromosomes, there is a cluster of globin genesin a row, interspersed with some junk DNA.

The alpha cluster, on Chromosome 11, contains seven globin genes. Four of these are pseudogenes, versions of alphadisabled by faults in their sequence and not translated into proteins. Two are true alpha globins, used in the adult. The final one is called zeta and is usedonly in embryos. Similarly the beta cluster, on chromosome 16, has six genes, some of which are disabled, and one of which is used only in the embryo. Adulthaemoglobin, as we've seen contains two alpha and two beta chains.

Never mind all this complexity. Here's the fascinating point. Careful letter-by-letter analysis shows that these different kinds of globin genes areliterally cousins of each other, literally members of a family. But these distant cousins still coexist inside ourown genome, and that of all vertebrates. On a the scale of whole organism, the vertebrates are our cousins too. The tree ofvertebrate evolution is the family tree we are all familiar with, its branch-points representing speciation events - the splitting ofspecies into pairs of daughter species. But there is another family tree occupying the same timescale, whose branchesrepresent not speciation events but gene duplication events within genomes.

The dozen or so different globins inside you are descended from an ancient globin gene which, in a remote ancestor who livedabout half a billion years ago, duplicated, after which both copies stayed in the genome. There were then two copies of it, indifferent parts of the genome of all descendant animals. One copy was destined to give rise to the alpha cluster (on whatwould eventually become Chromosome 11 in our genome), the other to the beta cluster (on Chromosome 16).

As the aeons passed, there were further duplications (and doubtless some deletions as well). Around 400 million years ago the ancestralalpha gene duplicated again, but this time the two copies remained near neighbours of each other, in a cluster on the samechromosome. One of them was destined to become the zeta of our embryos, the other became the alpha globin genes of adult humans(other branches gave rise to the nonfunctional pseudogenes I mentioned). It was a similar story along the beta branch of thefamily, but with duplications at other moments in geological history.