next up previous
Next: From Perceiving to Knowing Up: A Discipline Independent Definition Information Previous: Representation

(pdf of full article)

The Beginnings of ``Information Theory"

The hierarchical model of information may be applied to the description of any domain in which information appears, such as that used by Claude Shannon in the 1940s when he developed what is now referred to as ``information theory" to study communication systems. His notions concerning information and measures of information are special cases of our functional definition of informativeness and information. Information theory is often considered to have begun with work by Harry Nyquist [Nyq24]. While new knowledge is built by individuals standing on the shoulders of those who performed earlier research, people such as Nyquist can be seen as being extraordinarily creative for putting together previous work to produce a new and unique model.

Writing in the Bell System Technical Journal, Nyquist suggested that two factors determine the ``maximum speed of transmission of intelligence." Each telephone cable is implicitly considered to have a limit imposed on it such that there is a finite, maximum speed for transmitting ``intelligence." This limit was widely understood by practicing electrical engineers of the era to be related to such factors as power, noise, and the frequency of the intelligent signal. Accepting such a limit as a given, Nyquist was able to work backwards towards the study of what was transmitted. He began referring to what was transmitted as ``information."

The two fundamental factors governing the maximum speed of data transmission are the shape of a signal and the choice of code used to represent the intelligence. Responding to the earlier work of Squier and others, Nyquist argues that telegraph signals are most efficiently transmitted when the intelligence carrying waves are rectangular. Given a particular ``code," use of square waves allows for intelligence to be transmitted faster than with sine waves in many practical environments.

Once the proper wave form is selected, a different problem arises: how should ``intelligence" be represented? Telegraphers had long used Morse code and its variants to transmit text messages across distances. Each character was represented by a set of short or long electronic signals, the familiar dots and dashes. The letter C, for example, is represented in modern Morse code by a dash dot dash dot sequence. Experienced telegraphers listen to messages at speeds far exceeding the ability of humans to consciously translate each individual dash or dot into a ``thought representation" of the symbol; instead, Morse code is heard as a rhythm, with the rhythm for letters and common words being learned through long periods of listening.

Working backwards from the maximum telegraph speed, Nyquist considered the characteristics of an ``ideal" code. Morse code is adequate for many applications, but an ``adequate code' is far from being the best or optimal code available. Suggesting that the speed of intelligence transmission is proportional to the logarithm of the number of symbols which need to be represented, Nyquist was able to measure the amount of intelligence that can be transmitted using an ideal code. This is one step away from stating that there is a given amount of intelligence in a representation.

Four years later, another engineer, R.V.L. Hartley [Har28] expanded on ideas about information. Publishing in the same journal as Nyquist, the Bell System Technical Journal, and yet not citing Nyquist (or any one else, for that matter), Hartley developed the concept of information based on ``physical as contrasted with psychological considerations" for use in studying electronic communications.

In the first section of his paper, titled The Measurement of Information, he noted that ``information is a very elastic term." In fact, Hartley never adequately defines this core concept. Instead, he addresses the ``precision of ... information" and the ``amount of information." Information exists in the transmission of symbols, with symbols having ``certain meanings to the parties communicating." When someone receives information, each received symbol allows the recipient to ``eliminate possibilities," excluding other possible symbols and their associated meanings. ``The precision of information depends upon what other symbol sequences might have been chosen;" the measure of these other sequences provides an indication of the amount of information transmitted. Nyquist then suggests that we take ``as our practical measure of information the logarithm of the number of possible symbol sequences." Thus, if we received 4 different symbols occurring with equal frequency, this would represent 2 bits of information.

It is likely that Hartley was aware of the earlier work of Nyquist and that he assumed implicitly, as Nyquist did explicitly, that all symbol sequences were of the same length or size. The formula Hartley uses is consistent with this assumption, but serves only as an approximation of the information amount if the symbols are of different lengths [Bel73]. Symbols need not be equi-probable for Hartley's formula to be correct if symbols are of equal length. It is probable that Hartley did not make a statement concerning the probability of symbol sequences because of his (implicit) assumption of equal length symbols.

Hartley was aware of a relationship between the amount of energy in an information system and the amount of information that could be transmitted. Applying energy to an information transmitting system increases the ease with which the recipient receives or hears the transmitted signal. Energy serves as a component of the transmission process. Increasing the signal to noise ratio increases the probability that the information will be received correctly. Information itself isn't energy carrying; it is energy that carries information.

During World War II, Claude Shannon developed a model of the communication process using the earlier work of Nyquist and Hartley. Published in 1947, The Mathematical Theory of Communication became the founding document for much of the future work in information theory. Given a number of desired properties for an information measure, the Shannon and Hartley measures of information and only these measures [AFN74] have properties desirable in an information measure. The importance of this work soon became apparent to scholars in a range of disciplines, resulting in its use (and abuse) from the middle of the twentieth century to the present. Shannon does not provide a definition; he is merely providing a model and the capability to measure information.

Shannon's work was intended to provide exactly what the title indicated: a theory of communication, useful in understanding telecommunications systems. In a private conversation in 1961, Shannon indicated that applications of his work to areas outside of communication theory were ``suspect" [Rit86].

Shannon thought that

the fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages [SW49].

Using this engineering perspective, the communication process may be understood as a source communicating to a destination. The source provides its message to a transmitter through a perfect connection. The transmitter communicates through a channel to the receiver, which receives the message and gives it in a lossless manner to the destination.

Figure 4: Shannon's channel model.

One of the key additions that Shannon made to the earlier work of Nyquist and Hartley was the formal integration of noise into the communication model. Noise is introduced into the channel between the transmitter and the receiver and acts to changes messages so that what is received differs from what is transmitted.

Sources may be discrete or non-discrete. A discrete source generates ``the message, symbol by symbol. It will choose successive symbols according to certain probabilities depending, in general, on preceding choices as well as the particular symbols in question" [SW49]. Coding takes place at the transmitter. The source of the message does not transmit the message; the coded form of the message is what leaves the transmitting process and moves to the receiving process. The representation of the original message moves to the next process that transforms it, with the process continuing.

Between the source and the channel, the data being transmitted must be encoded, that is, it is represented in some form that can be transmitted by the medium supporting the channel. Transmitting data inherently requires that a change of medium take place, as the information moves from the source to the transmitter to the channel. When a signal moves from one medium to another, it must be physically represented somewhat differently, making an encoder necessary.

Given a source producing symbols at a rate consistent with a set of probabilities governing their frequency of occurrence, Shannon asks ``how much information is `produced' by such a process, or better, at what rate information is produced?" For Shannon, the amount of self-information that is contained in or associated with a message being transmitted, when the probability of its transmission is p, is the logarithm of the inverse of the probability, or $I=\log 1/p$ [Los90,TS95].

The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey. A device with two stable positions . . . can store one bit of information. N such devices can store N bits, since the total number of possible states is sN and $\log_2 2^N = N$[SW49].
The amount of information in the output of a process is proportional to the number of different values that the function might return. Given n different output values, the amount of information (I) may be computed as $I=\log_2 n$. The amount of information in the output of a process is related to the amount of information that is available about the input to the process combined with the information provided by the process itself. It is not just the amount of information about the input, although if the process always reproduces the input exactly at the output, there would be no difference in the amount of information present at the input to the process and at the output of the process. The information that is input to the function has measurable information, in its capacity as being the output of some other process, about which it provides information, the amount being measurable in terms of this earlier process.

The model for information transmission proposed by Shannon has been heavily abused by scholars who have applied the theory in domains distant from the electrical communication environment in which it was developed. By this, we mean that it has been frequently used to characterize situations that do not meet the assumptions and constraints of the model as proposed by Shannon. Ordering food at a restaurant might be modeled as a channel based process. The thoughts concerning food preference might be seen as the source, the vocalized order comes from the transmitting mouth, the waiter's ear is the receiver, and the chef is the destination. There is a symmetrical nature to the Shannon model that is missing from this example but, nevertheless, using the Shannon model may help an individual studying restaurant operations to be able to elucidate aspects of the operation that they hadn't considered before. For example, use of this model may suggest that noise effecting the channel might be examined. Using care in the choice of codes (names for food) might help decrease the error rate in recording customer orders.

A channel requires spatial or temporal distance between the sender and the receiver. Energy is necessary to transmit the message from the sender to the receiver. For Shannon, a channel is defined by a set of conditional probabilities that a certain message is received given what was transmitted. In cases where there is no noise, the conditional probability that a message is received given what was transmitted is simply the unconditional probability that the message is received. In noisy environments, what is transmitted is not always what is received.

The hierarchical model is, in some senses, a generalization of Shannon's model of a communication system. Both allow ``information" to be encoded, ``transmitted," and then decoded. Both provide a channel through which values may be passed. They differ in several respects. Shannon describes a communication system, while the more general hierarchical model can encompass communication, observation, and, in fact, any process that can produce a change in the universe. A communication channel is treated here as a process taking the input to a function in one hierarchy and producing output in another, providing information similar to what is communicated by a Shannonesque channel. Observation represents the presentation of information at a level in the hierarchy from a level in a second hierarchy, with the second level and hierarchy often being different from the first. The hierarchical model is broader than Shannon's model while retaining the ability to describe any particular communication system that Shannon's model can describe.

The communication model popularized by Shannon and Weaver may be understood in functional terms. Each channel may be understood as one function processing the input to produce characteristics in the output which takes on values related to the input. Unlike Shannon's model, the hierarchical model provides a definition of information in the system, in addition to measuring the information. The hierarchical model also makes explicit the large number of individual processes that participate in a Shannon channel. This hierarchical model may also be applied to more abstract notions, including perception, observation, belief, and knowledge. Use of a general model of information such as this allows for scholars across disciplines to share ideas and use words with the same meaning to describe information phenomenon found across the academic spectrum.

next up previous
Next: From Perceiving to Knowing Up: A Discipline Independent Definition Information Previous: Representation
Bob Losee