Information Theory and Applications

InformationPractice.org

Bob Losee, losee@unc.edu

U. of North Carolina, Chapel Hill

Under Construction

First, press the Practice Entering Data button that is several lines below.
Then enter two numbers into the boxes near the top and
then watch (near the bottom of the box) as the sum is recalculated every time a new number is
entered into the "Enter one number" or "Enter Second Number" boxes and the Enter key is pressed.
Note that there are other times when a number may be entered to replace a number in a bracketed list of numbers,
where the list might look like [1, 2, 3, 4].
Note that it is often necessary to hit *Enter* after entering a new number in a cell.

You might replace something such as textual material between quotes in a list, such as ['Alice', 'George', 'Robin'] with another name. You may add or delete items in a list by making sure that there is one and only one comma between items in the list.

Note that all logarithms below are given to base *2*, so that the amount of information is in
*bits*, binary digits.
The amount of information in bits can be computed as a logarithm to any base of the number of interest divided by the logarithm to
that same base of the number *2.*
Thus the logarithm to base *2* of *0.5* is *log(0.5)/log(2) = 1.*

Shannon suggested that the amount of information in learning about the occurrence of a single event is inversely related to the probability of the event.
More formally, the amount of information *I(x)* associated with event *x* is

when the amount of information is desired in bits. The amount of self-information for the probability of

A calculator below begins execution when the button is pressed.
You may enter either a probability or by entering fractions, with the numerator and the denominator
for a set of data, such as *2* for the numerator and *5* for the denominator, producing a
probability of *0.4* and the corresponding amount of information.

Note that the subscripts of *2* are often omitted from the logarithms below and in the
information theoretic literature.

**Exercise -- Self Information 1.**
Enter the number of cards that you are carrying with you in your
wallet, purse, or other card carrying device into the denominator of the self-Information
exercise box below.
Enter the number of cards that are credit or debit cards into the numerator box.
How many bits of
information are there in knowing that a single card is a credit or debit card or other card used for
paying, as opposed to other forms of cards, such as a license or a business card?
Remember that probabilities of *0* and *1* are technically impossible.

**Exercise -- Self Information 2.**
Taking a coin, toss it ten times, and enter "10" into the denominator box above.
As you are tossing the coin, note how many "heads" occur. Enter the number of heads
into the numerator box and then hit enter.
How much information was there in tossing the ten coins and producing this many heads?

so that the following is computed:

This method of computing the average is used in computing the values below when determining the expected value, the predicted average for the future.

Entropy is the average information received over the uncertainty
in the range of occurring values for the variable in question.
The entropy *H(X)* for random variable *X* is
computed as

the entropy of random variable

Try inserting some different numbers in the list in the calculator above to compute the entropy with different data.

The entropy for a binary variable with a given probability peaks at the probability of *0.5.*
To see the entropy for a range of values, press on the Entropy Graph button below.

We can examine the practical application of entropy with a dataset: [0, 0, 0, 1, 1, 1]. One can execute this by pressing on the Entropy Exercise button below.

Try the above with all zeroes, that is, enter the list [0, 0, 0, 0, 0, 0]
above (or one can do the same thing with all ones; the result will be the same.
This measures the uncertainty in the random variables.
Note that when there are equal numbers of each variable, the number of bits of entropy represents how
many bits it would take to encode the different options, such as *1* bit indicating that two
values could be encoded, zero and one.

**Exercise -- Entropy 2.**
Enter the following lists and calculate the entropy for each of these three lists:
[0, 0, 0, 0, 0, 1], [1, 1, 1, 1, 1, 0], and [1, 1, 1, 0, 0, 0]. Which list produces the
highest entropy?
Is there a general rule for which entropies have the highest values?

**Exercise -- Entropy 3.**
Use the Entropy Calculator above.
Make a list of values within square brackets, with each value being separated from the next by a comma.
Enter each of making several lists in the calculator above:
[0, 0], [1,1], [0,1], and [1,0].
What is the entropy for each list?
Which list produces the lowest entropy?
why does this list of data have the lowest uncertainty?

The joint entropy of two random variables or datasets is computed from treating the items in both datasets as pairs of items. If we examine the following example (Press the Joint Entropy button below) one can see the joint entropy when we have two random variables and thus two sets of data, such as [0, 0, 0, 0, 0, 0] and [0, 0, 0, 1, 1, 1]. This is the entropy of the two random variables combined, with each random variable being associated with one of the lists.

**Exercise -- Joint Entropy 1.**
By using the above calculator,
what is the joint entropy of two lists: [0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0]?
How would one obtain higher or lower joint entropy?

**Exercise -- Joint Entropy 2.**
By using the above calculator,
what is the joint entropy of two lists with nothing in common between the two lists, such as [0,1]
and [1,0]?
How would one obtain higher or lower joint entropy?

The conditional entropy *H(X|Y)*
is the entropy or uncertainty of one random variable *X* given a second random variable *Y*.
This view of conditional entropy is useful when examining any cause and
effect relationship,
whether a communication channel,
where we are concerned with the amount of entropy at the output,
given the entropy at the input, or a simple linkage,
although these all can be modeled as a channel.
The vertical line in
*H(X|Y ),*
"|",
can be read as the English language term *given.*
The conditional entropy may be computed as the joint entropy of the two random variables or two
datasets minus the entropy of the second random variable or dataset.

**Uncertainty in a Random Variable.**
Equivocation is another term for conditional entropy,
emphasizing the amount of information needed (or randomness that exists)
about one random variable when one knows a second random variable.
Equivocation may be viewed as a measure of the level of ambiguity (uncertainty)
in the first random variable, given the second random variable.

**Information Loss.**
Conditional entropy or equivocation can be treated as a measure of information loss within any cause and
effect situation, such as a channel.
This alternate model of equivocation
is found by
considering a channel that moves from *X* to *Y,* *H(X|Y)*
is the amount of information about *X* that does not make it through the channel to *Y*.
This is an information loss between the cause *X* and the effect *Y*.

Consider two lists of data: [*1, 1*] and [*0, 1*].
The conditional entropy of the first given the second is provided in the example below (Press the
Conditional Entropy or Equivocation button below):

The uncertainty at a receiver that receives [*1, 1*], given that one knows [*0, 1*] is transmitted, is
*0* bits.
Consider the situation where either a *0* or a *1* is transmitted, each with probability
*1/2.*
Receiving [*1, 1*] shows no uncertainty about what is received because only *1*s are received.

Now try entering the following data into the calculator above.
Receiving [*0, 1*] when a [*0, 1*] is transmitted,
leaves no uncertainty at the receiver about
what was transmitted.

When a [*0, 1*] is received and a [*1, 1*] is transmitted, what is transmitted tells us nothing about
what is received.
The transmission provides no information, and thus the uncertainty at the receiver is what happens
when there may be a *0* or a *1*; a single bit.

**Dependencies between Cause and Effect.**
Consider again where one receives [*0, 1*] when [*1, 0*] is sent.
Knowing what was sent could allow one to perfectly predict what is received,
thus there would be zero uncertainty at the receiver.
One would similarly have a perfect dependence if one received [*0, 1*]
when [*0, 1*] was transmitted and there was no uncertainty at the output.
There is complete dependence between cause and effect given these examples.

**Independence between Cause and Effect.**
Consider again where one receives [*0, 1*] when [*1, 1*] is sent.
Knowing what was sent provides no information about what is received,
thus producing complete (*1* bit) uncertainty at the receiver.
This exhibits complete independence between the cause,
what was transmitted, and the effect, what was received.

**Information Loss and Noise.**
Given the independence between cause and effect,
one can see the information that is lost in the cause and effect process or in the channel as
information moves between transmitter and receiver.
When dependence holds between input and output and one knows precisely what the output will be,
given the input, there is no loss of information in the channel.
This produces zero uncertainty at the receiver or effect side of the channel.

The information lost may also be viewed as noise introduced in the channel.
When the effect is fully dependent on the cause, no noise enters the channel.
One the other hand, when the effect is independent of the cause and
the conditional entropy of the cause given the effect is *1* bit,
this may be interpreted as the amount of noise introduced in the channel
between the cause and the effect.

**Exercise -- Equivocation 2.**
How can equivocation be used to measure the randomness quality of a supposedly random sequence of ones and zeroes?
What would be the weakness of this approach?

The mutual information between two random variables is
the amount of information that one provides about the other in a symmetrical way.
Sometimes the information that two random variables provide
about each other is referred to as average mutual information,
but we follow the more common tradition of referring to this as mutual information.
Pressing on the Mutual Information Measure button below,
one can see how much information the values [*0, 0*] and the
[*0, 1*] provide about each other.

In this case, there is no mutual information between the two random variables or datasets.
When one uses two identical lists, [0, 1] and [0, 1], there is one bit of mutual information between
the two variables.
Note that the value for the mutual information measure may go far beyond *1* when there are larger number of features being analyzed. Thus,
one needs to be cautions when looking at the raw value for the mutual information measure if comparing this value with a traditional correlation measure.
Those wishing to have a mutual information measure that remains in the range from *0* to *1* may wish to use one of the
normalized mutual information measures described on the internet.

**Exercise -- Mutual Information 1.**
When is the mutual Information value between two variables maximized?
How can this be interpreted in terms of dependence or independence between the two different variables?

**Exercise -- Mutual Information 2.**
(c) When is the mutual Information value between two variables minimized?
(d) How can this be interpreted in terms of dependence or independence between the two different variables?

**Exercise -- Mutual Information 3.**
Using the calculator above for mutual information and the earlier calculator for conditional entropy,
when does one obtain high values for the Mutual Information and for Conditional Entropy?
When are the values for both minimized?

As an example, imagine talking to a friend through cellular telephones. If you speak, you are the source, and your voice moves a few centimeters to the microphone in the cellular phone. The phone now acts as the transmitter, moving the information through to the cellular phone on the other end, with information being processed by the telephone carrier who may digitize the signal, combine it with other peoples' signals, or modified the signal's frequencies. When the message arrives at the receiving cellular telephone, this telephone, acting as the receiver, produces the message by vibrating the speaker. This message then goes to the destination, the destination's ears.

A channel may carry information, with the information, in turn, carrying information from another channel. For example, when I speak, the ideas in my head are encoded into English language statements. These ideas may be transmitted to a listener who also understands English. The statements are encoded by a speech part of my brain, taking the ideas and phrasing them into English, and the listener takes the statements and decodes them into ideas. The English statements may be encoded into speech sounds and are transmitted using the vocal chords in my throat. These sound waves are decoded through the listeners ear drum and cochlea and attached nerves, which allow a signal to move to the listeners English language processing component of the brain. Note that for the speaker there is a channel or conceptual channel connecting the speaker's English language processor to the listener's English language processor, and this channel communicated through use of the channel from the speaker's vocal chords to the listeners ear.

This set of channels serves as a hierarchy of channels, with one layer on the transmitting side sending information to the comparable level on the receiving side, often by using the layer below it, which may, in turn, communicate through the layer below it, and so on until the bottom or most physical communicating layer is reached.

Both the messages moving through the source and the destination for a channel have a given amount of uncertainty. The source of the information in the channel has some uncertainty about it, being what may be considered to be a random variable. The uncertainty in this variable may be because of the different types of messages entering and leaving the communication system. For example, a message may consist of a computer machine language program that would consist of a set of zeroes and ones. A text message in American English would have 26 uppercase and 26 lowercase letters, with no accents, providing much more uncertainty with each character in the message than with a one or a zero in binary machine language code. The entropy occurs at both the Source and transmitter, as well as the output end of the channel, at the receiver and the destination.

The bandwidth of a communication channel represents the frequency range needed for a signal traversing the channel. The bandwidth is usually measured in Hertz, a frequency measure and a newer measure that replaces the previously used "cycles per second" as a bandwidth measure. The highest frequency of the signal that is placed in the channel might be thousands of Hertz if they are produced by the human voice, whereas a video signal may have a signal measured in millions of Hertz. Bandwidth should not be taken as a firm wall, where everything below that frequency makes it through the channel and everything above that frequency is cutoff. Instead, the strength of a signal decreases gradually at the bandwidth frequency. Bandwidth is commonly measured to the frequency where the signal has decreased by 3 decibels, which is half the power level of the strongest frequencies that are passed through the bandwidth.

Walking into a busy office results in hearing many different sounds, at different pitches, from telephones ringing, people talking, equipment noises, and other sounds. Each sound occurs at one or more pitches and each works within its own frequency range. Some sounds are in a large range of frequencies, while others may be nearly constant at unique frequencies. Commercial radio stations operate with government licenses for a particular base frequency and a frequency range. Essentially, each station has its own bandwidth and base frequency that are designed to not interfere with other local stations, which have their own base frequencies and probably similar bandwidths. Other stations at dozens or hundreds of kilometers may have the same frequencies, and if one listens to a frequency when geographically halfway between two radio stations, one can often hear the interference. Receivers close to the transmitter will listen to one bandwidth section or a second or a third or so forth, with analog and digital filtering determining the characteristics of the received bandwidth that are passed through to the speaker.

The capacity of a channel is the maximum mutual information between the transmitter and the receiver. As the amount of information that can be transmitted through the system, it is also the highest frequency that can be move between transmitter and receiver.

Commercial analog FM stations in many locations require a channel capacity that is several times the bandwidth of the audio signal being transmitted. Commercial analog AM stations, on the other hand, often have a channel capacity that is near the bandwidth of the underlying audio signal. FM stations operate by deviating the carrier frequency by several times the audio frequency, making the overall needed channel capacity several times the bandwidth of the audio signal.

**Exercise -- Channel Capacity 1.**

- What is the channel capacity when one binary string has the opposite value from the other, thus X: 0, 1, 0, 1 and Y: 1, 0, 1, 0, and these are all that is ever transmitted?
- What is the channel capacity when one binary string has values unassociated from the other, thus X: 0, 1, 0, 1 and Y: 0, 1, 1, 0, and these are all that is ever transmitted?
- What would be the characteristics of two lists of binary values such that the channel capacity is maximized?

The rate at which information travels through a channel is determine by the entropy of what is transmitted, the noise in the channel, and the channel equivocation. The rate at which information moves through a channel is clearly limited by the channel capacity, with the information rate being at its maximum when the transmitter's entropy is the same as the channel capacity.

The equivocation in a channel represents the ambiguity or loss in the informative signal as it moves from the input to the output. Given the transmitted rate as the entropy measured as bits per unit of time, the information that actually makes it through the channel is what is not lost through equivocation. Thus, one may compute the information rate as the transmitted entropy minus the equivocation. Press the Information Rate Calculator button below to examine the calculation of the information rate given the input [1,0,1,0,1,0,1,0,1,0] and the output [1,0,1,0,0,1,1,0,1,0].

The information rate here is dependent upon the entropy presented at the input and the information lost moving from the input to the output and provides the amount of correct information at the output, given the input and the information loss. This may be viewed in terms of uncertainty, with the amount of uncertainty at the output being the uncertainty at the input minus the loss of uncertainty at the output, given the input. When the uncertainty at the input increases, when the vocabulary becomes more complex, one would otherwise expect the uncertainty and the vocabulary complexity at the output to similarly increase. The information rate at the output increases with the equivocation or conditional entropy of the output, given the input, decreases, given a constant uncertainty on the input.

**Exercise -- Information Rate 1.**

The Kullback-Leibler (KL) Divergence is used to compare the difference between two probability distributions, *X* and *Y*.
When one is used in place of another, what are the consequences?
The KL value may be understood as the information gained in *KL(X, Y)* when *Y* is treated as the true value but then the value is revised to be *X*.
It may also be viewed as the information lost in *KL(X, Y)* when *Y* is used as an approximation of *X*.
This occurs when *Y* is the correct or true value for the distribution but *Y* is used instead to simplify the sitution or because *Y* is being used erroneously.
In this situation, the KL divergence can be understood as measuring the loss associated with using an inaccurate distribtuion instead of using the accurate, correct
distribution.

**Exercise -- Kullback Leibler Divergence 1.**

Error detection allows the recipient or reader of a message to determine whether an error has occurred while the message was being processed or stored. A typographic error is often recognized by readers with strong spelling abilities and the correct word can be substituted for the incorrectly spelled word. Sometimes a word may be garbled so badly that the reader cannot determine which word was originally intended by the writer.

Error detection supports the user determining that an error has, in fact, occurred. In some circumstances, error correction can occur, when the user can determine what was originally intended by the producer of the information representation.

Note that when working with binary data, there are two possible values for each position in the representation.
When one can determine that a specific bit is erroneous, then the value of the bit should be inverted, so that a zero becomes a one or a one becomes a zero.
Error correction requires, for binary data, that the *location* of the erroneous bit is determined.
Error detection, on the other hand, implies that it is known that an error occurred but not necessarily with knowledge of which bit is erroneous.

One form of error detection is *parity*.
Parity consists of a set of original data with an added bit, referred to as the parity bit, which will be set to one or zero so that the number of ones or zeroes in the
representation, with the parity bit, is an even or an odd number, depending on the particular system used.
If one wants an even number of ones, the representation [1,0,1,0] would have the parity bit set to 0, producing an even number of ones for the four data bits, 1,0,1,0 and the
parity bit, 0.
If the representation was [1,0, 1, 1] then the parity bit would be set to 1 so that the number of 1s was an even number.

When using a single parity bit, one can detect all single bit errors that occur. This is a guarantee; all single bit errors can be detected with a single parity bit being used. However, if there were two bits that changed, such as [1,0,1,1] with parity of 1 changing to the representation of [1,0,0,0] with a parity bit of 1, the system cannot detect that the error has occcurred when the data is read. If one desired a guarantee of detecting all two bit errors, it is necessary to have more than one parity bit.

A second approach to determining errors is to add redundancy, such as repeating the original data.

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |

1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | |

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

This table can be viewed as counting 0, 1, 10, 11 in binary, where the four values 00, 01, 10, 11 are represented so that each bit is repeated four additional times so that there are 5 bits representing each original information bit.

When any 4 bits change in the above table, it is obvious that the representation has changed and that something is wrong. If 5 bits change, however, it may be impossible to always determine that a change has occurred. Thus we have a guarantee of 4 bit error detection.

Determining the location for an error will allow us to correct the error. Consider a table where the rightmost column represents the parity for data on that row, excepting the parity bit, and the bottom row represents the parity for that column. If the bit in row 2, column 3 inverts its value, then the parity bit for row 2 will be erroneous and the parity bit for column 3 will be erroneous. Given these two parity errors occurring, one can use the row number (2) and the column number (3) to locate the error and thus correct it, meaning that we have single bit error correction. By adding parity bits beyond one dimension, where a set of bits has a single parity bit added, more error correction often will be added.

The quality of a coding system is often referred to as *n*-bit error detection and *m*-bit error correction,
where we can correct fewer errors than we can detect.

Name ID | Address | Undergrad Major |
---|---|---|

Al | 123 Shady Lane | Accounting |

Bob | 111 Hilltop Circle | Accounting |

Caitlyn | 111 Hilltop Circle | Library Science |

Debra | 230 Westbury Drive | Computer Science |

This table contains data with four tuples, or rows, each one describing
a relationship between three attributes:
a personal name, a home address, and the department in which they are taking their undergraduate
major.
The entropy in each column is
the average information in receiving an attribute.
When each row or relational tuple in a table has the same set of possible attributes, that is, each has the same
number of columns, the database is said to be in *first normal form*.
Placing relational databases into higher numbered normal forms may result in more efficient operations
of the database.

Click on the above Attribute Entropy Calculator button. Note that when all the attributes are different, the entropy is at is maximum. If there are 4 different names, then the entropy will be 2 bits. if one were to write the first four numbers in binary, they would be 00, 01, 10, and 11. This is the set of all numbers possible using 2 bits.

**Exercise -- Relational Databases 1.**
Using the above calculator, what is the entropy of the addresses if all 4 students live at the same
address?

**Exercise -- Relational Databases 2.**
If Caitlyn changes her name to Debra and Bob changes his name to Al,
what is the entropy for the Name ID attribute?

Attributes in a relation are often dependent on other attributes in the relation. Tony Lee has suggested several problems as illustrative of how information theory may be applied to databases and to their decomposition by noting functional dependence between attributes. All are in Lee's classic articles (1987), but he, in turn, used some files from other sources. Note that when there is a disagreement between a table below and Lee's article, this is because the original source of the data, the article from which Lee took the data, was used in preference to the typesetting for Lee's article.

Student | Course | Teacher |
---|---|---|

Smith | Math | Prof. White |

Smith | Physics | Prof. Green |

Jones | Math | Prof. White |

Jones | Physics | Prof. Brown |

Lee provides the above table, which we may use to examine basic relationships between attributes. This set of relations may be decomposed so that the fact that the course and teacher relationships are separated from the fact that a particular student is taking a particular course. We may extract the following sub-table:

Course | Teacher |
---|---|

Math | Prof. White |

Physics | Prof. Green |

Physics | Prof. Brown |

Note that the course is dependent on who is the teacher. When these two variables are removed from the first relation, the remaining relation need only contain the student and the teacher.

When X completely determines Y , denoted as X -> Y , then H(X) = H(X,Y ). This is referred to as a functional dependence. The relationship Teacher --> Course is a functional dependence that Course has on Teacher. This occurs because the value of the variable Course is completely dependent on the value of the variable Teacher. Thus, the variable Teacher contains all the information that might exist between Teacher and another variable which it completely controls, such as Course.

Consider, for example, the names of readers of this material as a key with the binary attribute HumanYesOrNo. In this case the names will vary widely, with the key thus having a relatively high entropy value, while the HumanYesOrNo variable will likely have a low entropy, with all or almost all readers being human. In a different case, a family's listing of members's names and each item of clothing that they own will have relatively few family member's names but a much larger number of items of clothing. With the names as the key, the entropy of the key will be much lower than the entropy for the clothing attribute.

**Exercise -- Functional Dependencies 1**.
Pressing the Conditional Entropy Calculator button below provides data along with a set of
conditional entropy values for the relationships between the data attributes.
What do these values suggest about the conditional entropy relationships between the attributes
above?

**Exercise -- Functional Dependencies 2**.
Using the calculator below, what is the entropy of the addresses if all students live at the same
address?

**Exercise -- Functional Dependencies 3**.
When computing the entropy or conditional entropy, for example, enter into the
Entropy and Conditional Entropy Calculator list
of the values for the attributes in single quotes as a list,
such as ['A', 'B', 'C']. You may use code numbers instead of typing longer
strings.
Consider the following table and its attributes (columns):

ID Number | First Name | Street Address | Building Color |
---|---|---|---|

1 | Al | 123 Main St. | Blue |

2 | Bob | 123 Main St. | Blue |

3 | Carol | 456 Information Ave. | Green |

4 | Danielle | 456 Information Ave. | Green |

5 | Edie | 789 Maple Blvd | Blue |

6 | Carol | 120 Sage Lane | Green |

- From the above table, what is the entropy, in bits, for the ID Number, for the First Name, and for the Building Color?
- What is the conditional entropy of the ID Number, given the First Name? Enter each name (and text entries below) in single quotes, e.g., 'Danielle'. Thus, a list of Street Addresses would start like this: ['123 Main St.', '123 Main St.', '456 Information Ave.', ..].
- What is the conditional entropy of the ID Number, given the Street Address?
- What is the conditional entropy of the Street Address, given the ID Number?
- What is the conditional entropy of the Street Address, given the Building Color?
- What is the conditional entropy of the Building Color given the Street Address?
- What can one conclude from the above relationships about keys and non-key values?

**Second and Third Normal Forms.**
When there are no non-key fields that are dependent on the whole key or part of the key, then the
relation is said to be in * second normal form*.
For example, if I place an order with an online store, they may issue a shipping order for a
particular part number and a warehouse and the warehouse electronic address.
The key is the part number and the warehouse, and the warehouse electronic address is dependent on
the warehouse itself, a part of the key.
A database is in *third normal form* when there are no non-key items that are functionally
dependent on other non-key items.
If one non-key item is dependent on another, they may be broken out from the original relation by
producing a second relation composed of these non-key attribute which serves as the key for other
non-key attributes.

When there are multiple many-to-many relationships within a relation, the many-to-many relationships
need to be removed form the relation so that each relation has only one many-to-many relationship.
These relationships, when there is only one many-to-many relationship, are in *fourth normal
form* if they are also in third normal form.
Consider where a relation contains peoples' names, the foods they like cooking, and the colors for
shirts they own.
Each name has several foods associated with it, just as each name will have several colors associated
with it.
This relation is placed into fourth normal form when it is broken into two relations, one of peoples'
names and the foods they like cooking, and a second relation, with peoples' names and the colors for
shirts they own.

Consider a more complex example of Employees, the names of their Children, the Salary they have for the year denoted as Year.

Employee | Child | Salary | Year |
---|---|---|---|

Hilbert | Hubert | $35K | 1975 |

Hilbert | Hubert | $40K | 1976 |

Gauss | Gwendolyn | $40K | 1975 |

Gauss | Gwendolyn | $50K | 1976 |

Gauss | Greta | $40K | 1975 |

Gauss | Greta | $50K | 1976 |

Pythagoras | Peter | $15K | 1975 |

Pythagoras | Peter | $20K | 1976 |

**Decomposition of a Database.**
One can view the entropy values associated with the above table by pressing the
Multi-Valued Dependency Calculator button above.
The Entropy of the full database, the joint entropy of the attributes Employee, Child, Salary, and Year produces
the following: log(8)/log(2) = 3.00 bits.
Assume that the database was split (decomposed) into two smaller databases,
one composed of the Employee and Child attributes, and the
other with the Employee, Salary, and Year attributes.
Ideally, no information is lost with this decomposition.
Such a lossless decomposition results in a higher normal form for
the resulting databases,
with fewer insertion and deletion anomalies.
The entropy of Employee and Child is log(4)/log(2) = 2.00 bits.
The joint entropy of the Employee, Salary,
and Year combined database is 2.5 bits,
while the entropy of the Employee, which occurs in both and thus needs to be removed from one of
these when the are combined to produced the Employee, Child, Salary, and Year database is
1.5 bits.
We find that 2 + 2.5- 1.5 = 3 bits, the entropy of the
full table.
If we join the two sub-tables,
the Employee and Child sub-table with the sub-table of Employee, Salary, and
Year, based on matching Employee values,
the resulting table is the original table.

Further dependencies may be illustrated by noting different ways that groups of attributes may be removed from a table to provide separate tables. Multivalued dependencies are one approach, examining decomposition pairs as a useful way to view groups of attributes. If there are three attributes, X, Y, and Z, and they are all attributes of a particular database, then if X serves as a key, I(Y ;Z|X) = 0, meaning that given X (the key), then the combined amount of information about Y and Z is greater than or equal to 0. This inequality is equivalent to H(X,Y,Z) = H(X,Y ) + H(X,Z) - H(X). We note that equality holds when X is the key for a database composed of three attributes, X, Y, and Z such that the database can be appropriately decomposed into two databases with X as the key for both. Thus, we find that H(X,Y,Z) = H(X,Y ) + H(X,Z) - H(X).

In this case, the two databases can be joined back together to produce the original database.
It is the case that the inequality holds,
H(X,Y,Z) <= H(X,Y ) + H(X,Z) - H(X).
This suggests that the decomposition is such that the decomposed databases
contain more information or uncertainty than the full single database.
If we were to decompose the full database into a different set of databases,
we might find that the two combined subdatabases do not have the same amount of
information as the original database.
The entropy of the full Employee, Child, Salary, and Year database has this amount of data:
log(8)/log(2) = 3.00 bits.
If the key is treated as the salary,
then we might decompose the full database into subdatabases with Salary as the key.
The entropy of a Salary and Year subdatabase is log(8)/( 2 log(2)) +
log(4)/( 2 log(2)) = 2.50 bits.
If we add to this the entropy of Salary, Child, and Employee,
log(8)/log(2) = 3.00 bits and subtract the entropy of the salary
(which occurs in both of the subdatabases)
3 log(8)/( 8 log(2)) + log(4)/( 4 log(2)) + (3 log( 8/ 3 )/ 8 log(2)) = 2.16 bits,
one can see that this value exceeds the entropy of the original full database:
log(8)/log(2) = 3.00 bits. Simplifying this, 2.50 + 3.00 - 2.16 = 3.34 > 3.00 bits.
Thus, the Salary variable is *not* a suitable key upon
which the other factors in the subdatabases are dependent
and thus the sub-tables (Salary and Year) and (Salary, Child, Employee) cannot be
joined together using the Salary key.

**Exercise -- Multivalued Dependencies 1.**
Provide an example with the above table where changing a few values will allow
the Child to be completely determined by
Employee. What is the consequence of this transformation?

**Pairwise Decomposition**
A set of relationships that are informally independent but can be joined together is referred to as a join dependency. It may be helpful to view a
possible hierarachy of relationships by noting that they are acyclic, that one does not move from one relationship to another relationship to a third relationship and then back
to the original relationship, which would be a set of cyclic relationships.

Department | Researcher | Project | Computer | Language |
---|---|---|---|---|

CS | Taylor | DB | PDP-11 | C |

CS | Taylor | DB | VAX11 | PL1 |

CS | Taylor | DB | VAX11 | Fortran |

EE | Smith | DB | PDP-11 | C |

EE | Smith | DB | VAX11 | PL1 |

EE | Smith | DB | VAX11 | Fortran |

EE | Smith | Sig Proc | VAX11 | PL1 |

EE | Smith | Sig Proc | VAX11 | Fortran |

By selecting the above button and using single letters of the alphabet to represent the five attributes, the entropy of these pairs of variables (B is dependent on A, C on B, D on C, and E on D) may be computed as the joint entropies, minus the single variable entropies that occur twice in the set of joint entropies: Entropy of A,B: 3 log( 8/3 )/( 8 log(2)) + 5 log( 8/5 )/ (8 log(2)) = 0.954 + Entropy of B,C: log(4)/( 4 log(2)) + 3 log( 8/3 )/( 4 log(2)) = 1.56 + Entropy of C,D: log(4)/( 2 log(2)) + 1/2 = 1.50 + Entropy of D,E: log(4)/( 4 log(2)) + 3 log( 8/ 3 )/( 4 log(2)) = 1.56 - Entropy of B: 3 log( 8/3 )/( 8 log(2)) + 5 log( 8/5 )/( 8 log(2)) = 0.954 - Entropy of C: log(4)/( 4 log(2)) + 3 log( 4/3 )/( 4 log(2)) = 0.811 - Entropy of D: log(4)/( 4 log(2)) + 3 log( 4/3 )/( 4 log(2)) = 0.811 = 3.00 bits which also equals the joint entropy of the entire database: Entropy of ABCDE: log(8)/log(2) = 3.00 bits. Thus each of the pairs described earlier can be joined together to produce the full table

**Failure of Pairs to Form a Decomposition Pair**
The following table may be used to example potential problems with decomposition pairs:

Agent | Company | Product |
---|---|---|

Smith | Ford | car |

Smith | Ford | truck |

Smith | GM | car |

Smith | GM | truck |

Jones | Ford | car |

Jones | Ford | truck |

Brown | Ford | car |

Brown | GM | car |

Brown | Toyota | car |

Brown | Toyota | bus |

Below, the letter "A" is used to denote Agent, "C" to denote Company, and "P" to denote Product.

By pressing the above button to see the analysis of the above table, one can attempt to decompose the table into two smaller tables, both with key C. In this case, the entropies of the components exceed the entropy of the combined table, suggesting that the decomposition is inappropriate: Entropy of AC: log(10)/( 5 log(2)) + log(5)/ (5 log(2)) + 3 log( 10/3 )/( 5 log(2)) = 2.17 + Entropy of CP: 2 log(5)/ (5 log(2)) + 3 log( 5/3 )/ (5 log(2)) = 1.37 - Entropy of C: log(5)/( 5 log(2)) + 4 log( 5/4 )/( 5 log(2)) = 0.722 = 2.82 > Entropy of ACP: log(10)/( 2 log(2)) + log(5)/( 5 log(2)) + 3 log( 10/3 )/( 10 log(2)) = 2.65.

Similarly, decomposing with key P results in the components having entropic components that exceeds the joint entroy
Entropy of CP: 2 log(5)/( 5 log(2)) + 3 log( 5/3 )/( 5 log(2)) = 1.37 +
Entropy of AP: log(10)/(5 log(2)) + 2 log(5)/( 5 log(2)) + 2 log( 5/2 )/( 5 log(2)) = 2.12 -
Entropy of P: log(5)/( 5 log(2)) + 4 log( 5/4 )/( 5 log(2)) = 0.722 = 2.77 >
Entropy of ACP log(10)/( 2 log(2)) + log(5)/( 5 log(2)) + 3 log( 10/3 )/( 10 log(2)) = 2.65.
Thus the pairs mentioned cannot be joined as suggested to produce the full table.
Relations may be placed into *fifth normal form* when more redundancy is removed than
when databases are placed into fourth normal form.
Note the following tables extract information from the table above, yet have less redundancy, since
each has fewer relations that those found in the table above.

Agent | Company |
---|---|

Smith | Ford |

Smith | GM |

Jones | Ford |

Brown | Ford |

Brown | GM |

Brown | Toyota |

Company | Product |
---|---|

Ford | car |

Ford | truck |

GM | car |

GM | truck |

Toyota | car |

Toyota | bus |

Agent | Product |
---|---|

Smith | car |

Smith | truck |

Jones | car |

Jones | truck |

Brown | car |

Brown | bus |

Note that the pairs of variables show that any two pairs do not produce all the information in the original table with three attributes.

Index terms may be identified
based on their rarity.
Many of these terms will not occur in a randomly selected document,
only occurring in documents specifically on the topic of the term.
One might compute
information by dividing the term frequency by the maximum number of occurrences of any term,
often the frequency of the term *the* in English language documents.
Common English language terms such as
*the*,
*and*, and
*of*
are common terms and will not serve well as index terms.
Less common terms such as
*cat* or
*dog* might better serve as index terms.

Indexing terms in a document by their rarity could serve as an automated indexing process that does not require knowledge of the semantic meaning of terminology. Thus, taking the terms with the most self-information as index terms might serve as an automatic indexing approach. In the calculator below, each term is in quotes and each document contains a list of these terms, with the overall structure being a list of these document lists. Press the Index Information button below to see some terms and their self-information in the set of document titles.

Note that some words such as *adventures* and *puss* and *gone* are high information
terms, given the documents here.
The two terms that have the lowest indexing value are *in* and *the*.
Using this method and selecting the best few terms that occur in a document may serve as a method
that can be automated for automatic indexing.

The above method is often referred to as Inverse Document Frequency weighting, as the higher the term frequency, the lower the term's weight.

While the above indexing technique rewards rarity of terms and is commonly used in automatic indexing systems, it may have weaknesses. Luhn suggested in the 1950s that mid-frequency terms might be the best index terms. Common words should be rejected as index terms, but rare terms may also be so uncommon that few people use them and thus they wouldn't be effective index terms. Mid-frequency terms have an advantage here.

Rare terms may also be misspellings of uncommon term variants. The reader is invited to select a publisher's name that they think might be misspelled. Look up this name under the proper spelling using an Internet search engine (it will be in many web page bibliographies) and search for the name with misspellings that the reader thinks might be made by individuals. The reader might try "Morgan Kaufmann," "Morgan Kaufman," "Morgan Koufmann," and so forth.

The Luhn approach may be approximated by assigning the highest index weights to terms that are mid-frequency. This may be accomplished by computing the information, but where the frequency is computed based on the absolute value of the difference between the middle frequency and the frequency of the term in question and then terms with the frequencies closest to the middle ahead of those nearer the highest and lowest frequencies. This can be useful at generating a probability other than raw term frequency and its associated probability.

Selecting the button below, the mid-frequency terms receive the highest weights:

with high mid-frequency terms being
*cat*,
*wind*, and
*hat*.

**Exercise -- Automatic Indexing 1.**
Pressing the button above calculates a Luhn-like weight.
Enter 7 book or article titles that address the same topic.
When you are done, what are the index terms selected by the mid-frequency weighting?
How do these compare to the terms assigned as library subject headings by a library or by an indexing
service to the article or book?

Signal to noise ratio may be measured with a term and documents as the frequency in a document divided by the average frequency. This measure is effective when the term frequencies are large enough to produce statistical significance but are less effective when working with very small frequencies, those terms that have the highest Inverse Document Frequency weighting. The signal to noise ratio is often computed as the average frequency divided by the standard deviation of the frequencies for the term in quesiton.

Press the button above to determine the signal to noise ratios for terms in each document.
Indexing with these lists of signal to noise values for each document might begin with selecting the
*n*
highest signal to noise ratio terms
for each document.
Using a larger dataset of terms might allow one to select all terms with a value over an arbitrarily
chosen constant.

**Exercise -- Signal to Noise 1.**
Using the button above, enter *5* document titles for non-fiction documents that you use to help in
learning.
Given the signal to noise ratios for the the terms in these documents, what are the best *3* index
terms to use to represent and allow one to differentiate between these documents?
Could you use just these *3* terms to index the *5* documents?

**Exercise -- Signal to Noise 2.**
Consider using the signal to noise methodology described above.
What could you do to modify the formula
to allow the formula to have higher weights for material you prefer than for the normal signal to
noise ratio?

- Fagin, Ronald. "Multivalued Dependencies and a New Normal Form for Relational Databases."
*ACM Transactions on Database Systems*2(3) September 1977. 262-278. - Kent, William. "A simple guide to Firve Normal Forms in Relational Database Theory,"
*Communications of the ACM*. 26(2) Feb 1983, 120-125. - Lee, Tony T. "An Information-Theoretic Analysis of Relational Databases--Part I: Data
Dependencies and Information Metric.
*IEEE Transactions on Software Engineering.*SE-13 (10), October 1987. 1049-1061. - Losee, Robert.
*Information From Processes: About the Nature of Information Creation, Use, and Representation*. Springer Verlag. 2012.