Information Practice:
Information Theory and Applications

InformationPractice.org
Bob Losee, losee@unc.edu
U. of North Carolina, Chapel Hill
Under Construction

Introduction

First, press the Practice Entering Data button that is several lines below. Then enter two numbers into the boxes near the top and then watch (near the bottom of the box) as the sum is recalculated every time a new number is entered into the "Enter one number" or "Enter Second Number" boxes and the Enter key is pressed. Note that there are other times when a number may be entered to replace a number in a bracketed list of numbers, where the list might look like [1, 2, 3, 4]. Note that it is often necessary to hit Enter after entering a new number in a cell.

You might replace something such as textual material between quotes in a list, such as ['Alice', 'George', 'Robin'] with another name. You may add or delete items in a list by making sure that there is one and only one comma between items in the list.

Note that all logarithms below are given to base 2, so that the amount of information is in bits, binary digits. The amount of information in bits can be computed as a logarithm to any base of the number of interest divided by the logarithm to that same base of the number 2. Thus the logarithm to base 2 of 0.5 is log(0.5)/log(2) = 1.

Self-Information

Shannon suggested that the amount of information in learning about the occurrence of a single event is inversely related to the probability of the event. More formally, the amount of information I(x) associated with event x is

I(x) = - log2 Pr(x) = log2 (1/ Pr(x))

when the amount of information is desired in bits. The amount of self-information for the probability of p = 0.10 is 3.3 bits. When the probability increases to p = 0.5, the self-information decreases to 1.0 bit, which is consistent with the probability that a coin lands as heads with a coin toss with a fair, two sided coin. When the probability becomes almost certain, p = 0.9999, the information becomes 0.00014 bits, providing relatively little information.

A calculator below begins execution when the button is pressed. You may enter either a probability or by entering fractions, with the numerator and the denominator for a set of data, such as 2 for the numerator and 5 for the denominator, producing a probability of 0.4 and the corresponding amount of information.

Note that the subscripts of 2 are often omitted from the logarithms below and in the information theoretic literature.

Self-Information Exercises

Exercise -- Self Information 1. Enter the number of cards that you are carrying with you in your wallet, purse, or other card carrying device into the denominator of the self-Information exercise box below. Enter the number of cards that are credit or debit cards into the numerator box. How many bits of information are there in knowing that a single card is a credit or debit card or other card used for paying, as opposed to other forms of cards, such as a license or a business card? Remember that probabilities of 0 and 1 are technically impossible.

Exercise -- Self Information 2. Taking a coin, toss it ten times, and enter "10" into the denominator box above. As you are tossing the coin, note how many "heads" occur. Enter the number of heads into the numerator box and then hit enter. How much information was there in tossing the ten coins and producing this many heads?

Entropy

Averaging the information that occurs in a variable may be useful for studying random variables, such as the input or the output of a system. The average of a set of values may be computed using the traditional methods taught to most school children: adding the numbers together and then dividing by the number of numbers. For the set of example numbers (1, 2, 3, 3), for example, this produces (1 + 2 + 3 + 3)/4 = 9/4. An alternative approach, used in computing expected values, is to weight each of the values by its probability of occurrences and then adding them. Thus, one might compute for a list of values 1, 2, 3, and 3, conceptually a random variable X with a set of differing values, as

Average(X) = E(X) = ∑i Pr(xi ) Xi.

so that the following is computed:

1/4 * 1 + 1/4 * 2 + 1/2 * 3 = 9/4 .

This method of computing the average is used in computing the values below when determining the expected value, the predicted average for the future.

Entropy is the average information received over the uncertainty in the range of occurring values for the variable in question. The entropy H(X) for random variable X is computed as

H(X) = ∑ i Pr(xi) log(1/Pr(xi)),

the entropy of random variable X is the information associated with each of the xi values, weighted by the probability of that value. Using the numbers in the previous paragraph, the entropy for these four numbers is obtained by pressing the Entropy Calculator below which computes the information associated with each number and then the weighted average:

Try inserting some different numbers in the list in the calculator above to compute the entropy with different data.

The entropy for a binary variable with a given probability peaks at the probability of 0.5. To see the entropy for a range of values, press on the Entropy Graph button below.

We can examine the practical application of entropy with a dataset: [0, 0, 0, 1, 1, 1]. One can execute this by pressing on the Entropy Exercise button below.

Try the above with all zeroes, that is, enter the list [0, 0, 0, 0, 0, 0] above (or one can do the same thing with all ones; the result will be the same. This measures the uncertainty in the random variables. Note that when there are equal numbers of each variable, the number of bits of entropy represents how many bits it would take to encode the different options, such as 1 bit indicating that two values could be encoded, zero and one.

Entropy Exercises

Exercise -- Entropy 1. Begin with the Entropy Calculator above. Make a list of values within square brackets, with each value being separated from the next by a comma. For example, a set of coin tosses might be represented as [1,1,0,1,0], where a 1 represents a coin landing heads and a 0 representing tails. Enter a list of the values representing each card you have with you, with 1 representing credit or debit cards and 0 representing cards that are not credit or debit. What is the entropy associated with knowing whether one of your cards is a credit or debit card or not?

Exercise -- Entropy 2. Enter the following lists and calculate the entropy for each of these three lists: [0, 0, 0, 0, 0, 1], [1, 1, 1, 1, 1, 0], and [1, 1, 1, 0, 0, 0]. Which list produces the highest entropy? Is there a general rule for which entropies have the highest values?

Exercise -- Entropy 3. Use the Entropy Calculator above. Make a list of values within square brackets, with each value being separated from the next by a comma. Enter each of making several lists in the calculator above: [0, 0], [1,1], [0,1], and [1,0]. What is the entropy for each list? Which list produces the lowest entropy? why does this list of data have the lowest uncertainty?

Joint Entropy

The joint entropy of two random variables or datasets is computed from treating the items in both datasets as pairs of items. If we examine the following example (Press the Joint Entropy button below) one can see the joint entropy when we have two random variables and thus two sets of data, such as [0, 0, 0, 0, 0, 0] and [0, 0, 0, 1, 1, 1]. This is the entropy of the two random variables combined, with each random variable being associated with one of the lists.

Joint Entropy Exercises

Exercise -- Joint Entropy 1. By using the above calculator, what is the joint entropy of two lists: [0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0]? How would one obtain higher or lower joint entropy?

Exercise -- Joint Entropy 2. By using the above calculator, what is the joint entropy of two lists with nothing in common between the two lists, such as [0,1] and [1,0]? How would one obtain higher or lower joint entropy?

Conditional Entropy: Cause and Effect

The conditional entropy H(X|Y) is the entropy or uncertainty of one random variable X given a second random variable Y. This view of conditional entropy is useful when examining any cause and effect relationship, whether a communication channel, where we are concerned with the amount of entropy at the output, given the entropy at the input, or a simple linkage, although these all can be modeled as a channel. The vertical line in H(X|Y ), "|", can be read as the English language term given. The conditional entropy may be computed as the joint entropy of the two random variables or two datasets minus the entropy of the second random variable or dataset.

Uncertainty in a Random Variable. Equivocation is another term for conditional entropy, emphasizing the amount of information needed (or randomness that exists) about one random variable when one knows a second random variable. Equivocation may be viewed as a measure of the level of ambiguity (uncertainty) in the first random variable, given the second random variable.

Information Loss. Conditional entropy or equivocation can be treated as a measure of information loss within any cause and effect situation, such as a channel. This alternate model of equivocation is found by considering a channel that moves from X to Y, H(X|Y) is the amount of information about X that does not make it through the channel to Y. This is an information loss between the cause X and the effect Y.

Consider two lists of data: [1, 1] and [0, 1]. The conditional entropy of the first given the second is provided in the example below (Press the Conditional Entropy or Equivocation button below):

The uncertainty at a receiver that receives [1, 1], given that one knows [0, 1] is transmitted, is 0 bits. Consider the situation where either a 0 or a 1 is transmitted, each with probability 1/2. Receiving [1, 1] shows no uncertainty about what is received because only 1s are received.

Now try entering the following data into the calculator above. Receiving [0, 1] when a [0, 1] is transmitted, leaves no uncertainty at the receiver about what was transmitted.

When a [0, 1] is received and a [1, 1] is transmitted, what is transmitted tells us nothing about what is received. The transmission provides no information, and thus the uncertainty at the receiver is what happens when there may be a 0 or a 1; a single bit.

Dependencies between Cause and Effect. Consider again where one receives [0, 1] when [1, 0] is sent. Knowing what was sent could allow one to perfectly predict what is received, thus there would be zero uncertainty at the receiver. One would similarly have a perfect dependence if one received [0, 1] when [0, 1] was transmitted and there was no uncertainty at the output. There is complete dependence between cause and effect given these examples.

Independence between Cause and Effect. Consider again where one receives [0, 1] when [1, 1] is sent. Knowing what was sent provides no information about what is received, thus producing complete (1 bit) uncertainty at the receiver. This exhibits complete independence between the cause, what was transmitted, and the effect, what was received.

Information Loss and Noise. Given the independence between cause and effect, one can see the information that is lost in the cause and effect process or in the channel as information moves between transmitter and receiver. When dependence holds between input and output and one knows precisely what the output will be, given the input, there is no loss of information in the channel. This produces zero uncertainty at the receiver or effect side of the channel.

The information lost may also be viewed as noise introduced in the channel. When the effect is fully dependent on the cause, no noise enters the channel. One the other hand, when the effect is independent of the cause and the conditional entropy of the cause given the effect is 1 bit, this may be interpreted as the amount of noise introduced in the channel between the cause and the effect.

Equivocation Exercises

Exercise -- Equivocation 1. Begin with the Conditional Entropy or Equivocation Calculator above. For the second variable create a list of five 1 and 0 values, representing the last 5 web pages you looked at, where the 1 value represents information that you wanted to see and 0 represents information that lead you to another page, and eventually on another web page you found the information in which you were interested. For the first random variable (with the first value corresponding to the first above, the second to the second, and so forth), enter a list of values where 1 represents a web page you would pay the cost of a single snack that you enjoy to see and 0 represents a web page you would not pay to see. What is the relationship between the first random variable and the second random variable? What do equivocation or conditional entropy measures have to say about this?

Exercise -- Equivocation 2. How can equivocation be used to measure the randomness quality of a supposedly random sequence of ones and zeroes? What would be the weakness of this approach?

Mutual Information

The mutual information between two random variables is the amount of information that one provides about the other in a symmetrical way. Sometimes the information that two random variables provide about each other is referred to as average mutual information, but we follow the more common tradition of referring to this as mutual information. Pressing on the Mutual Information Measure button below, one can see how much information the values [0, 0] and the [0, 1] provide about each other.

In this case, there is no mutual information between the two random variables or datasets. When one uses two identical lists, [0, 1] and [0, 1], there is one bit of mutual information between the two variables. Note that the value for the mutual information measure may go far beyond 1 when there are larger number of features being analyzed. Thus, one needs to be cautions when looking at the raw value for the mutual information measure if comparing this value with a traditional correlation measure. Those wishing to have a mutual information measure that remains in the range from 0 to 1 may wish to use one of the normalized mutual information measures described on the internet.

Mutual Information Exercises

Exercise -- Mutual Information 1. When is the mutual Information value between two variables maximized? How can this be interpreted in terms of dependence or independence between the two different variables?

Exercise -- Mutual Information 2. (c) When is the mutual Information value between two variables minimized? (d) How can this be interpreted in terms of dependence or independence between the two different variables?

Exercise -- Mutual Information 3. Using the calculator above for mutual information and the earlier calculator for conditional entropy, when does one obtain high values for the Mutual Information and for Conditional Entropy? When are the values for both minimized?

Communication Channels

A channel is a communication link that connects a transmitter with a receiver. The transmitter receives a message from the source, and then the transmitter sends the message to the receiver. The receiver then sends the message received on to the destination. Ultimately, the source is sending a message to the destination, with the message moving between the transmitter and the receiver.

As an example, imagine talking to a friend through cellular telephones. If you speak, you are the source, and your voice moves a few centimeters to the microphone in the cellular phone. The phone now acts as the transmitter, moving the information through to the cellular phone on the other end, with information being processed by the telephone carrier who may digitize the signal, combine it with other peoples' signals, or modified the signal's frequencies. When the message arrives at the receiving cellular telephone, this telephone, acting as the receiver, produces the message by vibrating the speaker. This message then goes to the destination, the destination's ears.

Hierarchical Communication Models

A channel may carry information, with the information, in turn, carrying information from another channel. For example, when I speak, the ideas in my head are encoded into English language statements. These ideas may be transmitted to a listener who also understands English. The statements are encoded by a speech part of my brain, taking the ideas and phrasing them into English, and the listener takes the statements and decodes them into ideas. The English statements may be encoded into speech sounds and are transmitted using the vocal chords in my throat. These sound waves are decoded through the listeners ear drum and cochlea and attached nerves, which allow a signal to move to the listeners English language processing component of the brain. Note that for the speaker there is a channel or conceptual channel connecting the speaker's English language processor to the listener's English language processor, and this channel communicated through use of the channel from the speaker's vocal chords to the listeners ear.

This set of channels serves as a hierarchy of channels, with one layer on the transmitting side sending information to the comparable level on the receiving side, often by using the layer below it, which may, in turn, communicate through the layer below it, and so on until the bottom or most physical communicating layer is reached.

Entropy for Source and Destination

Both the messages moving through the source and the destination for a channel have a given amount of uncertainty. The source of the information in the channel has some uncertainty about it, being what may be considered to be a random variable. The uncertainty in this variable may be because of the different types of messages entering and leaving the communication system. For example, a message may consist of a computer machine language program that would consist of a set of zeroes and ones. A text message in American English would have 26 uppercase and 26 lowercase letters, with no accents, providing much more uncertainty with each character in the message than with a one or a zero in binary machine language code. The entropy occurs at both the Source and transmitter, as well as the output end of the channel, at the receiver and the destination.

Bandwidth

The bandwidth of a communication channel represents the frequency range needed for a signal traversing the channel. The bandwidth is usually measured in Hertz, a frequency measure and a newer measure that replaces the previously used "cycles per second" as a bandwidth measure. The highest frequency of the signal that is placed in the channel might be thousands of Hertz if they are produced by the human voice, whereas a video signal may have a signal measured in millions of Hertz. Bandwidth should not be taken as a firm wall, where everything below that frequency makes it through the channel and everything above that frequency is cutoff. Instead, the strength of a signal decreases gradually at the bandwidth frequency. Bandwidth is commonly measured to the frequency where the signal has decreased by 3 decibels, which is half the power level of the strongest frequencies that are passed through the bandwidth.

Walking into a busy office results in hearing many different sounds, at different pitches, from telephones ringing, people talking, equipment noises, and other sounds. Each sound occurs at one or more pitches and each works within its own frequency range. Some sounds are in a large range of frequencies, while others may be nearly constant at unique frequencies. Commercial radio stations operate with government licenses for a particular base frequency and a frequency range. Essentially, each station has its own bandwidth and base frequency that are designed to not interfere with other local stations, which have their own base frequencies and probably similar bandwidths. Other stations at dozens or hundreds of kilometers may have the same frequencies, and if one listens to a frequency when geographically halfway between two radio stations, one can often hear the interference. Receivers close to the transmitter will listen to one bandwidth section or a second or a third or so forth, with analog and digital filtering determining the characteristics of the received bandwidth that are passed through to the speaker.

Channel Capacity

The capacity of a channel is the maximum mutual information between the transmitter and the receiver. As the amount of information that can be transmitted through the system, it is also the highest frequency that can be move between transmitter and receiver.

Commercial analog FM stations in many locations require a channel capacity that is several times the bandwidth of the audio signal being transmitted. Commercial analog AM stations, on the other hand, often have a channel capacity that is near the bandwidth of the underlying audio signal. FM stations operate by deviating the carrier frequency by several times the audio frequency, making the overall needed channel capacity several times the bandwidth of the audio signal.

Channel Capacity Exercises

Exercise -- Channel Capacity 1.

The channel capacity above is determined by computing the maximum mutual information between all input and output sets of data. In this case, there is one set of data pairs, X, and Y, with both of the lists of data being 0, 1, 0, 1.
  1. What is the channel capacity when one binary string has the opposite value from the other, thus X: 0, 1, 0, 1 and Y: 1, 0, 1, 0, and these are all that is ever transmitted?
  2. What is the channel capacity when one binary string has values unassociated from the other, thus X: 0, 1, 0, 1 and Y: 0, 1, 1, 0, and these are all that is ever transmitted?
  3. What would be the characteristics of two lists of binary values such that the channel capacity is maximized?

Information Rate

The rate at which information travels through a channel is determine by the entropy of what is transmitted, the noise in the channel, and the channel equivocation. The rate at which information moves through a channel is clearly limited by the channel capacity, with the information rate being at its maximum when the transmitter's entropy is the same as the channel capacity.

The equivocation in a channel represents the ambiguity or loss in the informative signal as it moves from the input to the output. Given the transmitted rate as the entropy measured as bits per unit of time, the information that actually makes it through the channel is what is not lost through equivocation. Thus, one may compute the information rate as the transmitted entropy minus the equivocation. Press the Information Rate Calculator button below to examine the calculation of the information rate given the input [1,0,1,0,1,0,1,0,1,0] and the output [1,0,1,0,0,1,1,0,1,0].

The information rate here is dependent upon the entropy presented at the input and the information lost moving from the input to the output and provides the amount of correct information at the output, given the input and the information loss. This may be viewed in terms of uncertainty, with the amount of uncertainty at the output being the uncertainty at the input minus the loss of uncertainty at the output, given the input. When the uncertainty at the input increases, when the vocabulary becomes more complex, one would otherwise expect the uncertainty and the vocabulary complexity at the output to similarly increase. The information rate at the output increases with the equivocation or conditional entropy of the output, given the input, decreases, given a constant uncertainty on the input.

Information Rate Exercises

Exercise -- Information Rate 1.

  1. Using the above calculator, enter the same list of ones and zeros into both the input list and the output list. What is the Equivocation value returned? What is the information rate?
  2. Again using the above calculator, enter a list of the same number of ones and zeroes for the input list and for the output list, make half of the input ones received at the output as zeroes and half received as ones. Do the same with the input ones, with half the outputs as zeroes and half as ones. What is the equivocation for this communication system? What is the overall information rate?

Kullback-Leibler Divergence

The Kullback-Leibler (KL) Divergence is used to compare the difference between two probability distributions, X and Y. When one is used in place of another, what are the consequences? The KL value may be understood as the information gained in KL(X, Y) when Y is treated as the true value but then the value is revised to be X. It may also be viewed as the information lost in KL(X, Y) when Y is used as an approximation of X. This occurs when Y is the correct or true value for the distribution but Y is used instead to simplify the sitution or because Y is being used erroneously. In this situation, the KL divergence can be understood as measuring the loss associated with using an inaccurate distribtuion instead of using the accurate, correct distribution.

Kullback Leibler Divergence Exercises

Exercise -- Kullback Leibler Divergence 1.

  1. Using the above calculator, what is the divergence with the supplied data?
  2. Continuing, what is the KL divergence when both lists of data are identical?

Coding for Error Detection and Correction

Information may be encoded a number of different ways. An idea may be expressed in any of hundreds of natural languages. Other representations may be used besides a specific natural language.

Error detection allows the recipient or reader of a message to determine whether an error has occurred while the message was being processed or stored. A typographic error is often recognized by readers with strong spelling abilities and the correct word can be substituted for the incorrectly spelled word. Sometimes a word may be garbled so badly that the reader cannot determine which word was originally intended by the writer.

Error detection supports the user determining that an error has, in fact, occurred. In some circumstances, error correction can occur, when the user can determine what was originally intended by the producer of the information representation.

Note that when working with binary data, there are two possible values for each position in the representation. When one can determine that a specific bit is erroneous, then the value of the bit should be inverted, so that a zero becomes a one or a one becomes a zero. Error correction requires, for binary data, that the location of the erroneous bit is determined. Error detection, on the other hand, implies that it is known that an error occurred but not necessarily with knowledge of which bit is erroneous.

Error Detection

One form of error detection is parity. Parity consists of a set of original data with an added bit, referred to as the parity bit, which will be set to one or zero so that the number of ones or zeroes in the representation, with the parity bit, is an even or an odd number, depending on the particular system used. If one wants an even number of ones, the representation [1,0,1,0] would have the parity bit set to 0, producing an even number of ones for the four data bits, 1,0,1,0 and the parity bit, 0. If the representation was [1,0, 1, 1] then the parity bit would be set to 1 so that the number of 1s was an even number.

When using a single parity bit, one can detect all single bit errors that occur. This is a guarantee; all single bit errors can be detected with a single parity bit being used. However, if there were two bits that changed, such as [1,0,1,1] with parity of 1 changing to the representation of [1,0,0,0] with a parity bit of 1, the system cannot detect that the error has occcurred when the data is read. If one desired a guarantee of detecting all two bit errors, it is necessary to have more than one parity bit.

Redundancy

A second approach to determining errors is to add redundancy, such as repeating the original data.

0 0 0 0 0   0 0 0 0 0
0 0 0 0 0   1 1 1 1 1
1 1 1 1 1   0 0 0 0 0
1 1 1 1 1   1 1 1 1 1

This table can be viewed as counting 0, 1, 10, 11 in binary, where the four values 00, 01, 10, 11 are represented so that each bit is repeated four additional times so that there are 5 bits representing each original information bit.

When any 4 bits change in the above table, it is obvious that the representation has changed and that something is wrong. If 5 bits change, however, it may be impossible to always determine that a change has occurred. Thus we have a guarantee of 4 bit error detection.

Error Correction

Using the above table, note that if one or two bits change, the new value in the representation, a row in the table, is still closer to the original value than to one of the other values. If three or more bits change, however, the new value may produce a representation that is closer to one of the other representations then to the original value. In this case, one might be most likely "correct" the value incorrectly. This coding system may be said to have four bit error detection, by adding the redundancy. If there were two more bits of redundancy for each of the original 5 bits, then there is one more bit of error correction and two more bits of error detection guaranteed. Thus, with 5 bits representing each original bit (00, 01, 10, or 11), we have two bit error correction and 4 bits of error detection, while 7 bits representing each original bit produces 3 bit error correction and 6 bits of error detection, 9 bits representing each original bit produces 4 bit error correction guaranteed, and so forth.

Determining the location for an error will allow us to correct the error. Consider a table where the rightmost column represents the parity for data on that row, excepting the parity bit, and the bottom row represents the parity for that column. If the bit in row 2, column 3 inverts its value, then the parity bit for row 2 will be erroneous and the parity bit for column 3 will be erroneous. Given these two parity errors occurring, one can use the row number (2) and the column number (3) to locate the error and thus correct it, meaning that we have single bit error correction. By adding parity bits beyond one dimension, where a set of bits has a single parity bit added, more error correction often will be added.

The quality of a coding system is often referred to as n-bit error detection and m-bit error correction, where we can correct fewer errors than we can detect.

Relational Databases

Data may be stored in a variety of forms. One popular model for storing data is as a table, with columns for data attributes. Each row represents a tuple, a specific set of relationships, while the table as a whole is often referred to as a relation or relational database.

Name ID Address Undergrad Major
Al 123 Shady Lane Accounting
Bob 111 Hilltop Circle Accounting
Caitlyn 111 Hilltop Circle Library Science
Debra 230 Westbury Drive Computer Science

This table contains data with four tuples, or rows, each one describing a relationship between three attributes: a personal name, a home address, and the department in which they are taking their undergraduate major. The entropy in each column is the average information in receiving an attribute. When each row or relational tuple in a table has the same set of possible attributes, that is, each has the same number of columns, the database is said to be in first normal form. Placing relational databases into higher numbered normal forms may result in more efficient operations of the database.

Click on the above Attribute Entropy Calculator button. Note that when all the attributes are different, the entropy is at is maximum. If there are 4 different names, then the entropy will be 2 bits. if one were to write the first four numbers in binary, they would be 00, 01, 10, and 11. This is the set of all numbers possible using 2 bits.

First Normal Form Exercises

Exercise -- Relational Databases 1. Using the above calculator, what is the entropy of the addresses if all 4 students live at the same address?

Exercise -- Relational Databases 2. If Caitlyn changes her name to Debra and Bob changes his name to Al, what is the entropy for the Name ID attribute?

Functional Dependence

Attributes in a relation are often dependent on other attributes in the relation. Tony Lee has suggested several problems as illustrative of how information theory may be applied to databases and to their decomposition by noting functional dependence between attributes. All are in Lee's classic articles (1987), but he, in turn, used some files from other sources. Note that when there is a disagreement between a table below and Lee's article, this is because the original source of the data, the article from which Lee took the data, was used in preference to the typesetting for Lee's article.

Student Course Teacher
Smith Math Prof. White
Smith Physics Prof. Green
Jones Math Prof. White
Jones Physics Prof. Brown

Lee provides the above table, which we may use to examine basic relationships between attributes. This set of relations may be decomposed so that the fact that the course and teacher relationships are separated from the fact that a particular student is taking a particular course. We may extract the following sub-table:

Course Teacher
Math Prof. White
Physics Prof. Green
Physics Prof. Brown

Note that the course is dependent on who is the teacher. When these two variables are removed from the first relation, the remaining relation need only contain the student and the teacher.

When X completely determines Y , denoted as X -> Y , then H(X) = H(X,Y ). This is referred to as a functional dependence. The relationship Teacher --> Course is a functional dependence that Course has on Teacher. This occurs because the value of the variable Course is completely dependent on the value of the variable Teacher. Thus, the variable Teacher contains all the information that might exist between Teacher and another variable which it completely controls, such as Course.

Consider, for example, the names of readers of this material as a key with the binary attribute HumanYesOrNo. In this case the names will vary widely, with the key thus having a relatively high entropy value, while the HumanYesOrNo variable will likely have a low entropy, with all or almost all readers being human. In a different case, a family's listing of members's names and each item of clothing that they own will have relatively few family member's names but a much larger number of items of clothing. With the names as the key, the entropy of the key will be much lower than the entropy for the clothing attribute.

Functional Dependencies Exercises

Exercise -- Functional Dependencies 1. Pressing the Conditional Entropy Calculator button below provides data along with a set of conditional entropy values for the relationships between the data attributes. What do these values suggest about the conditional entropy relationships between the attributes above?

Exercise -- Functional Dependencies 2. Using the calculator below, what is the entropy of the addresses if all students live at the same address?

Exercise -- Functional Dependencies 3. When computing the entropy or conditional entropy, for example, enter into the Entropy and Conditional Entropy Calculator list of the values for the attributes in single quotes as a list, such as ['A', 'B', 'C']. You may use code numbers instead of typing longer strings. Consider the following table and its attributes (columns):

ID Number First Name Street Address Building Color
1 Al 123 Main St. Blue
2 Bob 123 Main St. Blue
3 Carol 456 Information Ave. Green
4 Danielle 456 Information Ave. Green
5 Edie 789 Maple Blvd Blue
6 Carol 120 Sage Lane Green

  1. From the above table, what is the entropy, in bits, for the ID Number, for the First Name, and for the Building Color?
  2. What is the conditional entropy of the ID Number, given the First Name? Enter each name (and text entries below) in single quotes, e.g., 'Danielle'. Thus, a list of Street Addresses would start like this: ['123 Main St.', '123 Main St.', '456 Information Ave.', ..].
  3. What is the conditional entropy of the ID Number, given the Street Address?
  4. What is the conditional entropy of the Street Address, given the ID Number?
  5. What is the conditional entropy of the Street Address, given the Building Color?
  6. What is the conditional entropy of the Building Color given the Street Address?
  7. What can one conclude from the above relationships about keys and non-key values?

Second and Third Normal Forms. When there are no non-key fields that are dependent on the whole key or part of the key, then the relation is said to be in second normal form. For example, if I place an order with an online store, they may issue a shipping order for a particular part number and a warehouse and the warehouse electronic address. The key is the part number and the warehouse, and the warehouse electronic address is dependent on the warehouse itself, a part of the key. A database is in third normal form when there are no non-key items that are functionally dependent on other non-key items. If one non-key item is dependent on another, they may be broken out from the original relation by producing a second relation composed of these non-key attribute which serves as the key for other non-key attributes.

Multivalued Dependencies and Fourth Normal Form

When there are multiple many-to-many relationships within a relation, the many-to-many relationships need to be removed form the relation so that each relation has only one many-to-many relationship. These relationships, when there is only one many-to-many relationship, are in fourth normal form if they are also in third normal form. Consider where a relation contains peoples' names, the foods they like cooking, and the colors for shirts they own. Each name has several foods associated with it, just as each name will have several colors associated with it. This relation is placed into fourth normal form when it is broken into two relations, one of peoples' names and the foods they like cooking, and a second relation, with peoples' names and the colors for shirts they own.

Consider a more complex example of Employees, the names of their Children, the Salary they have for the year denoted as Year.

Employee Child Salary Year
Hilbert Hubert $35K 1975
Hilbert Hubert $40K 1976
Gauss Gwendolyn $40K 1975
Gauss Gwendolyn $50K 1976
Gauss Greta $40K 1975
Gauss Greta $50K 1976
Pythagoras Peter $15K 1975
Pythagoras Peter $20K 1976

Decomposition of a Database. One can view the entropy values associated with the above table by pressing the Multi-Valued Dependency Calculator button above. The Entropy of the full database, the joint entropy of the attributes Employee, Child, Salary, and Year produces the following: log(8)/log(2) = 3.00 bits. Assume that the database was split (decomposed) into two smaller databases, one composed of the Employee and Child attributes, and the other with the Employee, Salary, and Year attributes. Ideally, no information is lost with this decomposition. Such a lossless decomposition results in a higher normal form for the resulting databases, with fewer insertion and deletion anomalies. The entropy of Employee and Child is log(4)/log(2) = 2.00 bits. The joint entropy of the Employee, Salary, and Year combined database is 2.5 bits, while the entropy of the Employee, which occurs in both and thus needs to be removed from one of these when the are combined to produced the Employee, Child, Salary, and Year database is 1.5 bits. We find that 2 + 2.5- 1.5 = 3 bits, the entropy of the full table. If we join the two sub-tables, the Employee and Child sub-table with the sub-table of Employee, Salary, and Year, based on matching Employee values, the resulting table is the original table.

Further dependencies may be illustrated by noting different ways that groups of attributes may be removed from a table to provide separate tables. Multivalued dependencies are one approach, examining decomposition pairs as a useful way to view groups of attributes. If there are three attributes, X, Y, and Z, and they are all attributes of a particular database, then if X serves as a key, I(Y ;Z|X) = 0, meaning that given X (the key), then the combined amount of information about Y and Z is greater than or equal to 0. This inequality is equivalent to H(X,Y,Z) = H(X,Y ) + H(X,Z) - H(X). We note that equality holds when X is the key for a database composed of three attributes, X, Y, and Z such that the database can be appropriately decomposed into two databases with X as the key for both. Thus, we find that H(X,Y,Z) = H(X,Y ) + H(X,Z) - H(X).

In this case, the two databases can be joined back together to produce the original database. It is the case that the inequality holds, H(X,Y,Z) <= H(X,Y ) + H(X,Z) - H(X). This suggests that the decomposition is such that the decomposed databases contain more information or uncertainty than the full single database. If we were to decompose the full database into a different set of databases, we might find that the two combined subdatabases do not have the same amount of information as the original database. The entropy of the full Employee, Child, Salary, and Year database has this amount of data: log(8)/log(2) = 3.00 bits. If the key is treated as the salary, then we might decompose the full database into subdatabases with Salary as the key. The entropy of a Salary and Year subdatabase is log(8)/( 2 log(2)) + log(4)/( 2 log(2)) = 2.50 bits. If we add to this the entropy of Salary, Child, and Employee, log(8)/log(2) = 3.00 bits and subtract the entropy of the salary (which occurs in both of the subdatabases) 3 log(8)/( 8 log(2)) + log(4)/( 4 log(2)) + (3 log( 8/ 3 )/ 8 log(2)) = 2.16 bits, one can see that this value exceeds the entropy of the original full database: log(8)/log(2) = 3.00 bits. Simplifying this, 2.50 + 3.00 - 2.16 = 3.34 > 3.00 bits. Thus, the Salary variable is not a suitable key upon which the other factors in the subdatabases are dependent and thus the sub-tables (Salary and Year) and (Salary, Child, Employee) cannot be joined together using the Salary key.

Multivalued Dependencies Exercises

Exercise -- Multivalued Dependencies 1. Provide an example with the above table where changing a few values will allow the Child to be completely determined by Employee. What is the consequence of this transformation?

Fifth Normal Form

Pairwise Decomposition A set of relationships that are informally independent but can be joined together is referred to as a join dependency. It may be helpful to view a possible hierarachy of relationships by noting that they are acyclic, that one does not move from one relationship to another relationship to a third relationship and then back to the original relationship, which would be a set of cyclic relationships.

DepartmentResearcherProjectComputer Language
CSTaylorDBPDP-11 C
CSTaylorDBVAX11 PL1
CSTaylorDBVAX11 Fortran
EESmithDBPDP-11 C
EESmithDBVAX11 PL1
EESmithDBVAX11 Fortran
EESmithSig ProcVAX11 PL1
EESmithSig ProcVAX11 Fortran

By selecting the above button and using single letters of the alphabet to represent the five attributes, the entropy of these pairs of variables (B is dependent on A, C on B, D on C, and E on D) may be computed as the joint entropies, minus the single variable entropies that occur twice in the set of joint entropies: Entropy of A,B: 3 log( 8/3 )/( 8 log(2)) + 5 log( 8/5 )/ (8 log(2)) = 0.954 + Entropy of B,C: log(4)/( 4 log(2)) + 3 log( 8/3 )/( 4 log(2)) = 1.56 + Entropy of C,D: log(4)/( 2 log(2)) + 1/2 = 1.50 + Entropy of D,E: log(4)/( 4 log(2)) + 3 log( 8/ 3 )/( 4 log(2)) = 1.56 - Entropy of B: 3 log( 8/3 )/( 8 log(2)) + 5 log( 8/5 )/( 8 log(2)) = 0.954 - Entropy of C: log(4)/( 4 log(2)) + 3 log( 4/3 )/( 4 log(2)) = 0.811 - Entropy of D: log(4)/( 4 log(2)) + 3 log( 4/3 )/( 4 log(2)) = 0.811 = 3.00 bits which also equals the joint entropy of the entire database: Entropy of ABCDE: log(8)/log(2) = 3.00 bits. Thus each of the pairs described earlier can be joined together to produce the full table

Failure of Pairs to Form a Decomposition Pair The following table may be used to example potential problems with decomposition pairs:

Agent Company Product
Smith Ford car
Smith Ford truck
Smith GM car
Smith GM truck
Jones Ford car
Jones Ford truck
Brown Ford car
Brown GM car
Brown Toyota car
Brown Toyota bus

Below, the letter "A" is used to denote Agent, "C" to denote Company, and "P" to denote Product.

By pressing the above button to see the analysis of the above table, one can attempt to decompose the table into two smaller tables, both with key C. In this case, the entropies of the components exceed the entropy of the combined table, suggesting that the decomposition is inappropriate: Entropy of AC: log(10)/( 5 log(2)) + log(5)/ (5 log(2)) + 3 log( 10/3 )/( 5 log(2)) = 2.17 + Entropy of CP: 2 log(5)/ (5 log(2)) + 3 log( 5/3 )/ (5 log(2)) = 1.37 - Entropy of C: log(5)/( 5 log(2)) + 4 log( 5/4 )/( 5 log(2)) = 0.722 = 2.82 > Entropy of ACP: log(10)/( 2 log(2)) + log(5)/( 5 log(2)) + 3 log( 10/3 )/( 10 log(2)) = 2.65.

Similarly, decomposing with key P results in the components having entropic components that exceeds the joint entroy Entropy of CP: 2 log(5)/( 5 log(2)) + 3 log( 5/3 )/( 5 log(2)) = 1.37 + Entropy of AP: log(10)/(5 log(2)) + 2 log(5)/( 5 log(2)) + 2 log( 5/2 )/( 5 log(2)) = 2.12 - Entropy of P: log(5)/( 5 log(2)) + 4 log( 5/4 )/( 5 log(2)) = 0.722 = 2.77 > Entropy of ACP log(10)/( 2 log(2)) + log(5)/( 5 log(2)) + 3 log( 10/3 )/( 10 log(2)) = 2.65. Thus the pairs mentioned cannot be joined as suggested to produce the full table. Relations may be placed into fifth normal form when more redundancy is removed than when databases are placed into fourth normal form. Note the following tables extract information from the table above, yet have less redundancy, since each has fewer relations that those found in the table above.

Agent Company
Smith Ford
Smith GM
Jones Ford
Brown Ford
Brown GM
Brown Toyota
and
Company Product
Ford car
Ford truck
GM car
GM truck
Toyota car
Toyota bus
and
Agent Product
Smith car
Smith truck
Jones car
Jones truck
Brown car
Brown bus

Note that the pairs of variables show that any two pairs do not produce all the information in the original table with three attributes.

Indexing & Text Mining

Indexing is the generation of terminology that can serve as access points for documents when searching for topics within the documents. The index terms may be either natural language terms or phrases, or they may be synthesized identifiers. Natural language terms may relatively easy to generate and search for by speakers of the language. Index terms may be described as uncontrolled vocabulary, sometimes known as a folksonomy, where any term can serve as an index term, or index terms may be in a controlled vocabulary, where only a subset of the possible index terms is allowed. Controlled vocabularies often have only one term per conceptual topic, and searching for an assigned controlled vocabulary term will result in finding all items addressing the topic, assuming that vocabulary are assigned in a systematical and effective manner. Placing all documents with terms such as cat, cats, kittens, or kitty under the controlled vocabulary term cat brings together items on the same topic, despite which of a number of synonyms the author chose to use.

Information in Less Common Features

Index terms may be identified based on their rarity. Many of these terms will not occur in a randomly selected document, only occurring in documents specifically on the topic of the term. One might compute information by dividing the term frequency by the maximum number of occurrences of any term, often the frequency of the term the in English language documents. Common English language terms such as the, and, and of are common terms and will not serve well as index terms. Less common terms such as cat or dog might better serve as index terms.

Indexing terms in a document by their rarity could serve as an automated indexing process that does not require knowledge of the semantic meaning of terminology. Thus, taking the terms with the most self-information as index terms might serve as an automatic indexing approach. In the calculator below, each term is in quotes and each document contains a list of these terms, with the overall structure being a list of these document lists. Press the Index Information button below to see some terms and their self-information in the set of document titles.

Note that some words such as adventures and puss and gone are high information terms, given the documents here. The two terms that have the lowest indexing value are in and the. Using this method and selecting the best few terms that occur in a document may serve as a method that can be automated for automatic indexing.

The above method is often referred to as Inverse Document Frequency weighting, as the higher the term frequency, the lower the term's weight.

Luhn's Model of Term Occurrences

While the above indexing technique rewards rarity of terms and is commonly used in automatic indexing systems, it may have weaknesses. Luhn suggested in the 1950s that mid-frequency terms might be the best index terms. Common words should be rejected as index terms, but rare terms may also be so uncommon that few people use them and thus they wouldn't be effective index terms. Mid-frequency terms have an advantage here.

Rare terms may also be misspellings of uncommon term variants. The reader is invited to select a publisher's name that they think might be misspelled. Look up this name under the proper spelling using an Internet search engine (it will be in many web page bibliographies) and search for the name with misspellings that the reader thinks might be made by individuals. The reader might try "Morgan Kaufmann," "Morgan Kaufman," "Morgan Koufmann," and so forth.

The Luhn approach may be approximated by assigning the highest index weights to terms that are mid-frequency. This may be accomplished by computing the information, but where the frequency is computed based on the absolute value of the difference between the middle frequency and the frequency of the term in question and then terms with the frequencies closest to the middle ahead of those nearer the highest and lowest frequencies. This can be useful at generating a probability other than raw term frequency and its associated probability.

Selecting the button below, the mid-frequency terms receive the highest weights:

with high mid-frequency terms being cat, wind, and hat.

Automatic Indexing Exercises

Exercise -- Automatic Indexing 1. Pressing the button above calculates a Luhn-like weight. Enter 7 book or article titles that address the same topic. When you are done, what are the index terms selected by the mid-frequency weighting? How do these compare to the terms assigned as library subject headings by a library or by an indexing service to the article or book?

Signal to Noise Ratios and Indexing

Signal to noise ratio may be measured with a term and documents as the frequency in a document divided by the average frequency. This measure is effective when the term frequencies are large enough to produce statistical significance but are less effective when working with very small frequencies, those terms that have the highest Inverse Document Frequency weighting. The signal to noise ratio is often computed as the average frequency divided by the standard deviation of the frequencies for the term in quesiton.

Press the button above to determine the signal to noise ratios for terms in each document. Indexing with these lists of signal to noise values for each document might begin with selecting the n highest signal to noise ratio terms for each document. Using a larger dataset of terms might allow one to select all terms with a value over an arbitrarily chosen constant.

Indexing with Signal to Noise Ratio Exercise

Exercise -- Signal to Noise 1. Using the button above, enter 5 document titles for non-fiction documents that you use to help in learning. Given the signal to noise ratios for the the terms in these documents, what are the best 3 index terms to use to represent and allow one to differentiate between these documents? Could you use just these 3 terms to index the 5 documents?

Exercise -- Signal to Noise 2. Consider using the signal to noise methodology described above. What could you do to modify the formula to allow the formula to have higher weights for material you prefer than for the normal signal to noise ratio?

References

  1. Fagin, Ronald. "Multivalued Dependencies and a New Normal Form for Relational Databases." ACM Transactions on Database Systems 2(3) September 1977. 262-278.
  2. Kent, William. "A simple guide to Firve Normal Forms in Relational Database Theory," Communications of the ACM. 26(2) Feb 1983, 120-125.
  3. Lee, Tony T. "An Information-Theoretic Analysis of Relational Databases--Part I: Data Dependencies and Information Metric. IEEE Transactions on Software Engineering. SE-13 (10), October 1987. 1049-1061.
  4. Losee, Robert. Information From Processes: About the Nature of Information Creation, Use, and Representation. Springer Verlag. 2012.
Copyright 2017 ©