|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| {{Unreferenced|date=September 2012}}[[Image:Entropy-mutual-information-relative-entropy-relation-diagram.svg|thumb|A simple [[information diagram]] illustrating the relationships among some of Shannon's basic quantities of information.]]
| | Alyson Meagher is the name her mothers and fathers gave her but she doesn't like when individuals use her full title. I am truly fond of to go to karaoke but I've been using on new things lately. I've always loved residing in Alaska. Distributing manufacturing is how he makes a residing.<br><br>my web-site psychics online, [http://www.onbizin.co.kr/xe/?document_srl=320614 www.onbizin.co.kr], |
| The [[Information theory|mathematical theory of information]] is based on [[probability theory]] and [[statistics]], and measures information with several '''quantities of information'''. The choice of logarithmic base in the following formulae determines the [[units of measurement|unit]] of [[information entropy]] that is used. The most common unit of information is the [[bit]], based on the [[binary logarithm]]. Other units include the [[nat (information)|nat]], based on the [[natural logarithm]], and the [[deciban|hartley]], based on the base 10 or [[common logarithm]].
| |
| | |
| In what follows, an expression of the form <math>p \log p \,</math> is considered by convention to be equal to zero whenever ''p'' is zero. This is justified because <math>\lim_{p \rightarrow 0+} p \log p = 0</math> for any logarithmic base.
| |
| | |
| ==Self-information==
| |
| Shannon derived a measure of information content called the '''[[self-information]]''' or '''"surprisal"''' of a message ''m'':
| |
| | |
| : <math> I(m) = \log \left( \frac{1}{p(m)} \right) = - \log( p(m) ) \, </math>
| |
| | |
| where <math>p(m) = \mathrm{Pr}(M=m)</math> is the probability that message ''m'' is chosen from all possible choices in the message space <math>M</math>. The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of [[bit]]s.
| |
| | |
| Information is transferred from a source to a recipient only if the recipient of the information did not already have the information to begin with. Messages that convey information that is certain to happen and already known by the recipient contain no real information. Infrequently occurring messages contain more information than more frequently occurring messages. This fact is reflected in the above equation - a certain message, i.e. of probability 1, has an information measure of zero. In addition, a compound message of two (or more) unrelated (or mutually independent) messages would have a quantity of information that is the sum of the measures of information of each message individually. That fact is also reflected in the above equation, supporting the validity of its derivation.
| |
| | |
| An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as [[Miami]]. The amount of information in a forecast of snow for a location where it never snows (impossible event) is the highest (infinity).
| |
| | |
| ==Entropy==
| |
| The '''[[information entropy|entropy]]''' of a discrete message space <math>M</math> is a measure of the amount of '''uncertainty''' one has about which message will be chosen. It is defined as the [[expected value|average]] self-information of a message <math>m</math> from that message space:
| |
| | |
| :<math> H(M) = \mathbb{E} \{I(M)\} = \sum_{m \in M} p(m) I(m) = -\sum_{m \in M} p(m) \log p(m).</math>
| |
| | |
| where
| |
| :<math> \mathbb{E} \{ \} </math> denotes the [[expected value]] operation.
| |
| | |
| An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g. <math> p(m) = 1/M </math>). In this case <math>H(M) = \log |M|.</math>.
| |
| | |
| Sometimes the function ''H'' is expressed in terms of the probabilities of the distribution:
| |
| | |
| :<math>H(p_1, p_2, \ldots , p_k) = -\sum_{i=1}^k p_i \log p_i,</math> where each <math>p_i \geq 0</math> and <math> \sum_{i=1}^k p_i = 1. </math>
| |
| | |
| An important special case of this is the '''[[binary entropy function]]''':
| |
| | |
| :<math>H_\mbox{b}(p) = H(p, 1-p) = - p \log p - (1-p)\log (1-p).\,</math>
| |
| | |
| ==Joint entropy==
| |
| The '''[[joint entropy]]''' of two discrete random variables <math>X</math> and <math>Y</math> is defined as the entropy of the [[joint distribution]] of <math>X</math> and <math>Y</math>:
| |
| | |
| :<math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math>
| |
| | |
| If <math>X</math> and <math>Y</math> are [[statistical independence|independent]], then the joint entropy is simply the sum of their individual entropies.
| |
| | |
| (Note: The joint entropy should not be confused with the [[cross entropy]], despite similar notations.)
| |
| | |
| ==Conditional entropy (equivocation)==
| |
| Given a particular value of a random variable <math>Y</math>, the conditional entropy of <math>X</math> given <math>Y=y</math> is defined as:
| |
| | |
| : <math> H(X|y) = \mathbb{E}_{{X|Y}} [-\log p(x|y)] = -\sum_{x \in X} p(x|y) \log p(x|y)</math>
| |
| | |
| where <math>p(x|y) = \frac{p(x,y)}{p(y)}</math> is the [[conditional probability]] of <math>x</math> given <math>y</math>.
| |
| | |
| The '''[[conditional entropy]]''' of <math>X</math> given <math>Y</math>, also called the '''equivocation''' of <math>X</math> about <math>Y</math> is then given by:
| |
| | |
| :<math> H(X|Y) = \mathbb E_Y \{H(X|y)\} = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = \sum_{x,y} p(x,y) \log \frac{p(y)}{p(x,y)}.</math>
| |
| | |
| A basic property of the conditional entropy is that:
| |
| | |
| : <math> H(X|Y) = H(X,Y) - H(Y) .\,</math>
| |
| | |
| ==Kullback–Leibler divergence (information gain)==
| |
| The '''[[Kullback%E2%80%93Leibler_divergence|Kullback–Leibler divergence]]''' (or '''information divergence''', '''information gain''', or '''relative entropy''') is a way of comparing two distributions, a "true" [[probability distribution]] ''p'', and an arbitrary probability distribution ''q''. If we compress data in a manner that assumes ''q'' is the distribution underlying some data, when, in reality, ''p'' is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically,
| |
| :<math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math> | |
| It is in some sense the "distance" from ''q'' to ''p'', although it is not a true [[Metric (mathematics)|metric]] due to its not being symmetric.
| |
| | |
| ==Mutual information (transinformation)==
| |
| It turns out that one of the most useful and important measures of information is the '''[[mutual information]]''', or '''transinformation'''. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of <math>X</math> relative to <math>Y</math> (which represents conceptually the average amount of information about <math>X</math> that can be gained by observing <math>Y</math>) is given by:
| |
| | |
| :<math>I(X;Y) = \sum_{y\in Y} p(y)\sum_{x\in X} p(x|y) \log \frac{p(x|y)}{p(x)} = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}.</math>
| |
| | |
| A basic property of the mutual information is that:
| |
| | |
| : <math>I(X;Y) = H(X) - H(X|Y).\,</math>
| |
| | |
| That is, knowing ''Y'', we can save an average of <math>I(X; Y)</math> bits in encoding ''X'' compared to not knowing ''Y''. Mutual information is [[symmetric function|symmetric]]:
| |
| | |
| : <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math>
| |
| | |
| | |
| Mutual information can be expressed as the average [[Kullback–Leibler divergence]] (information gain) of the [[posterior probability|posterior probability distribution]] of ''X'' given the value of ''Y'' to the [[prior probability|prior distribution]] on ''X'':
| |
| : <math>I(X;Y) = \mathbb E_{p(y)} \{D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )\}.</math>
| |
| In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
| |
| : <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math>
| |
| | |
| Mutual information is closely related to the [[likelihood-ratio test|log-likelihood ratio test]] in the context of contingency tables and the [[multinomial distribution]] and to [[Pearson's chi-squared test|Pearson's χ<sup>2</sup> test]]: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
| |
| | |
| ==Differential entropy==
| |
| :''See main article: [[Differential entropy]]''.
| |
| | |
| The basic measures of discrete entropy have been extended by analogy to [[continuum (mathematics)|continuous]]{{dn|date=May 2012}} spaces by replacing sums with integrals and [[probability mass function]]s with [[probability density function]]s. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does ''not'' imply identical properties; for example, differential entropy may be negative.
| |
| | |
| The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:
| |
| | |
| : <math> h(X) = -\int_X f(x) \log f(x) \,dx </math>
| |
| : <math> h(X,Y) = -\int_Y \int_X f(x,y) \log f(x,y) \,dx \,dy</math>
| |
| : <math> h(X|y) = -\int_X f(x|y) \log f(x|y) \,dx </math>
| |
| : <math> h(X|Y) = \int_Y \int_X f(x,y) \log \frac{f(y)}{f(x,y)} \,dx \,dy</math>
| |
| : <math> I(X;Y) = \int_Y \int_X f(x,y) \log \frac{f(x,y)}{f(x)f(y)} \,dx \,dy </math>
| |
| | |
| where <math>f(x,y)</math> is the joint density function, <math>f(x)</math> and <math>f(y)</math> are the marginal distributions, and <math>f(x|y)</math> is the conditional distribution.
| |
| | |
| == See also ==
| |
| * [[Information theory]]
| |
| | |
| [[Category:Information theory]]
| |
| | |
| [[ja:情報量]]
| |
Alyson Meagher is the name her mothers and fathers gave her but she doesn't like when individuals use her full title. I am truly fond of to go to karaoke but I've been using on new things lately. I've always loved residing in Alaska. Distributing manufacturing is how he makes a residing.
my web-site psychics online, www.onbizin.co.kr,