Sunday, September 12, 2010

Entropy of Joint Distribution

Reading Elements of Information Theory.

With respect to my reseach, I am only accountable to myself. I tend to flit between topics as the mood or opportunity strikes me. A few weeks ago, the library got a copy of the book linked above. I've been wanting to read this for years and checked it out and started reading it.

Chapter 1 was a high level overview of the book.

Chapter 2 covers entropy, joint entropy, conditional entropy, realtive entropy, mutual information, and conditional relative entropy. It includes a bunch of theorems (and chain rules) related to these values. It also includes relationships between these values that can be used to determine properties and bounds of the distribution. There is also a section that talks about the number of samples you need to accurately represent a distribution. This is something that I would have found quite useful years ago (Dieter Fox uses KLD-Sampling to do this sort of thing, but I never quite understood KLD-Sampling).

The chapter focuses on discrete distributions, but then mixes that with continuous numerical distributions when it talks about convex and concave functions (if you have a distribution of unordered colors, can it be concave or convex?)

I've read through the entire chapter, but there is a lot to digest. I also do not have practical applications for many of the relationships which make it difficult for me to learn the material well.

As always, I'm writing code based on the book. I do not know what I'm going to do for the relationships.

The main structure I am using so far is a discrete joint distribution P(X, Y). This is simply a 2D array for floating point values. I sum the values to use as a normalizing quotient.

P(X,Y) = P(X0,Y0)P(X1,Y0)P(X2,Y0)
P(X0,Y1)P(X1,Y1)P(X2,Y1)
P(X0,Y2)P(X1,Y2)P(X2,Y2)
P(X0,Y3)P(X1,Y3)P(X2,Y3)

The Joint Entropy of P(X,Y) (labeled as H(X,Y)) is the same as the Entropy of P(U) = { P(X0,Y0), P(X1,Y0), P(X2,Y0), P(X0,Y1) ... P(X2,Y3)}. H(X,Y) = H(U).

I think that this implies that any distribution P(U) can refactored into a joint distribution P(X,Y). Maximizing or minimizing some entropy calculation might reveal a useful hidden variable, either as independent components or a a smaller domain that is highly predictive of the other domain.

Chapter 2 talk a lot about the Conditional Entropy H(Y|X) of P(X,Y). Recall that P(X,Y) = P(Y|X) * P(X). In information space, this becomes H(X,Y) = H(Y|X) + H(X). So H(Y|X) = H(X,Y) - H(X).

The conditional entropy is the information in the joint distribution minus the information in one component of the joint distribution.

In the definition of P(X,Y) above, we can calculate P(X) by summing the columns (and P(Y) by summing the rows.)

sum = 19P(Y)P(X,Y)
P(X)469
P(X,Y)7124
5122
4112
3111

P(Y|X) = H(X,Y) - H(X) = P(X,Y) - (-4/19*Log(4/19) - 6/19*Log(6/19) - 9/19*Log(9/19))

That gets me up to section 2.3. I'll write more as I slowly grind my way through.

No comments:

Post a Comment