[time 289] Re: [time 288] Fisher Information


Hitoshi Kitada (hitoshi@kitada.com)
Sun, 9 May 1999 11:48:23 +0900


Dear Stephen,

A miscellaneous remark...

----- Original Message -----
From: Stephen P. King <stephenk1@home.com>
To: <time@kitada.com>
Sent: Saturday, May 08, 1999 10:47 PM
Subject: [time 288] Fisher Information

> On Mon, 26 Apr 1999, Stephen Paul King wrote:
>
> > Could someone give a definiton of Fisher Information that a
> > mindless philosopher would understand? :)
>
> Let me start with a couple of intuitive ideas which should help to
> orient
> you. Fisher information is related to the notion of information (a kind
> of
> "entropy") developed by Shannon 1948, but not the same. Roughly
> speaking
>
> 1. Shannon entropy is the volume of a "typical set"; Fisher information
> is the area of a "typical set",
>
> 2. Shannon entropy is allied to "nonparametric statistics"; Fisher
> information is allied to "parametric statistics".
>
> Now, for the definiton.
>
> Let f(x,t) be a family of probability densities parametrized by t.
> For example,
>
> f(x,t) = 1/t exp(-x/t), on x >= 0
>
> In parametric statistics, we want to estimate which t gives the best fit
> to a finite data set, say of size n. An estimator is a function from
> n-tuple data samples to the set of possible parameter values, e.g.
> (0,infty) in the example above. Given an estimator, its bias, as a
> function of t, is the difference between the expected value (as we range
> over x) of the estimator, according to the density f(.,t), and the
> actual
> value of t. The variance of the estimator, as a function of t, is the
> expectation (as we range over x), according to f(.,t), of the squared
> difference between t and the value of the estimator. If the bias
> vanishes
> (in this case the estimator is called unbiased), the variance will
> usually
> still be a positive function of t. It is natural to try to minimize the
> variance over the set of unbiased estimators defined for a given family
> of
> densities f(.,t).
>
> Given a family of densities, the score is the logarithmic derivative
>
> V(x,t) = d/dt log f(x,t) = d/dt f(x,t)/f(x,t)
>
> (We are tacitly now assuming some differentiablity properties of our
> parameterized family of densities.) The mean of the score (as we average
> over x) is always zero. The Fisher information is the variance of the
> score:
>
> J(t) = expected value of square of V(x) as we vary x
>
> Notice this is a function of t defined in terms of specific parametrized
> family of densities. (Of course, the definition is readily generalized
> to
> more than one parameter).
>
> The fundamentally important Cramer-Rao inequality says that
>
> variance of any estimator >= 1/J(t)
>
> Thus, in parametric statistics one wants to find estimators which
> achieve
> the optimal variance, the reciprocal of the Fisher information. From
> this
> point of view, the larger the Fisher the information, the more precisely
> one can (using a suitable estimator) fit a distribution from the given
> parametrized family to the data.
>
> (Incidently: someone has mentioned the work of Roy Frieden, who has
> attempted to relate the Cramer-Rao inequality to the Heisenberg
> inequality. See the simple "folklore" theorem (with complete proof)

His "folklore" theorem (and proof) in
http://members.home.net/stephenk1/Outlaw/Frieden.txt is given in Von Neumann's
book "Mathematical Foundations of Quantum Mechanics," 1932, Chapter III,
section 4, and is not a folklore at all, but has been well-known. Chris
Hillman seems a young inexperienced mathematician.

 I
> posted on a generalized Heisenberg inequality in sci.physics.research a
> few months ago--- you should be able to find it using Deja News.)
>
> This setup is more flexible than might at first appear. For instance,
> given a density f(x), where x is real, define the family of densities
> f(x-t); then the Fisher information is
>
> J(t) = expectation of [d/dt log f(x-t)]^2
>
> = int f(x-t) [d/dt log f(x-t)]^2 dx
>
> By a change of variables, we find that for a fixed density f, this is a
> constant. In this way, we can change our point of view and define a
> (nonlinear) functional on densities f:
>
> J(f) = int f(x) [f'(x)/f(x)]^2 dx
>
> On the other hand, Shannon's "continuous" entropy is the (nonlinear)
> functional:
>
> H(f) = -int f(x) log f(x) dx
>
> Suppose that X is a random variable with finite variance and Z is an
> independent normally distributed random variable with zero mean and unit
> variance ("standard noise"), so that X + sqrt(t) Z is another random
> variable associated with density f_t, represented X perturbed by noise.
> Then de Bruijn's identity says that
>
> J(f_t) = 2 d/dt h(f_t)
>
> and if the limit t-> 0 exists, we have a formula for the Fisher
> information of the density f_0 associated with X.
>
> See Elements of Information Theory, by Cover & Thomas, Wiley, 1981, for
> details on the above and for general orientation to the enormous body of
> ideas which consistutes modern information theory, including typical
> sets
> and comment (1) above. Then see some of the many other books which
> cover
> Fisher information in more detail. In one of the books by J. N. Kapur
> on
> maximal entropy you will find a particularly simple and nice connection
> between the multivariable Fisher information and Shannon's discrete
> "information" (arising from the discrete "entropy" -sum p_j log p_j).
>
> (Come to think of it, if you search under my name using Deja News you
> should find a previous posting of mine in which I gave considerable
> detail
> on some inequalities which are closely related to the area-volume
> interpretations of Fisher information and entropy. If you've ever heard
> of Hadamard's inequality on matrices, you should definitely look at the
> discussion in Cover & Thomas.)
>
> Hope this helps!
>
> Chris Hillman
>

Best wishes,
Hitoshi



This archive was generated by hypermail 2.0b3 on Sun Oct 17 1999 - 22:10:31 JST