On Fri, 17 Oct 2014, Ranjan Maitra wrote:
>
> What I mean is that R has the capability of generating PDFs, and R has
> the capability of calculating various goodness of fit measures, but if
> you want to check goodness of fit measures against, say, 50 PDFs, then
> you have to write the package. It's easier for me to use easyfit than
> write the package.
Never having heard of "easyfit" before now, I guess I am confused as to what
you mean when you say fitting a pdf. What is the form of the pdfs that you want to fit? It
is very unusual to want to fit 50 different parametric pdfs, unless what you mean is
something totally different. In that case, have you considered going the (nonparametric)
density estimation route?
Many thanks,
Ranjan
Well, this isn't really a fedora thing, but since I think it's interesting
I'll impose a little longer.
Here's the problem. Let's say you have a set of data and you want to characterize
it in order to use it as the basis of a model. In order to do that, you really need to
know the underlying PDF.
Here are two simple examples that I've run into in the past couple of years. I'm
a forensic pathologist, and investigate unnatural death. One common problem in the field
is the issue of abusive head trauma -- can you tell from the injuries on a child that the
injuries *must* have been caused by another, or could they be from an accident of some
sort.
There has been a great deal of biomechanical modeling involved with this issue. Some of
these models are based on physical measurements of the amount of force it takes to
fracture the skull of a child. One very commonly cited study of this actually uses a very
small data set of donated skulls. The data is reported as if it were gaussian, but in
fact if you look at it, it is a uniform distribution.
It's a uniform distribution because the investigators took one or two skulls from
infants of varying ages -- and what they are really measuring is the change in skull
properties over time. It's as if they did a study on "average human height"
and then took one sample from humans at each month from birth to 3 years old.
In situations like this, it's important to see and understand the underlying PDF,
because they then use the data *as if it were gaussian* to create biomechanical models.
And it's wrong to do that -- it's wrong to apply the "average" and
"standard deviation" of height of people from birth to 3 years as the supposed
"average" height of a newborn baby. If you look at the distribution, the error
becomes obvious.
A second example occurred when a group attempted to apply Benford's Law to look for
bias in manner of death determination in forensic death investigation. The investigators
looked at the number of homicides, suicides, accidents, and natural deaths in their
jurisdiction each month over a period of a couple of years, and it *seemed* as if it
followed Benford's Law.
However, it was an artifact of their workload. My office has about twice their workload,
and the first digit for my practice is scaled by two. The distribution is really a pretty
simple gaussian distribution, and these distributions tend not to follow Benford's
Law. Thus, knowing that the distribution fits a normal well is an argument that manner
determination should *not* follow Benford's Law. If, however, the data fit something
like the gamma distribution well, it would *not* argue against the applicability of
Benford's law.
billo