Improving metrics gathering
jspaleta at gmail.com
Fri Feb 5 11:49:55 UTC 2010
On Thu, Feb 4, 2010 at 7:16 AM, Matt Domsch <matt at domsch.com> wrote:
> One thing that's painfully obvious is that the "Unique IP addresses"
> method of counting the number of installations  is woefully
> under-counting the actual number of installs.
How is it obvious? How do you know that a significant chunk IP
addresses showing up are roaming systems? This computer I'm on right
now has check for updates from no less than 10 different networks this
I'm going to counter all of this by saying for the purpose of global
or regional map making... does getting more accurate numbers matter or
do you expect the undercounting factor to have a regional bias that is
skewing the relative client densities for one region compared to
another on the global map? Exact numbers are nice...but do you need
We aren't going to get an exact number for userbase ever. I'd be more
interested in standing up a correction factor with an error bar that
can be used in a statically significant way to get from the numbers we
do have to an estimate of active userbase. My first cut at doing that
involved looking at the rate of growth of smolt UUIDs to the rate of
growth of Unique IPs over a 16 month period. I wouldn't call what I
saw a huge undercount in unique ips. My method pegs the correction
factor at about 1.15 with a stdev of 0.03... or to say that in English
that we are under-counting by about 15% globally. I never found the
time to go back and check to see if that factor varied significantly
region by region.
So you've done the frequency analysis. Have you gone further and
assuming an update request cadence for a given client what the
weighted adjustment of the long tail looks like in aggregate?
Assuming every client is the same and checks in X number of times per
day... in the average what is the number of such clients per ip
address? You should be able to determine that number from by
integrating over the histogram of the number of ip addresses binned by
connections per day, dividing by the number of ip addresses seen that
day and dividing by whatever X you choose. You'd have to convince me
with some math and some plots that the long tail is really dragging
things off by a large factor.
More information about the advisory-board