Improving metrics gathering
matt at domsch.com
Thu Feb 4 16:16:09 UTC 2010
I've spent quite a bit of time over the last week fixing up the
scripts that generate Fedora's worldwide user maps  including the
worldwide map for all Fedora versions currently in use  as
determined by yum requests for mirrorlists.
One thing that's painfully obvious is that the "Unique IP addresses"
method of counting the number of installations  is woefully
under-counting the actual number of installs. Looking at a single
day's worth of checkins (over 3 million), we see ~40k unique IP
addresses checking in twice a day, another 40k checking in between
4x/day and up to say 20x/day, and then a long tail, fairly evenly
distributed, where a small number of single IPs are checking in up to
2000x/day. It takes quite a bit of effort to cause yum to make that
many mirrorlist requests using a single machine and a single IP
address - but it's highly likely there are 1000-2000 machines behind a
NAT making those requests.
This just shows that we currently have no way to know, within even a
2-4x margin of error, how many current installs of Fedora there are.
But this number, and it's growth (positive, or negative), would be
interesting to know, if only it were more accurate. 
To this end, I would like to see yum enhanced to provide information
which we can use to more accurately count the number of installed
Fedora systems. This has been discussed before, and documented on the
wiki , but for various reasons never been acted upon. While I'll
leave the implementation details to the appropriate teams, I think
including some form of UUID in yum mirrorlist queries would be both
appropriate, and safe.
The biggest concern people have with using any UUID in any form is the
"trackability" that comes inherent with it. Given enough log data
that includes UUIDs, one could potentially use it to understand
something about a user that they otherwise wouldn't want you to know.
For example: if I have the public IP address and UUID for a system,
and if I have the HTTP/FTP logfiles from _all_ our mirrors which
includes public IP addresses (which I don't have today), I could
potentially guess at which RPMs one system at that IP address has
installed. If there is only one system at that IP address, I'd have
even more certainty as to what they have installed.
Personally, I don't think this is a big problem. Maybe it is. If it
(and even more so) would have huge security, privacy, and other
lawsuit concerns which I just don't hear about. Whatever we do will
have to run past Legal.
For implementation details, I suggest yum create and persist a single
UUID for each installed system. This UUID would be separate from any
smolt UUID. Yum would include this UUID in HTTP requests. Yum would
only provide this UUID when making mirrorlist requests, not when
downloading content (from mirrors or other). All yumlib-using
applications such as PackageKit would then inherit this capability.
On the back end, Fedora Infrastructure would add capability to log
this UUID for each request, just as it logs mirrorlist requests
today. FI scripts would then use this UUID to accurately count the
number of installed instances over time, recognizing that systems can
get re-installed (and thus get new yum UUIDs), but over time can
provide more accurate trending than we can get today.
I'd like to hear your thoughts.
More information about the advisory-board