On Tue, 8 Jan 2019 at 00:30, Christopher Tubbs <ctubbsii(a)fedoraproject.org>
A few concerns/comments (inline):
> === The problem ===
> * A. Currently, we can only count Fedora OS use by observing IP
> addresses. This is subject to undercounting due to NAT — and to
> overcounting due to short DHCP leases and laptops moving between work
> or school and home or coffee shop.
"Counts are estimates" is not necessarily a problem. Please explain why
this is a problem. Also, why not use statistical modeling to try to improve
the estimates based on these known behaviors?
In the past when I looked at this, it was always a problem of choosing
which model best fit your data. You can come up with all kinds of models to
prove whatever you want but you need some sort of 'accurate' count at some
point to test those models against. There is also the fact that different
installations fit under different 'models'. The IOT systems will have a
different representational set than the laptop versus the livecd versus...
because how they are installed and how they look on the network is
different. Using the same model for all of them seemed questionable.
Currently the statistics are done off of the http logs from the proxies
which just see a basic set of information. Due to the fact that the
proxy/cache boxes are remote we wait for the rsync to take N days to
complete, then merge all the logs and then do a simple processing on an 8
year old 24 GB server.
The data in this merged log file is noisy due to dnf/yum trying to be
resilient as possible. A single 'dnf update' or 'yum update' may show up
multiple requests for the same data on different proxies because something
didn't look right.. or it might just show up once. However I don't know if
I have 10 systems behind a firewall or just 1. At the moment I assume that
I have 1 by just saving the tuple (date,ip=x,arch=x,rel=y) once per day.
Trying to count the number of times that tuple occurred was very very noisy
where looking at specific ip addresses I knew have N systems would show up
as either Nhundred systems or none.. depending on the vagaries of the
internet and whatever the systems decided to do that day.
Stephen J Smoogen.