On Tue, 8 Jan 2019 at 00:30, Christopher Tubbs <ctubbsii@fedoraproject.org> wrote:
A few concerns/comments (inline):

> === The problem ===
> * A. Currently, we can only count Fedora OS use by observing IP
> addresses. This is subject to undercounting due to NAT — and to
> overcounting due to short DHCP leases and laptops moving between work
> or school and home or coffee shop.

"Counts are estimates" is not necessarily a problem. Please explain why this is a problem. Also, why not use statistical modeling to try to improve the estimates based on these known behaviors?

In the past when I looked at this, it was always a problem of choosing which model best fit your data. You can come up with all kinds of models to prove whatever you want but you need some sort of 'accurate' count at some point to test those models against. There is also the fact that different installations fit under different 'models'. The IOT systems will have a different representational set than the laptop versus the livecd versus... because how they are installed and how they look on the network is different. Using the same model for all of them seemed questionable.

Currently the statistics are done off of the http logs from the proxies which just see a basic set of information. Due to the fact that the proxy/cache boxes are remote we wait for the rsync to take N days to complete, then merge all the logs and then do a simple processing on an 8 year old 24 GB server. 

The data in this merged log file is noisy due to dnf/yum trying to be resilient as possible. A single 'dnf update' or 'yum update' may show up as multiple requests for the same data on different proxies because something didn't look right.. or it might just show up once. However I don't know if I have 10 systems behind a firewall or just 1. At the moment I assume that I have 1 by just saving the tuple (date,ip=x,arch=x,rel=y) once per day. 

Trying to count the number of times that tuple occurred was very very noisy where looking at specific ip addresses I knew have N systems would show up as either Nhundred systems or none.. depending on the vagaries of the internet and whatever the systems decided to do that day. 

Stephen J Smoogen.