Good Monday. )
I am trying to find package installation/download statistics for Fedora. https://fedoraproject.org/wiki/Statistics is the page that pops up first in search results, and it is kind of 6 years outdated. It would be really great to update it with current situation. I can not do this, because acquiring CLA+1 for wiki edits is too hard and didn't happen even though I sent several requests to different groups. The second reason - I don't know what to write as there seems to be no contact point except this groups.
There is also unlinked https://fedoraproject.org/wiki/Statistics_2.0 page with many good stories including mine - `Parse mirror logs: what packages are being the most downloaded?`, but no links or tracker items to see the status and jump in.
I found a request to open stats@ list https://pagure.io/fedora-infrastructure/issue/2223 which speaks about https://github.com/fedora-infra/datanommer/ as a new location. There are still no examples of getting package popularity data.
I need stats to see how many people are using qdigidoc package to make it more official.
On Mon, 5 Nov 2018 at 09:56, Anatoli Babenia anatoli@rainforce.org wrote:
Good Monday. )
I am trying to find package installation/download statistics for Fedora. https://fedoraproject.org/wiki/Statistics is the page that pops up first in search results, and it is kind of 6 years outdated. It would be really great to update it with current situation. I can not do this, because acquiring CLA+1 for wiki edits is too hard and didn't happen even though I sent several requests to different groups. The second reason - I don't know what to write as there seems to be no contact point except this groups.
I think the page should be archived/removed. Mainly because a lot of the questions people want answers for usually also get in the way of people wanting privacy. Currently there is no way to know what packages are being installed/downloaded the most. yum and dnf downloads not provide those answers on purpose (it would require more computational power on the servers than we have and it can't be easily made anonymous. The data we can get is only basic information like 'what version of yum/dnf used', 'what arch was asked for', 'what was the version of Fedora/EPEL wanted' and 'what was the public ip address'. This loses all kinds of additional information and masks things like proxies, mock builds, etc which inflate/deflate numbers in different ways.
There is also unlinked https://fedoraproject.org/wiki/Statistics_2.0 page with many good stories including mine - `Parse mirror logs: what packages are being the most downloaded?`, but no links or tracker items to see the status and jump in.
That page is even older than the one you pointed to and should also be archived/removed. We are probably on Statistics 5.0
I found a request to open stats@ list https://pagure.io/fedora-infrastructure/issue/2223 which speaks about https://github.com/fedora-infra/datanommer/ as a new location. There are still no examples of getting package popularity data.
I need stats to see how many people are using qdigidoc package to make it more official.
I am sorry but there is no way to answer that question.
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Mon, 5 Nov 2018 at 09:56, Anatoli Babenia <anatoli(a)rainforce.org> wrote:
I think the page should be archived/removed. Mainly because a lot of the questions people want answers for usually also get in the way of people wanting privacy.
I agree that the page in its current state is not useful, but why do you propose to censor information about how Fedora handles privacy instead of explaining it on a case by case basis?
Without statistics people are pretty much limited in synchronizing the view of the world to make a joint action. For example, with qdigidoc stats we could try to get some funding for Fedora development from EU. And also it could be an opt-in feature like https://popcon.debian.org/
I am not saying that the stats are reflecting anything, but with some adjustments they still can be useful.
Currently there is no way to know what packages are being installed/downloaded the most. yum and dnf downloads not provide those answers on purpose (it would require more computational power on the servers than we have and it can't be easily made anonymous. The data we can get is only basic information like 'what version of yum/dnf used', 'what arch was asked for', 'what was the version of Fedora/EPEL wanted' and 'what was the public ip address'. This loses all kinds of additional information and masks things like proxies, mock builds, etc which inflate/deflate numbers in different ways.
Just a hypothesis. If HTTPS/SSH and dnf protocol uses fixed size packets and encryption increases the size proportionally, then I can guess the combination of packages being installed based on time of request and request size, so it doesn't help to hide that.
Recording IP is a big deal on its own. But for stats it can be replaced with just increasing counter. And you also forgot to mention about virtual machines and containers that also inflate the numbers. I don't believe that right now anybody has the incentive to keep the numbers on usage for `qdigidoc` higher than a real usage, and even if that's the case, the guys from the other side can validate the data according to the number of sessions with unique ID cards from Fedora to their servers. That's the whole point of it - making the first step to go further and pass the ball to the other side.
Also from file serving mirrors I'd expect the bottleneck to be in a bandwidth and not processing power. Storing IPs for each can be inefficient, but can we get some statistics about that? I could not find any example mirror at https://nagios.fedoraproject.org/nagios/
That page is even older than the one you pointed to and should also be archived/removed. We are probably on Statistics 5.0
I am sorry but there is no way to answer that question.
I want to believe, but because you touched my paranoia from the start, is there a dump of client server session with logs to do a proper privacy audit? Now I need to feed the lawyer inside. :D
On Tue, 6 Nov 2018 at 03:42, Anatoli Babenia anatoli@rainforce.org wrote:
I am just going to concentrate on one part of this email to try and cover things once.
Currently there is no way to know what packages are being installed/downloaded the most. yum and dnf downloads not provide those answers on purpose (it would require more computational power on the servers than we have and it can't be easily made anonymous. The data we can get is only basic information like 'what version of yum/dnf used', 'what arch was asked for', 'what was the version of Fedora/EPEL wanted' and 'what was the public ip address'. This loses all kinds of additional information and masks things like proxies, mock builds, etc which inflate/deflate numbers in different ways.
Just a hypothesis. If HTTPS/SSH and dnf protocol uses fixed size packets and encryption increases the size proportionally, then I can guess the combination of packages being installed based on time of request and request size, so it doesn't help to hide that.
That isn't downloaded from Fedora but from any of a thousand mirrors so there is no way for us to see what was downloaded or installed. All that is logged is data like the following:
209.132.184.33 - - [01/Nov/2018:04:02:57 +0000] "GET /metalink?repo=fedora-27&arch=x86_64 HTTP/1.1" 200 18196 "-" "dnf/2.7.5" 209.132.184.33 - - [01/Nov/2018:04:02:58 +0000] "GET /metalink?repo=fedora-27&arch=x86_64 HTTP/1.1" 200 18196 "-" "dnf/2.7.5" 209.132.184.33 - - [01/Nov/2018:04:03:11 +0000] "GET /mirrorlist?repo=epel-7&arch=x86_64 HTTP/1.1" 200 2428 "-" "urlgrabber/3.10.1 yum/3.4.3"
All the size data is the metalink mirrorlist xml file which will tell your client which mirrors to go to. yum/dnf then goes to a mirror and pulls out what files it wants from that repository. The data on that is harder to get because yum/dnf do not say 'this person wanted to install 'qdigidoc' It instead gets the repository data, calculates the dependencies and requirements and starts pulling down all the files needed so that qdigidoc can be successfully installed. In the case of 'leaf' packages it might be easier to figure out but if something requires qdigidoc or something similar it then gets lost in the shuffle.
The reason you did not see any mirrors in nagios, is that the vast majority of them are run by volunteers who are also doing the same for Debian, Mandrake, SUSE, etc. They do not share their data with us and it would be hard because they also have other sites intermixed. We do not monitor them in our nagios system because we do not run them and do not have any way to fix their problems.
Now I see. Then porting https://popcon.debian.org/ to Fedora and providing infrastructure for incoming data is the only way to collect the stats. How to know if that is possible or interesting for Fedora?
On Wed, 7 Nov 2018 at 03:34, Anatoli Babenia anatoli@rainforce.org wrote:
Now I see. Then porting https://popcon.debian.org/ to Fedora and providing infrastructure for incoming data is the only way to collect the stats. How to know if that is possible or interesting for Fedora?
There is probably some interest in it. I would contact Matthew Miller as he has wanted something like it in the past.
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Wed, Nov 07, 2018 at 07:04:23AM -0500, Stephen John Smoogen wrote:
Now I see. Then porting https://popcon.debian.org/ to Fedora and providing infrastructure for incoming data is the only way to collect the stats. How to know if that is possible or interesting for Fedora?
There is probably some interest in it. I would contact Matthew Miller as he has wanted something like it in the past.
Yeah, definitely very interested. Just not had anyone with time to work on it :(
On 11/7/18 10:34 AM, Matthew Miller wrote:
On Wed, Nov 07, 2018 at 07:04:23AM -0500, Stephen John Smoogen wrote:
Now I see. Then porting https://popcon.debian.org/ to Fedora and providing infrastructure for incoming data is the only way to collect the stats. How to know if that is possible or interesting for Fedora?
There is probably some interest in it. I would contact Matthew Miller as he has wanted something like it in the past.
Yeah, definitely very interested. Just not had anyone with time to work on it :(
Now that the messaging infrastructure work is wrapping up, I've started looking into the hardware database service that was discussed at the beginning of the year.
Since a hardware database with user counts requires tracking the number of users in an anonymous way, it's essentially also an installation tracker. Adding another table or two would make it a software database so it might make sense to run them all as a single back-end service and have a single end-user tool for reporting.
On 11/7/18 8:24 AM, Jeremy Cline wrote:
On 11/7/18 10:34 AM, Matthew Miller wrote:
On Wed, Nov 07, 2018 at 07:04:23AM -0500, Stephen John Smoogen wrote:
Now I see. Then porting https://popcon.debian.org/ to Fedora and providing infrastructure for incoming data is the only way to collect the stats. How to know if that is possible or interesting for Fedora?
There is probably some interest in it. I would contact Matthew Miller as he has wanted something like it in the past.
Yeah, definitely very interested. Just not had anyone with time to work on it :(
Now that the messaging infrastructure work is wrapping up, I've started looking into the hardware database service that was discussed at the beginning of the year.
Since a hardware database with user counts requires tracking the number of users in an anonymous way, it's essentially also an installation tracker. Adding another table or two would make it a software database so it might make sense to run them all as a single back-end service and have a single end-user tool for reporting.
You may want to look at https://pagure.io/fedora-infrastructure/issue/6497 and talk with npmccallum if he's still around and interested in this area.
kevin
On Wed, Nov 07, 2018 at 04:55:38PM -0800, Kevin Fenzi wrote:
You may want to look at https://pagure.io/fedora-infrastructure/issue/6497 and talk with npmccallum if he's still around and interested in this area.
Yes. Last I talked with him, he is still around and interested, but doesn't have spare time -- so help welcome!
I need only package info like https://popcon.debian.org/ but getting stats about Fedora hardware similar to https://store.steampowered.com/hwsurvey/ is interesting.
I can not find the source code for Smolt. The first point at https://fedoraproject.org/wiki/Smolt_retirement#Rationale is that Smalt is not maintained for more than 10 months, but census GitHub is 5 years old.
I am not a fan of rewrites. It seems to me that taking http://web.archive.org/web/20121029093725/http://www.smolts.org/static/stats... and then fixing that is way faster than writing everything from scratch.
I wish rationale above contained more technical details, because rewriting is always a step back until everything is implemented, and it would be interesting to know what were design limitation and why census design is better.
On Sat, Nov 10, 2018 at 09:49:00PM -0000, Anatoli Babenia wrote:
I need only package info like https://popcon.debian.org/ but getting stats about Fedora hardware similar to https://store.steampowered.com/hwsurvey/ is interesting.
I can not find the source code for Smolt. The first point at https://fedoraproject.org/wiki/Smolt_retirement#Rationale is that Smalt is not maintained for more than 10 months, but census GitHub is 5 years old.
I am not a fan of rewrites. It seems to me that taking http://web.archive.org/web/20121029093725/http://www.smolts.org/static/stats... and then fixing that is way faster than writing everything from scratch.
I wish rationale above contained more technical details, because rewriting is always a step back until everything is implemented, and it would be interesting to know what were design limitation and why census design is better.
It's a little old now, but back then Nathaniel presented Cencus at Flock and the presentation may answer some of your questions: https://www.youtube.com/watch?v=hcNYmBrSi14
Pierre
Dne 05. 11. 18 v 16:21 Stephen John Smoogen napsal(a):
Currently there is no way to know what packages are being installed/downloaded the most.
Some times ago, I led student bachelor thesis and he added rpm support for Debian's PopCon [1], which is opt-in system to report package usage. Alas, I never get to the point to set up instance and this properly packaged for Fedora.
If someone want to continue I can provide you with the thesis itself. IIRC it has been in English.
[1] https://popcon.debian.org/
Miroslav
Dne 12. 11. 18 v 15:41 Miroslav Suchý napsal(a):
Some times ago, I led student bachelor thesis and he added rpm support for Debian's PopCon [1], which is opt-in system to report package usage.
Ahh, forgot to add link to the port: https://github.com/xsuchy/popcon-for-fedora-old
Miroslav
infrastructure@lists.fedoraproject.org