Anonymized access log from a fedora mirror

Pablo Iranzo Gómez Pablo.Iranzo at redhat.com
Fri May 3 12:43:15 UTC 2013


Hi,


----- Mensaje original -----
> De: "Lukas Zapletal" <lzap at redhat.com>
> Para: infrastructure at lists.fedoraproject.org
> Enviados: Viernes, 3 de Mayo 2013 14:35:45
> Asunto: Anonymized access log from a fedora mirror
> 
> Hello,
> 
> I have two students interested in diploma thesis called Yum plugin
> for
> suggesting packages based on usage:
> 
> http://bit.ly/18hrHbL
> 
> TL;DR - from anonymized access log, create a database of suggested
> packages using data mining techniques and provide a Yum plugin that
> would suggest "Users of vim also installed: ctags, git, ..."
> 
> I am gonna create a Fedora Feature wiki page shortly describing this
> in
> more detail. Our goal is to offer this project for integration into
> Fedora later on, at least provide Fedora packages for it.
> 
> To do that, we need good source of data. It would be best to collect
> access logs from one or two main Fedora mirrors. We would provide
> short
> script in Python that would parse access logs and anonymize the data
> (IP
> address hash-salted) and filtered only relevant data (RPM files from
> latest Fedora release or updates repositories). That would be phase
> one
> which should give us a sample data.
> 
> Phase two would be to integrate this script with logrotate and for
> one
> Fedora release cycle (Fedora 19) the script would collect relevant
> anonymized data into a file. Final suggested package database would
> be
> created from this file (or maybe files to allow us to move them on
> the
> fly out of the stat directory).
> 
> The big (legal) question is if we are able to provide this anonymized
> data to public, or if we want to sign NDA with all people involved. I
> am
> CCing Tom for this question.
> 
> I need your help with connecting to relevant people. Any comments are
> appreciated.
> 
> Many thanks and I hope this effort will lead to improving user
> experience with Fedora packaging.

Not sure from our side, but Debian has always a package "popularity-contest", which automatically submitted packages to make those list of recommended packages.

Maybe it will require some information from legal team, but from the initiative part, it sounds good :)

Regards,
Pablo


-- 

Pablo Iranzo Gómez (Pablo.Iranzo at redhat.com)
Senior Global Profesional Services Consultant (RHCA, RHCSS, RHCDS, RHCVA, RHCE, RHCSA, RHCSP, JBCAA) #110-215-852
Phone: +34 645 01 01 49 (CET/CEST)
GnuPG KeyID: 0x5BD8E1E4 



More information about the infrastructure mailing list