Anonymized access log from a fedora mirror

Lukas Zapletal lzap at redhat.com
Fri May 3 12:35:45 UTC 2013


Hello,

I have two students interested in diploma thesis called Yum plugin for
suggesting packages based on usage:

http://bit.ly/18hrHbL

TL;DR - from anonymized access log, create a database of suggested
packages using data mining techniques and provide a Yum plugin that
would suggest "Users of vim also installed: ctags, git, ..."

I am gonna create a Fedora Feature wiki page shortly describing this in
more detail. Our goal is to offer this project for integration into
Fedora later on, at least provide Fedora packages for it.

To do that, we need good source of data. It would be best to collect
access logs from one or two main Fedora mirrors. We would provide short
script in Python that would parse access logs and anonymize the data (IP
address hash-salted) and filtered only relevant data (RPM files from
latest Fedora release or updates repositories). That would be phase one
which should give us a sample data.

Phase two would be to integrate this script with logrotate and for one
Fedora release cycle (Fedora 19) the script would collect relevant
anonymized data into a file. Final suggested package database would be
created from this file (or maybe files to allow us to move them on the
fly out of the stat directory).

The big (legal) question is if we are able to provide this anonymized
data to public, or if we want to sign NDA with all people involved. I am
CCing Tom for this question.

I need your help with connecting to relevant people. Any comments are
appreciated.

Many thanks and I hope this effort will lead to improving user
experience with Fedora packaging.

-- 
Later,

 Lukas "lzap" Zapletal
 irc: lzap #theforeman


More information about the infrastructure mailing list