Anonymized access log from a fedora mirror

Kevin Fenzi kevin at scrye.com
Fri May 3 15:21:07 UTC 2013


On Fri, 3 May 2013 14:35:45 +0200
Lukas Zapletal <lzap at redhat.com> wrote:

> Hello,
> 
> I have two students interested in diploma thesis called Yum plugin for
> suggesting packages based on usage:
> 
> http://bit.ly/18hrHbL
> 
> TL;DR - from anonymized access log, create a database of suggested
> packages using data mining techniques and provide a Yum plugin that
> would suggest "Users of vim also installed: ctags, git, ..."

So can you explain how this would work? 

How do we know that any particular person who installed yum installed
anything else? Are you using IP address to try and see what each IP
user installed? I can think of... a lot of ways that won't work. ;)

Another approach might be to work on https://fedorahosted.org/census/
This is the replacement for smolt, but never seems to have gotten very
far. It would be an application end users install. 

> I am gonna create a Fedora Feature wiki page shortly describing this
> in more detail. Our goal is to offer this project for integration into
> Fedora later on, at least provide Fedora packages for it.
> 
> To do that, we need good source of data. It would be best to collect
> access logs from one or two main Fedora mirrors. We would provide
> short script in Python that would parse access logs and anonymize the
> data (IP address hash-salted) and filtered only relevant data (RPM
> files from latest Fedora release or updates repositories). That would
> be phase one which should give us a sample data.

We had a discussion about making our logs public a while back, and I
think that discussion ended with us saying the IP addresses wouldn't be
safe to publish, even hashed.
 
http://lists.fedoraproject.org/pipermail/infrastructure/2012-April/011658.html

> Phase two would be to integrate this script with logrotate and for one
> Fedora release cycle (Fedora 19) the script would collect relevant
> anonymized data into a file. Final suggested package database would be
> created from this file (or maybe files to allow us to move them on the
> fly out of the stat directory).
> 
> The big (legal) question is if we are able to provide this anonymized
> data to public, or if we want to sign NDA with all people involved. I
> am CCing Tom for this question.

it's been asked before. 

I want to be cautious about this. ;) 

> I need your help with connecting to relevant people. Any comments are
> appreciated.
> 
> Many thanks and I hope this effort will lead to improving user
> experience with Fedora packaging.

kevin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20130503/39f5244c/attachment.sig>


More information about the infrastructure mailing list