Request for test data based off of obfuscated live data

John Palmieri johnp at
Wed Nov 19 15:07:50 UTC 2008

----- "Toshio Kuratomi" <a.badger at> wrote:

> > 
> Getting koji data munged and transferred may be a problem as it is
> just
> so darn big.  If we don't have to make changes to the data in koji,
> just
> get it distributed, then we could give access to a backup... but
> that's
> still a lot of information to transfer.

We would only need a portion of the data.  Ideally everything since the last supported version of each distribution (or one after so we get obsolete data to test against) but in reality the last month of activity should be suitable.

> pkgdb, fas, and bodhi are relatively small.
> fas is where we'd have our major security problems.  We can't give
> the
> information out unmunged.  I've munged it before, though, so it's
> doable.  How strict we need to be is an issue, though.  If we remove
> all
> the identifying information in the people table except for the
> userid,
> is that sufficient?  *Note: We probably also need to munge data in
> the
> configs table.

As long as we randomly generate data for that (well username at least).  Note that UID's are easily mapped back to usernames so you might want randomize that.  Also I believe packagedb and bodhi use usernames as the key instead of UID's so those would have to match accounts in the munged FAS db.  I would suggest generating a list of names from a dictionary and using that list to randomize names in the other services.  Of course the names need to correspond to group permissions so some logic would be needed to make sure records associated with a give name are valid.  However having the ability to recreate the associated user names may not be an issue since all of that data is public.  More importantly we need to make sure we aren't giving out addresses, phone numbers, password hashes and other such keys.

> pkgdb and bodhi don't have information that is privacy policy
> sensitive.
>  (Which doesn't mean that some users won't like it... just that I
> think
> we're covered.)

Mike's suggestion of running it by legal sounds like the best route. 
John (J5) Palmieri
Software Engineer
Red Hat, Inc.

More information about the infrastructure mailing list