Request for test data based off of obfuscated live data

Tue Nov 18 19:01:06 UTC 2008

Hey guys,

On IRC the other day there was a discussion where I had requested the ability to use live data stripped of all personal identification and data, for creating a test bed for development of MyFedora.  It was asked I write up a reason for needing this data as it was hard to explain in detail on IRC.

Current Development

First lets go over my current development process.  Right now I work on live data.  Since most of my code involves reading data this isn't an issue except for perhaps putting a load on the servers when testing. This however becomes less ideal as I need to test code that modifies data, such as pushing a build.  Even more daunting is if I need to add functionality to one of the other apps, creating data that would somewhat reflect the real word is time consuming and often a blocker which has me move to something else.

Why I need the data

So why is it important to have real world data - or at least a semblance of it?  Working on something that will consolidate a lot of the data into one interface, I hit a vast majority of our infrastructure while treating it as one entity.  If each piece of infrastructure lived in isolation, it wouldn't be as big of an issue but as it stands the data has keys which link each record in one piece of infrastructure to a record in another.  For instance Fas usernames link to builds in Koji who's build numbers link to releases in Bodhi.  I need data with those links intact so I can follow the workflow from one tool to another, test access rights and simulate the progression of various data through the pieces of infrastructure without worrying about stomping on the data because I can quickly restore it to its initial state.  Also, I can't hit every edge case, I need to concentrate on how the data most commonly flows and having something that resembles what we see on the production servers is key there.

What I am asking for   

As stated above, I would like a data set representing the data one would see in our infrastructure.  Ideally this would mean a secure process that would dump data from koji, bodhi, fas and pkgdb while obfuscating all personally identifying data.  This could include switching package owners and uids at random so as not to be able to trace the data back (though in reality one could gather this data slowly by querying each of the infrastructure pieces). I only need a relatively small sampling of say a months worth of data and a semi random drawing of the most active contributors and their packages.  I can update dates to keep the data "current" for testing purposes.  Every once in awhile I would need a fresh sampling to make sure the code didn't just work with my sample set. 

Why pure random data isn't sufficient

Random data does not produce the relationships needed to work with the entire fedora infrastructure and even if it did the data would not cover real world scenarios and most likely the relationships would be largely invalid (like a build tagged for F-8 released in F-9).  Also things like koji tags and group information need to absolutely conform to the structure we have setup.  For instance I key off of the string "updates-candidate" to determine if I should show a button to push the build to bodhi.  The button also relies on FAS telling bodhi that the current logged in user is in the correct group to push.  If it is not an updates candidate or the user is not in the correct group, the button does not show.

What I would do with this data

I would be able to accelerate development of the more interesting bits of myfedora while also being able to experiment and quickly produce patches to various bits of infrastructure.  For instance, FAS already had all the API I need to edit my profile except it is not exposed outside of fas because of the lack of a simple @allow_json decorator so I had to drop that feature until after the development freeze and a new FAS with the patch is put into production.  Even then modifying data on a production server, even if it is my own profile, is not an ideal way to test.  If I had a data set I could set up my own test environment, apply the patch and test before we deploy.  I could then go and patch other parts of the infrastructure to say speed up a query, add queries I needed and generally improve the base infrastructure as I developed MyFedora.  The patches would then be sent to trac and accepted or rejected in the usual manner.

Others could also more easily get into hacking on infrastructure bits as they would have a place to start instead of a daunting blank slate.  If I can get the data I am more than happy to write scripts and kickstart files to easily setup and teardown a Fedora Infrastructure test and development instance.

Whatever solution the infrastructure team thinks is good for what I need will be workable.  Above is what I think I need and an explanation on why it is needed.  Hopefully there will be some solution we can agree on to move forward fairly quickly.  Thanks for your time.

--
John (J5) Palmieri
Software Engineer
Red Hat, Inc.