Request for test data based off of obfuscated live data

Tue Nov 18 21:12:28 UTC 2008

Mike McGrath wrote:
> On Tue, 18 Nov 2008, John Palmieri wrote:
> 
>> Hey guys,
>>
>> On IRC the other day there was a discussion where I had requested the ability to use live data stripped of all personal identification and data, for creating a test bed for development of MyFedora.  It was asked I write up a reason for needing this data as it was hard to explain in detail on IRC.
>>
>> Current Development
>>
>> First lets go over my current development process.  Right now I work on live data.  Since most of my code involves reading data this isn't an issue except for perhaps putting a load on the servers when testing. This however becomes less ideal as I need to test code that modifies data, such as pushing a build.  Even more daunting is if I need to add functionality to one of the other apps, creating data that would somewhat reflect the real word is time consuming and often a blocker which has me move to something else.
>>
>> Why I need the data
>>
>> So why is it important to have real world data - or at least a semblance of it?  Working on something that will consolidate a lot of the data into one interface, I hit a vast majority of our infrastructure while treating it as one entity.  If each piece of infrastructure lived in isolation, it wouldn't be as big of an issue but as it stands the data has keys which link each record in one piece of infrastructure to a record in another.  For instance Fas usernames link to builds in Koji who's build numbers link to releases in Bodhi.  I need data with those links intact so I can follow the workflow from one tool to another, test access rights and simulate the progression of various data through the pieces of infrastructure without worrying about stomping on the data because I can quickly restore it to its initial state.  Also, I can't hit every edge case, I need to concentrate on how the data most commonly flows and having something that resembles what we see on the productio
!
>  n s
>>  ervers is key there.
>>
>> What I am asking for
>>
>> As stated above, I would like a data set representing the data one would see in our infrastructure.  Ideally this would mean a secure process that would dump data from koji, bodhi, fas and pkgdb while obfuscating all personally identifying data.  This could include switching package owners and uids at random so as not to be able to trace the data back (though in reality one could gather this data slowly by querying each of the infrastructure pieces). I only need a relatively small sampling of say a months worth of data and a semi random drawing of the most active contributors and their packages.  I can update dates to keep the data "current" for testing purposes.  Every once in awhile I would need a fresh sampling to make sure the code didn't just work with my sample set.
>>
>> Why pure random data isn't sufficient
>>
>> Random data does not produce the relationships needed to work with the entire fedora infrastructure and even if it did the data would not cover real world scenarios and most likely the relationships would be largely invalid (like a build tagged for F-8 released in F-9).  Also things like koji tags and group information need to absolutely conform to the structure we have setup.  For instance I key off of the string "updates-candidate" to determine if I should show a button to push the build to bodhi.  The button also relies on FAS telling bodhi that the current logged in user is in the correct group to push.  If it is not an updates candidate or the user is not in the correct group, the button does not show.
>>
>> What I would do with this data
>>
>> I would be able to accelerate development of the more interesting bits of myfedora while also being able to experiment and quickly produce patches to various bits of infrastructure.  For instance, FAS already had all the API I need to edit my profile except it is not exposed outside of fas because of the lack of a simple @allow_json decorator so I had to drop that feature until after the development freeze and a new FAS with the patch is put into production.  Even then modifying data on a production server, even if it is my own profile, is not an ideal way to test.  If I had a data set I could set up my own test environment, apply the patch and test before we deploy.  I could then go and patch other parts of the infrastructure to say speed up a query, add queries I needed and generally improve the base infrastructure as I developed MyFedora.  The patches would then be sent to trac and accepted or rejected in the usual manner.
>>
>> Others could also more easily get into hacking on infrastructure bits as they would have a place to start instead of a daunting blank slate.  If I can get the data I am more than happy to write scripts and kickstart files to easily setup and teardown a Fedora Infrastructure test and development instance.
>>
>> Whatever solution the infrastructure team thinks is good for what I need will be workable.  Above is what I think I need and an explanation on why it is needed.  Hopefully there will be some solution we can agree on to move forward fairly quickly.  Thanks for your time.
>>
> 
> We're actually in a pretty unique situation in that most of our data is
> public anyway, replicating pkgdb and bodhi data for example should be
> fairly easy.  Replicating the fas stuff should be easy too.
> 
> We're going to need to replicate not only the data but access to the data
> and this, to me at least, sounds like another development environment that
> is more mature then the pt setup but still not as strict as the staging
> environment.
> 
> 
Yep, that seems to be where the need fits in.

> What do others think on this?  I like the low overhead of the pt servers
> since people are kind of on their own in getting stuff done and it doesn't
> cause extra work to the sysadmin-web guys.  But there are drawbacks to it.
> 
I'm not sure what's best.  There's a lot of problems with doing this in
a shared development environment.  Even if we're controlling the access
to the data we'd still be more open with it here than in production or
staging.  For instance, people who are not primary fas authors or system
admins would have access to make modifications to fas.  So I think we'd
still end up wanting to modify the data before it hits this environment.
 We'd also have to devote resources to it.... another db server, host to
run koji-web,hub,builder, etc.  We'd have to update them.  We'd have to
work out conflicts between different developers, for instance if we work
on CSRF fixes in this environment and it makes developing other apps
like myfedora just flat out fail for a while.

If we can munge the data enough to be comfortable releasing it to the
public, it seems like that would cost us less man hours.  However, it
isn't entirely free.  We'd still have to make new dumps of data, modify
it for changes in the data model, etc.  Then the developer would become
responsible for downloading the sanitised data and running it on their
network.  Which is good because it isn't us but bad because it's not
trivial to set all this up.

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
Url : http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20081118/bc47abef/attachment.bin