Mike McGrath wrote:
On Tue, 18 Nov 2008, John Palmieri wrote:
> Hey guys,
> On IRC the other day there was a discussion where I had requested the ability to use
live data stripped of all personal identification and data, for creating a test bed for
development of MyFedora. It was asked I write up a reason for needing this data as it was
hard to explain in detail on IRC.
> Current Development
> First lets go over my current development process. Right now I work on live data.
Since most of my code involves reading data this isn't an issue except for perhaps
putting a load on the servers when testing. This however becomes less ideal as I need to
test code that modifies data, such as pushing a build. Even more daunting is if I need to
add functionality to one of the other apps, creating data that would somewhat reflect the
real word is time consuming and often a blocker which has me move to something else.
> Why I need the data
> So why is it important to have real world data - or at least a semblance of it?
Working on something that will consolidate a lot of the data into one interface, I hit a
vast majority of our infrastructure while treating it as one entity. If each piece of
infrastructure lived in isolation, it wouldn't be as big of an issue but as it stands
the data has keys which link each record in one piece of infrastructure to a record in
another. For instance Fas usernames link to builds in Koji who's build numbers link
to releases in Bodhi. I need data with those links intact so I can follow the workflow
from one tool to another, test access rights and simulate the progression of various data
through the pieces of infrastructure without worrying about stomping on the data because I
can quickly restore it to its initial state. Also, I can't hit every edge case, I
need to concentrate on how the data most commonly flows and having something that
resembles what we see on the productio
> ervers is key there.
> What I am asking for
> As stated above, I would like a data set representing the data one would see in our
infrastructure. Ideally this would mean a secure process that would dump data from koji,
bodhi, fas and pkgdb while obfuscating all personally identifying data. This could
include switching package owners and uids at random so as not to be able to trace the data
back (though in reality one could gather this data slowly by querying each of the
infrastructure pieces). I only need a relatively small sampling of say a months worth of
data and a semi random drawing of the most active contributors and their packages. I can
update dates to keep the data "current" for testing purposes. Every once in
awhile I would need a fresh sampling to make sure the code didn't just work with my
> Why pure random data isn't sufficient
> Random data does not produce the relationships needed to work with the entire fedora
infrastructure and even if it did the data would not cover real world scenarios and most
likely the relationships would be largely invalid (like a build tagged for F-8 released in
F-9). Also things like koji tags and group information need to absolutely conform to the
structure we have setup. For instance I key off of the string
"updates-candidate" to determine if I should show a button to push the build to
bodhi. The button also relies on FAS telling bodhi that the current logged in user is in
the correct group to push. If it is not an updates candidate or the user is not in the
correct group, the button does not show.
> What I would do with this data
> I would be able to accelerate development of the more interesting bits of myfedora
while also being able to experiment and quickly produce patches to various bits of
infrastructure. For instance, FAS already had all the API I need to edit my profile
except it is not exposed outside of fas because of the lack of a simple ＠allow_json
decorator so I had to drop that feature until after the development freeze and a new FAS
with the patch is put into production. Even then modifying data on a production server,
even if it is my own profile, is not an ideal way to test. If I had a data set I could
set up my own test environment, apply the patch and test before we deploy. I could then
go and patch other parts of the infrastructure to say speed up a query, add queries I
needed and generally improve the base infrastructure as I developed MyFedora. The patches
would then be sent to trac and accepted or rejected in the usual manner.
> Others could also more easily get into hacking on infrastructure bits as they would
have a place to start instead of a daunting blank slate. If I can get the data I am more
than happy to write scripts and kickstart files to easily setup and teardown a Fedora
Infrastructure test and development instance.
> Whatever solution the infrastructure team thinks is good for what I need will be
workable. Above is what I think I need and an explanation on why it is needed. Hopefully
there will be some solution we can agree on to move forward fairly quickly. Thanks for
We're actually in a pretty unique situation in that most of our data is
public anyway, replicating pkgdb and bodhi data for example should be
fairly easy. Replicating the fas stuff should be easy too.
We're going to need to replicate not only the data but access to the data
and this, to me at least, sounds like another development environment that
is more mature then the pt setup but still not as strict as the staging
Yep, that seems to be where the need fits in.
What do others think on this? I like the low overhead of the pt
since people are kind of on their own in getting stuff done and it doesn't
cause extra work to the sysadmin-web guys. But there are drawbacks to it.
I'm not sure what's best. There's a lot of problems with doing this
a shared development environment. Even if we're controlling the access
to the data we'd still be more open with it here than in production or
staging. For instance, people who are not primary fas authors or system
admins would have access to make modifications to fas. So I think we'd
still end up wanting to modify the data before it hits this environment.
We'd also have to devote resources to it.... another db server, host to
run koji-web,hub,builder, etc. We'd have to update them. We'd have to
work out conflicts between different developers, for instance if we work
on CSRF fixes in this environment and it makes developing other apps
like myfedora just flat out fail for a while.
If we can munge the data enough to be comfortable releasing it to the
public, it seems like that would cost us less man hours. However, it
isn't entirely free. We'd still have to make new dumps of data, modify
it for changes in the data model, etc. Then the developer would become
responsible for downloading the sanitised data and running it on their
network. Which is good because it isn't us but bad because it's not
trivial to set all this up.