John Palmieri wrote:
----- "Toshio Kuratomi" <a.badger(a)gmail.com> wrote:
<snip>
> Getting koji data munged and transferred may be a problem as it is
> just
> so darn big. If we don't have to make changes to the data in koji,
> just
> get it distributed, then we could give access to a backup... but
> that's
> still a lot of information to transfer.
We would only need a portion of the data. Ideally everything since the last supported
version of each distribution (or one after so we get obsolete data to test against) but in
reality the last month of activity should be suitable.
This gets us into the realm of figuring out what we can delete from the
entire koji data store which seems like a big can of worms. Some things
like usernames have to be in their entirety. Other things like builds
can be less than the entirety but since there's dependencies between
builds it wouldn't be a simple remove everything before this timestamp.
It gets us back into munging the koji data which is what I think we
should be avoiding.
> pkgdb, fas, and bodhi are relatively small.
>
> fas is where we'd have our major security problems. We can't give
> the
> information out unmunged. I've munged it before, though, so it's
> doable. How strict we need to be is an issue, though. If we remove
> all
> the identifying information in the people table except for the
> userid,
> is that sufficient? *Note: We probably also need to munge data in
> the
> configs table.
As long as we randomly generate data for that (well username at least). Note that
UID's are easily mapped back to usernames so you might want randomize that. Also I
believe packagedb and bodhi use usernames as the key instead of UID's so those would
have to match accounts in the munged FAS db. I would suggest generating a list of names
from a dictionary and using that list to randomize names in the other services. Of course
the names need to correspond to group permissions so some logic would be needed to make
sure records associated with a give name are valid. However having the ability to
recreate the associated user names may not be an issue since all of that data is public.
More importantly we need to make sure we aren't giving out addresses, phone numbers,
password hashes and other such keys.
pkgdb uses userids in the db. Bodhi and koji use usernames. I'm
migrating pkgdb to usernames (internally right now; the db and public
facing APIs for 0.4)
If we have to munge usernames that makes things harder as we can't just
dump the koji and bodhi dbs but also have to post-process them. (Note:
usernames are another thing that the privacy policy allows us to give out.)
> pkgdb and bodhi don't have information that is privacy
policy
> sensitive.
> (Which doesn't mean that some users won't like it... just that I
> think
> we're covered.)
Mike's suggestion of running it by legal sounds like the best route.
Running it by legal just to be sure we're doing the right thing is good
although we do have a list of things that we are allowed to have public
per the privacy policy and a pretty good criteria for deciding on other
data. I'm commenting more on the perception aspect rather than the pure
legal obligation. And not saying I think it's going to be a problem
just that we should be prepared for a few complaints even if it's
perfectly legal.
-Toshio