On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
Mike Bonnet wrote:
> On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
>> If the remote_repo_url data is going to be inherited (and I tend to
>> think it should be), then I think it should be in a separate table. I'd
>> like to reserve tag_config for data that is local to individual tags.
>> This will also make it easier to represent multiple remote repos.
> I don't have any problem with this, though it does mean we'll need to
> duplicate quite a bit of the inheritance-walking code, or make it
> configurable as to which inheritance it's walking. This new table would
> also have to be versioned, the same way the tag_config table is.
Walking inheritance is just a matter of determining the inheritance
order and scanning data on the parent tags in sequence. Currently,
nothing scans tag_config in this way because no data in tag_config is
inherited. (Well, in a sense tag_changed_since_event() does walk
tag_config, but that's a little different.)
Sorry, I was referring to walking tag_inheritance. I'd rather have one
place that walks the inheritance hierarchy and aggregates data from it,
than two places that are doing almost the same thing.
Each tag has a set of builds associated with it. We walk the
inheritance hierarchy, aggregating the builds from each tag in the
hierarchy into a flat list, and then pass that list to createrepo. We
would do essentially the same thing for external repos. When walking
the hierarchy, if a tag has an external repo associated with it, we
would append that repo url to a flat list, and pass that list to
mergerepo. In both cases we're working with collections of packages
that are associated with a tag, just in different formats.
We need to figure out how we'll deal with multiplicity for the
repos. If tag A uses repo X and inherits from tag B which uses repo Y,
then does tag A use both X and Y, or does the X entry override it?
A (+repo X)
+- B (+repo Y)
My inclination is that it should override, because I think we'll want
some way to do override that that mechanism seems easiest.
In discussing this with Jesse, I think we want external repos to be
inherited. This is probably the easiest way to deal with having
multiple external repos getting pulled in to a single buildroot, which
is essential for Fedora (think F9 GA and F9 Updates).
The idea was that, by convention, we would have external-repo-only tags,
with only a single external repo associated with it and no
packages/builds associated. These external-repo-only tags could then be
inserted into the build hierarchy where appropriate. An ordered list of
external repos could then be constructed by performing the current
depth-first search of the inheritance hierarchy. The ordered list would
then be passed to mergerepo, which would ensure that packages in repos
earlier in the list supersede packages (by srpm name) in repos later in
the list. This would preserve the "first-match-wins" inheritance policy
that Koji currently implements, and that admins expect. For example:
would result mergerepo creating a single repo that would only contain
packages from dist-f9-ga-external if they did not exist in the
Koji-generated repo (dist-custom-build + dist-custom),
dist-f9-updates-external, or the blacklist of blocked packages. This is
consistent with how Koji package inheritance currently works, and I
think is the most intuitive approach.
Also, I think we'll probably want to allow multiple external
tag, something which will be much easier to represent in an external
table. We can include an explicit priority field to make a sane
uniqueness condition (and to provide a clear ordering for the repo merge).
As outlined above, I'd prefer to keep it to one external repo per tag,
along with repo inheritance. I think this is easier from a management
perspective, and more consistent with the way Koji currently works.
Ordering for mergerepo will be represented by the location of the tag in
the inheritance hierarchy. With a 1-to-1 tag->external repo mapping, it
then makes sense to store the external repo url in the tag_config table.
> The big win here is that the methods and tools that query
> information about what was present in the buildroot at build time
I see all that, and I'm almost convinced. The flipside is that by
default all the code will treat these external rpms the same as the
local ones, which will not be correct for a number of cases. Obviously,
part of this will involve changing code to behave differently for the
external ones, I'm just worried about how much we might have to change,
or what we might miss.
Personally I'd prefer adding a few special cases to the existing code,
rather than maintain a whole heap of almost-but-not-quite-the-same code
to manage external rpms. I think that conceptually they're alike enough
that the number of special cases will be minimal.
> Yes, I realize that the "not null" constraint should
exist now, and in
> fact all rpms in the Fedora database do reference builds. However, I
> think logically having a remote rpm not reference a local build makes
> sense. The alternative is to create the build object from the srpm info
> in the repodata (along with some namespacing similar to rpminfo).
> However, this would significantly clutter the build table with
> information that is pretty non-essential.
The idea of grouping them into builds appeals to me, but I don't think
it's possible in general (though maybe we could fake it well enough
somehow). The only data we're (mostly) guaranteed to have to work with
is the sourcerpm header field. The catch is that in case of an
nvr-collision we can't determine which build it belongs to (or indeed if
we should create a new build of same nvr).
I think that synthesizing builds for that sake of maintaining the
not-null constraint is more pain than it's worth, and would make
enforcing our nvr-uniqueness constraints (which we definitely want to do
for local builds) more difficult. Having locally-built rpms always
associated with a build, and external rpms not, makes sense to me.
> I'm open to suggestions on how to modify the uniqueness
> handle this case. We care about ensuring that a locally-built rpm
> doesn't have the same n-v-r as another locally-built rpm. I don't think
> we care at all about n-v-r uniqueness amongst remote rpms. However, we
> probably want to avoid creating 2 rpminfo entries when the same remote
> rpm is used in 2 different buildroots. Using the sigmd5 is a good way
> to avoid that.
Agreed. same sigmd5 ==> same rpm.
> However, what happens if a remote rpm with the same
> n-v-r and sigmd5 gets pulled in from 2 different remote repos?
This gets into part of what bugs me about this and why I'm somewhat
inclined to keep the ext repo data a step removed. It's so potentially
dirty. Koji has all these consistency constraints that an external repo
(much less many of them in aggregate) lacks.
It's quite possible that an external repo might respin a package keeping
the same nvr, so we don't even need 2 external repos to hit this
> the "origin" field should be pushed down to the buildroot_listing table,
> so the buildroots can reference the same rpminfo object, but indicate
> that it came from a different repo in each buildroot?
Interesting. Yeah, I think that is is probably the right answer.
Also, I'm thinking we need to have some sort of rpm_origin table so that
all these references can be managed cleanly.
That sounds reasonable to me. Note that we may end up with a lot of
rows in this table, since we're allowing variable substitution in the
external_repo_url (tag name and arch). But I don't see that as a
> Also, what happens when we find 2 remote rpms with the same
> different sigmd5s? Should that be an error?
Certainly we have to allow the possibility that two origins might have
overlapping nvras. Within a single origin, I'm not so sure. I suppose we
can get away with some small consistency demands. As long as we're only
enforcing unique nvra for local builds and indexing by sigmd5/similar, I
don't think we /have/ to make this an error condition.
Yeah, it's probably safest to not make this an error condition, since we
have very little control over the remote repos.
In the same vein, what happens when an external repo has an
matching a /local/ rpm? Maybe it doesn't matter, though I guess
technically we want to record the origin properly when it gets into a
buildroot via external repo vs internal tag.
Right, we would record the origin as the remote repo it came from (by
parsing the merged repodata and looking at the baseurl).
>> First, I'd like to be able to support external koji
servers (or rather a
> I agree that this is a desirable goal. I believe this is more the
> domain of the Koji secondary-arch daemon. It would be talking directly
Well, it has some similarities to 2nd arch, but still quite different.
The more I think about it, the more I think that supporting an external
koji server will probably be much different from from the ext repo
business. Most of the issues with rpminfo will carry over, but with a
koji server we will be able to determine build data and can probably
actually pull off something like "inherit from tag X on koji server Y."
And in the external Koji server case, it might actually make sense to
create build objects for the external rpms, since we'll be able to query
the external Koji about which build an rpm came from.
> The tag content may be managed by build, but when it's time
for it to
> actually get used (in the form of a yum repo) it gets unfolded into a
> big list of rpms. And what gets associated with a buildroot is simply a
> big list of rpms. Conceptually I don't really have a problem with the
> idea of a tag as a big list of rpms, that we happen to group by srpm
> within Koji because it's more convenient for us. So adding the external
> repo information to tag_config is just an extension of the big list of
> rpms model.
Yeah, I almost wish I hadn't made the build structure quite the way I did.
> However, we will already be parsing the remote repodata, which contains
> information like the srpm name for each rpm, so we could do something
> more sophisticated here.
> The repomerge tool seems like it solves the problem better, and would be
> more useful in general.
If we're going to have our fingers in the repodata, we'll probably want
to have them in the merge too. Perhaps we can get createrepo and/or this
repomerge tool usefully libified?
I was thinking we would probably just call out to the tool the way we do
for createrepo, but I'm certainly not against using an API. I'm a
little concerned about memory usage when doing the create/mergerepo
in-process, since we know python and mod_python have garbage-collection
issues, but that may be a "cross the bridge when we come to it" problem.
Seth, is it feasible to provide an API to mergerepo that we could use