Mike Bonnet wrote:
On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
> Mike Bonnet wrote:
>> On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
>>> If the remote_repo_url data is going to be inherited (and I tend to
>>> think it should be), then I think it should be in a separate table.
...
>> I don't have any problem with this, though it does mean
we'll need to
>> duplicate quite a bit of the inheritance-walking code,
...
> Walking inheritance is just a matter of determining the
inheritance
> order and scanning data on the parent tags in sequence.
...
Sorry, I was referring to walking tag_inheritance. I'd rather
have one
place that walks the inheritance hierarchy and aggregates data from it,
than two places that are doing almost the same thing.
We're talking about inherently different data. External repos to be
merged in are quite different from builds in the system.
Each tag has a set of builds associated with it. We walk the
inheritance hierarchy, aggregating the builds from each tag in the
hierarchy into a flat list, and then pass that list to createrepo. We
would do essentially the same thing for external repos. When walking
the hierarchy, if a tag has an external repo associated with it, we
would append that repo url to a flat list, and pass that list to
mergerepo. In both cases we're working with collections of packages
that are associated with a tag, just in different formats.
Sure, we can do this with one call to readFullInheritance, and traverse
both the build table and external repo table from the given order.
In discussing this with Jesse, I think we want external repos to be
inherited. This is probably the easiest way to deal with having
multiple external repos getting pulled in to a single buildroot, which
is essential for Fedora (think F9 GA and F9 Updates).
The idea was that, by convention, we would have external-repo-only tags,
with only a single external repo associated with it and no
packages/builds associated. These external-repo-only tags could then be
inserted into the build hierarchy where appropriate. An ordered list of
external repos could then be constructed by performing the current
depth-first search of the inheritance hierarchy. The ordered list would
then be passed to mergerepo, which would ensure that packages in repos
earlier in the list supersede packages (by srpm name) in repos later in
the list. This would preserve the "first-match-wins" inheritance policy
that Koji currently implements, and that admins expect. For example:
dist-custom-build
├─dist-custom
└─dist-f9-updates-external
└─dist-f9-ga-external
would result mergerepo creating a single repo that would only contain
packages from dist-f9-ga-external if they did not exist in the
Koji-generated repo (dist-custom-build + dist-custom),
dist-f9-updates-external, or the blacklist of blocked packages. This is
consistent with how Koji package inheritance currently works, and I
think is the most intuitive approach.
It is similar, but different in potentially confusing ways. External
repos do not have build structure, so we can't really have the same sort
of inheritance behavior with a combination of external repo tags and
normal tags.
We order the external repos in inheritance order, but ultimately those
repos are merged with the internal one in a way that does not honor
inheritance in the way that the admin might expect.
Using tags to represent external repos fails intuition because external
repos are very much not like tags. When we get to supporting external
koji systems, we can do something like this, but for external repos the
"bolted-on" nature needs to be clear. This is why I'd prefer to have the
data a little more removed.
> I see all that, and I'm almost convinced. The flipside is
that by
> default all the code will treat these external rpms the same as the
> local ones, which will not be correct for a number of cases.
Personally I'd prefer adding a few special cases to the existing code,
rather than maintain a whole heap of almost-but-not-quite-the-same code
to manage external rpms. I think that conceptually they're alike enough
that the number of special cases will be minimal.
I think I'm ok with using the rpminfo table.
I think that synthesizing builds for that sake of maintaining the
not-null constraint is more pain than it's worth, and would make
enforcing our nvr-uniqueness constraints (which we definitely want to do
for local builds) more difficult. Having locally-built rpms always
associated with a build, and external rpms not, makes sense to me.
Ok, agreed.
> Also, I'm thinking we need to have some sort of rpm_origin
table so that
> all these references can be managed cleanly.
That sounds reasonable to me. Note that we may end up with a lot of
rows in this table, since we're allowing variable substitution in the
external_repo_url (tag name and arch). But I don't see that as a
problem.
I'm thinking the only substitution we should support is arch. Anything
else sort of constitutes a different repo.
If we use an origin table like this we can abstract out the arch.
Something like:
create table external_repo (
id SERIAL PRIMARY KEY,
name TEXT );
create table external_repo_config (
external_repo_id INTEGER NOT NULL REFERENCES external_repo (id),
url TEXT NOT NULL,
-- plus versioning fields
-- ... );
This way if upstream repo changes url scheme or moves to a different
host, you can keep some notion of connectedness. External rpms would
simply reference external_repo_id.
> In the same vein, what happens when an external repo has an
nvra+sigmd5
> matching a /local/ rpm? Maybe it doesn't matter, though I guess
> technically we want to record the origin properly when it gets into a
> buildroot via external repo vs internal tag.
Right, we would record the origin as the remote repo it came from (by
parsing the merged repodata and looking at the baseurl).
So where do we draw the line between code that we add to koji and code
that we add to createrepo (or some external merge-repo tool)?
>> However, we will already be parsing the remote repodata,
which contains
>> information like the srpm name for each rpm, so we could do something
>> more sophisticated here.
> -snipsnip-
> ...
>> The repomerge tool seems like it solves the problem better, and would be
>> more useful in general.
> If we're going to have our fingers in the repodata, we'll probably want
> to have them in the merge too. Perhaps we can get createrepo and/or this
> repomerge tool usefully libified?
I was thinking we would probably just call out to the tool the way we do
for createrepo, but I'm certainly not against using an API. I'm a
little concerned about memory usage when doing the create/mergerepo
in-process, since we know python and mod_python have garbage-collection
issues, but that may be a "cross the bridge when we come to it" problem.
Seth, is it feasible to provide an API to mergerepo that we could use
directly?
I don't think I even saw a reply from Seth on this. Where does the
mergerepo code stand now?