Thoughts and question about MM2's UMDL script

Kevin Fenzi kevin at scrye.com
Fri Jun 26 16:50:07 UTC 2015


On Fri, 26 Jun 2015 18:11:44 +0200
Pierre-Yves Chibon <pingou at pingoured.fr> wrote:

> Dear all,
> 
> Yesterday and today I spent a little time going over the UDML script
> of MirrorManager2.
> Going through it, I ended up with few questions regarding it.
> 
> * Repository name
> UMDL's code clearly says:
>   # historically, Repository.name was a longer string with
>   # product and category deliniations.  But we were getting
>   # unique constraint conflicts once we started introducing
>   # repositories under repositories.  And .name isn't used for
>   # anything meaningful.  So simply have it match dir.name,
>   # which can't conflict.
> And quickly grepping through MM2's sources, I could not find a
> reference to this, we alway rely on the repository's prefix, not its
> name.
> 
> Question: Should we drop this?
> It makes things confusing and is basically noise since we do not use
> it anywhere.

Sounds fine to me if we don't use it. 

> * Readable status of directories
> The Directory table has a 'readable' property, none of our
> directories is not readable.
> 
> Question is: what is the use-case for this boolean?

Could it be for when we have a release about to come out, but it's not
readable to the public yet? Typically we stage a release on friday and
until the actual release on tuesday the directory isn't open, only
mirrors with the acls can sync it. 

> * Changes while running
> Looking at the code, the UMDL seems to be very careful to handle
> changes on the FS while it is running.
> One hope I have is to speed up the UMDL run time, but I'm curious.
> 
> Question: Does anyone know if the FS changes often while the UMDL is
> actually running?
> Gaining speed of course does not mean being wreakless but I'm curious
> as to how often this situation occurs. IIRC, we trigger the UMDL via
> fedmsg now, right? So in theory, the FS shouldn't change too much
> under the UMDL's feet.

Well, I can think of one common case: 

1. Fedora updates push finishes, umdl starts. 
2. EPEL updates push finishes while umdl is in the middle of it's
directories. 

We could of course fix this by making it crawl them seperately? 
For each category?

> 
> * The directory table
> So looking at the database and more precisely the directory table in
> that database, it seems we store all the directories of the tree, ie:
> /pub/alt/
> /pub/alt/anaconda/
> /pub/alt/bfo/
> /pub/alt/bfo/gpxe-20120514
> ...
> This makes me a little pondering. What is the interest of keeping the
> whole list of directories in the DB ?
> After all, as far as I understand, the UMDL finds the repo in the
> tree (repo being defined by the presence of a 'repodata' folder
> containing the repomd.xml or by the presence of a 'summary' file and
> an 'objects' folder). For these repo, we look for the most recent
> files, stores this info in the DB and later use it to check if the
> mirrors are up to date.
>
> But do we need to checking that ``pub/fedora/linux`` exists when we
> later check that ``pub/fedora/linux/updates/testing/21/x86_64/``
> exists and is up to date?
> 
> I am under the impression currently that dropping un-necessary
> directories would save DB space (the directories being then linked in
> the host_category_dir table listing for each host, in each category
> which dir are present) as well as crawling time (both in the UMDL and
> in the crawler).

Yeah, I can't think of a reason we keep all that. 
 
> * Non-directory based support in UDML.
> 
> So the UMDL script currently supports three ways of crawling the tree:
> * file
> * rsync
> * directory
> 
> We, in Fedora, are only using the last one. I believe the `rsync`
> mode was added to support Ubuntu and the file mode is basically a
> simplified version of the directory mode, but that we do not use at
> at the moment.
> 
> I would like to propose that we drop support for rsync. I feel that
> it may be simpler and easier to create an UMDL and a crawler for each
> distro that would like to use MirrorManager than maintaining a
> one-script-fits-all UMDL that is in fact tested for only one of the
> scenario. That being said, if we ever have interest from Ubuntu,
> CentOS or any other communities, we should definitively look into
> making the UMDL and crawler as re-usable as possible for them, but
> keeping the distro-specific bits separated. 

Sounds fine to me, although it seems rpmfusion might use the rsync one. 

> 
> Looking forward hearing your thoughts about these points and
> questions, Thanks,

I had one additional thought based out of recent issues we have had:

Right now when an updates push happens, umdl starts and crawls
everything in all directory trees. Perhaps we could be much more
targted here? if the fedmsg says 'rawhide' was updated, only crawl that
area. If it says "Fedora 21" or "EPEL 7" only crawl those. 

And then of course we would need some way to have it crawl everything
in order to add new releases like Fedora 23 Alpha or whatever, but it
could just do that once a day? Or on demand?

Just a thought. 

kevin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20150626/bef619f5/attachment.sig>


More information about the infrastructure mailing list