Thoughts and question about MM2's UMDL script

Fri Jun 26 18:45:18 UTC 2015

On Fri, Jun 26, 2015 at 06:11:44PM +0200, Pierre-Yves Chibon wrote:
> Yesterday and today I spent a little time going over the UDML script of
> MirrorManager2.
> Going through it, I ended up with few questions regarding it.
> 
> * Repository name
> UMDL's code clearly says:
>   # historically, Repository.name was a longer string with
>   # product and category deliniations.  But we were getting
>   # unique constraint conflicts once we started introducing
>   # repositories under repositories.  And .name isn't used for
>   # anything meaningful.  So simply have it match dir.name,
>   # which can't conflict.
> And quickly grepping through MM2's sources, I could not find a reference to
> this, we alway rely on the repository's prefix, not its name.
> 
> Question: Should we drop this?
> It makes things confusing and is basically noise since we do not use it anywhere.

It was a helpful column for fixing errors with the repos. But as the
database is so huge everything we could drop should be dropped.

[...]

> * The directory table
> So looking at the database and more precisely the directory table in that
> database, it seems we store all the directories of the tree, ie:
> /pub/alt/
> /pub/alt/anaconda/
> /pub/alt/bfo/
> /pub/alt/bfo/gpxe-20120514
> ...
> This makes me a little pondering. What is the interest of keeping the whole
> list of directories in the DB ?
> After all, as far as I understand, the UMDL finds the repo in the tree (repo
> being defined by the presence of a 'repodata' folder containing the repomd.xml
> or by the presence of a 'summary' file and an 'objects' folder).
> For these repo, we look for the most recent files, stores this info in the DB
> and later use it to check if the mirrors are up to date.
> 
> But do we need to checking that ``pub/fedora/linux`` exists when we later check
> that ``pub/fedora/linux/updates/testing/21/x86_64/`` exists and is up to date?
> 
> I am under the impression currently that dropping un-necessary directories would
> save DB space (the directories being then linked in the host_category_dir table
> listing for each host, in each category which dir are present) as well as
> crawling time (both in the UMDL and in the crawler).

Again, dropping unnecessary information from the database sounds good.
Although this one sounds a bit more complex as you always have to delete
directories if subdirectories appear and add directories if
subdirectories disappear.

> * Non-directory based support in UDML.
> 
> So the UMDL script currently supports three ways of crawling the tree:
> * file
> * rsync
> * directory
> 
> We, in Fedora, are only using the last one. I believe the `rsync` mode was added
> to support Ubuntu and the file mode is basically a simplified version of the
> directory mode, but that we do not use at at the moment.
> 
> I would like to propose that we drop support for rsync. I feel that it may be
> simpler and easier to create an UMDL and a crawler for each distro that would
> like to use MirrorManager than maintaining a one-script-fits-all UMDL that is
> in fact tested for only one of the scenario.
> That being said, if we ever have interest from Ubuntu, CentOS or any other
> communities, we should definitively look into making the UMDL and crawler as
> re-usable as possible for them, but keeping the distro-specific bits separated. 

Like already mentioned, RPM Fusion uses the rsync mode as the master
mirror is 'far' away from the MirrorManager installation. It is still
using MM1 on CentOS 5 and currently I am not immediately planing on
upgrading to MM2. So it could be removed and I should be able to write
the necessary umdl rsync crawler once I need it.

Another thought about umdl I had concerns the file mode. We have for
the categories 'Fedora EPEL' and 'Fedora Linux' files called
'fullfilelist'. Maybe that would be an option for umdl to use to reduce
I/O on the NFS mounts. Only actually reading the files and metadata from
NFS if it is necessary. Just one of those ideas.

		Adrian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 811 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20150626/1212d79a/attachment.sig>