Thoughts and question about MM2's UMDL script

Pierre-Yves Chibon pingou at pingoured.fr
Fri Jun 26 16:11:44 UTC 2015


Dear all,

Yesterday and today I spent a little time going over the UDML script of
MirrorManager2.
Going through it, I ended up with few questions regarding it.

* Repository name
UMDL's code clearly says:
  # historically, Repository.name was a longer string with
  # product and category deliniations.  But we were getting
  # unique constraint conflicts once we started introducing
  # repositories under repositories.  And .name isn't used for
  # anything meaningful.  So simply have it match dir.name,
  # which can't conflict.
And quickly grepping through MM2's sources, I could not find a reference to
this, we alway rely on the repository's prefix, not its name.

Question: Should we drop this?
It makes things confusing and is basically noise since we do not use it anywhere.

* Readable status of directories
The Directory table has a 'readable' property, none of our directories is not
readable.

Question is: what is the use-case for this boolean?

* Changes while running
Looking at the code, the UMDL seems to be very careful to handle changes on the
FS while it is running.
One hope I have is to speed up the UMDL run time, but I'm curious.

Question: Does anyone know if the FS changes often while the UMDL is actually 
running?
Gaining speed of course does not mean being wreakless but I'm curious as to how
often this situation occurs. IIRC, we trigger the UMDL via fedmsg now, right?
So in theory, the FS shouldn't change too much under the UMDL's feet.

* The directory table
So looking at the database and more precisely the directory table in that
database, it seems we store all the directories of the tree, ie:
/pub/alt/
/pub/alt/anaconda/
/pub/alt/bfo/
/pub/alt/bfo/gpxe-20120514
...
This makes me a little pondering. What is the interest of keeping the whole
list of directories in the DB ?
After all, as far as I understand, the UMDL finds the repo in the tree (repo
being defined by the presence of a 'repodata' folder containing the repomd.xml
or by the presence of a 'summary' file and an 'objects' folder).
For these repo, we look for the most recent files, stores this info in the DB
and later use it to check if the mirrors are up to date.

But do we need to checking that ``pub/fedora/linux`` exists when we later check
that ``pub/fedora/linux/updates/testing/21/x86_64/`` exists and is up to date?

I am under the impression currently that dropping un-necessary directories would
save DB space (the directories being then linked in the host_category_dir table
listing for each host, in each category which dir are present) as well as
crawling time (both in the UMDL and in the crawler).

* Non-directory based support in UDML.

So the UMDL script currently supports three ways of crawling the tree:
* file
* rsync
* directory

We, in Fedora, are only using the last one. I believe the `rsync` mode was added
to support Ubuntu and the file mode is basically a simplified version of the
directory mode, but that we do not use at at the moment.

I would like to propose that we drop support for rsync. I feel that it may be
simpler and easier to create an UMDL and a crawler for each distro that would
like to use MirrorManager than maintaining a one-script-fits-all UMDL that is
in fact tested for only one of the scenario.
That being said, if we ever have interest from Ubuntu, CentOS or any other
communities, we should definitively look into making the UMDL and crawler as
re-usable as possible for them, but keeping the distro-specific bits separated. 


Looking forward hearing your thoughts about these points and questions,
Thanks,

Pierre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20150626/13806c51/attachment.sig>


More information about the infrastructure mailing list