Thoughts and question about MM2's UMDL script

Pierre-Yves Chibon pingou at pingoured.fr
Mon Jun 29 09:08:07 UTC 2015


On Fri, Jun 26, 2015 at 06:00:18PM +0000, Matt_Domsch at Dell.com wrote:
> * Readable status of directories
> The Directory table has a 'readable' property, none of our directories is not
> readable.
> 
> Question is: what is the use-case for this boolean?
> 
> == MD == Pre-bitflip content, which UMDL can see but the normal public can't yet.  Are you no longer bitflipping?  Then it doesn't matter.

Ok, I see the use-case in the crawler, but in the UMDL, how did it work?
The UMDL would not be allowed to read a given folder?

> 
> * Changes while running
> Looking at the code, the UMDL seems to be very careful to handle changes on the
> FS while it is running.
> One hope I have is to speed up the UMDL run time, but I'm curious.
> 
> Question: Does anyone know if the FS changes often while the UMDL is actually
> running?
> Gaining speed of course does not mean being wreakless but I'm curious as to how
> often this situation occurs. IIRC, we trigger the UMDL via fedmsg now, right?
> So in theory, the FS shouldn't change too much under the UMDL's feet.
> 
> * The directory table
> So looking at the database and more precisely the directory table in that
> database, it seems we store all the directories of the tree, ie:
> /pub/alt/
> /pub/alt/anaconda/
> /pub/alt/bfo/
> /pub/alt/bfo/gpxe-20120514
> ...
> This makes me a little pondering. What is the interest of keeping the whole
> list of directories in the DB ?
> After all, as far as I understand, the UMDL finds the repo in the tree (repo
> being defined by the presence of a 'repodata' folder containing the repomd.xml
> or by the presence of a 'summary' file and an 'objects' folder).
> For these repo, we look for the most recent files, stores this info in the DB
> and later use it to check if the mirrors are up to date.
> 
> But do we need to checking that ``pub/fedora/linux`` exists when we later check
> that ``pub/fedora/linux/updates/testing/21/x86_64/`` exists and is up to date?
> 
> I am under the impression currently that dropping un-necessary directories would
> save DB space (the directories being then linked in the host_category_dir table
> listing for each host, in each category which dir are present) as well as
> crawling time (both in the UMDL and in the crawler).
> 
> 
> == MD == You need non-repo directories for ISOs at least; there was a time when we were able to mirror the entire Fedora static web content too; able only because MM tracked all directories, not just repository directories.  MM1 also tried to be a "generic" mirror manager, not just a Fedora-specific mirror manager, so I intentionally tracked everything, not just Yum repos.
 
Idea: what if we were tracking only the folders that have files in them, so for
example http://dl.fedoraproject.org/pub/epel/5/ would not end-up in the
database.

In addition, we could add a sort of blacklist to avoid storing
http://dl.fedoraproject.org/pub/ just due to the presence of the
DIRECTORY_SIZES.txt file

This would reduce the number of directories we store for the Atomic tree.

> * Non-directory based support in UDML.
> 
> So the UMDL script currently supports three ways of crawling the tree:
> * file
> * rsync
> * directory
> 
> We, in Fedora, are only using the last one. I believe the `rsync` mode was added
> to support Ubuntu and the file mode is basically a simplified version of the
> directory mode, but that we do not use at at the moment.
> 
> I would like to propose that we drop support for rsync. I feel that it may be
> simpler and easier to create an UMDL and a crawler for each distro that would
> like to use MirrorManager than maintaining a one-script-fits-all UMDL that is
> in fact tested for only one of the scenario.
> That being said, if we ever have interest from Ubuntu, CentOS or any other
> communities, we should definitively look into making the UMDL and crawler as
> re-usable as possible for them, but keeping the distro-specific bits separated.
> 
> 
> == [file] was used early on for dev and testing.  It's not interesting.  [rsync] would be used when you don't have access to a master mirror (or very close replica).  Perhaps the rpmfusion setup still needs this.  I would have for testing Ubuntu, certainly.  It shouldn't be needed for production when the content being mirrored out is managed by the same people operating mirrormanager, as is the Fedora case.

Apparently RPMFusion does need this, so it needs to stay, the question becoming:
Should we split the different UMDL types into different scripts?
The idea being that allow easier optimization then.
(Note: I'm having this idea now but since I did not looked at what/how we could
optimize, it may end-up remaining in the same file)


Pierre


More information about the infrastructure mailing list