Thoughts and question about MM2's UMDL script

Tue Jun 30 06:34:51 UTC 2015

On Mon, Jun 29, 2015 at 02:01:46PM +0000, Matt_Domsch at Dell.com wrote:
>    > I am under the impression currently that dropping un-necessary
>    > directories would save DB space (the directories being then linked in
>    > the host_category_dir table listing for each host, in each category
>    > which dir are present) as well as crawling time (both in the UMDL and in
>    the crawler).
>    >
>    >
>    > == MD == You need non-repo directories for ISOs at least; there was a
>    time when we were able to mirror the entire Fedora static web content too;
>    able only because MM tracked all directories, not just repository
>    directories. MM1 also tried to be a "generic" mirror manager, not just a
>    Fedora-specific mirror manager, so I intentionally tracked everything, not
>    just Yum repos.
> 
>    Idea: what if we were tracking only the folders that have files in them,
>    so for example http://dl.fedoraproject.org/pub/epel/5/ would not end-up in
>    the database.
> 
>    In addition, we could add a sort of blacklist to avoid storing
>    http://dl.fedoraproject.org/pub/ just due to the presence of the
>    DIRECTORY_SIZES.txt file
> 
>    This would reduce the number of directories we store for the Atomic tree.
> 
>    == MD == I didn't optimize for a few non-file-containing directories. 
>    You're welcome to if you see a need.  But it's saving just a few entries
>    out of hundreds/thousands.

I got curious to see how it looks in reality, so I wrote a quick python script
that goes through the entire tree and count the number of folders, files and
folders with no files in them, this is how the results look like:

fedora
  1814562 files found
  4460 folders found
  293 folders w/o files
  Ran in 11.111 min

epel
  222697 files found
  492 folders found
  19 folders w/o files
  Ran in 1.400 min

alt
  1309830 files found
  13692 folders found
  1614 folders w/o files
  Ran in 4.633 min

fedora-secondary
  3774530 files found
  5576 folders found
  651 folders w/o files
  Ran in 26.701 min

archive
  2705931 files found
  3095 folders found
  351 folders w/o files
  Ran in 22.042 min

Total time: 65.887 min

So it would save a few hundreds of entry in the directory table but it should
still save some place in the host_category_dir table.

Also when seeing this, it feels to me that we should be more flexible about
which part of the tree we run against, could even be a sub-part (ie: a specific
secondary arch or so).
I also would like to see if we can parallelize the browsing of the tree.

Pierre