Thoughts and question about MM2's UMDL script
pingou at pingoured.fr
Tue Jun 30 06:34:51 UTC 2015
On Mon, Jun 29, 2015 at 02:01:46PM +0000, Matt_Domsch at Dell.com wrote:
> > I am under the impression currently that dropping un-necessary
> > directories would save DB space (the directories being then linked in
> > the host_category_dir table listing for each host, in each category
> > which dir are present) as well as crawling time (both in the UMDL and in
> the crawler).
> > == MD == You need non-repo directories for ISOs at least; there was a
> time when we were able to mirror the entire Fedora static web content too;
> able only because MM tracked all directories, not just repository
> directories. MM1 also tried to be a "generic" mirror manager, not just a
> Fedora-specific mirror manager, so I intentionally tracked everything, not
> just Yum repos.
> Idea: what if we were tracking only the folders that have files in them,
> so for example http://dl.fedoraproject.org/pub/epel/5/ would not end-up in
> the database.
> In addition, we could add a sort of blacklist to avoid storing
> http://dl.fedoraproject.org/pub/ just due to the presence of the
> DIRECTORY_SIZES.txt file
> This would reduce the number of directories we store for the Atomic tree.
> == MD == I didn't optimize for a few non-file-containing directories.
> You're welcome to if you see a need. But it's saving just a few entries
> out of hundreds/thousands.
I got curious to see how it looks in reality, so I wrote a quick python script
that goes through the entire tree and count the number of folders, files and
folders with no files in them, this is how the results look like:
1814562 files found
4460 folders found
293 folders w/o files
Ran in 11.111 min
222697 files found
492 folders found
19 folders w/o files
Ran in 1.400 min
1309830 files found
13692 folders found
1614 folders w/o files
Ran in 4.633 min
3774530 files found
5576 folders found
651 folders w/o files
Ran in 26.701 min
2705931 files found
3095 folders found
351 folders w/o files
Ran in 22.042 min
Total time: 65.887 min
So it would save a few hundreds of entry in the directory table but it should
still save some place in the host_category_dir table.
Also when seeing this, it feels to me that we should be more flexible about
which part of the tree we run against, could even be a sub-part (ie: a specific
secondary arch or so).
I also would like to see if we can parallelize the browsing of the tree.
More information about the infrastructure