HOTFIX: MM crawler (ticket #3268)

Matt Domsch Matt_Domsch at dell.com
Sat May 12 01:51:15 UTC 2012


https://fedorahosted.org/fedora-infrastructure/ticket/3268
notes that a mirror might not be removed from the list even though
it's stale.

In particular, there is a code path called add_parents() whose job it
is to mark all parent directories of a target directory up-to-date or
not, if those parent directories had not already been determined to be
up-to-date for themselves.  This can happen if a directory has no
files in it, for example, only child directories.  This code path had
an incorrect key lookup, specifically:

-        parent = '/'.join(splitpath[:-1])
-        try:
-            hcd = host_category_dirs[(hc, parent)]

which was looking up the parent directory in the host_category_dirs
cache (which is later operated on).  However, the actual key here is
not a the string form of the parent directory name, it is a Directory
object.  So it's looking up the wrong thing, failing the lookup, and
then proceeding to mark all its parent directories up-to-date
incorrectly.  In particular, it is marking all parent directories
up-to-date (e.g. pub/epel/5/i386) when a child subdirectory
(pub/epel/5/i386/repoview/layout) is marked up-to-date, even if the
parent directory is not in fact up-to-date.

The patch below fixes this by splitting out the parent directory
lookup function into its own function for readability, and fixes the key
lookup.

I've tested this on bapp02 against a stale mirror that was previously
marked up-to-date incorrectly, and it fixes it.

I'd like to hotfix bapp02 to address this.


Thanks,
Matt

-- 
Matt Domsch
Technology Strategist
Dell | Office of the CTO

--- crawler_perhost	2010-09-06 14:46:21.000000000 +0000
+++ crawler_perhost	2012-05-12 01:20:54.604906708 +0000
@@ -348,21 +348,24 @@
                 break
     return pref
         
-
-def add_parents(host_category_dirs, hc, d):
-    splitpath = d.name.split('/')
+def parent(directory):
+    parentDir = None
+    splitpath = directory.name.split(u'/')
     if len(splitpath[:-1]) > 0:
-        parent = '/'.join(splitpath[:-1])
+        parentPath = u'/'.join(splitpath[:-1])
         try:
-            hcd = host_category_dirs[(hc, parent)]
-        except KeyError:
-            try:
-                parentDir = Directory.byName(parent)
-                host_category_dirs[(hc, parentDir)] = True
-            except SQLObjectNotFound: # recursed out of the directory structure
-                parentDir = None
-                
-        if parentDir and parentDir != hc.category.topdir: # stop at top of the category
+            parentDir = Directory.byName(parentPath)
+        except SQLObjectNotFound:
+            pass
+    return parentDir
+
+def add_parents(host_category_dirs, hc, d):
+    parentDir = parent(d)
+    if parentDir is not None:
+        if (hc, parentDir) not in host_category_dirs:
+            print "directory %s adding parent %s, unknown up2date state" % (d.name, (hc, parentDir))
+            host_category_dirs[(hc, parentDir)] = None
+        if parentDir != hc.category.topdir: # stop at top of the category
             return add_parents(host_category_dirs, hc, parentDir)
     
     return host_category_dirs


More information about the infrastructure mailing list