Commops,

This is some nitty-gritty but super informative "sausage making" stuff here.

Could be pulled wholesale, and then provide a link to copr?
--RemyD.



---------- Forwarded message ----------
From: Miroslav Suchý <msuchy@redhat.com>
Date: Thu, Nov 12, 2015 at 10:16 AM
Subject: Post-mortem
To: Cool Other Package Repositories <copr-devel@lists.fedorahosted.org>


Last two days we had problem processing the queue. This is post-mortem of what happened.

Mmraka sent several thousands of builds to Copr - that is fine, it was discussed in advance with me and in fact I
encourage such tests and rebuilds. However this triggered one bug: this users was unable to get list of builds as we
have inefficient SQL query on that page [1]. As result of this Michal (and very likely somebody else too) tried to
delete several hundreds of builds at once.
This resulted in bad JobGrabber behaviour where it fetched few dozen tasks and then stopped without any error.
When I debug it (on production (!) because it did not happen in stage) it processed first round of builds, first action
and then stopped.
I then learned that JobGrabber waited for lock, which was hanging there from previously killed JobGrabber. After I
removed it, I found there is that big number of tasks to be executed. And our code in JobGrabber looks like:

while True:
  if some builds:
     put builds in queue
  if some tasks:
     execute them immediately

That is because previously users send only few tasks at once and those operation are basically very cheap (usually just
unlink, followed by quick createrepo_c --update).
However repositories to which belong those actions are big (several GBs) and even createrepo_c run for more than minute.
So it effectively blocked next fetch of builds from frontend for several hours.

Right now the task queue is empty so builds are processed in timely manner and our code in master is already changed to
be resistant to such behaviour.

I am really sorry if you had to wait for your build in past two days.
We learned a lesson from this massive usage of Copr and we identified some other potential performance issue and it will
result in even better service in upcoming days.


[1] Adam is fixing the code right now.
--
Miroslav Suchy, RHCA
Red Hat, Senior Software Engineer, #brno, #devexp, #fedora-buildsys
_______________________________________________
copr-devel mailing list
copr-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/copr-devel



--
Remy DeCausemaker
Fedora Community Lead & Council
https://whatcanidoforfedora.org