Last two days we had problem processing the queue. This is post-mortem of what happened.
Mmraka sent several thousands of builds to Copr - that is fine, it was discussed in
advance with me and in fact I
encourage such tests and rebuilds. However this triggered one bug: this users was unable
to get list of builds as we
have inefficient SQL query on that page . As result of this Michal (and very likely
somebody else too) tried to
delete several hundreds of builds at once.
This resulted in bad JobGrabber behaviour where it fetched few dozen tasks and then
stopped without any error.
When I debug it (on production (!) because it did not happen in stage) it processed first
round of builds, first action
and then stopped.
I then learned that JobGrabber waited for lock, which was hanging there from previously
killed JobGrabber. After I
removed it, I found there is that big number of tasks to be executed. And our code in
JobGrabber looks like:
if some builds:
put builds in queue
if some tasks:
execute them immediately
That is because previously users send only few tasks at once and those operation are
basically very cheap (usually just
unlink, followed by quick createrepo_c --update).
However repositories to which belong those actions are big (several GBs) and even
createrepo_c run for more than minute.
So it effectively blocked next fetch of builds from frontend for several hours.
Right now the task queue is empty so builds are processed in timely manner and our code in
master is already changed to
be resistant to such behaviour.
I am really sorry if you had to wait for your build in past two days.
We learned a lesson from this massive usage of Copr and we identified some other potential
performance issue and it will
result in even better service in upcoming days.
 Adam is fixing the code right now.
Miroslav Suchy, RHCA
Red Hat, Senior Software Engineer, #brno, #devexp, #fedora-buildsys