Last two days we had problem processing the queue. This is post-mortem of what happened.
Mmraka sent several thousands of builds to Copr - that is fine, it was discussed in advance with me and in fact I encourage such tests and rebuilds. However this triggered one bug: this users was unable to get list of builds as we have inefficient SQL query on that page [1]. As result of this Michal (and very likely somebody else too) tried to delete several hundreds of builds at once. This resulted in bad JobGrabber behaviour where it fetched few dozen tasks and then stopped without any error. When I debug it (on production (!) because it did not happen in stage) it processed first round of builds, first action and then stopped. I then learned that JobGrabber waited for lock, which was hanging there from previously killed JobGrabber. After I removed it, I found there is that big number of tasks to be executed. And our code in JobGrabber looks like:
while True: if some builds: put builds in queue if some tasks: execute them immediately
That is because previously users send only few tasks at once and those operation are basically very cheap (usually just unlink, followed by quick createrepo_c --update). However repositories to which belong those actions are big (several GBs) and even createrepo_c run for more than minute. So it effectively blocked next fetch of builds from frontend for several hours.
Right now the task queue is empty so builds are processed in timely manner and our code in master is already changed to be resistant to such behaviour.
I am really sorry if you had to wait for your build in past two days. We learned a lesson from this massive usage of Copr and we identified some other potential performance issue and it will result in even better service in upcoming days.
[1] Adam is fixing the code right now.
copr-devel@lists.fedorahosted.org