Today I spent a while looking at datagrepper. Our websites build process
calls it and has been failing for a long while now.
We went over some suggestions in
and decided that a db
upgrade could help out, but thats very intrusive so we didn't want to do
it during freeze.
I noticed that haproxy was seeing it stop responding and disabling it
then re-enabling it before nagios alerted and the queries that the
website build was doing were going a long time then returning a 500, so
I started tweaking with the wsgi settings.
Moving it to use processes instead of threads, and adding many more of
them seems to have made it quite stable. The worst case now it takes
60seconds or so to get the last 2 week atomic compose, and usually it's
1 second or so.
My only theory is that all the threads share a db connection somehow so
moving to processes causes it to be able to handle many more at once?
The db connections went from about 20 to about 50, but the load on the
db is down by about 1/2.
Anyhow, the change is below... hopefully it will get us through release.
diff --git a/roles/datagrepper/templates/datagrepper-app.conf
index 6944fb0..86f3ace 100644
@@ -23,7 +23,7 @@ AddOutputFilterByType DEFLATE text/html text/plain
# Static resources for the datagrepper app.
-WSGIDaemonProcess datagrepper user=fedmsg group=fedmsg
maximum-requests=50000 display-name=datagrepper processes=2 threads=2
+WSGIDaemonProcess datagrepper user=fedmsg group=fedmsg
maximum-requests=50000 display-name=datagrepper processes=20 threads=1