On 8 Jan 2014, at 4:58 am, Gao,Yan <ygao(a)suse.com> wrote:
Hi Andrew, David,
This is a scenario from an user:
Two nodes with 64 DRBD resources, running the latest throttling code.
The user wants as high as possible concurrency. So they started with the
following configuration:
LRMD_MAX_CHILDREN=64
This is almost certainly a horribly inappropriate value to use.
How many cores do these boxes have? I'm guessing less than 16.
load-threshold=0%
1. Start/Promote 64 DRBD resources [PASS].
2. Shutdown one node at 09:32:51. Quite some failures like the following
are encountered when the notify actions invoke "crm_master":
Jan 5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
Called /usr/sbin/crm_master -Q -l reboot -v 10000
Jan 5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
Exit code 107
Jan 5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
Command output:
Jan 5 09:33:08 liona lrmd[25655]: notice: operation_finished:
happy21-drbdclone_notify_0:6714:stderr [ Could not establish cib_rw
connection: Resource temporarily unavailable (11) ]
Jan 5 09:33:08 liona lrmd[25655]: notice: operation_finished:
happy21-drbdclone_notify_0:6714:stderr [ Error signing on to the CIB
service: Transport endpoint is not connected ]
...
Over 1 minute after the node was issued shutdown, the throttle code says:
Jan 5 09:34:08 lionb crmd[13212]: notice: throttle_mode: High CIB
load detected: 0.960333
According to the code, cib_max_cpu is 0.95 here. While apparently,
before the cib's load is founded to have exceeded 0.95, the cib has
already been overloaded.
The throttling code can only do so much.
By setting LRMD_MAX_CHILDREN=64, there are still around 128 updates queued for the cib to
process.
Of course, one of the options here is to tune down "load-threshold" to
find an appropriate value for the deployment -- It might not be very
easy to find an optimized value of it for the possible scenarios.
I'll let David judge the patch, but not specifying ridiculous values for
LRMD_MAX_CHILDREN would be the best initial path forward.
Its fine to want "as high as possible concurrency", but subverting the
throttling code by setting unrealistic job limits achieves the opposite.
While the user sought a way to prevent such failures -- to lengthen the
listen queue of libqb's IPC:
diff -uNr libqb/lib/util_int.h libqbfio/lib/util_int.h
--- libqb/lib/util_int.h
2013-10-23 08:44:54.000000000 -0600
+++ libqbfio/lib/util_int.h
2014-01-06 13:12:18.471097320 -0700
@@ -99,7 +99,7 @@
*/
void qb_socket_nosigpipe(int32_t s);
-#define SERVER_BACKLOG 5
+#define SERVER_BACKLOG 128
#ifndef UNIX_PATH_MAX
#define UNIX_PATH_MAX 108
And it did help. The failures are longer encountered in the tests.
We'd want the user to tune down the thresholds since we believe the cib
is being overloaded. Meanwhile, apparently, with the larger listen
backlog, the cib requests get better chance to get response with some
delay, instead of being rejected immediately. Do you think this change
make sense?
Regards,
Gao,Yan
--
Gao,Yan <ygao(a)suse.com>
Software Engineer
China Server Team, SUSE.
_______________________________________________
quarterback-devel mailing list
quarterback-devel(a)lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/quarterback-devel