New subject: [Pcmk-devel] Longer listen queue of IPC

Tuesday, 7 January 2014

Hi Andrew, David,

This is a scenario from an user:
Two nodes with 64 DRBD resources, running the latest throttling code.
The user wants as high as possible concurrency. So they started with the
following configuration:

LRMD_MAX_CHILDREN=64
load-threshold=0%

1. Start/Promote 64 DRBD resources [PASS].

2. Shutdown one node at 09:32:51. Quite some failures like the following
are encountered when the notify actions invoke "crm_master":

Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
Called /usr/sbin/crm_master -Q -l reboot -v 10000
Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
Exit code 107
Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
Command output:
Jan  5 09:33:08 liona lrmd[25655]:   notice: operation_finished:
happy21-drbdclone_notify_0:6714:stderr [ Could not establish cib_rw
connection: Resource temporarily unavailable (11) ]
Jan  5 09:33:08 liona lrmd[25655]:   notice: operation_finished:
happy21-drbdclone_notify_0:6714:stderr [ Error signing on to the CIB
service: Transport endpoint is not connected ]
...


Over 1 minute after the node was issued shutdown, the throttle code says:

Jan  5 09:34:08 lionb crmd[13212]:   notice: throttle_mode: High CIB
load detected: 0.960333

According to the code, cib_max_cpu is 0.95 here. While apparently,
before the cib's load is founded to have exceeded 0.95, the cib has
already been overloaded.

Of course, one of the options here is to tune down "load-threshold" to
find an appropriate value for the deployment -- It might not be very
easy to find an optimized value of it for the possible scenarios.

While the user sought a way to prevent such failures -- to lengthen the
listen queue of libqb's IPC:

diff -uNr libqb/lib/util_int.h libqbfio/lib/util_int.h

--- libqb/lib/util_int.h
2013-10-23 08:44:54.000000000 -0600
+++ libqbfio/lib/util_int.h
2014-01-06 13:12:18.471097320 -0700
@@ -99,7 +99,7 @@
  */
 void qb_socket_nosigpipe(int32_t s);

-#define SERVER_BACKLOG 5
+#define SERVER_BACKLOG 128

 #ifndef UNIX_PATH_MAX
 #define UNIX_PATH_MAX    108


And it did help. The failures are longer encountered in the tests.

We'd want the user to tune down the thresholds since we believe the cib
is being overloaded. Meanwhile, apparently, with the larger listen
backlog, the cib requests get better chance to get response with some
delay, instead of being rejected immediately.  Do you think this change
make sense?

Regards,
  Gao,Yan
-- 
Gao,Yan <ygao(a)suse.com&gt;
Software Engineer
China Server Team, SUSE.

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Longer listen queue of IPC