Re: [libqb] Longer listen queue of IPC

Tuesday, 7 January 2014

On 8 Jan 2014, at 4:58 am, Gao,Yan <ygao(a)suse.com&gt; wrote:

...
 Hi Andrew, David,

 This is a scenario from an user:
 Two nodes with 64 DRBD resources, running the latest throttling code.
 The user wants as high as possible concurrency. So they started with the
 following configuration:

 LRMD_MAX_CHILDREN=64 
This is almost certainly a horribly inappropriate value to use.
How many cores do these boxes have?  I'm guessing less than 16.

...
 load-threshold=0%

 1. Start/Promote 64 DRBD resources [PASS].

 2. Shutdown one node at 09:32:51. Quite some failures like the following
 are encountered when the notify actions invoke "crm_master":

 Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
 Called /usr/sbin/crm_master -Q -l reboot -v 10000
 Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
 Exit code 107
 Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
 Command output:
 Jan  5 09:33:08 liona lrmd[25655]:   notice: operation_finished:
 happy21-drbdclone_notify_0:6714:stderr [ Could not establish cib_rw
 connection: Resource temporarily unavailable (11) ]
 Jan  5 09:33:08 liona lrmd[25655]:   notice: operation_finished:
 happy21-drbdclone_notify_0:6714:stderr [ Error signing on to the CIB
 service: Transport endpoint is not connected ]
 ...

 Over 1 minute after the node was issued shutdown, the throttle code says:

 Jan  5 09:34:08 lionb crmd[13212]:   notice: throttle_mode: High CIB
 load detected: 0.960333

 According to the code, cib_max_cpu is 0.95 here. While apparently,
 before the cib's load is founded to have exceeded 0.95, the cib has
 already been overloaded. 
The throttling code can only do so much.
By setting LRMD_MAX_CHILDREN=64, there are still around 128 updates queued for the cib to
process.

...

 Of course, one of the options here is to tune down "load-threshold" to
 find an appropriate value for the deployment -- It might not be very
 easy to find an optimized value of it for the possible scenarios. 
I'll let David judge the patch, but not specifying ridiculous values for
LRMD_MAX_CHILDREN would be the best initial path forward.
Its fine to want "as high as possible concurrency", but subverting the
throttling code by setting unrealistic job limits achieves the opposite. 

...

 While the user sought a way to prevent such failures -- to lengthen the
 listen queue of libqb's IPC:

 diff -uNr libqb/lib/util_int.h libqbfio/lib/util_int.h
 --- libqb/lib/util_int.h
 2013-10-23 08:44:54.000000000 -0600
 +++ libqbfio/lib/util_int.h
 2014-01-06 13:12:18.471097320 -0700
 @@ -99,7 +99,7 @@
  */
 void qb_socket_nosigpipe(int32_t s);

 -#define SERVER_BACKLOG 5
 +#define SERVER_BACKLOG 128

 #ifndef UNIX_PATH_MAX
 #define UNIX_PATH_MAX    108

 And it did help. The failures are longer encountered in the tests.

 We'd want the user to tune down the thresholds since we believe the cib
 is being overloaded. Meanwhile, apparently, with the larger listen
 backlog, the cib requests get better chance to get response with some
 delay, instead of being rejected immediately.  Do you think this change
 make sense?

 Regards,
  Gao,Yan
 -- 
 Gao,Yan <ygao(a)suse.com&gt;
 Software Engineer
 China Server Team, SUSE.
 _______________________________________________
 quarterback-devel mailing list
 quarterback-devel(a)lists.fedorahosted.org
 https://lists.fedorahosted.org/mailman/listinfo/quarterback-devel 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [libqb] Longer listen queue of IPC