Re: [libqb] [Pcmk-devel] Longer listen queue of IPC

Tuesday, 7 January 2014

On 8 Jan 2014, at 4:34 pm, Gao,Yan <ygao(a)suse.com&gt; wrote:

...
 On 01/08/14 12:19, Andrew Beekhof wrote:
> 
> On 8 Jan 2014, at 2:42 pm, Gao,Yan <ygao(a)suse.com&gt; wrote:
> 
>> Hi Andrew,
>> On 01/08/14 05:35, Andrew Beekhof wrote:
>>> 
>>> On 8 Jan 2014, at 4:58 am, Gao,Yan <ygao(a)suse.com&gt; wrote:
>>> 
>>>> Hi Andrew, David,
>>>> 
>>>> This is a scenario from an user:
>>>> Two nodes with 64 DRBD resources, running the latest throttling code.
>>>> The user wants as high as possible concurrency. So they started with the
>>>> following configuration:
>>>> 
>>>> LRMD_MAX_CHILDREN=64
>>> 
>>> This is almost certainly a horribly inappropriate value to use.
>>> How many cores do these boxes have?  I'm guessing less than 16.
>> Each of them has 24 cores.
> 
> Thats actually pretty respectable :)
> Alas, it probably makes things worse for the cib - since the updates are likely
taking longer than the actual operations.
> 
> These guys are going to love the new cib code :)
 Really looking forward to that :-) 
Have you run it yet?  Should be in a good state

...

 Regards,
  Gao,Yan

> 
>> 
>>> 
>>>> load-threshold=0%
>>>> 
>>>> 1. Start/Promote 64 DRBD resources [PASS].
>>>> 
>>>> 2. Shutdown one node at 09:32:51. Quite some failures like the following
>>>> are encountered when the notify actions invoke "crm_master":
>>>> 
>>>> Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
>>>> Called /usr/sbin/crm_master -Q -l reboot -v 10000
>>>> Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
>>>> Exit code 107
>>>> Jan  5 09:33:08 liona drbd(happy21-drbdclone)[6714]: ERROR: happy21:
>>>> Command output:
>>>> Jan  5 09:33:08 liona lrmd[25655]:   notice: operation_finished:
>>>> happy21-drbdclone_notify_0:6714:stderr [ Could not establish cib_rw
>>>> connection: Resource temporarily unavailable (11) ]
>>>> Jan  5 09:33:08 liona lrmd[25655]:   notice: operation_finished:
>>>> happy21-drbdclone_notify_0:6714:stderr [ Error signing on to the CIB
>>>> service: Transport endpoint is not connected ]
>>>> ...
>>>> 
>>>> 
>>>> Over 1 minute after the node was issued shutdown, the throttle code
says:
>>>> 
>>>> Jan  5 09:34:08 lionb crmd[13212]:   notice: throttle_mode: High CIB
>>>> load detected: 0.960333
>>>> 
>>>> According to the code, cib_max_cpu is 0.95 here. While apparently,
>>>> before the cib's load is founded to have exceeded 0.95, the cib has
>>>> already been overloaded.
>>> 
>>> The throttling code can only do so much.
>>> By setting LRMD_MAX_CHILDREN=64, there are still around 128 updates queued
for the cib to process.
>> Indeed.
>>> 
>>>> 
>>>> Of course, one of the options here is to tune down
"load-threshold" to
>>>> find an appropriate value for the deployment -- It might not be very
>>>> easy to find an optimized value of it for the possible scenarios.
>>> 
>>> I'll let David judge the patch, but not specifying ridiculous values for
LRMD_MAX_CHILDREN would be the best initial path forward.
>>> Its fine to want "as high as possible concurrency", but subverting
the throttling code by setting unrealistic job limits achieves the opposite. 
>> Yes, agreed. We've been telling them tuning down the concurrency
>> actually could speed up the overall process.
>> 
>> Thanks a lot for your comments and suggestions!
>> 
>> Regards,
>> Gao,Yan
>> 
>>> 
>>>> 
>>>> While the user sought a way to prevent such failures -- to lengthen the
>>>> listen queue of libqb's IPC:
>>>> 
>>>> diff -uNr libqb/lib/util_int.h libqbfio/lib/util_int.h
>>>> --- libqb/lib/util_int.h
>>>> 2013-10-23 08:44:54.000000000 -0600
>>>> +++ libqbfio/lib/util_int.h
>>>> 2014-01-06 13:12:18.471097320 -0700
>>>> @@ -99,7 +99,7 @@
>>>> */
>>>> void qb_socket_nosigpipe(int32_t s);
>>>> 
>>>> -#define SERVER_BACKLOG 5
>>>> +#define SERVER_BACKLOG 128
>>>> 
>>>> #ifndef UNIX_PATH_MAX
>>>> #define UNIX_PATH_MAX    108
>>>> 
>>>> 
>>>> And it did help. The failures are longer encountered in the tests.
>>>> 
>>>> We'd want the user to tune down the thresholds since we believe the
cib
>>>> is being overloaded. Meanwhile, apparently, with the larger listen
>>>> backlog, the cib requests get better chance to get response with some
>>>> delay, instead of being rejected immediately.  Do you think this change
>>>> make sense?
>>>> 
>>>> Regards,
>>>> Gao,Yan
>>>> -- 
>>>> Gao,Yan <ygao(a)suse.com&gt;
>>>> Software Engineer
>>>> China Server Team, SUSE.
>>>> _______________________________________________
>>>> quarterback-devel mailing list
>>>> quarterback-devel(a)lists.fedorahosted.org
>>>> https://lists.fedorahosted.org/mailman/listinfo/quarterback-devel
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pcmk-devel mailing list
>>> Pcmk-devel(a)oss.clusterlabs.org
>>> http://oss-2.clusterlabs.org/mailman/listinfo/pcmk-devel
>>> 
>> 
>> -- 
>> Gao,Yan <ygao(a)suse.com&gt;
>> Software Engineer
>> China Server Team, SUSE.
>> 
>> _______________________________________________
>> Pcmk-devel mailing list
>> Pcmk-devel(a)oss.clusterlabs.org
>> http://oss-2.clusterlabs.org/mailman/listinfo/pcmk-devel
> 
> 
> 
> _______________________________________________
> Pcmk-devel mailing list
> Pcmk-devel(a)oss.clusterlabs.org
> http://oss-2.clusterlabs.org/mailman/listinfo/pcmk-devel
> 

 -- 
 Gao,Yan <ygao(a)suse.com&gt;
 Software Engineer
 China Server Team, SUSE.
 _______________________________________________
 quarterback-devel mailing list
 quarterback-devel(a)lists.fedorahosted.org
 https://lists.fedorahosted.org/mailman/listinfo/quarterback-devel 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [libqb] [Pcmk-devel] Longer listen queue of IPC