[389-devel] Re: System tap performance results,

Tuesday, 10 December 2019

...
 On 10 Dec 2019, at 19:15, thierry bordaz <tbordaz(a)redhat.com&gt;
wrote:

 Hi William,

 Thanks for these very interesting results.
 It would help if you can provide the stap scripts to make sure what you are accounting
the latency. 
Yes, I plan to put them into a PR soon once I have done a bit more data collection and
polishing of the script setup. 

...
 Also just to be sure is latency a synonym for response time ? 
Yep, here I mean the execution time of a single operation. 

...

 Regarding the comparison (tail) 1client/16client. It looks to me that the tail are
equivalent: The more we have clients the more we have long latency. So in a first approach
I would eliminate contention effect. 
I disagree, the tail is much more pronounced in the 16 client version, and there is a
significant shift of response times from the 32768 bucket to 65536 indicating that many
operations are now being "held up". 

...

 Regarding the ratio shorter/longer latency (assuming the search are equivalent) this is
interesting to know why we have such effect. One of the possible cause I was thinking of
is the impact of DB thread (checkpointing or trickling). But if it exists similar long
tail in ldbm_back, the absolute value is much lower than the opshare_search: in short
ldbm_back accounted at most for 4ms while opshared for 67ms. So there is something else
(aci, network, frontend..).

 Regarding USDT I think it is very good idea. However, just to share some recent stap
experience, I found it intrusive. In short, I had not the same throughput with and
without. In my case it was not a problem, as I wanted to investigate a reproducible
contention. But if we want support/user/customers to use it, the performance impact in
production will be valid point. 
I haven't noticed any "intrusiveness" from USDT so far? How were you running
the stap scripts? 

...

 best regards
 thierry

 On 12/9/19 6:16 AM, William Brown wrote:
> Hi all,
> 
> Following last weeks flamegraph runs, I wanted to find more details on exactly what
was happening. While flamegraphs highlighted that a changed existed between single and
multithreaded servers, it did not help to isolate where
> the change was occuring.
> 
> To understand this I have started to work on a set of systemtap scripts that we can
use to profile 389ds. These will be included in a future PR.
> 
> Here are the hisograms from an initial test of profiling "do_search"
> 
> 1 thread
> 
> stap test_do_search.stp
> ^CDistribution of latencies (in nanoseconds) for 441148 samples
> max/avg/min: 35911542/85694/38927
>     value |-------------------------------------------------- count
>      8192 |                                                        0
>     16384 |                                                        0
>     32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 167285
>     65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  239280
>    131072 |@@@@@                                               25788
>    262144 |@                                                    8530
>    524288 |                                                      252
>   1048576 |                                                        6
>   2097152 |                                                        1
>   4194304 |                                                        3
>   8388608 |                                                        0
>  16777216 |                                                        2
>  33554432 |                                                        1
>  67108864 |                                                        0
> 134217728 |                                                        0
> 
> 16 thread
> 
>  stap test_do_search.stp
> ^CDistribution of latencies (in nanoseconds) for 407806 samples
> max/avg/min: 100315928/112407/39368
>     value |-------------------------------------------------- count
>      8192 |                                                        0
>     16384 |                                                        0
>     32768 |@@@@@@@@@@@@@@@@@@@@                               100281
>     65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  249656
>    131072 |@@@@@@@                                             37837
>    262144 |@@@                                                 18322
>    524288 |                                                     1171
>   1048576 |                                                      203
>   2097152 |                                                       90
>   4194304 |                                                       74
>   8388608 |                                                       83
>  16777216 |                                                       58
>  33554432 |                                                       21
>  67108864 |                                                       10
> 134217728 |                                                        0
> 268435456 |                                                        0
> 
> 
> It's interesting to note the tail latency here: On the 16 thread version we see
67000 less in the 32768 buckets, shifting mostly through the 131072 and 262144 buckets, as
well as showing a much greater number of calls in the tail. In thread 1 no operation made
it to 67108864, but 10 did in 16thread, along with ~200 more that are higher than 1048567,
and ~1500 more that are greater than 524288. This kind of tailing means we have
"spikes" of latency throughout the execution, which then have a minor flow on
cause other operations to be increased in latency.
> 
> These are all in nanoseconds, so this means most operations are in do_search for
0.000131072 seconds or less, while there are spikes of operations taking between nearly
0.001048576 and 0.067108864 seconds to complete (which is an 8x to 512x increase in
execution time.
> 
> From these point I took two measurements of back_ldbm and the access log, knowing
these were easy targets for potential areas of latency. Both of these were performed with
the 16thread server. I didn't need to do this on the 1 thread because I'm looking
for tail latency, not comparisons now. Because we know that there is an increase of ~1500
events of high latency, we are looking for similar numbers to appear in other tail
latencies.
> 
> 
> ^CDistribution of back_ldbm latencies (in nanoseconds) for 364223 samples
> max/avg/min: 4757744/31192/13622
>    value |-------------------------------------------------- count
>     2048 |                                                        0
>     4096 |                                                        0
>     8192 |@@                                                  14318
>    16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  261265
>    32768 |@@@@@@@@@@@@                                        63533
>    65536 |@@@                                                 20624
>   131072 |                                                     4062
>   262144 |                                                      304
>   524288 |                                                       68
>  1048576 |                                                       33
>  2097152 |                                                       15
>  4194304 |                                                        1
>  8388608 |                                                        0
> 16777216 |                                                        0
> 
> This shows some tail latency in back_ldbm however it's infrequent, only showing a
handful of spikes, but they do exist and tend to taper off quickly as they approach
2097152. Certainly these should be looked at as they will likely add up to affecting
do_search.
> 
> 
> stap -v test_access_log.stp
> ^CDistribution of slapi_log_access latencies (in nanoseconds) for 1284350 samples
> max/avg/min: 23843992/5138/1036
>    value |-------------------------------------------------- count
>      256 |                                                        0
>      512 |                                                        0
>     1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     379710
>     2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        361243
>     4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  412480
>     8192 |@@@@@@@@@@                                          83815
>    16384 |@@@@                                                40087
>    32768 |                                                     5076
>    65536 |                                                     1253
>   131072 |                                                      406
>   262144 |                                                      130
>   524288 |                                                       72
>  1048576 |                                                       33
>  2097152 |                                                       16
>  4194304 |                                                       19
>  8388608 |                                                        1
> 16777216 |                                                        9
> 33554432 |                                                        0
> 67108864 |                                                        0
> 
> 
> From this, I can see that while some tail latency exists in some conditions of
back_ldbm, there is a much longer and present tail in the access log, with nearly 2000
operations in or exceeding the 65536 mark - remember, in do_search most of the operations
as a whole take 65536, so we have latency spikes in slapi_log_access that are as large or
larger than whole search operations as a total.
> 
> Now we can see "most! operations are fimly below 16384. This is what we want to
see, but it's interesting that a slapi_log_access can "take up to" a quarter
of a whole operations processing on the avg case.
> 
> What's truly telling here is the tail down to 16777216, with ~250 operations
exceeding 262144 ns. This shows that within slapi_log_access, we have have large spiked
delays in the behaviour, which will be affecting both the tail latency of do_search as a
whole, but also causing other slapi_log_access operations to "hold up".
> 
> My next steps from here will be:
> 
> * To add USDT probes to the logs and back_ldbm to get better, fine detail about what
is causing the excessive latency.
>   * these probes are also needed to resolve what appears to be a symbol resolution
issue with systemtap when optimisations are enabled.
> * To add probes in other parts of the code base to get better visualisation about
where and how long timings are taking through an operation.
> * To run a lock contention profile (I was unable to do this today due to bugs in
systemtap)
> * To document the setup proceedures
> * Commit all these into a PR
> * Document some actionable improvements we can make to improve the server
performance.
> 
> 
> Hope that is interesting to everyone,
> 
> 
> --
> Sincerely,
> 
> William
> _______________________________________________
> 389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproje...

--
Sincerely,

William

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[389-devel] Re: System tap performance results,