Hi all,
Thanks to Howard's advice on mutrace, and the addition of systemtap probes I have some
more data and a few things I can start to action and work on to improve the server
performance I hope.
The next step was two fold - measure the timing of "do_search" excluding the
connection code, and to identify lock contention with mutrace. Both yielded good data.
Due to an issue with systemtap, it appears that hooking on process().function is not
always reliable - I believe this could be related to inlinig of some function names. This
meant that I needed to add probe points to allow the tools to hook reliably to the code at
known points. What it also allows is the probes to exist in the middle of a function so we
can break functions down and see how they behave.
For example in do_search I was able to add probes for the function entry, query
preparation, op_shared_search, the finalisation and the function return.
This is the histogram of the "full" do_search:
^CDistribution of do_search_full latencies (in nanoseconds) for 776771 samples
max/avg/min: 1576111187091993/2029055283/46
value |-------------------------------------------------- count
8 | 0
16 | 0
32 | 106
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 755031
128 |@ 18609
256 | 1886
512 | 394
1024 | 135
2048 | 56
4096 | 67
8192 | 199
16384 | 0
32768 | 0
65536 | 243
131072 | 23
262144 | 1
524288 | 20
1048576 | 0
2097152 | 0
~
281474976710656 | 0
562949953421312 | 0
1125899906842624 | 1
2251799813685248 | 0
4503599627370496 | 0
These are the three stages (preparation, op_shared execution, and finalisation).
Distribution of do_search_prepared latencies (in nanoseconds) for 776772 samples
max/avg/min: 1576111187091369/2029052531/3
value |-------------------------------------------------- count
0 | 0
1 | 0
2 | 7
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 612338
8 |@@@@@@@@@@@ 144976
16 |@ 18222
32 | 873
64 | 189
128 | 127
256 | 30
512 | 5
1024 | 0
2048 | 1
4096 | 1
8192 | 2
16384 | 0
32768 | 0
~
281474976710656 | 0
562949953421312 | 0
1125899906842624 | 1
2251799813685248 | 0
4503599627370496 | 0
Distribution of do_search_complete latencies (in nanoseconds) for 776772 samples
max/avg/min: 662727/137/38
value |-------------------------------------------------- count
8 | 0
16 | 0
32 | 14543
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 748736
128 | 10848
256 | 1613
512 | 302
1024 | 127
2048 | 54
4096 | 66
8192 | 196
16384 | 0
32768 | 0
65536 | 243
131072 | 23
262144 | 1
524288 | 20
1048576 | 0
2097152 | 0
Distribution of do_search_finalise latencies (in nanoseconds) for 776771 samples
max/avg/min: 8940/1/1
value |-------------------------------------------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 338484
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 414648
4 | 3582
8 |@@ 18382
16 | 1497
32 | 81
64 | 44
128 | 38
256 | 8
512 | 5
1024 | 1
2048 | 0
4096 | 0
8192 | 1
16384 | 0
32768 | 0
So once again in do_search_full we can see spikes through out the whole execution of
do_search. Observing do_search_prepared, we can see one major spike (Which I have no idea
what from ...), but otherwise most of the values stay near the 8 - 256ns mark. Not bad
indeed! This probably won't yield much improvement to optimise.
In do_search_finalise, this really is just "log access" and some frees. Again,
mostly this stays low, but a few spikes go up to 512+ns. We'll come back to this.
And ofcourse, in do_search_complete, this is measuring op_shared_search. This is where we
expect most of the work, and here is where we see the spikes in latency occuring.
So next I dug into op_shared search:
Distribution of op_shared_search_full latencies (in nanoseconds) for 744467 samples
max/avg/min: 1576124962139717/19054067382/41
value
|-------------------------------------------------- count
8 |
0
16 | 0
32 |
144
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
728581
128 | 13318
256 | 1724
512 | 240
1024 | 48
2048 | 42
4096 | 39
8192 | 114
16384 | 0
32768 | 2
65536 | 154
131072 | 9
262144 | 33
524288 | 10
1048576 | 0
2097152 | 0
~
281474976710656 | 0
562949953421312 | 0
1125899906842624 | 9
2251799813685248 | 0
4503599627370496 | 0
Distribution of op_shared_search_prepared latencies (in nanoseconds) for 744462 samples
max/avg/min: 1576124962139653/6351398480/7
value |-------------------------------------------------- count
1 | 0
2 | 0
4 | 1
8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 578754
16 |@@@@@@@@@@@@@ 160359
32 | 3822
64 | 358
128 | 700
256 | 224
512 | 51
1024 | 17
2048 | 23
4096 | 22
8192 | 79
16384 | 0
32768 | 2
65536 | 0
131072 | 6
262144 | 31
524288 | 10
1048576 | 0
2097152 | 0
~
281474976710656 | 0
562949953421312 | 0
1125899906842624 | 3
2251799813685248 | 0
4503599627370496 | 0
Distribution of op_shared_search_complete latencies (in nanoseconds) for 744467 samples
max/avg/min: 1576124962139598/12702711719/19
value |-------------------------------------------------- count
4 | 0
8 | 0
16 | 132
32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 544392
64 |@@@@@@@@@@@@@@@@@ 195765
128 | 3206
256 | 647
512 | 105
1024 | 14
2048 | 17
4096 | 13
8192 | 21
16384 | 0
32768 | 0
65536 | 147
131072 | 2
262144 | 0
524288 | 0
~
281474976710656 | 0
562949953421312 | 0
1125899906842624 | 6
2251799813685248 | 0
4503599627370496 | 0
Distribution of op_shared_search_finalise latencies (in nanoseconds) for 744467 samples
max/avg/min: 1576124909455854/6351355714/2
value |-------------------------------------------------- count
0 | 0
1 | 0
2 | 1
4 | 1
8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 644654
16 |@@@@@@@ 95535
32 | 3005
64 | 364
128 | 688
256 | 166
512 | 15
1024 | 5
2048 | 2
4096 | 4
8192 | 13
16384 | 0
32768 | 0
65536 | 8
131072 | 3
262144 | 0
524288 | 0
~
281474976710656 | 0
562949953421312 | 0
1125899906842624 | 3
2251799813685248 | 0
4503599627370496 | 0
Again we have probes through out, measuring the full time, the preparation, backend
execution, and finalisation stages of the query. In the full we see most operations in the
64ns mark but there are some spikes up to 524288, which is a bit concerning. we
shouldn't be seeing such drastic spikes.
This is mind, we can quickly identify that it's in the query preparation that this
occurs, as it's the only histogram that replicates the spikes up to 524288.
We can also see a few spikes in the backend through the jumps to 65536. So this means we
probably need to look at the preparation and backend parts of search next to work out what
deeper parts are causing this.
I also followed up to check the log access times based on the finalise from do_search.
Again we break this into time to prepare a log entry, and then the time to buffer it.
Distribution of log_access_full latencies (in nanoseconds) for 943213 samples
max/avg/min: 573628/19/4
value |-------------------------------------------------- count
1 | 0
2 | 0
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 813261
8 |@@@@@ 85422
16 |@@ 41677
32 | 1567
64 | 308
128 | 575
256 | 139
512 | 49
1024 | 14
2048 | 23
4096 | 8
8192 | 130
16384 | 0
32768 | 10
65536 | 0
131072 | 10
262144 | 10
524288 | 10
1048576 | 0
2097152 | 0
Distribution of log_access_prepared latencies (in nanoseconds) for 943213 samples
max/avg/min: 2281/4/2
value |-------------------------------------------------- count
0 | 0
1 | 0
2 |@@@@@@@@@@ 149924
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 734833
8 |@@ 35399
16 |@ 21736
32 | 721
64 | 219
128 | 290
256 | 73
512 | 14
1024 | 3
2048 | 1
4096 | 0
8192 | 0
Distribution of log_access_complete latencies (in nanoseconds) for 943213 samples
max/avg/min: 573624/14/0
value |-------------------------------------------------- count
0 | 1035
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 822635
2 |@@@@@ 93607
4 | 5363
8 |@ 17309
16 | 2494
32 | 130
64 | 71
128 | 267
256 | 58
512 | 34
1024 | 11
2048 | 21
4096 | 8
8192 | 130
16384 | 0
32768 | 10
65536 | 0
131072 | 10
262144 | 10
524288 | 10
1048576 | 0
2097152 | 0
From this we can see that the time to prepare the entry is of almost no consequence.
It's the log_append_buffer2 function that can take up to 524288ns to execute, when in
the "normal" case it can be 8ns. That's a 65536x degredation in performance!
Next was to look at the most contended mutexs. This generated a lot of data, so I'll
only include the top results.
mutrace: Showing 23867 mutexes in order of (write) contention count:
Mutex # Locked Changed Cont. cont.Time[ms] tot.Time[ms] avg.Time[ms] Flags
21209 8348033 931370 202356 3728.122 960.397 0.000 M-.--.
4593 5616358 1434100 85812 195.660 428.424 0.000 M-.a-.
6556 2089049 1841307 44116 1441.666 2430.681 0.001 M-.a-.
6983 2368664 307564 36171 14461.861 17827.474 0.008 M-.r-.
6976 2344301 303284 33996 14139.214 18479.060 0.008 M-.r-.
6974 2421485 311208 33944 11542.366 15021.549 0.006 M-.r-.
6975 2366735 304595 33769 12534.205 17024.613 0.007 M-.r-.
6977 2298666 298851 33745 15032.955 19988.776 0.009 M-.r-.
6982 2354910 304511 33690 14949.869 18772.182 0.008 M-.r-.
21187 5565777 691964 33006 551.482 547.347 0.000 M-.--.
6978 2288445 298816 32834 16983.236 20672.063 0.009 M-.r-.
6980 2330792 304620 31609 15700.067 19536.323 0.008 M-.r-.
6981 2301734 298641 31374 16685.940 20538.976 0.009 M-.r-.
6979 2257093 295274 31100 17128.837 21545.489 0.010 M-.r-.
38 1395342 1088064 20207 508.864 2613.544 0.002 M-.a-.
6710 1392690 1257073 8006 84.416 176.202 0.000 M-.a-.
967 1391484 673342 6016 39.003 138.278 0.000 M-.a-.
6709 1394704 1086851 4235 94.286 174.136 0.000 M-.a-.
21111 1511494 655694 2991 62.206 304.168 0.000 M-.--.
21174 120109 10777 2316 64.759 15.292 0.000 M-.--.
29 1391466 1081661 2222 35772.114 3425.260 0.002 M-.a-.
25 695996 647807 1174 9.887 83.750 0.000 M-.a-.
6194 1394704 1088158 558 16.913 203.315 0.000 M-.a-.
4595 36131 7253 193 2.617 3.034 0.000 M-.a-.
21106 143568 34113 143 8.706 44.730 0.000 M-.--.
21167 24045 6548 112 1.354 3.898 0.000 M-.--.
The columns of interest is Cont. which is "how many times the mutux/rwlock was
contended. Cont.time is "how long was it contended for". tot tells us "how
long the lock was locked for.
This shows the top two most contended locks are:
Mutex #21209 (0x0x7fd8a5b27868) first referenced by:
/usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa]
/usr/lib64/libdb-4.8.so(__db_pthread_mutex_init+0x17a) [0x7fd8a687fc6a]
Mutex #4593 (0x0x2138358) first referenced by:
/usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa]
/usr/lib64/libnspr4.so(PR_NewMonitor+0x47) [0x7fd9c7373be7]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_instance_create+0x84)
[0x7fd8a665438e]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(+0x590d7) [0x7fd8a66730d7]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_instance_add_instance_entry_callback+0x105)
[0x7fd8a6673833]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_config_read_instance_entries+0x133)
[0x7fd8a6661847]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_config_load_dse_info+0x270)
[0x7fd8a6661b4f]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(dblayer_setup+0xea)
[0x7fd8a663fa98]
/opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_back_start+0x69)
[0x7fd8a6685931]
/opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc5bea) [0x7fd9c7cc6bea]
/opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc5a80) [0x7fd9c7cc6a80]
/opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc52e1) [0x7fd9c7cc62e1]
/opt/dirsrv/lib64/dirsrv/libslapd.so.0(plugin_startall+0x39) [0x7fd9c7cc6a0f]
/opt/dirsrv/sbin/ns-slapd(main+0xb81) [0x4271ca]
/lib64/libc.so.6(__libc_start_main+0xeb) [0x7fd9c7772e0b]
Everything needs access to the db. However the contended "time" was reasonable
low, so I think these aren't "too" concerning.
It's the 4th through 10th locks that concerned me as they had very large
"contended" times.
Mutex #6983 (0x0x27132e8) first referenced by:
/usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa]
/opt/dirsrv/sbin/ns-slapd(connection_table_get_connection+0x1d2) [0x41c940]
/opt/dirsrv/sbin/ns-slapd() [0x420c69]
/opt/dirsrv/sbin/ns-slapd() [0x41e481]
/opt/dirsrv/sbin/ns-slapd(slapd_daemon+0x643) [0x41f11c]
/opt/dirsrv/sbin/ns-slapd(main+0xc5f) [0x4272a8]
/lib64/libc.so.6(__libc_start_main+0xeb) [0x7fd9c7772e0b]
They all have the same backtrace - these will be the connection locks, and the contention
will be the attempts to try-lock these to find a free slot. Given that I already made one
change to conntable and yielded a 30% performance improvement this year, I think this is
the obvious first port to look at, and once it's improved we'll be able to more
easily see the "next bottleneck".
Actionable Items:
I've now submitted the stap tools and the changes needed in a PR.
The obvious code changes are:
* improving the access log buffer (
https://pagure.io/389-ds-base/issue/50785)
* improving connection table slot acquisition (
https://pagure.io/389-ds-base/issue/50786 I
plan to start work on this asap.).
Further investigations:
* "preparation" steps in opshared_search
* The backend query times and profiling that.
Hope that helps,
--
Sincerely,
William