> My next steps from here will be:
> * To add USDT probes to the logs and back_ldbm to get better, fine detail about what is causing the excessive latency.
> * these probes are also needed to resolve what appears to be a symbol resolution issue with systemtap when optimisations are enabled.
> * To add probes in other parts of the code base to get better visualisation about where and how long timings are taking through an operation.
> * To run a lock contention profile (I was unable to do this today due to bugs in systemtap)
> * To document the setup proceedures
> * Commit all these into a PR
> * Document some actionable improvements we can make to improve the server performance.
mutrace will give you a good audit of lock contention, and it's simple to use, takes no setup.
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
After Ludwig's suggestion to seperate the machines, and after some heartstopping lvm issues last week, I've now established a load testing environment where I have a two VM's, one for ldclt and one for ds.
I'm currently testing only altering the "thread" parameter, since the initial tests showed steep changes in performance based on threads. I have run two search tests, with one DS instance set to a single thread, and the other to 16. The VM in use is a dual core ontop of an i5 2.9GHz, with 4GB of ram and 40GB of ssd storage.
I'm going to attach two svgs to this email for the two tests. 1837 is the single thread results, 1942 is the 16 thread result.
There are still some quirks with perf resolving symbol names, but the stacks otherwise look correct. I'll be submitting my tooling in a PR when I have completed some more work.
The idea of a flamegraph is to allow visual inspection to determine high latency or resource consuming areas of the codebase. These flamegraphs show at an early glance some initial changes. We should expect these changes are due to the presence of lock contention as a single thread instance should have NO lock contention, where a multi thread instance will show it.
connection_threadmain (our worker threads, with poll/read), shows a decrease in time usage. This makes sense since there are no locks in connection threadmain which means that our "delays" or latency is coming from other code paths.
We spend fractionally more time in libback. This is too be expected though, since there are extra threads and they can be in different stages of a search, so we expect to see slightly more concurrent threads in each frame.
We spent an increased time in acl_access_allowed 7.7% vs 5.9%. This would indicate that we may have some delays due to ACL processing, or other locking in that region.
There is also a minor increase in time spent in log_result (3.6% to 3.9%). It's already a bit concerning that we spend ~4% of our time in logging, when it's only I think 3 or 4 function calls in do_search out of hundreds. The change appears to be due to higher numbers of vsnprintf, likely due to slightly increased throughput of the multi thread version.
I think my next step will be to update and look at a lock contention profile from system tap, as well as potentially adding timing markers to the code base and using system tap to record function entry/exit timings to determine where latency changes are occuring. It appears we have no single source of pain, as much as what I have suspected - lots of little locks, throughout the code base, that cause a small, yet noticeable progression delay.
note: you can click into the content of the .svg to "zoom" to specific function areas and see their percentages etc.