Hi, everyone.
I am the author of liblfds (a lock-free data structure library), which is used by nunc-stans, which is used by 389.
By various events of no interest to anyone, I've ended up idling in the #389 channel on Freenode and earlier today had a conversation with richm (who used liblfds in nunc-stans), regarding performance, data structures in use, locking and so on.
I was asked by richm to post the substantive content of our conversation to this mailing list.
===
I am currently working on a benchmark application for liblfds.
This benchmarks the lock-free data structures in the library, and also *locking* versions of those same data structures, where the locks tested are all those available in the platform.
I've not finished yet, and so currently only have two gnuplots available, from an earlier prototype benchmark application.
Note the 16 core image is I think it's 1800x2400 pixels. The file size is still small (~250kb).
http://www.liblfds.org/freelist_4core.png http://www.liblfds.org/freelist_16core.png
The CPU topology diagram can be found at the top of each image.
The four core machine is my laptop, real hardware, real Linux.
The sixteen core machine is a dedicated-hardware Amazon VM, so all bets are off. It looks nice, but God knows what the Xen configuration is. I can't even know there's a one-to-one mapping between virtual cores and logical cores - I've seen cloud providers where this is *not* the case.
Turning first to the four core plot, what we see is that while we are within the confines of a single physical core, spinlocks win. This is because the physical core knows what its logical cores are up to.
As soon as we step outside of a single physical core, lock-free wins, and wins big - an order of magnitude.
We see however, looking at the sad, short bars in the final graph, that the freelist in its current form does not scale linearly with cores. This is because of memory contention on the single cache line holding the freelist head pointer.
There is a method to improve on this, though, something called an elimination layer, which is based on the insight that when a thread which pushes at or almost at the same time as a thread which pops, those two threads can service each others request, without having to access the freelist head pointer.
Turning to the (questionable - VM, don't forget) sixteen core plot, we see much the same. Lock-free wins, but it's not scaling. We do see one or two other items of note; looking at the second graph, we can see a huge imbalance between the spinlock performances on the two cores; in both cases, one core dominated the other and stopped it from getting much work done.
The total amount of work done then is fairly high - because we are in fact by this behaviour starting to approach a single-threaded scenario!
The pthread mutex absolutely does not display this behaviour, since it serializes access and I believe internally is maintaining a queue of requestors; so everyone gets a fair chance.
The spinlocks of course just do their thing, and then we're much more reliant upon the hardware to provide fairness, and as we see, this may not be a safe bet.
We also see in the third graph and fifth graphs that the first core performed badly compared to the others. I do not know why, but it needs explanation.
Relating to this though, although I don't have other plots, but I did do quite a lot of benchmarking with the prototype benchmark app, and I seemed to find on some machines that some cores *always* did badly, and for *all* lock types. I have a weak suspicion it reflects internal memory access arrangements in the core itself. However, this was on VMs - although they were all dedicated hardware, so I had no neighbours on the box - so it could just be Xen doing strange things.
I need to add RW locks to the benchmark, as they are used by nunc-stans/389.
The recent new release of liblfds included - and this was done for implementation speed, to at least get some functionality out there ASAP - an *addonly* list, btree and hash. The list will become normal (read-write) soon enough, as I have a safe memory reclamation implementation about to come out. There is now in the literature a lock-free red-black tree, which was invented last year. Cliff Click has had a proper lock-free hash out for some time now. Each of these will take a month to implement, whereas the addonly versions were straightforward, so I got them out first.
389-users@lists.fedoraproject.org