locking performance and scalability (eye candy gnuplots inside!)

Tuesday, 8 March 2016

Hi, everyone.

I am the author of liblfds (a lock-free data structure library), which 
is used by nunc-stans, which is used by 389.

By various events of no interest to anyone, I've ended up idling in the 
#389 channel on Freenode and earlier today had a conversation with richm 
(who used liblfds in nunc-stans), regarding performance, data structures 
in use, locking and so on.

I was asked by richm to post the substantive content of our conversation 
to this mailing list.

===

I am currently working on a benchmark application for liblfds.

This benchmarks the lock-free data structures in the library, and also 
*locking* versions of those same data structures, where the locks tested 
are all those available in the platform.

I've not finished yet, and so currently only have two gnuplots 
available, from an earlier prototype benchmark application.

Note the 16 core image is I think it's 1800x2400 pixels.  The file size 
is still small (~250kb).

http://www.liblfds.org/freelist_4core.png
http://www.liblfds.org/freelist_16core.png

The CPU topology diagram can be found at the top of each image.

The four core machine is my laptop, real hardware, real Linux.

The sixteen core machine is a dedicated-hardware Amazon VM, so all bets 
are off.  It looks nice, but God knows what the Xen configuration is.  I 
can't even know there's a one-to-one mapping between virtual cores and 
logical cores - I've seen cloud providers where this is *not* the case.

Turning first to the four core plot, what we see is that while we are 
within the confines of a single physical core, spinlocks win.  This is 
because the physical core knows what its logical cores are up to.

As soon as we step outside of a single physical core, lock-free wins, 
and wins big - an order of magnitude.

We see however, looking at the sad, short bars in the final graph, that 
the freelist in its current form does not scale linearly with cores. 
This is because of memory contention on the single cache line holding 
the freelist head pointer.

There is a method to improve on this, though, something called an 
elimination layer, which is based on the insight that when a thread 
which pushes at or almost at the same time as a thread which pops, those 
two threads can service each others request, without having to access 
the freelist head pointer.

Turning to the (questionable - VM, don't forget) sixteen core plot, we 
see much the same.  Lock-free wins, but it's not scaling.  We do see one 
or two other items of note; looking at the second graph, we can see a 
huge imbalance between the spinlock performances on the two cores; in 
both cases, one core dominated the other and stopped it from getting 
much work done.

The total amount of work done then is fairly high - because we are in 
fact by this behaviour starting to approach a single-threaded scenario!

The pthread mutex absolutely does not display this behaviour, since it 
serializes access and I believe internally is maintaining a queue of 
requestors; so everyone gets a fair chance.

The spinlocks of course just do their thing, and then we're much more 
reliant upon the hardware to provide fairness, and as we see, this may 
not be a safe bet.

We also see in the third graph and fifth graphs that the first core 
performed badly compared to the others.  I do not know why, but it needs 
explanation.

Relating to this though, although I don't have other plots, but I did do 
quite a lot of benchmarking with the prototype benchmark app, and I 
seemed to find on some machines that some cores *always* did badly, and 
for *all* lock types.  I have a weak suspicion it reflects internal 
memory access arrangements in the core itself.  However, this was on VMs 
- although they were all dedicated hardware, so I had no neighbours on 
the box - so it could just be Xen doing strange things.

I need to add RW locks to the benchmark, as they are used by nunc-stans/389.

The recent new release of liblfds included - and this was done for 
implementation speed, to at least get some functionality out there ASAP 
- an *addonly* list, btree and hash.  The list will become normal 
(read-write) soon enough, as I have a safe memory reclamation 
implementation about to come out.  There is now in the literature a 
lock-free red-black tree, which was invented last year.  Cliff Click has 
had a proper lock-free hash out for some time now.  Each of these will 
take a month to implement, whereas the addonly versions were 
straightforward, so I got them out first.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005