[fedora-arm] builder io isue

Mon Dec 26 03:57:31 UTC 2011

On 12/25/2011 03:47 AM, Gordan Bobic wrote:
> On 12/25/2011 06:16 AM, Brendan Conoboy wrote:
>> Allocating builders to individual rather than a single raid volume will
>> help dramatically.
> Care to explain why?

Sure, see below.

> Is this a "proper" SAN or just another Linux box with some disks in it?
> Is NFS backed by a SAN "volume"?

As I understand it, the server is a Linux host using raid0 with 512k 
chunks across 4 sata drives.  This md device is then formatted with some 
filesystem (ext4?).  Directories on this filesystem are then exported to 
individual builders such that each builder has its own private space. 
These private directories contain a large file that is used as a 
loopback ext4fs (IE, the builder mounts the nfs share, then loopback 
mounts the file on that nfs share as an ext4fs).  This is where 
/var/lib/mock comes from.  Just to be clear, if you looked at nfs 
mounted directory on a build host you would see a single large file that 
represented a filesystem, making traditional ext?fs tuning a bit more 
complicated.

The structural complication is that we have something like 30-40 systems 
all vying for the attention of those 4 spindles.  It's really important 
that each builder not cause more than one disk to perform an operation 
because seeks are costly, and if just 2 disks get called up by a single 
builder, 50% of the storage resources will be taken up by a single host 
until the operation completes.  With 40 hosts, you'll just end up 
thrashing (with considerably fewer hosts, too)..  Raid0 gives great 
throughput, but it's at the cost of latency.  With so many 100mbit 
builders, throughput is less important and latency is key.

Roughly put, the two goals for good performance in this scenario are:

1. Make sure each builder only activates one disk per operation.

2. Make sure each io operation causes the minimum amount of seeking.

You're right that good alignment and block sizes and whatnot will help 
this cause, but there is still greater likelihood of io operations 
traversing spindle boundaries periodically in the best situation.  You'd 
need a chunk size about equal to the fs image file size to pull that 
off.  Perhaps an lvm setup with strictly defined layouts with each 
lvcreate would make it a bit more manageable, but for simplicity's sake 
I advocate simply treating the 4 disks like 4 disks, exported according 
to expected usage patterns.

In the end, if all this is done and the builders are delayed by deep 
sleeping nfsds, the only options are to move /var/lib/mock to local 
storage or increase the number of spindles on the server.

>> Disable fs
>> journaling (normally dangerous, but this is throw-away space).
>
> Not really dangerous - the only danger is that you might have to wait
> for fsck to do it's thing on an unclean shutdown (which can take hours
> on a full TB scale disk, granted).

I mean dangerous in the sense that if the server goes down, there might 
be data loss, but the builders using the space won't know that.  This is 
particularly true if nfs exports are async.

> Speaking of "dangerous" tweaks, you could LD_PRELOAD libeatmydata (add
> to a profile.d file in the mock config, and add the package to the
> buildsys-build group). That will eat fsync() calls which will smooth out
> commits and make a substantial difference to performance. Since it's
> scratch space anyway it doesn't matter WRT safety.

Sounds good to me :-)

> Build of zsh will break on NFS whateveryou do. It will also break on a
> local FS with noatime. There may be other packages that suffer from this
> issue but I don't recall them off the top of my head. Anyway, that is an
> issue for a build policy - have one builder using block level storage
> with atime and the rest on NFS.

Since loopback files representing filesystems are being used with nfs as 
the storage mechanism, this would probably be a non-issue.  You just 
can't have the builder mount its loopback fs noatime (hadn't thought of 
that previously).

>> Once all that is done, tweak the number of nfsds such that
>> there are as many as possible without most of them going into deep
>> sleep. Perhaps somebody else can suggest some optimal sysctl and ext4fs
>> settings?
>
> As mentioned in a previous post, have a look here:
> http://www.altechnative.net/?p=96
>
> Deadline scheduler might also help on the NAS/SAN end, plus all the
> usual tweaks (e.g. make sure write caches on the disks are enabled, if
> the disks support write-read-verify disable it, etc.)

Definitely worth testing.  Well ordered IO is critical here.

-- 
Brendan Conoboy / Red Hat, Inc. / blc at redhat.com