[fedora-arm] builder io isue

Mon Dec 26 05:06:04 UTC 2011

On 12/26/2011 03:57 AM, Brendan Conoboy wrote:
> On 12/25/2011 03:47 AM, Gordan Bobic wrote:
>> On 12/25/2011 06:16 AM, Brendan Conoboy wrote:
>>> Allocating builders to individual rather than a single raid volume will
>>> help dramatically.
>> Care to explain why?
>
> Sure, see below.
>
>> Is this a "proper" SAN or just another Linux box with some disks in it?
>> Is NFS backed by a SAN "volume"?
>
> As I understand it, the server is a Linux host using raid0 with 512k
> chunks across 4 sata drives. This md device is then formatted with some
> filesystem (ext4?). Directories on this filesystem are then exported to
> individual builders such that each builder has its own private space.
> These private directories contain a large file that is used as a
> loopback ext4fs (IE, the builder mounts the nfs share, then loopback
> mounts the file on that nfs share as an ext4fs). This is where
> /var/lib/mock comes from. Just to be clear, if you looked at nfs mounted
> directory on a build host you would see a single large file that
> represented a filesystem, making traditional ext?fs tuning a bit more
> complicated.

Why not just mount direct via NFS? It'd be a lot quicker, not to mention 
easier to tune. It'd work for building all but a handful of packages 
(e.g. zsh), but you could handle that by having a single builder that 
uses a normal fs that has a policy pointing the packages that fail 
self-tests on NFS at it.

> The structural complication is that we have something like 30-40 systems
> all vying for the attention of those 4 spindles. It's really important
> that each builder not cause more than one disk to perform an operation
> because seeks are costly, and if just 2 disks get called up by a single
> builder, 50% of the storage resources will be taken up by a single host
> until the operation completes. With 40 hosts, you'll just end up
> thrashing (with considerably fewer hosts, too).. Raid0 gives great
> throughput, but it's at the cost of latency. With so many 100mbit
> builders, throughput is less important and latency is key.

512KB chunks sound vastly oversized for this sort of a workload. But if 
you are running ext4 on top of loopback file on top of NFS, no wonder 
the performance sucks.

> Roughly put, the two goals for good performance in this scenario are:
>
> 1. Make sure each builder only activates one disk per operation.

Sounds like a better way to ensure that would be to re-architect the 
storage solution more sensibly. If you really want to use block level 
storage, use iSCSI on top of raw partitions. Providing those partitions 
are suitably aligned (e.g. for 4KB physical sector disks, erase block 
sizes, underlying RAID, etc.), your FS on top of those iSCSI exports 
will also end up being properly aligned, and the stride, stripe-width 
and block group size will all still line up properly.

But with 40 builders, each builder only hammering one disk, you'll still 
get 10 builders hammering each spindle and causing a purely random seek 
pattern. I'd be shocked if you see any measurable improvement from just 
splitting up the RAID.

> 2. Make sure each io operation causes the minimum amount of seeking.
>
> You're right that good alignment and block sizes and whatnot will help
> this cause, but there is still greater likelihood of io operations
> traversing spindle boundaries periodically in the best situation. You'd
> need a chunk size about equal to the fs image file size to pull that
> off.

Using the fs image over loopback over NFS sounds so eyewateringly wrong 
that I'm just going to give up on this thread if that part is immutable. 
I don't think the problem is significantly fixable if that approach remains.

> Perhaps an lvm setup with strictly defined layouts with each
> lvcreate would make it a bit more manageable, but for simplicity's sake
> I advocate simply treating the 4 disks like 4 disks, exported according
> to expected usage patterns.

I don't see why you think that seeking within a single disk is any less 
problematic than seeking across multiple disks. That will only happen 
when the file exceeds the chunk size, and that will typically happen 
only at the end when linking - there aren't many cases where a single 
code file is bigger than a sensible chunk size (and in a 4-disk RAID0 
case, you're pretty much forced to use a 32KB chunk size if you intend 
for the block group beginnings to be distributed across spindles).

> In the end, if all this is done and the builders are delayed by deep
> sleeping nfsds, the only options are to move /var/lib/mock to local
> storage or increase the number of spindles on the server.

And local storage will be what? SD cards? There's only one model line of 
SD cards I have seen to date that actually produce random-write results 
that begin to approach a ~5000 rpm disk (up to 100 IOPS), and those are 
SLC and quite expensive. Having spent the last few months patching, 
fixing up and rebuilding RHEL6 packages for ARM, I have a pretty good 
understanding of what works for backing storage and what doesn't - and 
SD cards are not an approach to take if performance is an issue. Even 
expensive, highly branded Class 10 SD cards only manage ~ 20 IOPS 
(80KB/s) on random writes.

>>> Disable fs
>>> journaling (normally dangerous, but this is throw-away space).
>>
>> Not really dangerous - the only danger is that you might have to wait
>> for fsck to do it's thing on an unclean shutdown (which can take hours
>> on a full TB scale disk, granted).
>
> I mean dangerous in the sense that if the server goes down, there might
> be data loss, but the builders using the space won't know that. This is
> particularly true if nfs exports are async.

Strictly speaking, journal is about preventing the integrity of the FS 
so you don't have to fsck it after an unclean shutdown, not about 
preventing data loss as such. But I guess you could argue the two are 
related.

>> Build of zsh will break on NFS whateveryou do. It will also break on a
>> local FS with noatime. There may be other packages that suffer from this
>> issue but I don't recall them off the top of my head. Anyway, that is an
>> issue for a build policy - have one builder using block level storage
>> with atime and the rest on NFS.
>
> Since loopback files representing filesystems are being used with nfs as
> the storage mechanism, this would probably be a non-issue. You just
> can't have the builder mount its loopback fs noatime (hadn't thought of
> that previously).

I'm still not sure what is the point of using a loopback-ed file for 
storage instead of raw NFS. NFS mounted with nolock,noatime,proto=udp 
works exceedingly well for me with NFSv3.

>>> Once all that is done, tweak the number of nfsds such that
>>> there are as many as possible without most of them going into deep
>>> sleep. Perhaps somebody else can suggest some optimal sysctl and ext4fs
>>> settings?
>>
>> As mentioned in a previous post, have a look here:
>> http://www.altechnative.net/?p=96
>>
>> Deadline scheduler might also help on the NAS/SAN end, plus all the
>> usual tweaks (e.g. make sure write caches on the disks are enabled, if
>> the disks support write-read-verify disable it, etc.)
>
> Definitely worth testing. Well ordered IO is critical here.

Well, deadline is about favouring reads over writes. Writes you can 
buffer as long as you have RAM to spare (expecially with libeatmydata 
LD_PRELOAD-ed). Reads, however, block everything until they complete. So 
favouring reads over writes may well get you ahead in terms of keeping 
he builders busy.

Gordan