Re: builder io isue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/26/2011 03:57 AM, Brendan Conoboy wrote:
On 12/25/2011 03:47 AM, Gordan Bobic wrote:
On 12/25/2011 06:16 AM, Brendan Conoboy wrote:
Allocating builders to individual rather than a single raid volume will
help dramatically.
Care to explain why?

Sure, see below.

Is this a "proper" SAN or just another Linux box with some disks in it?
Is NFS backed by a SAN "volume"?

As I understand it, the server is a Linux host using raid0 with 512k
chunks across 4 sata drives. This md device is then formatted with some
filesystem (ext4?). Directories on this filesystem are then exported to
individual builders such that each builder has its own private space.
These private directories contain a large file that is used as a
loopback ext4fs (IE, the builder mounts the nfs share, then loopback
mounts the file on that nfs share as an ext4fs). This is where
/var/lib/mock comes from. Just to be clear, if you looked at nfs mounted
directory on a build host you would see a single large file that
represented a filesystem, making traditional ext?fs tuning a bit more
complicated.

Why not just mount direct via NFS? It'd be a lot quicker, not to mention easier to tune. It'd work for building all but a handful of packages (e.g. zsh), but you could handle that by having a single builder that uses a normal fs that has a policy pointing the packages that fail self-tests on NFS at it.

The structural complication is that we have something like 30-40 systems
all vying for the attention of those 4 spindles. It's really important
that each builder not cause more than one disk to perform an operation
because seeks are costly, and if just 2 disks get called up by a single
builder, 50% of the storage resources will be taken up by a single host
until the operation completes. With 40 hosts, you'll just end up
thrashing (with considerably fewer hosts, too).. Raid0 gives great
throughput, but it's at the cost of latency. With so many 100mbit
builders, throughput is less important and latency is key.

512KB chunks sound vastly oversized for this sort of a workload. But if you are running ext4 on top of loopback file on top of NFS, no wonder the performance sucks.

Roughly put, the two goals for good performance in this scenario are:

1. Make sure each builder only activates one disk per operation.

Sounds like a better way to ensure that would be to re-architect the storage solution more sensibly. If you really want to use block level storage, use iSCSI on top of raw partitions. Providing those partitions are suitably aligned (e.g. for 4KB physical sector disks, erase block sizes, underlying RAID, etc.), your FS on top of those iSCSI exports will also end up being properly aligned, and the stride, stripe-width and block group size will all still line up properly.

But with 40 builders, each builder only hammering one disk, you'll still get 10 builders hammering each spindle and causing a purely random seek pattern. I'd be shocked if you see any measurable improvement from just splitting up the RAID.

2. Make sure each io operation causes the minimum amount of seeking.

You're right that good alignment and block sizes and whatnot will help
this cause, but there is still greater likelihood of io operations
traversing spindle boundaries periodically in the best situation. You'd
need a chunk size about equal to the fs image file size to pull that
off.

Using the fs image over loopback over NFS sounds so eyewateringly wrong that I'm just going to give up on this thread if that part is immutable. I don't think the problem is significantly fixable if that approach remains.

Perhaps an lvm setup with strictly defined layouts with each
lvcreate would make it a bit more manageable, but for simplicity's sake
I advocate simply treating the 4 disks like 4 disks, exported according
to expected usage patterns.

I don't see why you think that seeking within a single disk is any less problematic than seeking across multiple disks. That will only happen when the file exceeds the chunk size, and that will typically happen only at the end when linking - there aren't many cases where a single code file is bigger than a sensible chunk size (and in a 4-disk RAID0 case, you're pretty much forced to use a 32KB chunk size if you intend for the block group beginnings to be distributed across spindles).

In the end, if all this is done and the builders are delayed by deep
sleeping nfsds, the only options are to move /var/lib/mock to local
storage or increase the number of spindles on the server.

And local storage will be what? SD cards? There's only one model line of SD cards I have seen to date that actually produce random-write results that begin to approach a ~5000 rpm disk (up to 100 IOPS), and those are SLC and quite expensive. Having spent the last few months patching, fixing up and rebuilding RHEL6 packages for ARM, I have a pretty good understanding of what works for backing storage and what doesn't - and SD cards are not an approach to take if performance is an issue. Even expensive, highly branded Class 10 SD cards only manage ~ 20 IOPS (80KB/s) on random writes.

Disable fs
journaling (normally dangerous, but this is throw-away space).

Not really dangerous - the only danger is that you might have to wait
for fsck to do it's thing on an unclean shutdown (which can take hours
on a full TB scale disk, granted).

I mean dangerous in the sense that if the server goes down, there might
be data loss, but the builders using the space won't know that. This is
particularly true if nfs exports are async.

Strictly speaking, journal is about preventing the integrity of the FS so you don't have to fsck it after an unclean shutdown, not about preventing data loss as such. But I guess you could argue the two are related.

Build of zsh will break on NFS whateveryou do. It will also break on a
local FS with noatime. There may be other packages that suffer from this
issue but I don't recall them off the top of my head. Anyway, that is an
issue for a build policy - have one builder using block level storage
with atime and the rest on NFS.

Since loopback files representing filesystems are being used with nfs as
the storage mechanism, this would probably be a non-issue. You just
can't have the builder mount its loopback fs noatime (hadn't thought of
that previously).

I'm still not sure what is the point of using a loopback-ed file for storage instead of raw NFS. NFS mounted with nolock,noatime,proto=udp works exceedingly well for me with NFSv3.

Once all that is done, tweak the number of nfsds such that
there are as many as possible without most of them going into deep
sleep. Perhaps somebody else can suggest some optimal sysctl and ext4fs
settings?

As mentioned in a previous post, have a look here:
http://www.altechnative.net/?p=96

Deadline scheduler might also help on the NAS/SAN end, plus all the
usual tweaks (e.g. make sure write caches on the disks are enabled, if
the disks support write-read-verify disable it, etc.)

Definitely worth testing. Well ordered IO is critical here.

Well, deadline is about favouring reads over writes. Writes you can buffer as long as you have RAM to spare (expecially with libeatmydata LD_PRELOAD-ed). Reads, however, block everything until they complete. So favouring reads over writes may well get you ahead in terms of keeping he builders busy.

Gordan
_______________________________________________
arm mailing list
arm@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/arm



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM (Vger)]     [Linux ARM]     [ARM Kernel]     [Fedora User Discussion]     [Older Fedora Users Discussion]     [Fedora Advisory Board]     [Fedora Security]     [Fedora Maintainers]     [Fedora Devel Java]     [Fedora Legacy]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Mentors]     [Fedora Package Announce]     [Fedora Package Review]     [Fedora Music]     [Fedora Packaging]     [Centos]     [Fedora SELinux]     [Coolkey]     [Yum Users]     [Tux]     [Yosemite News]     [Linux Apps]     [KDE Users]     [Fedora Tools]     [Fedora Art]     [Fedora Docs]     [Asterisk PBX]

Powered by Linux