Re: builder io isue

Gordan Bobic <gordan@xxxxxxxxxx> · Mon, 26 Dec 2011 18:31:27 +0000

On 12/26/2011 06:43 AM, Brendan Conoboy wrote:
On 12/25/2011 09:06 PM, Gordan Bobic wrote:
Why not just mount direct via NFS? It'd be a lot quicker, not to mention
easier to tune. It'd work for building all but a handful of packages
(e.g. zsh), but you could handle that by having a single builder that
uses a normal fs that has a policy pointing the packages that fail
self-tests on NFS at it.

I'm not acquainted with the rationale for the decision so perhaps
somebody else can comment. Beyond the packages that demand a local
filesystem, perhaps there were issues with .nfsXXX files, or some
stability problem not seen when working with a single open file? Not sure.

I have rebuilt the entire distro using mock on NFSv3 with 
noatime,nolock,proto=udp and had no problems at all, apart from zfs 
which has to be on a local fs mounted with atime, and that is only to 
pass it's self-tests.

512KB chunks sound vastly oversized for this sort of a workload. But if
you are running ext4 on top of loopback file on top of NFS, no wonder
the performance sucks.

Well, 512KB chunks is oversize for traditional NFS use, but perhaps
undersized for this unusual use case.

The problem with the current method is that you can pretty much 
guarantee that you will not have proper alignment at any layer which 
will cause considerable I/O imbalances and hot-spots.

Sounds like a better way to ensure that would be to re-architect the
storage solution more sensibly. If you really want to use block level
storage, use iSCSI on top of raw partitions. Providing those partitions
are suitably aligned (e.g. for 4KB physical sector disks, erase block
sizes, underlying RAID, etc.), your FS on top of those iSCSI exports
will also end up being properly aligned, and the stride, stripe-width
and block group size will all still line up properly.

I understand there was an issue with iSCSI stability about a year ago.
One of our engineers tried it on his trimslice recently and had no
problems so it may be time to reevaluate its use.

Or you could just use bare NFS, unless somebody can provide a concrete 
example for why that won't work combined with a single local-storage 
builder for the one or two packages that require a local fs. Having 
rebuilt 2000+ packages at least 3 times in the past month, I have found 
it to not be an issue. Whether some of the other 2000+ packages in 
Fedora have issues - I don't know, but it is not at all clear that there 
are enough packages that have a problem with NFS to make it unworkable 
with a single iSCSI builder with a suitable build policy (similar to 
what is used for "heavy" builders).

But with 40 builders, each builder only hammering one disk, you'll still
get 10 builders hammering each spindle and causing a purely random seek
pattern. I'd be shocked if you see any measurable improvement from just
splitting up the RAID.

Let's say 10 (40/4) builders are using one disk at the same time- that's
not necessarily a doomsday scenario since their network speed is only
100mbps. The one situation you want to avoid is having numerous mock
setups at one time, that will amount to a hammering. How much time on
average is spent composing the chroot vs building? Sure, at some point
builders will simply overwhelm any given disk, but what is that point?
My guess is that 10 is really pushing it. 5 would be better.

LD_PRELOAD=libeatmydata.so makes a _massive_ difference to the mock 
setup times when not using a cached tar balled mock root image. Less so 
when using a cached mock root image, but it still makes a difference.

Using the fs image over loopback over NFS sounds so eyewateringly wrong
that I'm just going to give up on this thread if that part is immutable.
I don't think the problem is significantly fixable if that approach
remains.

Why is that?

Because you are virtually guaranteed to have a file system that is not 
aligned, especially WRT block groups.

I don't see why you think that seeking within a single disk is any less
problematic than seeking across multiple disks. That will only happen
when the file exceeds the chunk size, and that will typically happen
only at the end when linking - there aren't many cases where a single
code file is bigger than a sensible chunk size (and in a 4-disk RAID0
case, you're pretty much forced to use a 32KB chunk size if you intend
for the block group beginnings to be distributed across spindles).

It's the chroot composition that makes me think seeking across multiple
disks is an issue.

If you are talking about untarring the cached mock rootfs, I doubt it. 
If you are talking about creating a mock root from scratch, then 
LD_PRELOAD=libeatmydata.so is what will make the biggest difference due 
to all the rpm database induced I/O.

And local storage will be what? SD cards? There's only one model line of
SD cards I have seen to date that actually produce random-write results
that begin to approach a ~5000 rpm disk (up to 100 IOPS), and those are
SLC and quite expensive. Having spent the last few months patching,
fixing up and rebuilding RHEL6 packages for ARM, I have a pretty good
understanding of what works for backing storage and what doesn't - and
SD cards are not an approach to take if performance is an issue. Even
expensive, highly branded Class 10 SD cards only manage ~ 20 IOPS
(80KB/s) on random writes.

80KB/s? Really? That sounds like bad alignment.

That is with optimal alignment. It gets worse if you don't make sure the 
FS is aligned for erase block sizes. 80KB/s = 20 IOPS (random write) 
with 4KB blocks. A lot of SD cards do even worse than that.

On random reads it is not unusual to see 1000-1500 IOPS, but the 
performance of random writes on SD cards is pretty dire, and 
unfortunately, the I/O heavy things like rootfs setup (especially when 
not using the cached one) are almost pure writes.

I'm still not sure what is the point of using a loopback-ed file for
storage instead of raw NFS. NFS mounted with nolock,noatime,proto=udp
works exceedingly well for me with NFSv3.

I didn't think udp was a good idea any longer.

It will give you a bit less of a network overhead, and if your network 
is reliable (i.e. no significant packet loss), it certainly won't be any 
worse than running over TCP.

Well, deadline is about favouring reads over writes. Writes you can
buffer as long as you have RAM to spare (expecially with libeatmydata
LD_PRELOAD-ed). Reads, however, block everything until they complete. So
favouring reads over writes may well get you ahead in terms of keeping
he builders busy.

It really begs the question: What are builers blocking on right now? I'd
assumed chroot composition which is rather write heavy.

That is certainly the most I/O bound part of the task.

Gordan
_______________________________________________
arm mailing list
arm@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/arm