Re: sleeps and waits during io_submit

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 2 Dec 2015 10:41:39 +1100

On Tue, Dec 01, 2015 at 10:56:01PM +0200, Avi Kivity wrote:
> On 12/01/2015 10:45 PM, Dave Chinner wrote:
> >On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
> >The difference is an allocation can block waiting on IO, and the
> >CPU can then go off and run another process, which then tries to do
> >an allocation. So you might only have 4 CPUs, but a workload that
> >can have a hundred active allocations at once (not uncommon in
> >file server workloads).
> 
> But for us, probably not much more.  We try to restrict active I/Os
> to the effective disk queue depth (more than that and they just turn
> sour waiting in the disk queue).
> 
> 
> >On worklaods that are roughly 1 process per CPU, it's typical that
> >agcount = 2 * N cpus gives pretty good results on large filesystems.
> 
> This is probably using sync calls.  Using async calls you can have
> many more I/Os in progress (but still limited by effective disk
> queue depth).

Ah, no. Even with async IO you don't want unbound allocation
concurrency. The allocation algorithms rely on having contiguous
free space extents that are much larger than the allocations being
done to work effeectively and minimise file fragmentation. If you
chop the filesystem up into lots of small AGs, then it accelerates
the rate at which the free space gets chopped up into smaller
extents and performance then suffers. It's the same problem as
running a large filesystem near ENOSPC for an extended period of
time, which again is something we most definitely don't recommend
you do in production systems.

> >If you've got 400GB filesystems or you are using spinning disks,
> >then you probably don't want to go above 16 AGs, because then you
> >have problems with maintaining continugous free space and you'll
> >seek the spinning disks to death....
> 
> We're concentrating on SSDs for now.

Sure, so "problems with maintaining continugous free space" is what
you need to be concerned about.

> >>>>'mount -o ikeep,'
> >>>
> >>>Interesting.  Our files are large so we could try this.
> >Keep in mind that ikeep means that inode allocation permanently
> >fragments free space, which can affect how large files are allocated
> >once you truncate/rm the original files.
> 
> We can try to prime this by allocating a lot of inodes up front,
> then removing them, so that this doesn't happen.

Again - what problem have you measured that inode preallocation will
solves in your application? Don't make changes just because you
*think* it will fix what you *think* is a problem. Measure, analyse,
solve, in that order.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs