Re: xfs_buf_lock vs aio

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 13 Feb 2018 16:18:51 +1100

On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
> On 02/10/2018 01:10 AM, Dave Chinner wrote:
> >On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
> >>On 02/09/2018 12:11 AM, Dave Chinner wrote:
> >>>On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> >>>>On 02/08/2018 01:33 AM, Dave Chinner wrote:
> >>>>>On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> >>>>>>As usual, I'm having my lovely io_submit()s sleeping. This time some
> >>>>>>detailed traces. 4.14.15.
> >>>[....]
> >>>
> >>>>>>Forcing the log, so sleeping with ILOCK taken.
> >>>>>Because it's trying to reallocate an extent that is pinned in the
> >>>>>log and is marked stale. i.e. we are reallocating a recently freed
> >>>>>metadata extent that hasn't been committed to disk yet. IOWs, it's
> >>>>>the metadata form of the "log force to clear a busy extent so we can
> >>>>>re-use it" condition....
> >>>>>
> >>>>>There's nothing you can do to reliably avoid this - it's a sign that
> >>>>>you're running low on free space in an AG because it's recycling
> >>>>>recently freed space faster than the CIL is being committed to disk.
> >>>>>
> >>>>>You could speed up background journal syncs to try to reduce the
> >>>>>async checkpoint latency that allows busy extents to build up
> >>>>>(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> >>>>>journal overhead and IO latency, etc.
> >>>>Perhaps xfs should auto-tune this variable.
> >>>That's not a fix. That's a nasty hack that attempts to hide the
> >>>underlying problem of selecting AGs and/or free space that requires
> >>>a log force to be used instead of finding other, un-encumbered
> >>>freespace present in the filesystem.
> >>Isn't the underlying problem that you have a foreground process
> >>depending on the progress of a background process?
> >At a very, very basic level.
> >
> >>i.e., no matter
> >>how AG and free space selection improves, you can always find a
> >>workload that consumes extents faster than they can be laundered?
> >Sure, but that doesn't mean we have to fall back to a synchronous
> >alogrithm to handle collisions. It's that synchronous behaviour that
> >is the root cause of the long lock stalls you are seeing.
> 
> Well, having that algorithm be asynchronous will be wonderful. But I
> imagine it will be a monstrous effort.

It's not clear yet whether we have to do any of this stuff to solve
your problem.

> >>I'm not saying that free extent selection can't or shouldn't be
> >>improved, just that it can never completely fix the problem on its
> >>own.
> >Righto, if you say so.
> >
> >After all, what do I know about the subject at hand? I'm just the
> >poor dumb guy
> 
> 
> Just because you're an XFS expert, and even wrote the code at hand,
> doesn't mean I have nothing to contribute. If I'm wrong, it's enough
> to tell me that and why.

It takes time and effort to have to explain why someone's suggestion
for fixing a bug will not work. It's tiring, unproductive work and I
get no thanks for it at all. I'm just seen as the nasty guy who says
"no" to everything because I eventually run out of patience trying
to explain everything in simple enough terms for non-XFS people to
understand that they don't really understand XFS or what I'm talking
about.

IOWs, sometimes the best way to contribute is to know when you're in
way over you head and to step back and simply help the master
crafters get on with weaving their magic.....

> >  who wrote the current busy extent list handling
> >algorithm years ago.  Perhaps you'd like to read the commit message
> >(below), because it explains these sycnhronous slow paths and why
> >they exist. I'll quote the part relevant to the discussion here,
> >though:
> >
> >	    Ideally we should not reallocate busy extents. That is a
> >	    much more complex fix to the problem as it involves
> >	    direct intervention in the allocation btree searches in
> >	    many places. This is left to a future set of
> >	    modifications.
> 
> Thanks, that commit was interesting.
> 
> So, this future set of modifications is to have the extent allocator
> consult this rbtree and continue searching if locked?

See, this is exactly what I mean.

You're now trying to guess how we'd solve the busy extent blocking
problem. i.e. you now appear to be assuming we have a plan to fix
this problem and are going to do it immediately.  Nothing could be
further from the truth - I said:

> >this is now important, and so we now need to revisit the issues we
> >laid out some 8 years ago and work from there.

That does not mean "we're going to fix this now" - it means we need
to look at the problem again and determine if it's the best solution
to the problem being presented to us. There are other avenues we
still need to explore.

Indeed, does your application and/or users even care about
[acm]times on your files being absolutely accurate and crash
resilient? i.e. do you use fsync() or fdatasync() to guarantee the
data is on stable storage?

[....]

> I still think reducing the amount of outstanding busy extents is
> important.  Modern disks write multiple GB/s, and big-data
> applications like to do large sequential writes and deletes,

Hah! "modern disks"

You need to recalibrate what "big data" and "high performance IO"
means. This was what we were doing with XFS on linux back in 2006:

https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

i.e. 10 years ago we were already well into the *tens of GB/s* on
XFS filesystems for big-data applications with large sequential
reads and writes. These "modern disks" are so slow! :)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html