Re: xfs_buf_lock vs aio

Avi Kivity <avi@xxxxxxxxxxxx> · Wed, 14 Feb 2018 14:07:42 +0200

On 02/13/2018 07:18 AM, Dave Chinner wrote:
On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
On 02/10/2018 01:10 AM, Dave Chinner wrote:
On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
On 02/09/2018 12:11 AM, Dave Chinner wrote:
On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
On 02/08/2018 01:33 AM, Dave Chinner wrote:
On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
As usual, I'm having my lovely io_submit()s sleeping. This time some
detailed traces. 4.14.15.
[....]

Forcing the log, so sleeping with ILOCK taken.
Because it's trying to reallocate an extent that is pinned in the
log and is marked stale. i.e. we are reallocating a recently freed
metadata extent that hasn't been committed to disk yet. IOWs, it's
the metadata form of the "log force to clear a busy extent so we can
re-use it" condition....

There's nothing you can do to reliably avoid this - it's a sign that
you're running low on free space in an AG because it's recycling
recently freed space faster than the CIL is being committed to disk.

You could speed up background journal syncs to try to reduce the
async checkpoint latency that allows busy extents to build up
(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
journal overhead and IO latency, etc.
Perhaps xfs should auto-tune this variable.
That's not a fix. That's a nasty hack that attempts to hide the
underlying problem of selecting AGs and/or free space that requires
a log force to be used instead of finding other, un-encumbered
freespace present in the filesystem.
Isn't the underlying problem that you have a foreground process
depending on the progress of a background process?
At a very, very basic level.

i.e., no matter
how AG and free space selection improves, you can always find a
workload that consumes extents faster than they can be laundered?
Sure, but that doesn't mean we have to fall back to a synchronous
alogrithm to handle collisions. It's that synchronous behaviour that
is the root cause of the long lock stalls you are seeing.
Well, having that algorithm be asynchronous will be wonderful. But I
imagine it will be a monstrous effort.
It's not clear yet whether we have to do any of this stuff to solve
your problem.
I was going by "is the root cause" above. But if we don't have to touch 
it, great.
I'm not saying that free extent selection can't or shouldn't be
improved, just that it can never completely fix the problem on its
own.
Righto, if you say so.

After all, what do I know about the subject at hand? I'm just the
poor dumb guy
Just because you're an XFS expert, and even wrote the code at hand,
doesn't mean I have nothing to contribute. If I'm wrong, it's enough
to tell me that and why.
It takes time and effort to have to explain why someone's suggestion
for fixing a bug will not work. It's tiring, unproductive work and I
get no thanks for it at all.
Isn't the part of being a maintainer? When everything works, the users 
are off the mailing list.
I'm just seen as the nasty guy who says
"no" to everything because I eventually run out of patience trying
to explain everything in simple enough terms for non-XFS people to
understand that they don't really understand XFS or what I'm talking
about.

IOWs, sometimes the best way to contribute is to know when you're in
way over you head and to step back and simply help the master
crafters get on with weaving their magic.....
Are you suggesting that I should go away? Or something else?

  who wrote the current busy extent list handling
algorithm years ago.  Perhaps you'd like to read the commit message
(below), because it explains these sycnhronous slow paths and why
they exist. I'll quote the part relevant to the discussion here,
though:

	    Ideally we should not reallocate busy extents. That is a
	    much more complex fix to the problem as it involves
	    direct intervention in the allocation btree searches in
	    many places. This is left to a future set of
	    modifications.
Thanks, that commit was interesting.

So, this future set of modifications is to have the extent allocator
consult this rbtree and continue searching if locked?
See, this is exactly what I mean.

You're now trying to guess how we'd solve the busy extent blocking
problem. i.e. you now appear to be assuming we have a plan to fix
this problem and are going to do it immediately.  Nothing could be
further from the truth - I said:

this is now important, and so we now need to revisit the issues we
laid out some 8 years ago and work from there.
That does not mean "we're going to fix this now" - it means we need
to look at the problem again and determine if it's the best solution
to the problem being presented to us. There are other avenues we
still need to explore.

Indeed, does your application and/or users even care about
[acm]times on your files being absolutely accurate and crash
resilient? i.e. do you use fsync() or fdatasync() to guarantee the
data is on stable storage?
We use fdatasync and don't care about mtime much. So lazytime would work 
for us.
[....]

I still think reducing the amount of outstanding busy extents is
important.  Modern disks write multiple GB/s, and big-data
applications like to do large sequential writes and deletes,
Hah! "modern disks"

You need to recalibrate what "big data" and "high performance IO"
means. This was what we were doing with XFS on linux back in 2006:

https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

i.e. 10 years ago we were already well into the *tens of GB/s* on
XFS filesystems for big-data applications with large sequential
reads and writes. These "modern disks" are so slow! :)
Today, that's one or a few disks, not 90, and you can such a setup for a 
few dollars an hour, doing millions of IOPS.
Cheers,

Dave.

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html