Re: xfs_buf_lock vs aio

Avi Kivity <avi@xxxxxxxxxxxx> · Thu, 15 Feb 2018 11:36:54 +0200

On 02/15/2018 01:56 AM, Dave Chinner wrote:
On Wed, Feb 14, 2018 at 02:07:42PM +0200, Avi Kivity wrote:
On 02/13/2018 07:18 AM, Dave Chinner wrote:
On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
On 02/10/2018 01:10 AM, Dave Chinner wrote:
On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
i.e., no matter
how AG and free space selection improves, you can always find a
workload that consumes extents faster than they can be laundered?
Sure, but that doesn't mean we have to fall back to a synchronous
alogrithm to handle collisions. It's that synchronous behaviour that
is the root cause of the long lock stalls you are seeing.
Well, having that algorithm be asynchronous will be wonderful. But I
imagine it will be a monstrous effort.
It's not clear yet whether we have to do any of this stuff to solve
your problem.
I was going by "is the root cause" above. But if we don't have to
touch it, great.
Remember that triage - which is all about finding the root cause of
an issue - is a separate process to finding an appropriate fix for
the issue that has been triaged.

Sure.

I'm not saying that free extent selection can't or shouldn't be
improved, just that it can never completely fix the problem on its
own.
Righto, if you say so.

After all, what do I know about the subject at hand? I'm just the
poor dumb guy
Just because you're an XFS expert, and even wrote the code at hand,
doesn't mean I have nothing to contribute. If I'm wrong, it's enough
to tell me that and why.
It takes time and effort to have to explain why someone's suggestion
for fixing a bug will not work. It's tiring, unproductive work and I
get no thanks for it at all.
Isn't the part of being a maintainer?
I'm not the maintainer.  That burnt me out, and this was one of the
aspects of the job that contributes significantly to burn-out.

I'm sorry to hear that. As an ex kernel maintainer (and current 
non-kernel maintainer), I can certainly sympathize, though it was never 
so bad for me.

I don't want the current maintainer to suffer from the same fate.
I can handle some stress, so I'm happy to play the bad guy because
it shares the stress around.

However, I'm not going to make the same mistake I did the first time
around - internalising these issues doesn't make them go away. Hence
I'm going to speak out about it in the hope that users realise that
their demands can have a serious impact on the people that are
supporting them. Sure, I could have put it better, but this is still
an unfamiliar, learning-as-I-go process for me and so next time I
won't make the same mistakes....

Well, I'm happy to adjust in order to work better with you, just tell me 
what will work.

When everything works, the
users are off the mailing list.
That often makes things worse :/ Users are always asking questions
about configs, optimisations, etc. And then there's all the other
developers who want their projects merged and supported. The need to
say no doesn't go away just because "everything works"....

I'm just seen as the nasty guy who says
"no" to everything because I eventually run out of patience trying
to explain everything in simple enough terms for non-XFS people to
understand that they don't really understand XFS or what I'm talking
about.

IOWs, sometimes the best way to contribute is to know when you're in
way over you head and to step back and simply help the master
crafters get on with weaving their magic.....
Are you suggesting that I should go away? Or something else?
Something else.

Avi, your help and insight is most definitely welcome (and needed!)
because we can't find a solution that would suit your needs without
it.  All I'm asking for is a little bit of patience as we go
through the process of gathering all the info we need to determine
the best approach to solving the problem.

Thanks. I'm under pressure to find a solution quickly, so maybe I'm 
pushing too hard.

I'm certainly all for the right long-term fix rather than creating 
mountains of workarounds that later create more problems.

Be aware that when you are asked triage questions that seem
illogical or irrelevant, then the best thing to do is to answer the
question as best you can and wait to ask questions later. Those
questions are usually asked to rule out complex, convoluted cases
that take a long, long time to explain and by responding with
questions rather than answers it derails the process of expedient
triage and analysis.

IOWs, lets talk about the merits and mechanisms of solutions when
they are proposed, not while questions are still being asked about
the application, requirements, environment, etc needed to determine
what the best potential solution may be.

Ok. I also ask these questions as a way to increase my understanding of 
the topic, it's not just my hope of getting a quick fix in.

Indeed, does your application and/or users even care about
[acm]times on your files being absolutely accurate and crash
resilient? i.e. do you use fsync() or fdatasync() to guarantee the
data is on stable storage?
We use fdatasync and don't care about mtime much. So lazytime would
work for us.
OK, so let me explore that in a bit more detail and see whether it's
something we can cleanly implement....

I still think reducing the amount of outstanding busy extents is
important.  Modern disks write multiple GB/s, and big-data
applications like to do large sequential writes and deletes,
Hah! "modern disks"

You need to recalibrate what "big data" and "high performance IO"
means. This was what we were doing with XFS on linux back in 2006:

https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

i.e. 10 years ago we were already well into the *tens of GB/s* on
XFS filesystems for big-data applications with large sequential
reads and writes. These "modern disks" are so slow! :)
Today, that's one or a few disks, not 90, and you can such a setup
for a few dollars an hour, doing millions of IOPS.
Sure, but that's not "big-data" anymore - it's pretty common
nowdays in enterprise server environments. Big data applications
these days are measured in TB/s and hundreds of PBs.... :)

Across a cluster, with each node having tens of cores and tens/hundreds 
of TB, not more. The nodes I described are fairly typical.

Meanwhile, we've tried inode32 on a newly built filesystem (to avoid any 
inherited imbalance). The old filesystem had a large AGF imbalance, the 
new one did not, as expected. However, the stalls remain.

A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to avoid 
the the time update lock, so we'll be trying that next, to emulate lazytime.

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html