Re: [GIT PULL] Block fixes for 4.14-rc2

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 25 Sep 2017 11:32:33 -0700

On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> Honestly, what I really wanted to do was kill that call completely.

I can understand why you'd want to, but at the same time it is not
entirely clear that is possible.

So the problem is that we need more memory for buffer head
allocations, and we *have* to get it (since we very likely need the
buffer heads for writeout under low-mem cirumstances) - but at the
same time we obviously must not recurse into the filesystem.

The end result being that any synchronous freeing is going to be very
limited (GFP_NOFS really ends up limiting a lot), and so there's a
fairly strong argument that we should then wake up other threads that
*can* do filesystem writeback.

Of course, the problem with _that_ is that it's quite likely that the
flusher threads doing filesystem writeback may in turn get back to
this very same "we need buffer heads".

And that is, I assume, the case that facebook then hits.

So it's kind of "damned in you do, damned if you don't" situation.

But this is very much also why that free_more_memory() code then

 (a) limits the writeout

 (b) has that "yield()" in it

with the logic being to not cause unnecessarily much writeout, and
hopefully avoid huge latency spikes because of the (potential) looping
over the unsiccessful GFP_NOFS allocations.

But yes, a lot has changed over time. For example, originally
filesystems that used buffer heads used them both for reading and
writing, so the act of reading in a page tended to also populate the
buffer heads for that page.

I think most filesystems these days avoid buffer heads entirely, and
the dynamics of bh allocations have likely changed radically.

So I wouldn't remove the "wakeup_flusher_threads()" call, but maybe it
could be moved to be a unusual fallback case if the direct freeing
doesn't work.

But none of this happens unless that alloc_buffer_head() failed, so at
least one GFP_NOFS allocation has already failed miserably, which is
why that "make other threads that _can_ do filesystem IO try to free
things" exists in the first place.

> If you read the free_more_memory() function, it looks like crap.

I really don't see that.

Think of what the preconditions are:

 - a GFP_NOFS allocation has failed

 - we obviously cannot call back into filesystem code

and that function makes a fair amount of sense. It's _supposed_ to be
an unusual nasty case, after all.

Now,

> What I really wanted to do was kill the
> wakeup_flusher_threads() completely, we want to rely on
> try_to_free_pages() starting targeted writeback instead of some magic
> number of pages across devices.

Yes, hopefully the regular memory allocation should already do the
right thing, but the problem here is that we do need to loop over it,
which it isn't really designed for.

But one option might be to just say "_we_ shouldn't loop, the page
allocator should just loop until it gets a successful end result".

So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL.

That is in fact what some filesystems use for their own metadata:
you'll find that the

    GFP_NOFS | __GFP_NOFAIL

combination is not entirely unheard of, and in fact that buffer code
itself actually uses it for the backing page itself (not the buffer
heads) in grow_dev_page().

So rather than play more games in free_more_memory(), I'd personally
just rather remove that function entirely, and say "the buffer head
allocations simply cannot fail" and leave it to the VM code to do the
right thing.

> But I decided that we should probably
> just keep that random flusher wakeup, and tackle that part in a v2 for
> 4.15.

Oh, absolutely, I'd be more than happy to play with this code, but as
mentioned, I actually would do even more radical surgery on it.

So I'm not at all objecting to changing that function in major ways.

I just don't think that changing the flush count is in any way
obvious. In many ways it's a *much* scarier change than adding
__GFP_NOFAIL, in my opinion.

At least with __GFP_NOFAIL, we have some fairly good reason to believe
that now the memory management code won't be doing enormous flushing
work "just because".

So my only real objection was really the timing, and the "it's
obviously a fix".

> The workload is nothing special. Getting to the point where it's bad
> enough to trigger a softlockup probably requires some serious beating up
> of the box, but getting to a suboptimal state wrt work units pending and
> the flusher threads is pretty trivial.

I'm actually surprised that you have what appears to be a buffer-head
heavy workload at all.

What is it that triggers that many buffer heads in the first place?
Because I thought we'd gotten to the point where all normal file IO
can avoid the buffer heads entirely, and just directly work with
making bio's from the pages.

So I assume this is some metadata writeout (ie either directory
writeback or just the btrfs metadata tree?

                     Linus