Re: [GIT PULL] Block fixes for 4.14-rc2

Chris Mason <clm@xxxxxx> · Mon, 25 Sep 2017 18:48:57 -0400

On 09/25/2017 06:21 PM, Linus Torvalds wrote:
On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@xxxxxx> wrote:

My understanding is that for order-0 page allocations and
kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or
at the very least OOM kill something before returning NULL?

That should generally be true. We've occasionally screwed up in the
VM, so an explicit GFP_NOFAIL would definitely be best if we then
remove the looping in fs/buffer.c.

Right, I wouldn't remove the looping without the NOFAIL.  But in the 
normal case, it shouldn't be possible for free_more_memory() to be 
called without an OOM and without already having triggered the full flush.

That's why using the full flush in free_more_memory() felt like a small 
change to me.  But if you'd rather see GFP_NOFAIL in the next merge 
window I don't have any objections to that method either.

What is it that triggers that many buffer heads in the first place?
Because I thought we'd gotten to the point where all normal file IO
can avoid the buffer heads entirely, and just directly work with
making bio's from the pages.

We're not triggering free_more_memory().  I ran a probe on a few production
machines and it didn't fire once over a 90 minute period of heavy load.  The
main target of Jens' patchset was preventing shrink_inactive_list() ->
wakeup_flusher_threads() from creating millions of work items without any
rate limiting at all.

So the two things I reacted to in that patch series were apparently
things that you guys don't even care about.

Yes and no.  fs/buffer.c didn't explode in prod recently, but we do 
exercise the OOM code often.  Even though I haven't seen it happen, I'd 
rather not leave fs/buffer.c able to trigger the same work explosion 
during an OOM spiral.  I know it's really unlikely, but java.

I reacted to the fs/buffer.c code, and to the change in laptop mode to
not do circular writeback.

The latter is another "it's probably ok, but it can be a subtle
change". In particular, things that re-write the same thing over and
over again can get very different behavior, even when you write out
"all" pages.

And I'm assuming you're not using laptop mode either on your servers
(that sounds insane, but I remember somebody actually ended up using
laptop mode even on servers, simply because they did *not* want the
regular timed writeback model, so it's not quite as insane as it
sounds).

I'd honestly have to do a query to make sure we aren't using it in some 
dark corner.  I sometimes think people try things just to see how long 
it takes Jens to notice.

-chris