Re: [PATCH] xfs: allocate sector sized IO buffer via page_frag_alloc

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 26 Feb 2019 07:11:22 +1100

On Mon, Feb 25, 2019 at 04:46:25PM +0800, Ming Lei wrote:
> On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote:
> > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> > > XFS uses kmalloc() to allocate sector sized IO buffer.
> > ....
> > > Use page_frag_alloc() to allocate the sector sized buffer, then the
> > > above issue can be fixed because offset_in_page of allocated buffer
> > > is always sector aligned.
> > 
> > Didn't we already reject this approach because page frags cannot be
> 
> I remembered there is this kind of issue mentioned, but just not found
> the details, so post out the patch for restarting the discussion.

As previously discussed, the only solution that fits all use cases
we have to support are a slab caches that do not break object
alignment when slab debug options are turned on.

> > reused and that pages allocated to the frag pool are pinned in
> > memory until all fragments allocated on the page have been freed?
> 
> Yes, that is one problem. But if one page is consumed, sooner or later,
> all fragments will be freed, then the page becomes available again.

Ah, no, your assumption about how metadata caching in XFS works is
flawed. Some metadata ends up being cached for the life of the
filesystem because it is so frequently referenced it never gets
reclaimed. AG headers, btree root blocks, etc.  And the XFS metadata
cache hangs on to such metadata even under extreme memory pressure
because if we reclaim it then any filesystem operation will need to
reallocate that memory to clean dirty pages and that is the very
last thing we want to do under extreme memory pressure conditions.

If allocation cannot reuse holes in pages (i.e. works as a proper
slab cache) then we are going to blow out the amount of memory that
the XFS metadata cache uses very badly on filesystems where block
size != page size. 

> > i.e. when we consider 64k page machines and 4k block sizes (i.e.
> > default config), every single metadata allocation is a sub-page
> > allocation and so will use this new page frag mechanism. IOWs, it
> > will result in fragmenting memory severely and typical memory
> > reclaim not being able to fix it because the metadata that pins each
> > page is largely unreclaimable...
> 
> It can be an issue in case of IO timeout & retry.

This makes no sense to me. Exactly how does filesystem memory
allocation affect IO timeouts and any retries the filesystem might
issue?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx