Re: [PATCH 2/3] xfs: add kmem_alloc_io()

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 22 Aug 2019 09:40:17 -0400

On Thu, Aug 22, 2019 at 07:14:52AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> > On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > Memory we use to submit for IO needs strict alignment to the
> > > underlying driver contraints. Worst case, this is 512 bytes. Given
> > > that all allocations for IO are always a power of 2 multiple of 512
> > > bytes, the kernel heap provides natural alignment for objects of
> > > these sizes and that suffices.
> > > 
> > > Until, of course, memory debugging of some kind is turned on (e.g.
> > > red zones, poisoning, KASAN) and then the alignment of the heap
> > > objects is thrown out the window. Then we get weird IO errors and
> > > data corruption problems because drivers don't validate alignment
> > > and do the wrong thing when passed unaligned memory buffers in bios.
> > > 
> > > TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
> > 
> > s/TO/To/
> > 
> > > 512 byte alignment of buffers for IO, even if memory debugging
> > > options are turned on. It is assumed that the minimum allocation
> > > size will be 512 bytes, and that sizes will be power of 2 mulitples
> > > of 512 bytes.
> > > 
> > > Use this everywhere we allocate buffers for IO.
> > > 
> > > This no longer fails with log recovery errors when KASAN is enabled
> > > due to the brd driver not handling unaligned memory buffers:
> > > 
> > > # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > ---
> > >  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
> > >  fs/xfs/kmem.h            |  1 +
> > >  fs/xfs/xfs_buf.c         |  4 +--
> > >  fs/xfs/xfs_log.c         |  2 +-
> > >  fs/xfs/xfs_log_recover.c |  2 +-
> > >  fs/xfs/xfs_trace.h       |  1 +
> > >  6 files changed, 50 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index edcf393c8fd9..ec693c0fdcff 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > ...
> > > @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> > >  	return ptr;
> > >  }
> > >  
> > > +/*
> > > + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> > > + * returned. vmalloc always returns an aligned region.
> > > + */
> > > +void *
> > > +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> > > +{
> > > +	void	*ptr;
> > > +
> > > +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> > > +
> > > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > > +	if (ptr) {
> > > +		if (!((long)ptr & 511))
> > > +			return ptr;
> > > +		kfree(ptr);
> > > +	}
> > > +	return __kmem_vmalloc(size, flags);
> > > +}
> > 
> > Even though it is unfortunate, this seems like a quite reasonable and
> > isolated temporary solution to the problem to me. The one concern I have
> > is if/how much this could affect performance under certain
> > circumstances.
> 
> Can't measure a difference on 4k block size filesystems. It's only
> used for log recovery and then for allocation AGF/AGI buffers on
> 512 byte sector devices. Anything using 4k sectors only hits it
> during mount. So for default configs with memory posioning/KASAN
> enabled, the massive overhead of poisoning/tracking makes this
> disappear in the noise.
> 
> For 1k block size filesystems, it gets hit much harder, but
> there's no noticable increase in runtime of xfstests vs 4k block
> size with KASAN enabled. The big increase in overhead comes from
> enabling KASAN (takes 3x longer than without), not doing one extra
> allocation/free pair.
> 
> > I realize that these callsites are isolated in the common
> > scenario. Less common scenarios like sub-page block sizes (whether due
> > to explicit mkfs time format or default configurations on larger page
> > size systems) can fall into this path much more frequently, however.
> 
> *nod*
> 
> > Since this implies some kind of vm debug option is enabled, performance
> > itself isn't critical when this solution is active. But how bad is it in
> > those cases where we might depend on this more heavily? Have you
> > confirmed that the end configuration is still "usable," at least?
> 
> No noticable difference, most definitely still usable. 
> 

Ok, thanks.

> > I ask because the repeated alloc/free behavior can easily be avoided via
> > something like an mp flag (which may require a tweak to the
> 
> What's an "mp flag"?
> 

A bool or something similar added to xfs_mount to control further slab
allocation attempts for I/O allocations for this mount. I was just
throwing this out there if the performance hit happened to be bad (on
top of whatever vm debug option is enabled) in those configurations
where slab based buffer allocations are more common. If the performance
hit is negligible in practice, then I'm not worried about it.

> > kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> > path once we see one unaligned allocation. That assumes this behavior is
> > tied to functionality that isn't dynamically configured at runtime, of
> > course.
> 
> vmalloc() has a _lot_ more overhead than kmalloc (especially when
> vmalloc has to do multiple allocations itself to set up page table
> entries) so there is still an overall gain to be had by trying
> kmalloc even if 50% of allocations are unaligned.
> 

I had the impression that this unaligned allocation behavior is tied to
enablement of debug options that otherwise aren't enabled/disabled
dynamically. Hence, the unaligned allocation behavior is persistent for
a particular mount and repeated attempts are pointless once we see at
least one such result. Is that not the case?

Again, I don't think performance is a huge deal so long as testing shows
that an fs is still usable with XFS running this kind of allocation
pattern. In thinking further about it, aren't we essentially bypassing
these tools for associated allocations if they don't offer similar
functionality for vmalloc allocations? It might be worth 1.) noting that
as a consequence of this change in the commit log and 2.) having a
oneshot warning somewhere when we initially hit this problem so somebody
actually using one of these tools realizes that enabling it actually
changes allocation behavior. For example:

XFS ...: WARNING: Unaligned I/O memory allocation. VM debug enabled?
Disabling slab allocations for I/O.

... or alternatively just add a WARN_ON_ONCE() or something with a
similar comment in the code.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx