Re: [PATCH 2/3] xfs: add kmem_alloc_io()

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 23 Aug 2019 08:10:02 -0400

On Fri, Aug 23, 2019 at 08:39:59AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 09:40:17AM -0400, Brian Foster wrote:
> > On Thu, Aug 22, 2019 at 07:14:52AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> > > > On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > > > kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> > > > path once we see one unaligned allocation. That assumes this behavior is
> > > > tied to functionality that isn't dynamically configured at runtime, of
> > > > course.
> > > 
> > > vmalloc() has a _lot_ more overhead than kmalloc (especially when
> > > vmalloc has to do multiple allocations itself to set up page table
> > > entries) so there is still an overall gain to be had by trying
> > > kmalloc even if 50% of allocations are unaligned.
> > > 
> > 
> > I had the impression that this unaligned allocation behavior is tied to
> > enablement of debug options that otherwise aren't enabled/disabled
> > dynamically. Hence, the unaligned allocation behavior is persistent for
> > a particular mount and repeated attempts are pointless once we see at
> > least one such result. Is that not the case?
> 
> The alignment for a given slab is likely to be persistent, but it
> will be different for different sized slabs. e.g. I saw 128 offsets
> for 512 slabs, and 1024 byte offsets for 4k slabs. The 1024 byte
> offsets worked just fine (because multiple of 512 bytes!) but the
> 128 byte ones didn't.
> 

Interesting. I wasn't aware offsets could be large enough to maintain
alignment requirements.

> Hence it's not a black and white "everythign is unaligned and
> unsupportable" situation, nor is the alignment necessarily an issue
> for the underlying driver. e.g. most scsi and nvme handle 8
> byte alignment of buffers, and if the buffer is not aligned they
> bounce it (detected via the same blk_rq_alignment() check I added)
> and can still do the IO anyway. So a large amount of the IO stack
> just doesn't care about user buffers being unaligned....
> 
> > Again, I don't think performance is a huge deal so long as testing shows
> > that an fs is still usable with XFS running this kind of allocation
> > pattern.
> 
> It's added 10 minutes to the runtime of a full auto run with KASAN
> enabled on pmem. To put that in context, the entire run took:
> 
> real    461m36.831s
> user    44m31.779s
> sys     708m37.467s
> 
> More than 7.5 hours to complete, so ten minutes here or there is
> noise.
> 

Agreed. Thanks for the data.

> > In thinking further about it, aren't we essentially bypassing
> > these tools for associated allocations if they don't offer similar
> > functionality for vmalloc allocations?
> 
> kasan uses guard pages around vmalloc allocations to detect out of
> bound accesses. It still tracks the page allocations, etc, so we
> still get use after free tracking, etc. i.e. AFAICT we don't
> actually lose any debugging functonality by using vmalloc here.
> 

Ok.

> > It might be worth 1.) noting that
> > as a consequence of this change in the commit log and 2.) having a
> > oneshot warning somewhere when we initially hit this problem so somebody
> > actually using one of these tools realizes that enabling it actually
> > changes allocation behavior. For example:
> > 
> > XFS ...: WARNING: Unaligned I/O memory allocation. VM debug enabled?
> > Disabling slab allocations for I/O.
> > 
> > ... or alternatively just add a WARN_ON_ONCE() or something with a
> > similar comment in the code.
> 
> Well, the WARN_ON_ONCE is in xfs_bio_add_page() when it detects an
> invalid alignment. So we'll get this warning on production systems
> as well as debug/test systems. I think that's the important case to
> catch, because misalignment will result in silent data corruption...
> 

That's in the subsequent patch that I thought was being dropped (or more
accurately, deferred until Christoph has a chance to try and get it
fixed in the block layer..)..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx