Re: xfs_bmap_extents_to_btree allocation warnings

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 10 Jan 2022 19:41:52 +1100

On Thu, Jan 06, 2022 at 12:52:28AM -0800, Krister Johansen wrote:
> On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote:
> > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> > > However, linux is using 64-bit block
> > > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in
> > > xfs_alloc_vextent() seems to try to ensure that it never considers an AG
> > > that's less than the agno for the fsbno passed in via args.
> > 
> > Because otherwise allocations ABBA deadlock on AG locking.
> 
> Sorry, what I'm really trying to ask is: are there still cases in XFS
> where using XFS_ALLOCTYPE_START_BNO can give you this kind of deadlock?

I'd say yes - I can think of several scenarios where we make
multiple allocations per transaction (directory and attribute code)
and I think that they don't actually use t_firstblock correctly to
avoid AGF locking issues. i.e. I think that tp->t_firstblock should
actually be tracking the highest locked AGF in the transaction, not
the first AGF we locked...

> There's some deadlock avoidance in the START_BNO implementation, but
> there are plenty of places where XFS_ALLOCTYPE_NEAR_BNO is still used if
> t_firstblock is not NULLFSBLOCK.

Right, NEAR_BNO is used because we likely have physical locality
constraints (e.g. seek minimisation by placing related blocks as
close to each other as possible) and potentially other limits, like
the allocations are for per-AG data and therefore must be placed
within the specified AG...

> I'm trying to work out if that code
> still exists because of concerns about deadlocks, or if its an attempt
> to limit the number of AGs searched instead.  (Or some other policy
> choice, even.)

All of the above, and more... :(

> > > be a reasonable way to address the WARN?  Or does this open a box of
> > > problems that obvious to the experienced, but just subtle enough to
> > > elude the unfamiliar?
> > 
> > No, yes, and yes.
> 
> Thank you humoring my questions nonetheless.
> 
> > > The xfs_db freesp report after the problem occurred.  (N.B. it was a few
> > > hours before I was able to get to this machine to investigate)
> > > 
> > > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol
> > >    from      to extents  blocks    pct
> > >       1       1      48      48   0.00
> > >       2       3     119     303   0.02
> > >       4       7      46     250   0.01
> > >       8      15      22     255   0.01
> > >      16      31      17     374   0.02
> > >      32      63      16     728   0.04
> > >      64     127       9     997   0.05
> > >     128     255     149   34271   1.83
> > >     256     511       7    2241   0.12
> > >     512    1023       4    2284   0.12
> > >    1024    2047       1    1280   0.07
> > >    2048    4095       1    3452   0.18
> > > 1048576 2097151       1 1825182  97.52
> > > total free extents 440
> > > total free blocks 1871665
> > > average free extent size 4253.78
> > 
> > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2%
> > full, so it's extremely likely you are hitting a full AG condition.
> > 
> > /me goes and looks at xfs_iomap_write_direct()....
> > 
> > .... and notices that it passes "0" as the total allocation block
> > count, which means it isn't reserving space in the AG for both the
> > data extent and the BMBT blocks...
> > 
> > ... and several other xfs_bmapi_write() callers have the same
> > issue...
> > 
> > Ok, let me spend a bit more looking into this in more depth, but it
> > looks like the problem is at the xfs_bmapi_write() caller level, not
> > deep in the allocator itself.
> 
> At least on 5.4 xfs_bmapi_write is still passing resblks instead of
> zero, which is computed in xfs_iomap_write_direct.

yup, I missed commit da781e64b28c ("xfs: don't set bmapi total block
req where minleft is") back in 2019 where that behaviour was
changed, and instead it changes xfs_bmapi_write() to implcitly
manage space for BMBT blocks via args->minleft whilst still
explicitly requiring the caller to reserve those blocks at
transaction allocation time.

Bit of a mess, really, because multi-allocation transactions are
still required to pass both the data blocks + the possible BMBT
blocks that might be needed to xfs_bmapi_write(). I suspect that for
this case the implicit args->minleft reservation is double
accounted...

> Related to your comment about alloc_args.total, I did a bit of tracing
> of the xfs_alloc tracepoints on my system and found that total seems to
> be getting set in both cases, but that a) it's actually a larger value
> for directio; and b) in the buffered write case the code is requesting
> more blocks at one time which causes a larger allocation to occur.  I'm
> not certain, but wondered if this could be causing us to select an AG
> with more space by luck.
> 
> directio:
> 
>               dd-102229  [005] .... 4969662.383215: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 14240 minlen 4 maxlen 4 mod 0 prod 1 minleft 2 total 8 alignment 1 minalignslop 3 len 4 type THIS_BNO otype THIS_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff

That one is correct - 4 extra blocks on top of the data extent...

> buffered write + fsync:
> 
>               dd-109921  [010] .... 4972814.844429: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 21280 minlen 16 maxlen 16 mod 0 prod 1 minleft 2 total 4 alignment 1 minalignslop 3 len 16 type THIS_BNO otype THIS_BNO wasdel 1 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff

But that one is clearly wrong - we're asking for 16 blocks for the
data extent, and a total block count of the allocation of 4 blocks.
Even though we've reserved 16 blocks during the initial delayed
allocation, we've still got to have 16 + 4 blocks free in the AG for
the allocation to succced. That's one of the bugs the commit I
mentioned above fixed...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx