On Fri, Sep 13, 2019 at 10:58:02AM -0400, Brian Foster wrote: > On Fri, Sep 13, 2019 at 08:35:19AM +1000, Dave Chinner wrote: > > On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote: > > > The bmap block allocation code issues a sequence of retries to > > > perform an optimal allocation, gradually loosening constraints as > > > allocations fail. For example, the first attempt might begin at a > > > particular bno, with maxlen == minlen and alignment incorporated. As > > > allocations fail, the parameters fall back to different modes, drop > > > alignment requirements and reduce the minlen and total block > > > requirements. > > > > > > For large extent allocations with an args.total value that exceeds > > > the allocation length (i.e., non-delalloc), the total value tends to > > > dominate despite these fallbacks. For example, an aligned extent > > > allocation request of tens to hundreds of MB that cannot be > > > satisfied from a particular AG will not succeed after dropping > > > alignment or minlen because xfs_alloc_space_available() never > > > selects an AG that can't satisfy args.total. The retry sequence > > > eventually reduces total and ultimately succeeds if a minlen extent > > > is available somewhere, but the first several retries are > > > effectively pointless in this scenario. > > > > > > Beyond simply being inefficient, another side effect of this > > > behavior is that we drop alignment requirements too aggressively. > > > Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe > > > unit: > > > > > > # xfs_io -c "falloc 0 1g" /mnt/file > > > # <xfstests>/src/t_stripealign /mnt/file 32 > > > /mnt/file: Start block 347176 not multiple of sunit 32 > > > > Ok, so what Carlos and I found last night was an issue with the > > the agresv code leading to the maximum free extent calculated > > by xfs_alloc_longest_free_extent() being longer than the largest > > allowable extent allocation (mp->m_ag_max_usable) resulting in the > > situation where blen > args->maxlen, and so in the case of initial > > allocation here, we never run this: > > > > /* > > * Adjust for alignment > > */ > > if (blen > args.alignment && blen <= args.maxlen) > > args.minlen = blen - args.alignment; > > args.minalignslop = 0; > > .... > > > As a step towards addressing this problem, insert a new retry in the > > > bmap allocation sequence to drop minlen (from maxlen) before tossing > > > alignment. This should still result in as large of an extent as > > > possible as the block allocator prioritizes extent size in all but > > > exact allocation modes. By itself, this does not change the behavior > > > of the command above because the preallocation code still specifies > > > total based on maxlen. Instead, this facilitates preservation of > > > alignment once extra reservation is separated from the extent length > > > portion of the total block requirement. > > > > AFAICT this is not necessary. The prototypoe patch I wrote last > > night while working through this with Carlos is attached below. I > > updated with a variant of your patch 2 to demonstrate that it does > > actually solve the problem of full AG allocation failing to be > > aligned. > > > > I agree that this addresses the reported issue, but I can reproduce > other corner cases affected by the original patch that aren't affected > by this one. For example, if the allocation request happens to be > slightly less than blen but not enough to allow for alignment, minlen > isn't dropped and we can run through the same allocation retry sequence > that kills off alignment before success. But isn't that just another variation of the initial conditions (minlen/maxlen) not being set up correctly for alignment when the AG is empty? i.e. Take the above condition and change it like this: /* * Adjust for alignment */ - if (blen > args.alignment && blen <= args.maxlen) + if (blen > args.alignment && blen <= args.maxlen + args.alignment) args.minlen = blen - args.alignment; args.minalignslop = 0; and now we cover all the cases when blen covers an aligned maxlen allocation... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx