Re: [PATCH] btrfs: handle shrink_delalloc pages calculation differently

David Sterba <dsterba@xxxxxxx> · Tue, 22 Jun 2021 13:25:50 +0200

On Tue, Jun 22, 2021 at 01:16:04PM +0200, David Sterba wrote:
> On Tue, Jun 01, 2021 at 03:45:08PM -0400, Josef Bacik wrote:
> > We have been hitting some early ENOSPC issues in production with more
> > recent kernels, and I tracked it down to us simply not flushing delalloc
> > as aggressively as we should be.  With tracing I was seeing us failing
> > all tickets with all of the block rsvs at or around 0, with very little
> > pinned space, but still around 120mib of outstanding bytes_may_used.
> > Upon further investigation I saw that we were flushing around 14 pages
> > per shrink call for delalloc, despite having around 2gib of delalloc
> > outstanding.
> > 
> > Consider the example of a 8 way machine, all cpu's trying to create a
> > file in parallel, which at the time of this commit requires 5 items to
> > do.  Assuming a 16k leaf size, we have 10mib of total metadata reclaim
> > size waiting on reservations.  Now assume we have 128mib of delalloc
> > outstanding.  With our current math we would set items to 20, and then
> > set to_reclaim to 20 * 256k, or 5mib.
> > 
> > Assuming that we went through this loop all 3 times, for both
> > FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
> > twice, we'd only flush 60mib of the 128mib delalloc space.  This could
> > leave a fair bit of delalloc reservations still hanging around by the
> > time we go to ENOSPC out all the remaining tickets.
> > 
> > Fix this two ways.  First, change the calculations to be a fraction of
> > the total delalloc bytes on the system.  Prior to my change we were
> > calculating based on dirty inodes so our math made more sense, now it's
> > just completely unrelated to what we're actually doing.
> > 
> > Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
> > gone through the flush states at least once.  This will empty the system
> > of all delalloc so we're sure to be truly out of space when we start
> > failing tickets.
> > 
> > I'm tagging stable 5.10 and forward, because this is where we started
> > using the page stuff heavily again.  This affects earlier kernel
> > versions as well, but would be a pain to backport to them as the
> > flushing mechanisms aren't the same.
> > 
> > CC: stable@xxxxxxxxxxxxxxx # 5.10
> > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> 
> As this is going to be resent, I'll remove it from misc-next for now.
> Updated version can go in as a fix after rc1.

Ok so that does not work, the patchset "[PATCH 0/4][v2] btrfs: commit
the transaction unconditionally for ensopc"
https://lore.kernel.org/linux-btrfs/cover.1623421213.git.josef@xxxxxxxxxxxxxx/
touches the defines and can't be trivially resolved.