Re: [PATCH] btrfs: handle shrink_delalloc pages calculation differently

David Sterba <dsterba@xxxxxxx> · Tue, 22 Jun 2021 13:16:04 +0200

On Tue, Jun 01, 2021 at 03:45:08PM -0400, Josef Bacik wrote:
> We have been hitting some early ENOSPC issues in production with more
> recent kernels, and I tracked it down to us simply not flushing delalloc
> as aggressively as we should be.  With tracing I was seeing us failing
> all tickets with all of the block rsvs at or around 0, with very little
> pinned space, but still around 120mib of outstanding bytes_may_used.
> Upon further investigation I saw that we were flushing around 14 pages
> per shrink call for delalloc, despite having around 2gib of delalloc
> outstanding.
> 
> Consider the example of a 8 way machine, all cpu's trying to create a
> file in parallel, which at the time of this commit requires 5 items to
> do.  Assuming a 16k leaf size, we have 10mib of total metadata reclaim
> size waiting on reservations.  Now assume we have 128mib of delalloc
> outstanding.  With our current math we would set items to 20, and then
> set to_reclaim to 20 * 256k, or 5mib.
> 
> Assuming that we went through this loop all 3 times, for both
> FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
> twice, we'd only flush 60mib of the 128mib delalloc space.  This could
> leave a fair bit of delalloc reservations still hanging around by the
> time we go to ENOSPC out all the remaining tickets.
> 
> Fix this two ways.  First, change the calculations to be a fraction of
> the total delalloc bytes on the system.  Prior to my change we were
> calculating based on dirty inodes so our math made more sense, now it's
> just completely unrelated to what we're actually doing.
> 
> Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
> gone through the flush states at least once.  This will empty the system
> of all delalloc so we're sure to be truly out of space when we start
> failing tickets.
> 
> I'm tagging stable 5.10 and forward, because this is where we started
> using the page stuff heavily again.  This affects earlier kernel
> versions as well, but would be a pain to backport to them as the
> flushing mechanisms aren't the same.
> 
> CC: stable@xxxxxxxxxxxxxxx # 5.10
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>

As this is going to be resent, I'll remove it from misc-next for now.
Updated version can go in as a fix after rc1.