Re: [PATCH] btrfs: workaround the over-confident over-commit available space calculation

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Mon, 5 Oct 2020 21:12:52 +0800

On 2020/10/5 下午9:05, Josef Bacik wrote:
> On 9/30/20 8:01 AM, Qu Wenruo wrote:
>> [BUG]
>> There are quite some bug reports of btrfs falling into a ENOSPC trap,
>> where btrfs can't even start a transaction to add new devices.
>>
>> [CAUSE]
>> Most of the reports are utilize multi-device profiles, like
>> RAID1/RAID10/RAID5/RAID6, and the involved disks have very unbalanced
>> sizes.
>>
>> It turns out that, the overcommit calculation in btrfs_can_overcommit()
>> is just a factor based calculation, which can't check if devices can
>> really fulfill the requirement for the desired profile.
>>
>> This makes btrfs_can_overcommit() to be always over-confident about
>> usable space, and when we can't allocate any new metadata chunk but
>> still allow new metadata operations, we fall into the ENOSPC trap and
>> have no way to exit it.
>>
>> [WORKAROUND]
>> The root fix needs a device layout aware, chunk allocator like available
>> space calculation.
>>
>> There used to be such patchset submitted to the mail list, but the extra
>> failure mode is tricky to handle for chunk allocation, thus that
>> patchset needs more time to mature.
>>
>> Meanwhile to prevent such problems reaching more users, workaround the
>> problem by:
>> - Half the over-commit available space reported
>>    So that we won't always be that over-confident.
>>    But this won't really help if we have extremely unbalanced disk size.
>>
>> - Don't over-commit if the space info is already full
>>    This may already be too late, but still better than doing nothing and
>>    believe the over-commit values.
>>
> 
> I just had a thought, what if we simply cap the free_chunk_space to the
> min of the free space of all the devices.

Sure, reducing the number will never be a problem.

> Simply walk through all the
> devices on mount, and we do the initial set of whatever the smallest one
> is.  The rest of the math would work out fine, and the rest of the
> modifications would work fine.

But I still prefer to do the minimal device size update at the timing of
my per-profile available space, so we don't have any chance to
over-estimate.

>  The only "tricky" part would be when we
> do a shrink or grow, we'd have to re-calculate the sizes for everybody,
> but that's not a big deal.  Thanks,

As long as we don't over-estimate, everything will be fine, just how
many extra metadata flushing is needed (thus extra overhead).

The rest is just a spectrum between "I don't really like over-commit at
all and let's make it really hard to do any overcommit" and "I'm a super
smart guy and here is the best algorithm to estimate how many space we
really have for over-commit".

Thanks,
Qu

> 
> Josef
> 

Attachment:
signature.asc

Description: OpenPGP digital signature