Re: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option

Jan Kara <jack@xxxxxxx> · Wed, 30 Aug 2023 17:30:35 +0200

Hello!

On Thu 24-08-23 17:26:03, Zhang Yi wrote:
> The delayed allocation method allocates blocks during page writeback in
> ext4_writepages(), which cannot handle block allocation failures due to
> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
> delalloc mode if the space if the free blocks is less than 150% of dirty
> blocks or the watermark.

Well, that functionality is there mainly so that we can really allocate all
blocks available in the filesystem. But you are right that since we are not
reserving metadata blocks explicitely anymore, it is questionable whether
we really still need this.

> In the meantime, '27dd43854227 ("ext4:
> introduce reserved space")' reserve some of the file system space (2% or
> 4096 clusters, whichever is smaller). Both of those to solutions can
> make sure that space is not exhausted when mapping delalloc blocks in
> most cases, but cannot guarantee work in all cases, which could lead to
> infinite loop or data loss (please see patch 14 for details).

OK, I agree that in principle there could be problems due to percpu
counters inaccuracy etc. but were you able to reproduce the problem under
some at least somewhat realistic conditions? We were discussing making
free space percpu counters switch to exact counting in case we are running
tight on space to avoid these problems but it never proved to be a problem
in practice so we never bothered to actually implement it.

> This patch set wants to reserve metadata space more accurate for
> delalloc mount option. The metadata blocks reservation is very tricky
> and is also related to the continuity of physical blocks, an effective
> way is to reserve as the worst case, which means that every data block
> is discontinuous and one data block costs an extent entry. Reserve
> metadata space as the worst case can make sure enough blocks reserved
> during data writeback, the unused reservaion space can be released after
> mapping data blocks.

Well, as you say, there is a problem with the worst case estimates - either
you *heavily* overestimate the number of needed metadata blocks or the code
to estimate the number of needed metadata blocks is really complex. We used
to have estimates of needed metadata and we ditched that code (in favor of
reserved clusters) exactly because it was complex and suffered from
cornercases that were hard to fix. I haven't quite digested the other
patches in this series to judge which case is it but it seems to lean on
the "complex" side :).

So I'm somewhat skeptical this complexity is really needed but I can be
convinced :).

> After doing this, add a worker to submit delayed
> allocations to prevent excessive reservations. Finally, we could
> completely drop the policy of switching back to non-delayed allocation.

BTW the worker there in patch 15 seems really pointless. If you do:
queue_work(), flush_work() then you could just directly do the work inline
and get as a bonus more efficiency and proper lockdep tracking of
dependencies. But that's just a minor issue I have noticed.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR