Hello! On Thu 24-08-23 17:26:03, Zhang Yi wrote: > The delayed allocation method allocates blocks during page writeback in > ext4_writepages(), which cannot handle block allocation failures due to > e.g. ENOSPC if acquires more extent blocks. In order to deal with this, > commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low > on free blocks count.")' introduce ext4_nonda_switch() to convert to no > delalloc mode if the space if the free blocks is less than 150% of dirty > blocks or the watermark. Well, that functionality is there mainly so that we can really allocate all blocks available in the filesystem. But you are right that since we are not reserving metadata blocks explicitely anymore, it is questionable whether we really still need this. > In the meantime, '27dd43854227 ("ext4: > introduce reserved space")' reserve some of the file system space (2% or > 4096 clusters, whichever is smaller). Both of those to solutions can > make sure that space is not exhausted when mapping delalloc blocks in > most cases, but cannot guarantee work in all cases, which could lead to > infinite loop or data loss (please see patch 14 for details). OK, I agree that in principle there could be problems due to percpu counters inaccuracy etc. but were you able to reproduce the problem under some at least somewhat realistic conditions? We were discussing making free space percpu counters switch to exact counting in case we are running tight on space to avoid these problems but it never proved to be a problem in practice so we never bothered to actually implement it. > This patch set wants to reserve metadata space more accurate for > delalloc mount option. The metadata blocks reservation is very tricky > and is also related to the continuity of physical blocks, an effective > way is to reserve as the worst case, which means that every data block > is discontinuous and one data block costs an extent entry. Reserve > metadata space as the worst case can make sure enough blocks reserved > during data writeback, the unused reservaion space can be released after > mapping data blocks. Well, as you say, there is a problem with the worst case estimates - either you *heavily* overestimate the number of needed metadata blocks or the code to estimate the number of needed metadata blocks is really complex. We used to have estimates of needed metadata and we ditched that code (in favor of reserved clusters) exactly because it was complex and suffered from cornercases that were hard to fix. I haven't quite digested the other patches in this series to judge which case is it but it seems to lean on the "complex" side :). So I'm somewhat skeptical this complexity is really needed but I can be convinced :). > After doing this, add a worker to submit delayed > allocations to prevent excessive reservations. Finally, we could > completely drop the policy of switching back to non-delayed allocation. BTW the worker there in patch 15 seems really pointless. If you do: queue_work(), flush_work() then you could just directly do the work inline and get as a bonus more efficiency and proper lockdep tracking of dependencies. But that's just a minor issue I have noticed. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR