On 10/15/21 5:57 AM, Theodore Ts'o wrote: > On Fri, Oct 15, 2021 at 02:06:52AM +0800, Gao Xiang wrote: >> On Thu, Oct 14, 2021 at 12:54:14PM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote: >>> Hi, >>> >>> I'm seeing these i_reserved_data_blocks not cleared! messages when using ext4 >>> with nodelalloc, message added in: >>> >>> commit 6fed83957f21eff11c8496e9f24253b03d2bc1dc >>> Author: Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx> >>> Date: Mon Aug 23 14:13:58 2021 +0800 >>> >>> ext4: fix reserved space counter leakage >>> >>> I can quickly reproduce in 5.15.0-rc5-00041-g348949d9a444 by doing some >>> filesystem I/O while toggling delalloc: >>> >>> >>> while true; do mount -o remount,nodelalloc /; sleep 1; mount -o remount,delalloc /; sleep 1; done & >>> git clone linux xxx; rm -rf xxx >> >> If I understand correctly, switching such option implies >> sync inodes to write back exist delayed allocation blocks. > > Well, no. What it implies is that all writes after the remount into > an unallocated portion of the file will be allocated at the time when > the page is dirtied, instead of when the page is written back. It's > possible for some pages to be written using delayed allocation, and > some other pages in the legacy "allocate on page dirty" mechanism. > This can happen when the file system is remounted; it can also happen > when the file system starts getting close to 100% full. See the > comment in ext4_nonda_switch: > > /* > * switch to non delalloc mode if we are running low > * on free block. The free block accounting via percpu > * counters can get slightly wrong with percpu_counter_batch getting > * accumulated on each CPU without updating global counters > * Delalloc need an accurate free block accounting. So switch > * to non delalloc when we are near to error range. > */ > So it seems possible that s_dirtyclusters_counter/i_reserved_data_blocks counters are not maintained anymore when filesystem gets remounted from 'delalloc' to 'nodelalloc', even when you're writing back a (previously) delay allocated page cache (when it's still mounted as 'delalloc'). Thus it is possible that s_dirtyclusters_counter/i_reserved_data_blocks counters are non-zero when the inode is finally evicted and destroyed. IMHO I think this inconsistency is problematic. For example, when filesystem gets remounted from 'delalloc' to 'nodelalloc' and then runs for a period, s_dirtyclusters_counter/i_reserved_data_blocks counters already gets inconsistent. Then it's remounted back to 'delalloc', in which case s_dirtyclusters_counter/i_reserved_data_blocks counters are already incorrect. -- Thanks, Jeffle