Re: [PATCH] ext4: fix reserved space counter leakage

Eric Whitney <enwlinux@xxxxxxxxx> · Sun, 22 Aug 2021 17:52:14 -0400

* Joseph Qi <joseph.qi@xxxxxxxxxxxxxxxxx>:
> 
> 
> On 8/22/21 9:06 PM, Joseph Qi wrote:
> > 
> > 
> > On 8/21/21 12:45 AM, Eric Whitney wrote:
> >> * Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx>:
> >>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
> >>> previously reserved space is not released as the error handling,
> >>> in which case @s_dirtyclusters_counter is left over. Since this delayed
> >>> extent failes to be inserted into extent status tree, when inode is
> >>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
> >>> remains there forever.
> >>>
> >>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
> >>> non-zero even when syncfs is executed on the filesystem.
> >>>
> >>
> >> Hi:
> >>
> >> I think the fix below looks fine.  However, this comment doesn't look right
> >> to me.  Are you really seeing delayed_allocation_blocks values that remain
> >> incorrectly elevated across last closes (or across file system unmounts and
> >> remounts)?  s_dirtyclusters_counter isn't written out to stable storage -
> >> it's an in-memory only variable that's created when a file is first opened
> >> and destroyed on last close.
> >>
> > 
> > Actually we've encountered a real case in our production environment,
> > which has about 20G space lost (df - du = ~20G).
> > After some investigation, we've confirmed that it cause by leaked
> > s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
> > Since there is no error messages, we've checked all logic around
> > s_dirtyclusters_counter and found this. Also we can manually inject
> > error and reproduce the leaked s_dirtyclusters_counter.
> > 

Sure - as I noted, the fix looks good - I agree that you could see inaccurate
s_dirtyclusters_counter (and i_reserved_data_blocks) values.  This is a good
catch and a good fix.  It's the comment I find misleading / inaccurate, and
I'd like to see that improved for the sake of developers reading commit
histories in the future.

Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict
path sounds good to me - I'd considered doing that in the past but never
actually did it.

> 
> BTW, it's a runtime lost, but not about on-disk.
> If umount and then mount it again, it becomes normal. But
> application also should be restarted...

And this is where the comment could use a little help.  "when inode is
written back, the extra @s_dirtyclusters_counter won't be subtracted and
remains there forever" suggests to me that s_dirtyclusters_counter is
being persisted on stable storage.  But as you note, simply umounting and
remounting the filesystem clears up the problem.  (And in my rush to get
my feedback out earlier I incorrectly stated that s_dirtyclusters_counter
would get created and destroyed on first open and last close - that's
i_reserved_data_blocks, of course.)

So, in order to speed things along, please allow me to suggest some edits
for the commit comment:

When ext4_insert_delayed block receives and recovers from an error from
ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
space it has reserved for that block insertion as it should.  One effect
of this bug is that s_dirtyclusters_counter is not decremented and remains
incorrectly elevated until the file system has been unmounted.  This can
result in premature ENOSPC returns and apparent loss of free space.

Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks
can remain non-zero even after syncfs has been executed on the filesystem.

Does that sound right?

Regards,
Eric

> 
> Thanks,
> Joseph