On 8/23/21 5:52 AM, Eric Whitney wrote: > * Joseph Qi <joseph.qi@xxxxxxxxxxxxxxxxx>: >> >> >> On 8/22/21 9:06 PM, Joseph Qi wrote: >>> >>> >>> On 8/21/21 12:45 AM, Eric Whitney wrote: >>>> * Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx>: >>>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM, >>>>> previously reserved space is not released as the error handling, >>>>> in which case @s_dirtyclusters_counter is left over. Since this delayed >>>>> extent failes to be inserted into extent status tree, when inode is >>>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and >>>>> remains there forever. >>>>> >>>>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains >>>>> non-zero even when syncfs is executed on the filesystem. >>>>> >>>> >>>> Hi: >>>> >>>> I think the fix below looks fine. However, this comment doesn't look right >>>> to me. Are you really seeing delayed_allocation_blocks values that remain >>>> incorrectly elevated across last closes (or across file system unmounts and >>>> remounts)? s_dirtyclusters_counter isn't written out to stable storage - >>>> it's an in-memory only variable that's created when a file is first opened >>>> and destroyed on last close. >>>> >>> >>> Actually we've encountered a real case in our production environment, >>> which has about 20G space lost (df - du = ~20G). >>> After some investigation, we've confirmed that it cause by leaked >>> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains. >>> Since there is no error messages, we've checked all logic around >>> s_dirtyclusters_counter and found this. Also we can manually inject >>> error and reproduce the leaked s_dirtyclusters_counter. >>> > > Sure - as I noted, the fix looks good - I agree that you could see inaccurate > s_dirtyclusters_counter (and i_reserved_data_blocks) values. This is a good > catch and a good fix. It's the comment I find misleading / inaccurate, and > I'd like to see that improved for the sake of developers reading commit > histories in the future. > > Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict > path sounds good to me - I'd considered doing that in the past but never > actually did it. > >> >> BTW, it's a runtime lost, but not about on-disk. >> If umount and then mount it again, it becomes normal. But >> application also should be restarted... > > And this is where the comment could use a little help. "when inode is > written back, the extra @s_dirtyclusters_counter won't be subtracted and > remains there forever" suggests to me that s_dirtyclusters_counter is > being persisted on stable storage. But as you note, simply umounting and > remounting the filesystem clears up the problem. (And in my rush to get > my feedback out earlier I incorrectly stated that s_dirtyclusters_counter > would get created and destroyed on first open and last close - that's > i_reserved_data_blocks, of course.) > > So, in order to speed things along, please allow me to suggest some edits > for the commit comment: > > When ext4_insert_delayed block receives and recovers from an error from > ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the > space it has reserved for that block insertion as it should. One effect > of this bug is that s_dirtyclusters_counter is not decremented and remains > incorrectly elevated until the file system has been unmounted. This can > result in premature ENOSPC returns and apparent loss of free space. > > Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks > can remain non-zero even after syncfs has been executed on the filesystem. > > Does that sound right? > Yes, looks better. Thanks for your comments. We'll update the commit log in v2. Thanks, Joseph