On 8/22/21 9:06 PM, Joseph Qi wrote: > > > On 8/21/21 12:45 AM, Eric Whitney wrote: >> * Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx>: >>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM, >>> previously reserved space is not released as the error handling, >>> in which case @s_dirtyclusters_counter is left over. Since this delayed >>> extent failes to be inserted into extent status tree, when inode is >>> written back, the extra @s_dirtyclusters_counter won't be subtracted and >>> remains there forever. >>> >>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains >>> non-zero even when syncfs is executed on the filesystem. >>> >> >> Hi: >> >> I think the fix below looks fine. However, this comment doesn't look right >> to me. Are you really seeing delayed_allocation_blocks values that remain >> incorrectly elevated across last closes (or across file system unmounts and >> remounts)? s_dirtyclusters_counter isn't written out to stable storage - >> it's an in-memory only variable that's created when a file is first opened >> and destroyed on last close. >> > > Actually we've encountered a real case in our production environment, > which has about 20G space lost (df - du = ~20G). > After some investigation, we've confirmed that it cause by leaked > s_dirtyclusters_counter (~5M), and even we do manually sync, it remains. > Since there is no error messages, we've checked all logic around > s_dirtyclusters_counter and found this. Also we can manually inject > error and reproduce the leaked s_dirtyclusters_counter. > BTW, it's a runtime lost, but not about on-disk. If umount and then mount it again, it becomes normal. But application also should be restarted... Thanks, Joseph