Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

Baokun Li <libaokun1@xxxxxxxxxx> · Fri, 28 Apr 2023 11:47:26 +0800

On 2023/4/28 9:41, Ming Lei wrote:
On Thu, Apr 27, 2023 at 07:27:04PM +0800, Ming Lei wrote:
On Thu, Apr 27, 2023 at 07:19:35PM +0800, Baokun Li wrote:
On 2023/4/27 18:01, Ming Lei wrote:
On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
On 2023/4/27 12:50, Ming Lei wrote:
Hello Matthew,

On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
Hello Guys,

I got one report in which buffered write IO hangs in balance_dirty_pages,
after one nvme block device is unplugged physically, then umount can't
succeed.
That's a feature, not a bug ... the dd should continue indefinitely?
Can you explain what the feature is? And not see such 'issue' or 'feature'
on xfs.

The device has been gone, so IMO it is reasonable to see FS buffered write IO
failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
filesystem read-only'. Seems these things may confuse user.
The reason for this difference is that ext4 and xfs handle errors
differently.

ext4 remounts the filesystem as read-only or even just continues, vfs_write
does not check for these.
vfs_write may not find anything wrong, but ext4 remount could see that
disk is gone, which might happen during or after remount, however.

xfs shuts down the filesystem, so it returns a failure at
xfs_file_write_iter when it finds an error.

``` ext4
ksys_write
   vfs_write
    ext4_file_write_iter
     ext4_buffered_write_iter
      ext4_write_checks
       file_modified
        file_modified_flags
         __file_update_time
          inode_update_time
           generic_update_time
            __mark_inode_dirty
             ext4_dirty_inode ---> 2. void func, No propagating errors out
              __ext4_journal_start_sb
               ext4_journal_check_start ---> 1. Error found, remount-ro
      generic_perform_write ---> 3. No error sensed, continue
       balance_dirty_pages_ratelimited
        balance_dirty_pages_ratelimited_flags
         balance_dirty_pages
          // 4. Sleeping waiting for dirty pages to be freed
          __set_current_state(TASK_KILLABLE)
          io_schedule_timeout(pause);
```

``` xfs
ksys_write
   vfs_write
    xfs_file_write_iter
     if (xfs_is_shutdown(ip->i_mount))
       return -EIO;    ---> dd fail
```
Thanks for the info which is really helpful for me to understand the
problem.

balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
the dd process should succeed.
Yeah, dd can be killed, however it may be any application(s), :-)

Fortunately it won't cause trouble during reboot/power off, given
userspace will be killed at that time.

Thanks,
Ming

Don't worry about that, we always set the current thread to TASK_KILLABLE

while waiting in balance_dirty_pages().
I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
these (big amount)memory becomes not usable, and typical scenario could be USB HDD
unplugged.

thanks,
Ming
Yes, it is unreasonable to continue writing data with the previously opened
fd after
the file system becomes read-only, resulting in dirty page accumulation.

I provided a patch in another reply.
Could you help test if it can solve your problem?
If it can indeed solve your problem, I will officially send it to the email
list.
OK, I will test it tomorrow.
Your patch can avoid dd hang when bs is 512 at default, but if bs is
increased to 1G and more 'dd' tasks are started, the dd hang issue
still can be observed.

Thank you for your testing!

Yes, my patch only prevents the adding of new dirty pages, but it 
doesn't clear
the dirty pages that already exist. The reason why it doesn't work after 
bs grows
is that there are already enough dirty pages to trigger 
balance_dirty_pages().
Executing drop_caches at this point may make dd fail and exit. But the 
dirty pages
are still not cleared, nor is the shutdown implemented by ext4. These 
dirty pages
will not be cleared until the filesystem is unmounted.

This is the result of my test at bs=512:

ext4 -- remount read-only
OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
313872 313872 100%    0.10K   8048   39 32192K buffer_head   ---> wait 
to max
82602  82465  99%     0.10K   2118   39  8472K buffer_head   ---> kill 
dd && drop_caches
897    741    82%     0.10K     23   39    92K buffer_head   ---> umount

patched:
25233  25051  99%    0.10K    647     39     2588K buffer_head

The reason should be the next paragraph I posted.

Another thing is that if remount read-only makes sense on one dead
disk? Yeah, block layer doesn't export such interface for querying
if bdev is dead. However, I think it is reasonable to export such
interface if FS needs that.
Ext4 just detects I/O Error and remounts it as read-only, it doesn't know
if the current disk is dead or not.

I asked Yu Kuai and he said that disk_live() can be used to determine 
whether
a disk has been removed based on the status of the inode corresponding to
the block device, but this is generally not done in file systems.

But I am afraid if it can avoid the issue completely because the
old write task hang in balance_dirty_pages() may still write/dirty pages
if it is one very big size write IO.

thanks,
Ming

Those dirty pages that are already there are piling up and can't be 
written back,
which I think is a real problem. Can the block layer clear those dirty 
pages when
it detects that the disk is deleted?

--
With Best Regards,
Baokun Li
.