Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 28 Apr 2023 09:41:15 +0800

On Thu, Apr 27, 2023 at 07:27:04PM +0800, Ming Lei wrote:
> On Thu, Apr 27, 2023 at 07:19:35PM +0800, Baokun Li wrote:
> > On 2023/4/27 18:01, Ming Lei wrote:
> > > On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
> > > > On 2023/4/27 12:50, Ming Lei wrote:
> > > > > Hello Matthew,
> > > > > 
> > > > > On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
> > > > > > On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > > > > > > Hello Guys,
> > > > > > > 
> > > > > > > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > > > > > > after one nvme block device is unplugged physically, then umount can't
> > > > > > > succeed.
> > > > > > That's a feature, not a bug ... the dd should continue indefinitely?
> > > > > Can you explain what the feature is? And not see such 'issue' or 'feature'
> > > > > on xfs.
> > > > > 
> > > > > The device has been gone, so IMO it is reasonable to see FS buffered write IO
> > > > > failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
> > > > > filesystem read-only'. Seems these things may confuse user.
> > > > 
> > > > The reason for this difference is that ext4 and xfs handle errors
> > > > differently.
> > > > 
> > > > ext4 remounts the filesystem as read-only or even just continues, vfs_write
> > > > does not check for these.
> > > vfs_write may not find anything wrong, but ext4 remount could see that
> > > disk is gone, which might happen during or after remount, however.
> > > 
> > > > xfs shuts down the filesystem, so it returns a failure at
> > > > xfs_file_write_iter when it finds an error.
> > > > 
> > > > 
> > > > ``` ext4
> > > > ksys_write
> > > >   vfs_write
> > > >    ext4_file_write_iter
> > > >     ext4_buffered_write_iter
> > > >      ext4_write_checks
> > > >       file_modified
> > > >        file_modified_flags
> > > >         __file_update_time
> > > >          inode_update_time
> > > >           generic_update_time
> > > >            __mark_inode_dirty
> > > >             ext4_dirty_inode ---> 2. void func, No propagating errors out
> > > >              __ext4_journal_start_sb
> > > >               ext4_journal_check_start ---> 1. Error found, remount-ro
> > > >      generic_perform_write ---> 3. No error sensed, continue
> > > >       balance_dirty_pages_ratelimited
> > > >        balance_dirty_pages_ratelimited_flags
> > > >         balance_dirty_pages
> > > >          // 4. Sleeping waiting for dirty pages to be freed
> > > >          __set_current_state(TASK_KILLABLE)
> > > >          io_schedule_timeout(pause);
> > > > ```
> > > > 
> > > > ``` xfs
> > > > ksys_write
> > > >   vfs_write
> > > >    xfs_file_write_iter
> > > >     if (xfs_is_shutdown(ip->i_mount))
> > > >       return -EIO;    ---> dd fail
> > > > ```
> > > Thanks for the info which is really helpful for me to understand the
> > > problem.
> > > 
> > > > > > balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
> > > > > > the dd process should succeed.
> > > > > Yeah, dd can be killed, however it may be any application(s), :-)
> > > > > 
> > > > > Fortunately it won't cause trouble during reboot/power off, given
> > > > > userspace will be killed at that time.
> > > > > 
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > Ming
> > > > > 
> > > > Don't worry about that, we always set the current thread to TASK_KILLABLE
> > > > 
> > > > while waiting in balance_dirty_pages().
> > > I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
> > > these (big amount)memory becomes not usable, and typical scenario could be USB HDD
> > > unplugged.
> > > 
> > > 
> > > thanks,
> > > Ming
> > Yes, it is unreasonable to continue writing data with the previously opened
> > fd after
> > the file system becomes read-only, resulting in dirty page accumulation.
> > 
> > I provided a patch in another reply.
> > Could you help test if it can solve your problem?
> > If it can indeed solve your problem, I will officially send it to the email
> > list.
> 
> OK, I will test it tomorrow.

Your patch can avoid dd hang when bs is 512 at default, but if bs is
increased to 1G and more 'dd' tasks are started, the dd hang issue
still can be observed.

The reason should be the next paragraph I posted.

Another thing is that if remount read-only makes sense on one dead
disk? Yeah, block layer doesn't export such interface for querying
if bdev is dead. However, I think it is reasonable to export such
interface if FS needs that.

> 
> But I am afraid if it can avoid the issue completely because the
> old write task hang in balance_dirty_pages() may still write/dirty pages
> if it is one very big size write IO.

thanks,
Ming