write call hangs in kernel space after virtio hot-remove

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm looking into an issue where removing a virtio disk via sysfs while another
process is issuing write() calls results in the writing task going into a
livelock:


root@guest # cat test.sh
#!/bin/bash

dd if=/dev/zero of=/dev/vda bs=1M count=10000 &
sleep 1
echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove

root@guest # ls /dev/vd*
/dev/vda

root@guest # grep Dirty /proc/meminfo
Dirty:                 0 kB

root@guest # sh test.sh

root@guest # ps aux | grep "[d]d if"
root      1699 38.6  0.0 111424  1216 hvc0     D+   10:48   0:01 dd if=/dev/zero of=/dev/vda bs=1M count=10000

root@guest # ls /dev/vd*
ls: cannot access /dev/vd*: No such file or directory

root@guest # cat /proc/1699/stack
[<0>] 0xc0000000ffe28218
[<0>] __switch_to+0x31c/0x480
[<0>] balance_dirty_pages+0x990/0xb90
[<0>] balance_dirty_pages_ratelimited+0x50c/0x6c0
[<0>] generic_perform_write+0x1b0/0x260
[<0>] __generic_file_write_iter+0x200/0x240
[<0>] blkdev_write_iter+0xa4/0x150
[<0>] __vfs_write+0x14c/0x240
[<0>] vfs_write+0xd0/0x240
[<0>] ksys_write+0x6c/0x110
[<0>] system_call+0x58/0x6c

root@guest # grep Dirty /proc/meminfo
Dirty:           1506816 kB

---

I have done some tracing and I believe this is caused by the clearing of
'WB_registered' in 'wb_shutdown':

<snip>
sh-1697  [000] ....  3994.541664: sysfs_remove_link <-del_gendisk
sh-1697  [000] ....  3994.541671: wb_shutdown <-bdi_unregister
<snip>

Later, when 'balance_dirty_pages' tries to start writeback, it doesn't happen
because 'WB_registered' is not set:

fs/fs-writeback.c

static void wb_wakeup(struct bdi_writeback *wb)
{
        spin_lock_bh(&wb->work_lock);
        if (test_bit(WB_registered, &wb->state))
                mod_delayed_work(bdi_wq, &wb->dwork, 0);
        spin_unlock_bh(&wb->work_lock);
}

So we get stuck in a loop in 'balance_dirty_pages':

root@guest # cat /sys/kernel/debug/tracing/set_ftrace_filter
balance_dirty_pages_ratelimited
balance_dirty_pages
wb_wakeup
wb_workfn
io_schedule_timeout

<snip>
dd-1699  [000] .... 11192.535946: wb_wakeup <-balance_dirty_pages
dd-1699  [000] .... 11192.535950: io_schedule_timeout <-balance_dirty_pages
dd-1699  [000] .... 11192.745968: wb_wakeup <-balance_dirty_pages
dd-1699  [000] .... 11192.745972: io_schedule_timeout <-balance_dirty_pages
<snip>


The test on 'WB_registered' before starting the writeback task is introduced
by: "5acda9 bdi: avoid oops on device removal".

I have made a *naive* attempt at fixing it by allowing writeback to happen even
without 'WB_registered':


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fe..050b067 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -982,7 +982,7 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(wb);
-	wb_wakeup(wb);
+	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 }
 
 /*
@@ -1933,7 +1933,7 @@ void wb_workfn(struct work_struct *work)
 						struct bdi_writeback, dwork);
 	long pages_written;
 
-	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
+	set_worker_desc("flush-%s", wb->bdi->dev ? dev_name(wb->bdi->dev) : "?" );
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
--


The effect of that is that the 'dd' process now finishes successfully and we
get "Buffer I/O error"s for the dirty pages that remain. I believe this to be
in conformance with existing interfaces since dd does not issue any fsync()
calls.


Does my analysis make any sense and would something along these lines be
acceptable as a solution?


Cheers




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux