Hello Fengguang,
On 10/07/2011 03:37 PM, Wu Fengguang wrote:
Hi Bernd,
On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
Hello,
while I'm working on the page cached mode in FhGFS (*) I noticed a
deadlock in balance_dirty_pages().
sysrq-w showed that it never started background write-out due to
if (bdi_nr_reclaimable> bdi_thresh) {
pages_written += writeback_inodes_wb(&bdi->wb,
(write_chunk);
and therefore also did not leave that loop with
if (pages_written>= write_chunk)
break; /* We've done our duty */
So my process stay in uninterruptible D-state forever.
If writeback_inodes_wb() is not triggered, the process should still be
able to proceed, presumably with longer delays, but never stuck forever.
That's because the flusher thread should still be cleaning the pages
in the background which will knock down the dirty pages and eventually
unthrottle the dirtier process.
Hmm, that does not seem to work:
1330 pts/0 D+ 0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
count=100
So the process is in D state ever since I wrote the first mail, just for
100MB writes. Even if it still would do something, it would be extremely
slow. Sysrq-w then shows:
[ 6727.616976] SysRq : Show Blocked State
[ 6727.617575] task PC stack pid father
[ 6727.618252] dd D 0000000000000000 3544 1330 1306 0x00000000
[ 6727.619002] ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
[ 6727.620157] 0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
[ 6727.620466] ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
[ 6727.620466] Call Trace:
[ 6727.620466] [<ffffffff81398627>] ? __schedule+0x697/0x7e0
[ 6727.620466] [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
[ 6727.620466] [<ffffffff8139884f>] schedule+0x3f/0x60
[ 6727.620466] [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
[ 6727.620466] [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
[ 6727.620466] [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
[ 6727.620466] [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
[ 6727.620466] [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
[ 6727.620466] [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
[ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
[ 6727.620466] [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
[ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
[ 6727.620466] [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
[ 6727.620466] [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
[ 6727.620466] [<ffffffff8115af8a>] do_sync_write+0xda/0x120
[ 6727.620466] [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
[ 6727.620466] [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
[ 6727.620466] [<ffffffff8115b661>] sys_write+0x51/0x90
[ 6727.620466] [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
[ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
Once I added basic inode->i_data.backing_dev_info bdi support to our
file system, the deadlock did not happen anymore.
What's the workload and change exactly?
I wish I could simply send the patch, but until all the paper work is
done I'm not allowed to :(
The basic idea is:
1) During mount and setting the super block from
static struct file_system_type fhgfs_fs_type =
{
.mount = fhgfs_mount,
}
Then in fhgfs_mount():
bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
sb->s_bdi = &sbInfo->bdi;
2) When new (S_IFREG) inodes are allocated, for example from
static struct inode_operations fhgfs_dir_inode_ops
{
.lookup,
.create,
.link
}
inode->i_data.backing_dev_info = &sbInfo->bdi;
So my question is simply if we should expect this deadlock, if the file
system does not set up backing device information and if so, shouldn't
this be documented?
Such deadlock is not expected..
Ok thanks, then we should figure out why it happens. Due to a network
outage here I won't have time before Monday to track down which kernel
version introduced it, though.
Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html