Re: deadlock balance_dirty_pages() to be expected?

Bernd Schubert <bernd.schubert@xxxxxxxxxxxxxxxxxx> · Fri, 07 Oct 2011 16:08:06 +0200

Hello Fengguang,

On 10/07/2011 03:37 PM, Wu Fengguang wrote:
Hi Bernd,

On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
Hello,

while I'm working on the page cached mode in FhGFS (*) I noticed a
deadlock in balance_dirty_pages().

sysrq-w showed that it never started background write-out due to

if (bdi_nr_reclaimable>  bdi_thresh) {
	pages_written += writeback_inodes_wb(&bdi->wb,
					    (write_chunk);

and therefore also did not leave that loop with

	if (pages_written>= write_chunk)
   				break;	/* We've done our duty */

So my process stay in uninterruptible D-state forever.

If writeback_inodes_wb() is not triggered, the process should still be
able to proceed, presumably with longer delays, but never stuck forever.
That's because the flusher thread should still be cleaning the pages
in the background which will knock down the dirty pages and eventually
unthrottle the dirtier process.

Hmm, that does not seem to work:

1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M 
count=100

So the process is in D state ever since I wrote the first mail, just for 
100MB writes. Even if it still would do something, it would be extremely 
slow. Sysrq-w then shows:

[ 6727.616976] SysRq : Show Blocked State
[ 6727.617575]   task                        PC stack   pid father
[ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
[ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
[ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
[ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
[ 6727.620466] Call Trace:
[ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
[ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
[ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
[ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
[ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
[ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
[ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
[ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
[ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
[ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
[ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
[ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
[ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
[ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
[ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
[ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
[ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
[ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
[ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
[ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47

Once I added basic inode->i_data.backing_dev_info bdi support to our
file system, the deadlock did not happen anymore.

What's the workload and change exactly?

I wish I could simply send the patch, but until all the paper work is 
done I'm not allowed to :(

The basic idea is:

1) During mount and setting the super block from

static struct file_system_type fhgfs_fs_type =
{
	.mount = fhgfs_mount,
}

Then in fhgfs_mount():

bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
sb->s_bdi = &sbInfo->bdi;

2) When new (S_IFREG) inodes are allocated, for example from

static struct inode_operations fhgfs_dir_inode_ops
{
	.lookup,
	.create,
	.link
}

inode->i_data.backing_dev_info = &sbInfo->bdi;

So my question is simply if we should expect this deadlock, if the file
system does not set up backing device information and if so, shouldn't
this be documented?

Such deadlock is not expected..

Ok thanks, then we should figure out why it happens. Due to a network 
outage here I won't have time before Monday to track down which kernel 
version introduced it, though.

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html