Re: deadlock balance_dirty_pages() to be expected?

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Fri, 7 Oct 2011 22:21:26 +0800

On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
> Hello Fengguang,
> 
> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> > Hi Bernd,
> >
> > On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> >> Hello,
> >>
> >> while I'm working on the page cached mode in FhGFS (*) I noticed a
> >> deadlock in balance_dirty_pages().
> >>
> >> sysrq-w showed that it never started background write-out due to
> >>
> >> if (bdi_nr_reclaimable>  bdi_thresh) {
> >> 	pages_written += writeback_inodes_wb(&bdi->wb,
> >> 					    (write_chunk);
> >>
> >>
> >> and therefore also did not leave that loop with
> >>
> >> 	if (pages_written>= write_chunk)
> >>    				break;	/* We've done our duty */
> >>
> >>
> >> So my process stay in uninterruptible D-state forever.
> >
> > If writeback_inodes_wb() is not triggered, the process should still be
> > able to proceed, presumably with longer delays, but never stuck forever.
> > That's because the flusher thread should still be cleaning the pages
> > in the background which will knock down the dirty pages and eventually
> > unthrottle the dirtier process.
> 
> Hmm, that does not seem to work:
> 
> 1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M 
> count=100

That's normal: dd will be in D state in the vast majority time, but
the point is, one single balance_dirty_pages() call should not take
forever time, and dd should be able to go out of the D state (and
re-enter it almost immediately) from time to time.

> So the process is in D state ever since I wrote the first mail, just for 
> 100MB writes. Even if it still would do something, it would be extremely 
> slow. Sysrq-w then shows:

So it's normal to catch such trace for 99% times.  But do you mean the
writeout bandwidth is lower than expected?

> > [ 6727.616976] SysRq : Show Blocked State
> > [ 6727.617575]   task                        PC stack   pid father
> > [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
> > [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> > [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> > [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> > [ 6727.620466] Call Trace:
> > [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> > [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> > [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
> > [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> > [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> > [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> > [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> > [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> > [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> > [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> > [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> > [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> > [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> > [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> > [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> > [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> > [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> > [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
> > [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> > [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
> 
> 
> >
> >> Once I added basic inode->i_data.backing_dev_info bdi support to our
> >> file system, the deadlock did not happen anymore.
> >
> > What's the workload and change exactly?
> 
> I wish I could simply send the patch, but until all the paper work is 
> done I'm not allowed to :(
> 
> The basic idea is:
> 
> 1) During mount and setting the super block from
> 
> static struct file_system_type fhgfs_fs_type =
> {
> 	.mount = fhgfs_mount,
> }
> 
> Then in fhgfs_mount():
> 
> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
> sb->s_bdi = &sbInfo->bdi;
> 
> 
> 
> 2) When new (S_IFREG) inodes are allocated, for example from
> 
> static struct inode_operations fhgfs_dir_inode_ops
> {
> 	.lookup,
> 	.create,
> 	.link
> }
> 
> inode->i_data.backing_dev_info = &sbInfo->bdi;

Ah when you didn't register the "fhgfs" bdi, there should be no
dedicated flusher thread for doing the writeout.  Which is obviously
suboptimal.

> >> So my question is simply if we should expect this deadlock, if the file
> >> system does not set up backing device information and if so, shouldn't
> >> this be documented?
> >
> > Such deadlock is not expected..
> 
> Ok thanks, then we should figure out why it happens. Due to a network 
> outage here I won't have time before Monday to track down which kernel 
> version introduced it, though.

It's long time ago when the per-bdi writeback is introduced, I suspect.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html