Re: deadlock balance_dirty_pages() to be expected?

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Fri, 7 Oct 2011 22:38:51 +0800

On Fri, Oct 07, 2011 at 10:30:18PM +0800, Bernd Schubert wrote:
> On 10/07/2011 04:21 PM, Wu Fengguang wrote:
> > On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
> >> Hello Fengguang,
> >>
> >> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> >>> Hi Bernd,
> >>>
> >>> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> >>>> Hello,
> >>>>
> >>>> while I'm working on the page cached mode in FhGFS (*) I noticed a
> >>>> deadlock in balance_dirty_pages().
> >>>>
> >>>> sysrq-w showed that it never started background write-out due to
> >>>>
> >>>> if (bdi_nr_reclaimable>   bdi_thresh) {
> >>>> 	pages_written += writeback_inodes_wb(&bdi->wb,
> >>>> 					    (write_chunk);
> >>>>
> >>>>
> >>>> and therefore also did not leave that loop with
> >>>>
> >>>> 	if (pages_written>= write_chunk)
> >>>>     				break;	/* We've done our duty */
> >>>>
> >>>>
> >>>> So my process stay in uninterruptible D-state forever.
> >>>
> >>> If writeback_inodes_wb() is not triggered, the process should still be
> >>> able to proceed, presumably with longer delays, but never stuck forever.
> >>> That's because the flusher thread should still be cleaning the pages
> >>> in the background which will knock down the dirty pages and eventually
> >>> unthrottle the dirtier process.
> >>
> >> Hmm, that does not seem to work:
> >>
> >> 1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
> >> count=100
> >
> > That's normal: dd will be in D state in the vast majority time, but
> > the point is, one single balance_dirty_pages() call should not take
> > forever time, and dd should be able to go out of the D state (and
> > re-enter it almost immediately) from time to time.
> >
> >> So the process is in D state ever since I wrote the first mail, just for
> >> 100MB writes. Even if it still would do something, it would be extremely
> >> slow. Sysrq-w then shows:
> >
> > So it's normal to catch such trace for 99% times.  But do you mean the
> > writeout bandwidth is lower than expected?
> 
> If it really is still doing something, it is *ways* slower. Once I added 
> bdi support, it finishes to write the 100MB file in my kvm test instance 
> within a few seconds. Right now it is running for hours already... As I 
> added a dump_stack() to our writepages() method, I also see that this 
> function is never called.

In your case it should be the default/forker thread that's doing the
(suboptimal) writeout: 

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        17  0.0  0.0      0     0 ?        S    21:12   0:00 [bdi-default]

In normal cases there are the flush-* threads doing the writeout:

root      1146  0.0  0.0      0     0 ?        S    21:12   0:00 [flush-8:0]

> >
> >>> [ 6727.616976] SysRq : Show Blocked State
> >>> [ 6727.617575]   task                        PC stack   pid father
> >>> [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
> >>> [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> >>> [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> >>> [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> >>> [ 6727.620466] Call Trace:
> >>> [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> >>> [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> >>> [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
> >>> [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> >>> [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> >>> [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> >>> [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> >>> [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> >>> [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> >>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> >>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> >>> [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> >>> [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> >>> [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> >>> [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> >>> [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
> >>> [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> >>> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
> >>
> >>
> >>>
> >>>> Once I added basic inode->i_data.backing_dev_info bdi support to our
> >>>> file system, the deadlock did not happen anymore.
> >>>
> >>> What's the workload and change exactly?
> >>
> >> I wish I could simply send the patch, but until all the paper work is
> >> done I'm not allowed to :(
> >>
> >> The basic idea is:
> >>
> >> 1) During mount and setting the super block from
> >>
> >> static struct file_system_type fhgfs_fs_type =
> >> {
> >> 	.mount = fhgfs_mount,
> >> }
> >>
> >> Then in fhgfs_mount():
> >>
> >> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
> >> sb->s_bdi =&sbInfo->bdi;
> >>
> >>
> >>
> >> 2) When new (S_IFREG) inodes are allocated, for example from
> >>
> >> static struct inode_operations fhgfs_dir_inode_ops
> >> {
> >> 	.lookup,
> >> 	.create,
> >> 	.link
> >> }
> >>
> >> inode->i_data.backing_dev_info =&sbInfo->bdi;
> >
> > Ah when you didn't register the "fhgfs" bdi, there should be no
> > dedicated flusher thread for doing the writeout.  Which is obviously
> > suboptimal.
> >
> >>>> So my question is simply if we should expect this deadlock, if the file
> >>>> system does not set up backing device information and if so, shouldn't
> >>>> this be documented?
> >>>
> >>> Such deadlock is not expected..
> >>
> >> Ok thanks, then we should figure out why it happens. Due to a network
> >> outage here I won't have time before Monday to track down which kernel
> >> version introduced it, though.
> >
> > It's long time ago when the per-bdi writeback is introduced, I suspect.
> 
> Ok, I can start to test if 2.6.32 also already deadlocks.

I found the commit, it's introduced right in .32, hehe.

commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae
Author: Jens Axboe <jens.axboe@xxxxxxxxxx>
Date:   Wed Sep 9 09:08:54 2009 +0200

    writeback: switch to per-bdi threads for flushing data

    This gets rid of pdflush for bdi writeout and kupdated style cleaning.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html