On Fri, Oct 07, 2011 at 10:30:18PM +0800, Bernd Schubert wrote: > On 10/07/2011 04:21 PM, Wu Fengguang wrote: > > On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote: > >> Hello Fengguang, > >> > >> On 10/07/2011 03:37 PM, Wu Fengguang wrote: > >>> Hi Bernd, > >>> > >>> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote: > >>>> Hello, > >>>> > >>>> while I'm working on the page cached mode in FhGFS (*) I noticed a > >>>> deadlock in balance_dirty_pages(). > >>>> > >>>> sysrq-w showed that it never started background write-out due to > >>>> > >>>> if (bdi_nr_reclaimable> bdi_thresh) { > >>>> pages_written += writeback_inodes_wb(&bdi->wb, > >>>> (write_chunk); > >>>> > >>>> > >>>> and therefore also did not leave that loop with > >>>> > >>>> if (pages_written>= write_chunk) > >>>> break; /* We've done our duty */ > >>>> > >>>> > >>>> So my process stay in uninterruptible D-state forever. > >>> > >>> If writeback_inodes_wb() is not triggered, the process should still be > >>> able to proceed, presumably with longer delays, but never stuck forever. > >>> That's because the flusher thread should still be cleaning the pages > >>> in the background which will knock down the dirty pages and eventually > >>> unthrottle the dirtier process. > >> > >> Hmm, that does not seem to work: > >> > >> 1330 pts/0 D+ 0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M > >> count=100 > > > > That's normal: dd will be in D state in the vast majority time, but > > the point is, one single balance_dirty_pages() call should not take > > forever time, and dd should be able to go out of the D state (and > > re-enter it almost immediately) from time to time. > > > >> So the process is in D state ever since I wrote the first mail, just for > >> 100MB writes. Even if it still would do something, it would be extremely > >> slow. Sysrq-w then shows: > > > > So it's normal to catch such trace for 99% times. But do you mean the > > writeout bandwidth is lower than expected? > > If it really is still doing something, it is *ways* slower. Once I added > bdi support, it finishes to write the 100MB file in my kvm test instance > within a few seconds. Right now it is running for hours already... As I > added a dump_stack() to our writepages() method, I also see that this > function is never called. In your case it should be the default/forker thread that's doing the (suboptimal) writeout: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 17 0.0 0.0 0 0 ? S 21:12 0:00 [bdi-default] In normal cases there are the flush-* threads doing the writeout: root 1146 0.0 0.0 0 0 ? S 21:12 0:00 [flush-8:0] > > > >>> [ 6727.616976] SysRq : Show Blocked State > >>> [ 6727.617575] task PC stack pid father > >>> [ 6727.618252] dd D 0000000000000000 3544 1330 1306 0x00000000 > >>> [ 6727.619002] ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046 > >>> [ 6727.620157] 0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8 > >>> [ 6727.620466] ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000 > >>> [ 6727.620466] Call Trace: > >>> [ 6727.620466] [<ffffffff81398627>] ? __schedule+0x697/0x7e0 > >>> [ 6727.620466] [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0 > >>> [ 6727.620466] [<ffffffff8139884f>] schedule+0x3f/0x60 > >>> [ 6727.620466] [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0 > >>> [ 6727.620466] [<ffffffff81070930>] ? lock_timer_base+0x70/0x70 > >>> [ 6727.620466] [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90 > >>> [ 6727.620466] [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640 > >>> [ 6727.620466] [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180 > >>> [ 6727.620466] [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270 > >>> [ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0 > >>> [ 6727.620466] [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460 > >>> [ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0 > >>> [ 6727.620466] [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0 > >>> [ 6727.620466] [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs] > >>> [ 6727.620466] [<ffffffff8115af8a>] do_sync_write+0xda/0x120 > >>> [ 6727.620466] [<ffffffff8112146c>] ? might_fault+0x9c/0xb0 > >>> [ 6727.620466] [<ffffffff8115b4b8>] vfs_write+0xc8/0x180 > >>> [ 6727.620466] [<ffffffff8115b661>] sys_write+0x51/0x90 > >>> [ 6727.620466] [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b > >>> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47 > >> > >> > >>> > >>>> Once I added basic inode->i_data.backing_dev_info bdi support to our > >>>> file system, the deadlock did not happen anymore. > >>> > >>> What's the workload and change exactly? > >> > >> I wish I could simply send the patch, but until all the paper work is > >> done I'm not allowed to :( > >> > >> The basic idea is: > >> > >> 1) During mount and setting the super block from > >> > >> static struct file_system_type fhgfs_fs_type = > >> { > >> .mount = fhgfs_mount, > >> } > >> > >> Then in fhgfs_mount(): > >> > >> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY); > >> sb->s_bdi =&sbInfo->bdi; > >> > >> > >> > >> 2) When new (S_IFREG) inodes are allocated, for example from > >> > >> static struct inode_operations fhgfs_dir_inode_ops > >> { > >> .lookup, > >> .create, > >> .link > >> } > >> > >> inode->i_data.backing_dev_info =&sbInfo->bdi; > > > > Ah when you didn't register the "fhgfs" bdi, there should be no > > dedicated flusher thread for doing the writeout. Which is obviously > > suboptimal. > > > >>>> So my question is simply if we should expect this deadlock, if the file > >>>> system does not set up backing device information and if so, shouldn't > >>>> this be documented? > >>> > >>> Such deadlock is not expected.. > >> > >> Ok thanks, then we should figure out why it happens. Due to a network > >> outage here I won't have time before Monday to track down which kernel > >> version introduced it, though. > > > > It's long time ago when the per-bdi writeback is introduced, I suspect. > > Ok, I can start to test if 2.6.32 also already deadlocks. I found the commit, it's introduced right in .32, hehe. commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae Author: Jens Axboe <jens.axboe@xxxxxxxxxx> Date: Wed Sep 9 09:08:54 2009 +0200 writeback: switch to per-bdi threads for flushing data This gets rid of pdflush for bdi writeout and kupdated style cleaning. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html