On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote: > Hello Fengguang, > > On 10/07/2011 03:37 PM, Wu Fengguang wrote: > > Hi Bernd, > > > > On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote: > >> Hello, > >> > >> while I'm working on the page cached mode in FhGFS (*) I noticed a > >> deadlock in balance_dirty_pages(). > >> > >> sysrq-w showed that it never started background write-out due to > >> > >> if (bdi_nr_reclaimable> bdi_thresh) { > >> pages_written += writeback_inodes_wb(&bdi->wb, > >> (write_chunk); > >> > >> > >> and therefore also did not leave that loop with > >> > >> if (pages_written>= write_chunk) > >> break; /* We've done our duty */ > >> > >> > >> So my process stay in uninterruptible D-state forever. > > > > If writeback_inodes_wb() is not triggered, the process should still be > > able to proceed, presumably with longer delays, but never stuck forever. > > That's because the flusher thread should still be cleaning the pages > > in the background which will knock down the dirty pages and eventually > > unthrottle the dirtier process. > > Hmm, that does not seem to work: > > 1330 pts/0 D+ 0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M > count=100 That's normal: dd will be in D state in the vast majority time, but the point is, one single balance_dirty_pages() call should not take forever time, and dd should be able to go out of the D state (and re-enter it almost immediately) from time to time. > So the process is in D state ever since I wrote the first mail, just for > 100MB writes. Even if it still would do something, it would be extremely > slow. Sysrq-w then shows: So it's normal to catch such trace for 99% times. But do you mean the writeout bandwidth is lower than expected? > > [ 6727.616976] SysRq : Show Blocked State > > [ 6727.617575] task PC stack pid father > > [ 6727.618252] dd D 0000000000000000 3544 1330 1306 0x00000000 > > [ 6727.619002] ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046 > > [ 6727.620157] 0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8 > > [ 6727.620466] ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000 > > [ 6727.620466] Call Trace: > > [ 6727.620466] [<ffffffff81398627>] ? __schedule+0x697/0x7e0 > > [ 6727.620466] [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0 > > [ 6727.620466] [<ffffffff8139884f>] schedule+0x3f/0x60 > > [ 6727.620466] [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0 > > [ 6727.620466] [<ffffffff81070930>] ? lock_timer_base+0x70/0x70 > > [ 6727.620466] [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90 > > [ 6727.620466] [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640 > > [ 6727.620466] [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180 > > [ 6727.620466] [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270 > > [ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0 > > [ 6727.620466] [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460 > > [ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0 > > [ 6727.620466] [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0 > > [ 6727.620466] [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs] > > [ 6727.620466] [<ffffffff8115af8a>] do_sync_write+0xda/0x120 > > [ 6727.620466] [<ffffffff8112146c>] ? might_fault+0x9c/0xb0 > > [ 6727.620466] [<ffffffff8115b4b8>] vfs_write+0xc8/0x180 > > [ 6727.620466] [<ffffffff8115b661>] sys_write+0x51/0x90 > > [ 6727.620466] [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b > > [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47 > > > > > >> Once I added basic inode->i_data.backing_dev_info bdi support to our > >> file system, the deadlock did not happen anymore. > > > > What's the workload and change exactly? > > I wish I could simply send the patch, but until all the paper work is > done I'm not allowed to :( > > The basic idea is: > > 1) During mount and setting the super block from > > static struct file_system_type fhgfs_fs_type = > { > .mount = fhgfs_mount, > } > > Then in fhgfs_mount(): > > bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY); > sb->s_bdi = &sbInfo->bdi; > > > > 2) When new (S_IFREG) inodes are allocated, for example from > > static struct inode_operations fhgfs_dir_inode_ops > { > .lookup, > .create, > .link > } > > inode->i_data.backing_dev_info = &sbInfo->bdi; Ah when you didn't register the "fhgfs" bdi, there should be no dedicated flusher thread for doing the writeout. Which is obviously suboptimal. > >> So my question is simply if we should expect this deadlock, if the file > >> system does not set up backing device information and if so, shouldn't > >> this be documented? > > > > Such deadlock is not expected.. > > Ok thanks, then we should figure out why it happens. Due to a network > outage here I won't have time before Monday to track down which kernel > version introduced it, though. It's long time ago when the per-bdi writeback is introduced, I suspect. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html