Re: deadlock balance_dirty_pages() to be expected?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/07/2011 04:21 PM, Wu Fengguang wrote:
On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
Hello Fengguang,

On 10/07/2011 03:37 PM, Wu Fengguang wrote:
Hi Bernd,

On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
Hello,

while I'm working on the page cached mode in FhGFS (*) I noticed a
deadlock in balance_dirty_pages().

sysrq-w showed that it never started background write-out due to

if (bdi_nr_reclaimable>   bdi_thresh) {
	pages_written += writeback_inodes_wb(&bdi->wb,
					    (write_chunk);


and therefore also did not leave that loop with

	if (pages_written>= write_chunk)
    				break;	/* We've done our duty */


So my process stay in uninterruptible D-state forever.

If writeback_inodes_wb() is not triggered, the process should still be
able to proceed, presumably with longer delays, but never stuck forever.
That's because the flusher thread should still be cleaning the pages
in the background which will knock down the dirty pages and eventually
unthrottle the dirtier process.

Hmm, that does not seem to work:

1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
count=100

That's normal: dd will be in D state in the vast majority time, but
the point is, one single balance_dirty_pages() call should not take
forever time, and dd should be able to go out of the D state (and
re-enter it almost immediately) from time to time.

So the process is in D state ever since I wrote the first mail, just for
100MB writes. Even if it still would do something, it would be extremely
slow. Sysrq-w then shows:

So it's normal to catch such trace for 99% times.  But do you mean the
writeout bandwidth is lower than expected?

If it really is still doing something, it is *ways* slower. Once I added bdi support, it finishes to write the 100MB file in my kvm test instance within a few seconds. Right now it is running for hours already... As I added a dump_stack() to our writepages() method, I also see that this function is never called.


[ 6727.616976] SysRq : Show Blocked State
[ 6727.617575]   task                        PC stack   pid father
[ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
[ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
[ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
[ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
[ 6727.620466] Call Trace:
[ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
[ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
[ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
[ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
[ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
[ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
[ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
[ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
[ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
[ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
[ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
[ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
[ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
[ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
[ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
[ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
[ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
[ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
[ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
[ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47



Once I added basic inode->i_data.backing_dev_info bdi support to our
file system, the deadlock did not happen anymore.

What's the workload and change exactly?

I wish I could simply send the patch, but until all the paper work is
done I'm not allowed to :(

The basic idea is:

1) During mount and setting the super block from

static struct file_system_type fhgfs_fs_type =
{
	.mount = fhgfs_mount,
}

Then in fhgfs_mount():

bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
sb->s_bdi =&sbInfo->bdi;



2) When new (S_IFREG) inodes are allocated, for example from

static struct inode_operations fhgfs_dir_inode_ops
{
	.lookup,
	.create,
	.link
}

inode->i_data.backing_dev_info =&sbInfo->bdi;

Ah when you didn't register the "fhgfs" bdi, there should be no
dedicated flusher thread for doing the writeout.  Which is obviously
suboptimal.

So my question is simply if we should expect this deadlock, if the file
system does not set up backing device information and if so, shouldn't
this be documented?

Such deadlock is not expected..

Ok thanks, then we should figure out why it happens. Due to a network
outage here I won't have time before Monday to track down which kernel
version introduced it, though.

It's long time ago when the per-bdi writeback is introduced, I suspect.

Ok, I can start to test if 2.6.32 also already deadlocks.

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux