Miklos Szeredi <miklos@xxxxxxxxxx> and me identified a writeback bug: > The following strange behavior can be observed: > > 1. large file is written > 2. after 30 seconds, nr_dirty goes down by 1024 > 3. then for some time (< 30 sec) nothing happens (disk idle) > 4. then nr_dirty again goes down by 1024 > 5. repeat from 3. until whole file is written > > So basically a 4Mbyte chunk of the file is written every 30 seconds. > I'm quite sure this is not the intended behavior. It can be produced by the following test scheme: # cat bin/test-writeback.sh grep nr_dirty /proc/vmstat echo 1 > /proc/sys/fs/inode_debug dd if=/dev/zero of=/var/x bs=1K count=204800& while true; do grep nr_dirty /proc/vmstat; sleep 1; done # bin/test-writeback.sh nr_dirty 19207 nr_dirty 19207 nr_dirty 30924 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s nr_dirty 47150 nr_dirty 47141 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47205 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47215 nr_dirty 47216 nr_dirty 47216 nr_dirty 47216 nr_dirty 47154 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47134 nr_dirty 47134 nr_dirty 47135 nr_dirty 47135 nr_dirty 47135 nr_dirty 46097 <== -1038 nr_dirty 46098 nr_dirty 46098 nr_dirty 46098 [...] nr_dirty 46091 nr_dirty 46092 nr_dirty 46092 nr_dirty 45069 <== -1023 nr_dirty 45056 nr_dirty 45056 nr_dirty 45056 [...] nr_dirty 37822 nr_dirty 36799 <== -1023 [...] nr_dirty 36781 nr_dirty 35758 <== -1023 [...] nr_dirty 34708 nr_dirty 33672 <== -1024 [...] nr_dirty 33692 nr_dirty 32669 <== -1023 % ls -li /var/x 847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x % dmesg|grep 847824 # generated by a debug printk [ 529.263184] redirtied inode 847824 line 548 [ 564.250872] redirtied inode 847824 line 548 [ 594.272797] redirtied inode 847824 line 548 [ 629.231330] redirtied inode 847824 line 548 [ 659.224674] redirtied inode 847824 line 548 [ 689.219890] redirtied inode 847824 line 548 [ 724.226655] redirtied inode 847824 line 548 [ 759.198568] redirtied inode 847824 line 548 # line 548 in fs/fs-writeback.c: 543 if (wbc->pages_skipped != pages_skipped) { 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ 548 redirty_tail(inode); 549 } More debug efforts show that __block_write_full_page() never has the chance to call submit_bh() for that big dirty file: the buffer head is *clean*. So basicly no page io is issued by __block_write_full_page(), hence pages_skipped goes up. This patch fixes this bug. I'm not quite sure about it. But at least the comment in generic_sync_sb_inodes(): 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ and the comment in __block_write_full_page(): 1713 /* 1714 * The page was marked dirty, but the buffers were 1715 * clean. Someone wrote them back by hand with 1716 * ll_rw_block/submit_bh. A rare case. 1717 */ do not quite agree with each other. The page writeback is skipped not because of 'locked buffer', but 'clean buffer'. This is the new behavior after the patch: # bin/test-writeback.sh nr_dirty 60 847824 /var/x nr_dirty 60 nr_dirty 31139 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.55338 seconds, 135 MB/s nr_dirty 47137 nr_dirty 46147 nr_dirty 46147 nr_dirty 46147 nr_dirty 46148 nr_dirty 46148 nr_dirty 46148 nr_dirty 46148 nr_dirty 46193 nr_dirty 46193 nr_dirty 46193 nr_dirty 46193 nr_dirty 46126 nr_dirty 46126 nr_dirty 46126 nr_dirty 46126 nr_dirty 46126 nr_dirty 46109 nr_dirty 46109 nr_dirty 46109 nr_dirty 46113 nr_dirty 46113 nr_dirty 46106 nr_dirty 46106 nr_dirty 46106 nr_dirty 46106 nr_dirty 46106 nr_dirty 46089 nr_dirty 46089 nr_dirty 46090 nr_dirty 46093 nr_dirty 46093 nr_dirty 15 nr_dirty 15 nr_dirty 15 nr_dirty 15 It is pretty numbers: wait 30s and write ALL:) But another run is not so good: # sh bin/test-writeback.sh mount: proc already mounted nr_dirty 223 nr_dirty 223 nr_dirty 23664 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.51092 seconds, 139 MB/s nr_dirty 47299 nr_dirty 47271 nr_dirty 47260 nr_dirty 47260 nr_dirty 47267 nr_dirty 47267 nr_dirty 47329 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47352 nr_dirty 47606 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47604 nr_dirty 47480 nr_dirty 47492 nr_dirty 47492 nr_dirty 47492 nr_dirty 47492 nr_dirty 46470 nr_dirty 46473 nr_dirty 46473 nr_dirty 46473 nr_dirty 46473 nr_dirty 45428 nr_dirty 45435 nr_dirty 45436 nr_dirty 45436 nr_dirty 45436 nr_dirty 257 nr_dirty 259 nr_dirty 259 nr_dirty 259 nr_dirty 259 nr_dirty 16 nr_dirty 16 nr_dirty 16 nr_dirty 16 nr_dirty 16 Basicly they are - during the dd: ~16M - after 30s: ~4M - after 5s: ~4M - after 5s: ~176M The box has 2G memory. Question 1: How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays. Question 2: __block_write_full_page() is virtually doing nothing for the whole dirty file. Isn't it abnormal? Who did the actual write back for us? The jounal? How to fix it? Any suggestions? Thank you. Cc: Ken Chen <kenchen@xxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> --- fs/buffer.c | 1 - 1 file changed, 1 deletion(-) --- linux-2.6.23-rc2-mm2.orig/fs/buffer.c +++ linux-2.6.23-rc2-mm2/fs/buffer.c @@ -1713,7 +1713,6 @@ done: * The page and buffer_heads can be released at any time from * here on. */ - wbc->pages_skipped++; /* We didn't write this page */ } return err; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html