On Thursday June 6, neilb@cse.unsw.edu.au wrote: > > I'm not in a position at the moment to spend time testing whether the > inefficacy of bdflush is ext3 specific or applies to ext2 as well. I > may try that next week (we have a long weekend coming up). If it > applies to ext2, it could be a showstopped for 2.4.19.... Actually, I did find some time. I added the attached patch to 2.4.19-pre8 + ext3 0.9.18. It checks the "age" of buffers being written by bdflush and shows me the high water mark. By "age" I mean the time between the b_flushtime and now, if now is after the b_flushtime. This is the "Age-since-should-have-been-flushed" rather than "Age-since-dirtied". The high water mark is maintained separately for each scsi drive and decays 1 second in 5 so that I can get a good idea of what is happening. What should I see? Given the default bdflush settings of flush 30 second old buffers every 5 seconds, I should see buffers with "ages" up to 5 seconds being flushed. Also, given that the lru_list[BUF_DIRTY] is kept in oldest-first order, the first buffer on the list should set the high water mark, and so remaining buffers should not cause a message to be printer, at least until the next run of write_some_buffers when the high water mark will have decayed. And this is exactly what I see when using ext2: Jun 6 12:22:35 elfman kernel: buffer on 4 has age 325 Jun 6 12:22:36 elfman kernel: buffer on 0 has age 32 Jun 6 12:22:41 elfman kernel: buffer on 4 has age 498 Jun 6 12:22:41 elfman kernel: buffer on 0 has age 57 Jun 6 12:22:46 elfman kernel: buffer on 4 has age 487 Jun 6 12:22:51 elfman kernel: buffer on 4 has age 446 Jun 6 12:22:56 elfman kernel: buffer on 4 has age 454 Jun 6 12:23:01 elfman kernel: buffer on 4 has age 486 Jun 6 12:23:06 elfman kernel: buffer on 4 has age 466 Jun 6 12:23:06 elfman kernel: buffer on 0 has age 57 Jun 6 12:23:11 elfman kernel: buffer on 4 has age 401 Jun 6 12:23:16 elfman kernel: buffer on 4 has age 440 Jun 6 12:23:16 elfman kernel: buffer on 0 has age 10 Jun 6 12:23:21 elfman kernel: buffer on 4 has age 411 Jun 6 12:23:46 elfman kernel: buffer on 4 has age 118 Jun 6 12:23:46 elfman kernel: buffer on 0 has age 4 Jun 6 12:23:51 elfman kernel: buffer on 4 has age 499 "0" == sda which has my root filesystem (ext2 at this point) and "4" == sde which has a test ext2 filesystem. But with ext3 I see different numbers. Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9511 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9512 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9513 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9514 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9515 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9516 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9517 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9518 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9520 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9521 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9529 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9530 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9531 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9532 Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9533 5 == sdf with a test ext3 filesystem. Two things to note: 1/ ages up to 95 seconds (numbers in jiffies). This is the worst I got in a little bit of testing. 20-30 seconds was quite common. 2/ consecutive buffers have increasing age... young buffers are in-front of old buffers in the lru. (I think having a large journal, or data=writeback will display the largest times) I tried the same test 2.4.16 and got results like: Jun 6 13:09:30 elfman kernel: buffer on 5 has age 529 Jun 6 13:09:31 elfman kernel: buffer on 5 has age 558 Jun 6 13:09:31 elfman kernel: buffer on 5 has age 560 Jun 6 13:09:31 elfman kernel: buffer on 5 has age 561 Jun 6 13:09:31 elfman kernel: buffer on 5 has age 563 Jun 6 13:09:36 elfman kernel: buffer on 5 has age 475 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 471 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 472 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 475 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 476 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 477 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 478 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 488 Jun 6 13:09:37 elfman kernel: buffer on 5 has age 489 I never got the same loooong time (9 seconds was the max) but the buffers are still in the wrong order. My guess is that ext3 (actually jbd) is making buffers young without pushing them to the end of the BUF_DIRTY queue (possibly by calling set_buffer_flushtime - jbd is the only caller). This causes bdflush to stop processing when it sees these buffers, and to leave old buffers that are later on in the queue unflushed. I don't know why the effect became much worse in the latest code, but it definately does. In real-life on my servers there are occasionally holding dirty buffers that must be over 10 minutes old which are playing Pin the tail on the journal NeilBrown --------------------------- --- fs/buffer.c 2002/05/26 23:13:05 1.2 +++ fs/buffer.c 2002/06/06 01:41:33 @@ -219,6 +219,28 @@ __refile_buffer(bh); get_bh(bh); array[count++] = bh; + if (MAJOR(bh->b_dev) == 8 && (MINOR(bh->b_dev)>>4) < 10) { + static unsigned long age[10]; + static unsigned long then[10]; + int i = MINOR(bh->b_dev)>>4; + unsigned long j = jiffies; + if (time_before(then[i]+20, j)) { + if (age[i] > (j-then[i])/5) + age[i] -= (j-then[i])/5; + else age[i] = 0; + then[i] = j; + } + if (time_before(bh->b_flushtime, j)) { + unsigned long a; + a = j - bh->b_flushtime; + if (a > age[i]) { + printk("buffer on %d has age %ld\n", + i, a); + age[i] = a; + then[i] = j; + } + } + } if (count < NRSYNC) continue;