On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: > On Sun, 12 Aug 2007 17:11:20 +0800 > Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> wrote: > > > Andrew and Ken, > > > > Here are some more experiments on the writeback stuff. > > Comments are highly welcome~ > > I've been doing benchmarks lately to try and trigger fragmentation, and > one of them is a simulation of make -j N. It takes a list of all > the .o files in the kernel tree, randomly sorts them and then > creates bogus files with the same names and sizes in clean kernel trees. > > This is basically creating a whole bunch of files in random order in a > whole bunch of subdirectories. > > The results aren't pretty: > > http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png > > The top graph shows one dot for each write over time. It shows that > ext3 is basically writing all over the place the whole time. But, ext3 > actually wins the read phase, so the layout isn't horrible. My guess > is that if we introduce some write clustering by sending a group of > inodes down at the same time, it'll go much much better. > > Andrew has mentioned bringing a few radix trees into the writeback paths > before, it seems like file servers and other general uses will benefit > from better clustering here. > > I'm hoping to talk you into trying it out ;) Thank you for the description of problem. So far I have a similar one in mind: if we are to delay writeback of atime-dirty-only inodes to above 1 hour, some grouping/piggy-backing scenario would be beneficial. (Which I guess does not deserve the complexity now that we have Ingo's make-reltime-default patch.) My vague idea is to - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching queue. - convert s_dirty to some radix-tree/rbtree based data structure. It would have dual functions: delayed-writeback and clustered-writeback. clustered-writeback: - Use inode number as clue of locality, hence the key for the sorted tree. - Drain some more s_dirty inodes into s_io on every kupdate wakeup, but do it in the ascending order of inode number instead of ->dirtied_when. delayed-writeback: - Make sure that a full scan of the s_dirty tree takes <=30s, i.e. dirty_expire_interval. Notes: (1) I'm not sure inode number is correlated to disk location in filesystems other than ext2/3/4. Or parent dir? (2) It duplicates some function of elevators. Why is it necessary? Maybe we have no clue on the exact data location at this time? Fengguang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html