On Wed, 22 Aug 2007 09:18:41 +0800 Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> wrote: > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: > > On Sun, 12 Aug 2007 17:11:20 +0800 > > Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> wrote: > > > > > Andrew and Ken, > > > > > > Here are some more experiments on the writeback stuff. > > > Comments are highly welcome~ > > > > I've been doing benchmarks lately to try and trigger fragmentation, > > and one of them is a simulation of make -j N. It takes a list of > > all the .o files in the kernel tree, randomly sorts them and then > > creates bogus files with the same names and sizes in clean kernel > > trees. > > > > This is basically creating a whole bunch of files in random order > > in a whole bunch of subdirectories. > > > > The results aren't pretty: > > > > http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png > > > > The top graph shows one dot for each write over time. It shows that > > ext3 is basically writing all over the place the whole time. But, > > ext3 actually wins the read phase, so the layout isn't horrible. > > My guess is that if we introduce some write clustering by sending a > > group of inodes down at the same time, it'll go much much better. > > > > Andrew has mentioned bringing a few radix trees into the writeback > > paths before, it seems like file servers and other general uses > > will benefit from better clustering here. > > > > I'm hoping to talk you into trying it out ;) > > Thank you for the description of problem. So far I have a similar one > in mind: if we are to delay writeback of atime-dirty-only inodes to > above 1 hour, some grouping/piggy-backing scenario would be > beneficial. (Which I guess does not deserve the complexity now that > we have Ingo's make-reltime-default patch.) Good clustering would definitely help some delayed atime writeback scheme. > > My vague idea is to > - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching > queue. > - convert s_dirty to some radix-tree/rbtree based data structure. > It would have dual functions: delayed-writeback and > clustered-writeback. > clustered-writeback: > - Use inode number as clue of locality, hence the key for the sorted > tree. > - Drain some more s_dirty inodes into s_io on every kupdate wakeup, > but do it in the ascending order of inode number instead of > ->dirtied_when. > > delayed-writeback: > - Make sure that a full scan of the s_dirty tree takes <=30s, i.e. > dirty_expire_interval. I think we should assume a full scan of s_dirty is impossible in the presence of concurrent writers. We want to be able to pick a start time (right now) and find all the inodes older than that start time. New things will come in while we're scanning. But perhaps that's what you're saying... At any rate, we've got two types of lists now. One keeps track of age and the other two keep track of what is currently being written. I would try two things: 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that indexes by inode number (or some arbitrary field the FS can set in the inode). Radix tree tags are used to indicate which things in s_io are already in progress or are pending (hand waving because I'm not sure exactly). inodes are pulled off s_dirty and the corresponding slot in s_io is tagged to indicate IO has started. Any nearby inodes in s_io are also sent down. 2) s_dirty and s_io both become radix trees. s_dirty is indexed by a sequence number that corresponds to age. It is treated as a big circular indexed list that can wrap around over time. Radix tree tags are used both on s_dirty and s_io to flag which inodes are in progress. > > Notes: > (1) I'm not sure inode number is correlated to disk location in > filesystems other than ext2/3/4. Or parent dir? In general, it is a better assumption than sorting by time. It may make sense to one day let the FS provide a clustering hint (corresponding to the first block in the file?), but for starters it makes sense to just go with the inode number. > (2) It duplicates some function of elevators. Why is it necessary? > Maybe we have no clue on the exact data location at this time? The elevator can only sort the pending IO, and we send down a relatively small window of all the dirty pages at a time. If we sent down all the dirty pages and let the elevator sort it out, we wouldn't need this clustering at all. But, that has other issues ;) -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html