On Thu, Sep 24, 2009 at 09:29:50PM +0800, Jens Axboe wrote: > On Thu, Sep 24 2009, Wu Fengguang wrote: > > On Thu, Sep 24, 2009 at 08:35:19PM +0800, Jens Axboe wrote: > > > On Thu, Sep 24 2009, Wu Fengguang wrote: > > > > On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote: > > > > > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has > > > > > several partitions, writeback might keep spindle moving between partitions. > > > > > To reduce the move, better write big chunk of one partition and then move to > > > > > another. Inodes from one fs usually are in one partion, so idealy move indoes > > > > > from one fs together should reduce spindle move. This patch tries to address > > > > > this. Before per-bdi writeback is added, the behavior is write indoes > > > > > from one fs first and then another, so the patch restores previous behavior. > > > > > The loop in the patch is a bit ugly, should we add a dirty list for each > > > > > superblock in bdi_writeback? > > > > > > > > > > Test in a two partition disk with attached fio script shows about 3% ~ 6% > > > > > improvement. > > > > > > > > A side note: given the noticeable performance gain, I wonder if it > > > > deserves to generalize the idea to do whole disk location ordered > > > > writeback. That should benefit many small file workloads more than > > > > 10%. Because this patch only sorted 2 partitions and inodes in 5s > > > > time window, while the below patch will roughly divide the disk into > > > > 5 areas and sort inodes in a larger 25s time window. > > > > > > > > http://lkml.org/lkml/2007/8/27/45 > > > > > > > > Judging from this old patch, the complexity cost would be about 250 > > > > lines of code (need a rbtree). > > > > > > First of all, nice patch, I'll add it to the current tree. I too was > > > > You mean Shaohua's patch? It should be a good addition for 2.6.32. > > Yes indeed, the parent patch. > > > In long term move_expired_inodes() needs some rework. Because it > > could be time consuming to move around all the inodes in a large > > system, and thus hold inode_lock() for too long time (and this patch > > scales up the locked time). > > It does. As mentioned in my reply, for 100 inodes or less, it will still > be faster than eg using an rbtree. But the more "reliable" runtime of an > rbtree based solution is appealing, though. It's not hugely critical, > though. Agreed. Desktops are not big worries; servers rarely do many partitions per disk. > > So would need to split the list moves into smaller pieces in future, > > or to change data structure. > > Yes, those are the two options. > > > > pondering using an rbtree for sb+dirty_time insertion and extraction. Note that dirty_time may not be unique, so need some workaround. And the resulted rbtree implementation may not be more efficient than several list traversals even for a very large list (as long as superblocks numbers are low). The good side is, once sb+dirty_time rbtree is implemented, it should be trivial to switch the key to sb+inode_number (also may not be unique), and to do location ordered writeback ;) Thanks, Fengguang > > FYI Michael Rubin did some work on a rbtree implementation, just > > in case you are interested: > > > > http://lkml.org/lkml/2008/1/15/25 > > Thanks, I'll take a look. > > -- > Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html