On Dec 17, 2008 06:47 -0500, Theodore Ts'o wrote: > I've played with this a bit, and changing extents.c to pass in > EXT4_MB_HINT_DATA for directories does work, although it's a toss-up > regarding exactly how effective it really is. It does seem to reduce > fragmentation of directories, but I'm concerned that it might impact > the long-term performance of the filesystem as it ages. How can reducing fragmentation of the directories hurt long-term performance? > My current thinking is that we should consider changing the block > allocation algorithms as follows: > > 1) Change the inode allocator to strongly avoid (unless no other > inodes are available) block groups where the block group number is a > even multiple of the flex blockgroup size. The reasoning behind this > is these bg's have a fewer number of blocks given that the inode table > blocks are all allocated there, so they are much more likely to > overflow into other bg's when used. So we should try to avoid these > bg's by the inode allocator unless there is no other choice. With flex_bg does it really matter at all where the blocks for an inode are located? There will ALWAYS be a seek from reading the inode until the first data block is read, so I don't see any significance to whether the inode's "group" has more free blocks or not. > 2) Directory blocks for inodes in the flex bg metagroup should be > allocated in this first bg of the flexbg metagroup. This keeps the > filesystem metadata together, and keeps directory blocks (which tend > to be much longer-lived that data blocks, especially for source/build > directories) in different block allocation regions, which is a good > thing. It may be that all metadata blocks (i.e., also long symlinks > and extent-tree blocks) should also be located here, although that's > probably less important, simply because there are so few of such > blocks in most ext4 filesystems. I do agree with this, and if (1) is just a mechanism to ensure that there is space for (2) then I would tend to agree. This would also allow implementation of my long-held idea of using LVM to put some parts of the filesystem on one type of device (e.g. RAID-1 and/or SSD) for metadata, and the rest (data blocks) on RAID-5/6. I had always thought of doing this with the first N of 128 MB for each group on the fast storage. Putting the first of each N whole groups on the fast storage would be equivalent, and probably less work to configure. Having the allocator also put other metadata there (index and directory blocks) is a bonus. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html