On Nov 18, 2008 21:40 -0500, Theodore Ts'o wrote: > Looking at the blkparse profiles, doing an rm -rf given the ext4 > produced layout required 5130 megabytes. The exact same directory > hierarchy, as laied out by ext3, required only 1294 megabytes. > Looking at a few selected inode allocation bitmaps, we see that ext4 > will often need to write (and thus journal) the same block allocation > bitmap block 4 or 5 times: > > 254,7 0 352 0.166492349 9376 C R 8216 + 8 [0] > 254,7 0 348788 212.885545554 0 C W 8216 + 8 [0] > 254,7 0 461448 309.533613765 0 C W 8216 + 8 [0] > 254,7 0 827687 558.781690434 0 C W 8216 + 8 [0] > 254,7 0 1210492 760.738217014 0 C W 8216 + 8 [0] > > However, the same block allocation block bitmap is only written once > or twice. > > 254,8 0 3119 9.535331283 0 C R 524288 + 8 [0] > 254,8 0 24504 45.253431031 0 C W 524288 + 8 [0] > 254,8 0 85476 144.455205555 23903 C W 524288 + 8 [0] Looking at the seekwatcher graphs, it is clear that the ext4 layout is doing fewer seeks, and packing the data into a smaller part of the filesystem, which is counter-intuitive to the performance result. Even though the IO bandwidth is ostensibly higher (usually a good thing on metadata benchmarks) that isn't any good if we are doing more writes. It isn't immediately clear that _just_ the case of rewriting the same block multiple times is a culprit in itself, because in the ext3 case there would be more block bitmaps affeted that would _each_ be written out 1 or 2 times, while the closer packing of ext4 allocations results in fewer total bimaps being used. One would think in the case of more sharing of a block bitmap would result in a performance _increase_ because there is more chance that it will be re-used within the same transaction. > ext4: > Reads Completed: 59947, 239788KiB > Writes Completed: 1282K, 5130MiB > > ext3: > Reads Completed: 64856, 259424KiB > Writes Completed: 323582, 1294MiB The reads look the about same, writes are 4x higher. What would be useful to examine is the inode number grouping of files in the same subdirectory, along with the blocks they are allocating. It seems like the inodes are being packed more closely together, but the blocks (and hence block bitmap writes) are spread further apart. That may be a side-effect of the mballoc per-CPU cache again, where files being written in the same subdirectory are spread apart because of the write thread being rescheduled to different cores. I discussed this in the past with Eric, in the case of a file doing small writes+fsync and the blocks being fragmented needlessly between different parts of the filesystem. The proposed solution in that case (that Aneesh could probably fix quickly) is to attach an inode to the per-CPU preallocation group on the first write (for small files). If it doesn't get any more writes that is fine, but if it does then the same PA would be used for further allocations regardless of what CPU is doing the IO. Another solution for that case, and (as I speculate) this case, is to attach the PA to the parent directory and have all small files in the same directory use that PA. This would ensure that blocks allocated to small inodes in the same directory are kept together. The drawback is that this could hurt performance for multiple threads writing to the same directory. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html