Re: ext4 unlink performance

Andreas Dilger <adilger@xxxxxxx> · Wed, 19 Nov 2008 12:10:01 -0600

On Nov 18, 2008  21:40 -0500, Theodore Ts'o wrote:
> Looking at the blkparse profiles, doing an rm -rf given the ext4
> produced layout required 5130 megabytes.  The exact same directory
> hierarchy, as laied out by ext3, required only 1294 megabytes.
> Looking at a few selected inode allocation bitmaps, we see that ext4
> will often need to write (and thus journal) the same block allocation
> bitmap block 4 or 5 times:
> 
> 254,7    0      352     0.166492349  9376  C   R 8216 + 8 [0]
> 254,7    0   348788   212.885545554     0  C   W 8216 + 8 [0]
> 254,7    0   461448   309.533613765     0  C   W 8216 + 8 [0]
> 254,7    0   827687   558.781690434     0  C   W 8216 + 8 [0]
> 254,7    0  1210492   760.738217014     0  C   W 8216 + 8 [0]
> 
> However, the same block allocation block bitmap is only written once
> or twice.
> 
> 254,8    0     3119     9.535331283     0  C   R 524288 + 8 [0]
> 254,8    0    24504    45.253431031     0  C   W 524288 + 8 [0]
> 254,8    0    85476   144.455205555 23903  C   W 524288 + 8 [0]

Looking at the seekwatcher graphs, it is clear that the ext4 layout
is doing fewer seeks, and packing the data into a smaller part of
the filesystem, which is counter-intuitive to the performance result.

Even though the IO bandwidth is ostensibly higher (usually a good thing
on metadata benchmarks) that isn't any good if we are doing more writes.

It isn't immediately clear that _just_ the case of rewriting the same
block multiple times is a culprit in itself, because in the ext3 case
there would be more block bitmaps affeted that would _each_ be written
out 1 or 2 times, while the closer packing of ext4 allocations results
in fewer total bimaps being used.

One would think in the case of more sharing of a block bitmap would
result in a performance _increase_ because there is more chance that
it will be re-used within the same transaction.

> ext4:
>  Reads Completed:    59947,   239788KiB
>  Writes Completed:     1282K,     5130MiB
>
> ext3:
>  Reads Completed:    64856,   259424KiB
>  Writes Completed:   323582,     1294MiB

The reads look the about same, writes are 4x higher.  What would be
useful to examine is the inode number grouping of files in the same
subdirectory, along with the blocks they are allocating.  It seems
like the inodes are being packed more closely together, but the
blocks (and hence block bitmap writes) are spread further apart.

That may be a side-effect of the mballoc per-CPU cache again, where
files being written in the same subdirectory are spread apart because
of the write thread being rescheduled to different cores.

I discussed this in the past with Eric, in the case of a file doing
small writes+fsync and the blocks being fragmented needlessly between
different parts of the filesystem.  The proposed solution in that case
(that Aneesh could probably fix quickly) is to attach an inode to the
per-CPU preallocation group on the first write (for small files).  If it
doesn't get any more writes that is fine, but if it does then the same
PA would be used for further allocations regardless of what CPU is doing
the IO.

Another solution for that case, and (as I speculate) this case, is to
attach the PA to the parent directory and have all small files in the
same directory use that PA.  This would ensure that blocks allocated to
small inodes in the same directory are kept together.  The drawback is
that this could hurt performance for multiple threads writing to the
same directory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html