Re: read-modify-write occurring for direct I/O on RAID-5

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 4 Aug 2023 18:07:03 +1000

On Thu, Aug 03, 2023 at 10:44:31PM -0700, Corey Hickey wrote:
> Hello,
> 
> I am having a problem with write performance via direct I/O. My setup is:
> * Debian Sid
> * Linux 6.3.0-2 (Debian Kernel)
> * 3-disk MD RAID-5 of hard disks
> * XFS
> 
> When I do large sequential writes via direct I/O, sometimes the writes are
> fast, but sometimes the RAID ends up doing RMW and performance gets slow.
> 
> If I use regular buffered I/O, then performance is better, presumably due to
> the MD stripe cache. I could just use buffered writes, of course, but I am
> really trying to make sure I get the alignment correct to start with.
> 
> 
> I can reproduce the problem on a fresh RAID.
> -----------------------------------------------------------------------
> $ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi]
> mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1%
> Continue creating array? y
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md10 started.
> -----------------------------------------------------------------------
> For testing, I'm using "-z 30G" to limit the duration of the initial RAID
> resync.
> 
> 
> For XFS I can use default options:
> -----------------------------------------------------------------------
> $ sudo mkfs.xfs /dev/md10
> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md10              isize=512    agcount=16, agsize=983040 blks

So an AG size of just under 2GB.

>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=1 inobtcount=1
> nrext64=0
> data     =                       bsize=4096   blocks=15728640, imaxpct=25
>          =                       sunit=128    swidth=68352 blks
                                                ^^^^^^^^^^^^^^^^^

Something is badly broken in MD land.

.....

> The default chunk size is 512K
> -----------------------------------------------------------------------
> $ sudo mdadm --detail /dev/md10 | grep Chunk
>         Chunk Size : 512K
> $ sudo blkid -i /dev/md10
> /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792"
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yup, that's definitely broken.

> PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
> -----------------------------------------------------------------------
> I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would
> expect this to be 1024K (due to two data disks in a three-disk RAID-5).

Yup, it's broken. :/

> Translating into 512-byte sectors, I think the topology should be:
> chunk size (sunit): 1024 sectors
> stripe size (swidth): 2048 sectors

Yup, or as it reports from mkfs, sunit=128 fsbs, swidth=256 fsbs.

> -----------------------------------------------------------------------
> $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q '
>   9,10  15        1     0.000000000 186548  Q  WS 3829760 + 2048 [dd]
>   9,10  15        3     0.021087119 186548  Q  WS 3831808 + 2048 [dd]
>   9,10  15        5     0.023605705 186548  Q  WS 3833856 + 2048 [dd]
>   9,10  15        7     0.026093572 186548  Q  WS 3835904 + 2048 [dd]
>   9,10  15        9     0.028595887 186548  Q  WS 3837952 + 2048 [dd]
>   9,10  15       11     0.031171221 186548  Q  WS 3840000 + 2048 [dd]
> [...]
>   9,10   5      441    14.601942400 186608  Q  WS 8082432 + 2048 [dd]
>   9,10   5      443    14.620316654 186608  Q  WS 8084480 + 2048 [dd]
>   9,10   5      445    14.646707430 186608  Q  WS 8086528 + 2048 [dd]
>   9,10   5      447    14.654519976 186608  Q  WS 8088576 + 2048 [dd]
>   9,10   5      449    14.680901605 186608  Q  WS 8090624 + 2048 [dd]
>   9,10   5      451    14.689156421 186608  Q  WS 8092672 + 2048 [dd]
>   9,10   5      453    14.706529362 186608  Q  WS 8094720 + 2048 [dd]
>   9,10   5      455    14.732451407 186608  Q  WS 8096768 + 2048 [dd]
> -----------------------------------------------------------------------
> In the beginning, writes queued are stripe-aligned. For example:
> 3829760 / 2048 == 1870
> 
> Later on, writes end up getting misaligned by half a stripe. For example:
> 8082432 / 2048 == 3946.5

So it's aligned to sunit, not swidth. That will match up with a
discontiguity in the file layout. i.e. an extent boundary.

And given this is at just under 4GB written, and the AG size is 
just under 2GB, this discontiguity is going to occur as writing
fills AG 1 and allocation switches to AG 2.

> I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but
> that had pretty much the same behavior when writing (the RMW starts later,
> but it still starts).

It won't change anything, actually. The first allocation in an AG
will determine which stripe unit the new extent starts on, and then
for the entire AG the write will be aligned to that choice.

If you do IOs much larger than the stripe width (e.g. 16MB at a
time) the impact of the head/tail RMW will largely go away. The
problem is that you are doing exactly stripe width sized IOs and so
is the worse case for any allocation misalignment that might occur.

> Am I doing something wrong, or is there a bug, or are my expectations
> incorrect? I had expected that large sequential writes would be aligned with
> swidth.

Expectations are wrong. Large allocations are aligned to stripe unit
in XFS by default.

THis is because XFS was tuned for *large* multi-layer RAID setups
like RAID-50 that had hardware RAID 5 luns stripe together via
RAID-0 in the volume manager.

In these setups, the stripe unit is the hardware RAID-5 lun stripe
width (the minimum size that avoids RMW) and the stripe width is the
RAID-0 width.

Hence for performance, it didn't matter which sunit allocation
aligned to as long as writes spanned the entire stripe width. That
way they would hit every lun.

In general, we don't want stripe width aligned allocation, because
that hot-spots the first stripe unit in the stripe as all file data
first writes to that unit. A raid stripe is only as fast as it's
slowest disk, and so having a hot stripe unit slows everything down.
Hence by default we move the initial allocation around the stripe
units, and that largely removes the hotspots in the RAID luns...

So, yeah, there are good reasons for stripe unit aligned allocation
rather than stripe width aligned.

The problem is that MD has never behaved this way - it has always
exposed it's individual disk chunk size as the minimum IO size (i.e.
the stripe unit) and the stripe width as the optimal IO size to
avoid RMW cycles.

If you want to force XFS to do stripe width aligned allocation for
large files to match with how MD exposes it's topology to
filesytsems, use the 'swalloc' mount option. The down side is that
you'll hotspot the first disk in the MD array....

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx