read-modify-write occurring for direct I/O on RAID-5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I am having a problem with write performance via direct I/O. My setup is:
* Debian Sid
* Linux 6.3.0-2 (Debian Kernel)
* 3-disk MD RAID-5 of hard disks
* XFS

When I do large sequential writes via direct I/O, sometimes the writes are fast, but sometimes the RAID ends up doing RMW and performance gets slow.

If I use regular buffered I/O, then performance is better, presumably due to the MD stripe cache. I could just use buffered writes, of course, but I am really trying to make sure I get the alignment correct to start with.


I can reproduce the problem on a fresh RAID.
-----------------------------------------------------------------------
$ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi]
mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md10 started.
-----------------------------------------------------------------------
For testing, I'm using "-z 30G" to limit the duration of the initial RAID resync.


For XFS I can use default options:
-----------------------------------------------------------------------
$ sudo mkfs.xfs /dev/md10
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md10              isize=512    agcount=16, agsize=983040 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=15728640, imaxpct=25
         =                       sunit=128    swidth=68352 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount /dev/md10 /mnt/tmp
-----------------------------------------------------------------------


I am testing via dd:
-----------------------------------------------------------------------
$ sudo dd if=/dev/zero of=/mnt/tmp/test.bin iflag=fullblock oflag=direct bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 100.664 s, 107 MB/s
-----------------------------------------------------------------------

I can monitor performance with dstat (the I/O reported at the start seems to be an artifact of dstat's monitoring).
-----------------------------------------------------------------------
$ dstat -dD sdg,sdh,sdi 2
--dsk/sdg-----dsk/sdh-----dsk/sdi--
 read  writ: read  writ: read  writ
  16G 5673M:  16G 5673M: 537M   21G  # <--not a real reading
   0     0 :   0     0 :   0     0
   0     0 :   0     0 :   0     0
   0    29M:   0    29M:   0    29M  # <--test starts here
   0   126M:   0   126M:   0   126M
   0   134M:   0   134M:   0   134M
   0   145M:   0   145M:   0   144M
  16k  137M:   0   137M:   0   138M
   0   152M:   0   152M:   0   152M
   0   140M:   0   140M:   0   140M
5632k  110M:5376k  110M:5376k  111M  # <--RMW begins here
  12M   49M:  12M   49M:  12M   49M
  14M   53M:  13M   54M:  13M   53M
  12M   50M:  12M   50M:  12M   50M
  12M   49M:  12M   50M:  12M   49M
  12M   50M:  12M   49M:  12M   49M
  13M   50M:  13M   51M:  12M   51M
  12M   50M:  12M   50M:  12M   50M
  12M   48M:  12M   48M:  12M   48M
  13M   53M:  13M   52M:  13M   53M
  13M   50M:  12M   50M:  13M   50M
  13M   52M:  13M   52M:  13M   52M
  12M   47M:  12M   46M:  12M   46M
  13M   52M:  13M   52M:  13M   52M
-----------------------------------------------------------------------
(I truncated the output--the rest looks the same)

Note how the I/O starts out fully as writes, but then continues with many reads. I am fairly sure this is RAID-5 read-modify-write due to misaligned writes.


The default chunk size is 512K
-----------------------------------------------------------------------
$ sudo mdadm --detail /dev/md10 | grep Chunk
        Chunk Size : 512K
$ sudo blkid -i /dev/md10
/dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
-----------------------------------------------------------------------
I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would expect this to be 1024K (due to two data disks in a three-disk RAID-5).

Translating into 512-byte sectors, I think the topology should be:
chunk size (sunit): 1024 sectors
stripe size (swidth): 2048 sectors


I can see the write alignment with blktrace.
-----------------------------------------------------------------------
$ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q '
  9,10  15        1     0.000000000 186548  Q  WS 3829760 + 2048 [dd]
  9,10  15        3     0.021087119 186548  Q  WS 3831808 + 2048 [dd]
  9,10  15        5     0.023605705 186548  Q  WS 3833856 + 2048 [dd]
  9,10  15        7     0.026093572 186548  Q  WS 3835904 + 2048 [dd]
  9,10  15        9     0.028595887 186548  Q  WS 3837952 + 2048 [dd]
  9,10  15       11     0.031171221 186548  Q  WS 3840000 + 2048 [dd]
[...]
  9,10   5      441    14.601942400 186608  Q  WS 8082432 + 2048 [dd]
  9,10   5      443    14.620316654 186608  Q  WS 8084480 + 2048 [dd]
  9,10   5      445    14.646707430 186608  Q  WS 8086528 + 2048 [dd]
  9,10   5      447    14.654519976 186608  Q  WS 8088576 + 2048 [dd]
  9,10   5      449    14.680901605 186608  Q  WS 8090624 + 2048 [dd]
  9,10   5      451    14.689156421 186608  Q  WS 8092672 + 2048 [dd]
  9,10   5      453    14.706529362 186608  Q  WS 8094720 + 2048 [dd]
  9,10   5      455    14.732451407 186608  Q  WS 8096768 + 2048 [dd]
-----------------------------------------------------------------------
In the beginning, writes queued are stripe-aligned. For example:
3829760 / 2048 == 1870

Later on, writes end up getting misaligned by half a stripe. For example:
8082432 / 2048 == 3946.5

I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but that had pretty much the same behavior when writing (the RMW starts later, but it still starts).


Am I doing something wrong, or is there a bug, or are my expectations incorrect? I had expected that large sequential writes would be aligned with swidth.

Thank you,
Corey



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux