Re: slow BLKDISCARD on RAID10 md block devices

Lennert Buytenhek <buytenh@xxxxxxxxxxxxxx> · Mon, 22 Jul 2019 15:57:04 +0300

On Wed, Jul 17, 2019 at 05:04:27PM +0200, Guoqing Jiang wrote:

> > I've been running into an issue with background fstrim on large xfs
> > filesystems on RAID10d SSDs taking a lot of time to complete and
> > starving out other I/O to the filesystem.  There seem to be a few
> > different issues involved here, but the main one appears to be that
> > BLKDISCARD on a RAID10 md block device sends many small discard
> > requests down to the underlying component devices (while this doesn't
> > seem to be an issue for RAID0 or for RAID1).
> > 
> > It's quite easy to reproduce this with just using in-memory loop
> > devices, for example by doing:
> > 
> >          cd /dev/shm
> >          touch loop0
> >          touch loop1
> >          touch loop2
> >          touch loop3
> >          truncate -s 7681501126656 loop0
> >          truncate -s 7681501126656 loop1
> >          truncate -s 7681501126656 loop2
> >          truncate -s 7681501126656 loop3
> >          losetup /dev/loop0 loop0
> >          losetup /dev/loop1 loop1
> >          losetup /dev/loop2 loop2
> >          losetup /dev/loop3 loop3
> > 
> >          mdadm --create -n 4 -c 512 -l 0 --assume-clean /dev/md0 /dev/loop[0123]
> >          time blkdiscard /dev/md0
> > 
> >          mdadm --stop /dev/md0
> > 
> >          mdadm --create -n 4 -c 512 -l 1 --assume-clean /dev/md0 /dev/loop[0123]
> >          time blkdiscard /dev/md0
> > 
> >          mdadm --stop /dev/md0
> > 
> >          mdadm --create -n 4 -c 512 -l 10 --assume-clean /dev/md0 /dev/loop[0123]
> >          time blkdiscard /dev/md0
> > 
> > This simulates trimming RAID0/1/10 arrays with 4x7.68TB component
> > devices, and the blkdiscard completion times are as follows:
> > 
> >          RAID0   0m0.213s
> >          RAID1   0m2.667s
> >          RAID10  10m44.814s
> 
> IIUC, there is no dedicated function for discard request for raid10
> and raid1, raid1 has better performance than raid10 because of the new
> barrier mechanism or it doesn't need to translate the address from
> virtual to physical.

Thank you for the reply and the helpful pointer!

I had another look at this, and I noticed something odd.

When I build a RAID10 array as above, and I blkdiscard the whole
array, and use blktrace to look at the IOs being sent down to one of
the individual component devices during this process, it starts off
looking fairly normal, discarding in 512 KiB chunks:

  7,0    1       89     3.740837190   846  A   D 264192 + 1024 <- (9,0) 0
  7,0    1       90     3.740848185   846  A   D 265216 + 1024 <- (9,0) 2048
  7,0    1       91     3.740854505   846  A   D 266240 + 1024 <- (9,0) 4096
  7,0    1       92     3.740860124   846  A   D 267264 + 1024 <- (9,0) 6144
  7,0    1       93     3.740871191   846  A   D 268288 + 1024 <- (9,0) 8192
[...]
  7,0    1     4182     3.752027147   846  A   D 4455424 + 1024 <- (9,0) 8382464
  7,0    1     4183     3.752028736   846  A   D 4456448 + 1024 <- (9,0) 8384512
  7,0    1     4184     3.752030923   846  A   D 4457472 + 1024 <- (9,0) 8386560

But then it starts discarding only the first 4 KiB sector of every
subsequent 512 KiB chunk, skipping holes of 508 KiB at a time:

  7,0    1     4185     3.752034845   846  A   D 4459512 + 8 <- (9,0) 8389624
  7,0    1     4186     3.752037523   846  A   D 4460536 + 8 <- (9,0) 8391672
  7,0    1     4187     3.752039573   846  A   D 4461560 + 8 <- (9,0) 8393720
  7,0    1     4188     3.752042164   846  A   D 4462584 + 8 <- (9,0) 8395768
[...]
  7,0    1     8278     3.765219445   846  A   D 8650744 + 8 <- (9,0) 16772088
  7,0    1     8279     3.765221463   846  A   D 8651768 + 8 <- (9,0) 16774136
  7,0    1     8280     3.765224025   846  A   D 8652792 + 8 <- (9,0) 16776184

It eventually does get around to discarding the 508 KiB holes it left
behind... but... it discards those in reverse (downwards) block order:

  7,0    1     8281     3.765225941   846  A   D 8651776 + 1016 <- (9,0) 16775168
  7,0    1     8282     3.765227699   846  A   D 8650752 + 1016 <- (9,0) 16773120
  7,0    1     8283     3.765228878   846  A   D 8649728 + 1016 <- (9,0) 16771072
  7,0    1     8284     3.765230104   846  A   D 8648704 + 1016 <- (9,0) 16769024
  7,0    1     8285     3.765231271   846  A   D 8647680 + 1016 <- (9,0) 16766976
  7,0    1     8286     3.765233010   846  A   D 8646656 + 1016 <- (9,0) 16764928
[...]
  7,0    1    12374     3.771357037   846  A   D 4460544 + 1016 <- (9,0) 8392704
  7,0    1    12375     3.771358244   846  A   D 4459520 + 1016 <- (9,0) 8390656
  7,0    1    12376     3.771359690   846  A   D 4458496 + 1016 <- (9,0) 8388608

This dance seems to continue for the rest of the discard process:
discard 4 KiB out of every 512 KiB block in forward block order, and
then discard the last 508 KiB of every 512 KiB block in backward
block order.

(I'm seeing the same behavior on "real" hardware.)

Most of my kernel experience and contributions are related to arch/arm/
and net/, and I can't say I know much about storage at all, but I think
I'll get to the bottom of this issue eventually. :-)