Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild

Roger Heflin <rogerheflin@xxxxxxxxx> · Tue, 11 Jan 2022 10:53:52 -0600

I have noticed on simple tests making arrays with tmpfs that Intel
cpus seem to be able to xor about 2x the speed of AMD.   The speed may
also vary based on cpu generation.

Also, grow differs in the fact that blocks get moved around hence the writes.

On the raid you are building, is there other IO going on to the disks?
 This will cause seeks and the more io there is (outside of the
rebuild) the worse it will be.

Here is everything I set on my arrays
find /sys -name "*sync_speed_min*" -exec /usr/local/bin/set_sync_speed
15000 {} \;
# MB Intel controller
find /sys/devices/pci0000:00/0000:00:1f.2/ -name "*queue_depth*" -exec
/usr/local/bin/set_queue_depth 1 {} \;
find /sys/devices/pci0000:00/0000:00:1f.2/ -name "nr_requests" -exec
/usr/local/bin/set_queue_depth 4 {} \;
#
# AMD FM2 MB
find /sys/devices/pci0000:00/0000:00:11.0/ -name "queue_depth" -exec
/usr/local/bin/set_queue_depth 8 {} \;
find /sys/devices/pci0000:00/0000:00:11.0/ -name "nr_requests" -exec
/usr/local/bin/set_queue_depth 16 {} \;
echo 30000 > /proc/sys/dev/raid/speed_limit_min
  for mddev in md13 md14 md15 md16 md17 md18 ; do
  blockdev --setfra 65536 /dev/${mddev}
  blockdev --setra 65536 /dev/${mddev}
  echo 32768 > /sys/block/${mddev}/md/stripe_cache_size
  echo 30000 > /sys/block/${mddev}/md/sync_speed_min
  echo 2 > /sys/block/${mddev}/md/group_thread_cnt
done

You will need to adjust my find/pci* devices to find your device, and
you will need to test some with the queue_depth/nr_requests to see
what is best for your controller/disk combination.   You may want to
also test different values with the group_thread_cnt.

The set_queue_depth file (and the set_sync_speed file look like this):
cat //usr/local/bin/set_queue_depth
echo $1 > $2

On mine you will notice I have 6 arrays, 4 of those arrays are 3tb
disk split into 4 750GB partitions to minimize time for a single grow
to complete.

The Other 2 are a 3tb remaining space split into 2 1.5tb spaces to
also minimize the grow time.

I have also found that when a disk fails often only a single partition
gets a bad block and fails and so I only have to --re-add/--add one
device.

And if the disk has not failed you can do a --replace so long as you
can get the old and new devices in the chassis.   With the multiple
partitions it usually means I only have 1 of 4 partitions that failed
in mdadm, and so a re-add gets that one to work and I can then do the
replace which just reads from the disk it is replacing and as such is
much faster.

I also carefully setup the disk partition naming such that the last
digit of the partitions matches the last digit of the md ie:
md16 : active raid6 sdh6[10] sdi6[12] sdj6[7] sdg6[9] sde6[1] sdb6[8] sdf6[11]
      3615495680 blocks super 1.2 level 6, 512k chunk, algorithm 2
[7/7] [UUUUUUU]
      bitmap: 0/6 pages [0KB], 65536KB chunk

as that makes the adding/re-adding simpler as I know which device it always is.

On Tue, Jan 11, 2022 at 3:59 AM Jaromír Cápík <jaromir.capik@xxxxxxxx> wrote:
>
> Hello Roger.
>
> I just run atop on a different and much better hardware doing mdadm --grow on raid5 with 4 drives and it shows the following
>
> DSK | sdl | | busy 90% | read 950  | | write 502 | | KiB/r 1012 | KiB/w 506 | | MBr/s 94.0 | | MBw/s 24.9 | | avq 1.29 | avio 6.22 ms | |
> DSK | sdk | | busy 89% | read 968  | | write 499 | | KiB/r 995  | KiB/w 509 | | MBr/s 94.1 | | MBw/s 24.8 | | avq 0.92 | avio 6.09 ms | |
> DSK | sdj | | busy 88% | read 1004 | | write 503 | | KiB/r 958  | KiB/w 505 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.66 | avio 5.91 ms | |
> DSK | sdi | | busy 87% | read 1013 | | write 499 | | KiB/r 949  | KiB/w 509 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.65 | avio 5.81 ms | |
>
> Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
> md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2]
>       46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
>       [=================>...]  resync = 88.5% (13834588672/15625745920) finish=293.1min speed=101843K/sec
>       bitmap: 8/59 pages [32KB], 131072KB chunk
>
> Surprisingly all 4 drives show approximately 94MB/s read and 25MB/s write.
> Even when each of the drives can read 270MB/s and write 250MB/s, the sync speed is 100MB/s only, so?
>
> Does --grow differ from --add?
>
> Thanks,
> Jaromir
>
>
>
> ---------- Původní e-mail ----------
>
> Od: Roger Heflin <rogerheflin@xxxxxxxxx>
>
> Komu: Wols Lists <antlists@xxxxxxxxxxxxxxx>
>
> Datum: 11. 1. 2022 1:15:17
>
> Předmět: Re: Feature request: Add flag for assuming a new clean drive
>  completely dirty when adding to a degraded raid5 array in order to increase
>  the speed of the array rebuild
>
> I just did a "--add" with sdd on a raid6 array missing a volume and here is what sar shows:
>
> 06:08:12 PM       sdb     91.03  34615.97      0.36      0.00    380.26      0.41      4.47     30.31
> 06:08:12 PM       sdc      0.02      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:08:12 PM       sdd     77.12     26.28  34563.36      0.00    448.54      0.64      8.23     27.40
> 06:08:12 PM       sde     36.45  34598.82      0.36      0.00    949.22      1.43     38.78     70.37
> 06:08:12 PM       sdf     46.87  34598.89      0.36      0.00    738.25      1.23     26.13     57.81
>
> 06:09:12 PM       sda      5.12      0.93     75.33      0.00     14.91      0.01      1.48      0.39
> 06:09:12 PM       sdb    122.57  46819.67      0.40      0.00    382.00      0.54      4.38     35.85
> 06:09:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:09:12 PM       sdd    105.92      0.00  46775.73      0.00    441.63      1.12     10.53     35.80
> 06:09:12 PM       sde     48.47  46817.53      0.40      0.00    965.98      1.95     40.00     97.89
> 06:09:12 PM       sdf     56.95  46834.53      0.40      0.00    822.39      1.73     30.32     82.33
>
>
> 06:10:12 PM       sda      4.55      1.20     48.20      0.00     10.86      0.01      0.97      0.27
>
> 06:10:12 PM       sdb    123.67  46616.93      0.40      0.00    376.96      0.52      4.15     34.66
> 06:10:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:10:12 PM       sdd    109.82      0.00  46623.40      0.00    424.56      1.30     11.80     36.15
> 06:10:12 PM       sde     49.18  46602.00      0.40      0.00    947.52      1.93     39.17     97.27
> 06:10:12 PM       sdf     54.88  46601.07      0.40      0.00    849.10      1.75     31.82     85.16
>
>
> 06:11:12 PM       sda      4.07      1.00     50.80      0.00     12.74      0.01      1.77      0.30
>
> 06:11:12 PM       sdb    121.93  46363.20      0.40      0.00    380.24      0.51      4.10     34.72
> 06:11:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:11:12 PM       sdd    109.58      0.00  46372.47      0.00    423.17      1.37     12.44     35.69
> 06:11:12 PM       sde     49.38  46371.00      0.40      0.00    939.01      1.93     38.88     97.09
> 06:11:12 PM       sdf     55.12  46352.53      0.40      0.00    841.00      1.73     31.39     85.25
>
>
> 06:12:12 PM       sda      5.75     14.20     79.05      0.00     16.22      0.01      1.78      0.40
>
> 06:12:12 PM       sdb    120.73  45994.13      0.40      0.00    380.97      0.51      4.20     34.72
> 06:12:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:12:12 PM       sdd    110.95      0.00  45982.87      0.00    414.45      1.43     12.81     35.39
> 06:12:12 PM       sde     49.63  46020.46      0.40      0.00    927.37      1.91     38.39     96.18
> 06:12:12 PM       sdf     54.27  46022.80      0.40      0.00    847.97      1.75     32.14     86.65
>
>
>
> So there are very few reads going on for sdd, but a lot of reads of the other disks to recalculate what the data on that disk.
>
> This is on raid6, but if raid6 is not doing a pointless check read on a new disk add, I would not expect raid5 to be.
>
>
> This is on a 5.14 kernel.
>
>
>
> On Mon, Jan 10, 2022 at 5:15 PM Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:
>
> On 09/01/2022 14:21, Jaromír Cápík wrote:
>
> > In case of huge arrays (48TB in my case) the array rebuild takes a couple of
>
> > days with the current approach even when the array is idle and during that
>
> > time any of the drives could fail causing a fatal data loss.
>
> >
>
> > Does it make at least a bit of sense or my understanding and assumptions
>
> > are wrong?
>
>
>
> It does make sense, but have you read the code to see if it already does it?
>
>
>
> And if it doesn't, someone's going to have to write it, in which case it
>
> doesn't make sense, not to have that as the default.
>
>
>
> Bear in mind that rebuilding the array with a new drive is completely
>
> different logic to doing an integrity check, so will need its own code,
>
> so I expect it already works that way.
>
>
>
> I think you've got two choices. Firstly, raid or not, you should have
>
> backups! Raid is for high-availability, not for keeping your data safe!
>
> And secondly, go raid-6 which gives you that bit extra redundancy.
>
>
>
> Cheers,
>
> Wol
>
>
>