md-cluster + raid1 and raid0 / clvm

"Thomas Rosenstein" <thomas.rosenstein@xxxxxxxxxxxxxxxx> · Mon, 19 Aug 2019 22:47:44 +0200

Hi,

I'm trying to provide a bunch of storage based on SAS3 Shared SSD (12x 
WD SS530), so far I have enabled mq-blk and scsi-mq and also md-cluster 
and clvm.

I created 6 raid1 arrays out of the 12 disks and now want to join them.

I have tried CLVM and non-clustered raid0 to get some performance 
numbers, but it seems that as soon as I activate them, the performance 
is abysmal.

For the testing command:

fio --ioengine=libaio --direct=1 --name=test --filename=test --bs=4k 
--iodepth=128 --size=64G --runtime=10000 --readwrite=randwrite 
--numjobs=2
I vary the numjobs (1 - 16) and iodepth (1 - 128) to find the sweet spot 
- this is a purely synthetic benchmark to get what's possible.

Since it affects both mdadm and clvm I might have to post this to both 
lists, but let's start here.

mdadm --create /dev/md20 --metadata=1.2 --level=0 --raid-devices=6 
/dev/md10 /dev/md11 /dev/md12 /dev/md13 /dev/md14 /dev/md15
Since I'm only using it on 1 server, even though it would be available 
on both, I think we are good - btw. I noticed a immense degredation in 
performance when it was mounted on 2 servers (like 90% less)

So, on the RAID10

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=randwrite --numjobs=8

Jobs: 8 (f=8): [w(8)][0.5%][r=0KiB/s,w=228MiB/s][r=0,w=58.2k IOPS][eta 
40m:03s]

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=write --numjobs=8
Jobs: 8 (f=8): [W(8)][0.6%][r=0KiB/s,w=197MiB/s][r=0,w=50.4k IOPS][eta 
37m:41s]

The disks are practically ideling at 10 - 15% utilization, and around 
10k/reqs per raid1 volume, below we see that more is possible - how to 
reach it?

CPU load per process is 20% - it's a 8 core Intel Xeon Silver 4110 with 
256 GB Ram.

on the RAID1

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=randwrite --numjobs=8
Jobs: 8 (f=8): [w(8)][1.5%][r=0KiB/s,w=182MiB/s][r=0,w=46.6k IOPS][eta 
50m:12s]

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=write --numjobs=8
Jobs: 8 (f=8): [W(8)][2.4%][r=0KiB/s,w=609MiB/s][r=0,w=156k IOPS][eta 
13m:08s]

So, the 6x 45k IOPS end up as 58k IOPS, and the sequential read if even 
worse? - something seems wrong!

Let's also check read:

RAID10

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=randread --numjobs=8
Jobs: 8 (f=8): [r(8)][33.3%][r=8015MiB/s,w=0KiB/s][r=2052k,w=0 IOPS][eta 
00m:44s]

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=read --numjobs=8
Jobs: 8 (f=8): [R(8)][7.3%][r=3063MiB/s,w=0KiB/s][r=784k,w=0 IOPS][eta 
02m:57s]

Huch!, What's happening here? For the first test only 700MB/s are 
actually hitting md20 (via iostat -xk), who is caching here with 
DIRECT=1 ?! What's going on??
Second Read test seems okay, but let's check raid1:

RAID1:

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=randread --numjobs=8
Jobs: 8 (f=8): [r(8)][36.4%][r=8178MiB/s,w=0KiB/s][r=2094k,w=0 IOPS][eta 
00m:42s]

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test 
--filename=test --bs=4k --iodepth=128 --size=64G --runtime=10000 
--readwrite=read --numjobs=8
Jobs: 8 (f=8): [R(8)][30.6%][r=8547MiB/s,w=0KiB/s][r=2188k,w=0 IOPS][eta 
00m:43s]

The same, around 100MB/s hit the disks actually.
I tested this beforehand and the IO was always hitting the disks, until 
I started to play around with raid0 and clvm - how could that influence 
RAID1?

Can someone explain to me what's happening here?
Or how can I analyze what's going on with the RAID0?
What options are there to make it faster - use all disks accordingly?
What options are there to make the RAID1 write faster? (schedulers? 
something? - the disks should do 220k IOPS @4k)

Thanks
Thomas