Re: RAID5 Performance

Doug Dumitru <doug@xxxxxxxxxx> · Wed, 27 Jul 2016 10:38:40 -0700

On Wed, Jul 27, 2016 at 7:26 AM, Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx> wrote:
> First a terminology point: you are reporting a *speed* problem,
> not a performance problem. My impression that you are getting
> pretty good performance, given your hardware, configuration and
> workload. The difference between "speed" and "performance" is
> that the first is a simple rate of stuff done per unit of time,
> the second is an envelope embodying several tradeoffs, among
> them with cost. In your question you describe a low rate of
> stuff done per unit of time. :-)
>
>> Currently I am using 8 x 480GB Intel SSD in a RAID5, then LVM
>> on top, DRBD on top, and finally iSCSI on top (and then used
>> as VM raw disks for mostly windows VM's).
>
> A very brave configuration, a shining example of the "syntactic"
> mindset, according to which any arbitrary combination of
> legitimate features must be fine :-).

While you may say that this configuration is very "brave", it is
actually quite common for VDI "appliance" deployments.  If you look at
cluster solutions like Ganeti, it is exactly this stack less the iSCSI
(Ganeti runs storage and VM compute on the same nodes).  It is also a
"LOT" faster and a lot lower wear than using a file system like ZFS to
create ZVOLs and export those.  The other option is to run a
file-system and export files as block devices.  This just replaces LVM
with EXT4/XFS, and for static "blobs", LVM is a bunch faster and
safer.

>
> First server DRBD primary disks:
>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdi               0.00     0.00    0.00    0.00     0.00 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>> sda              78.00    59.00   79.00   86.00     0.74 0.52     15.55     0.02    0.15    0.20    0.09   0.15   2.40
>> sdg              35.00    48.00   68.00   79.00     0.52 0.44     13.39     0.02    0.14    0.24    0.05   0.11   1.60
>> sdf              46.00    65.00   86.00   98.00     0.76 0.58     14.96     0.03    0.17    0.09    0.24   0.09   1.60
>> sdh              97.00    45.00   70.00  141.00     0.66 0.68     12.96     0.08    0.36    0.29    0.40   0.34   7.20
>> sde             101.00    75.00   87.00   94.00     0.79 0.61     15.76     0.08    0.42    0.32    0.51   0.29   5.20
>> sdb              85.00    54.00   94.00  102.00     0.84 0.56     14.62     0.01    0.04    0.09    0.00   0.04   0.80
>> sdc              85.00    74.00   98.00  106.00     0.79 0.66     14.53     0.01    0.06    0.04    0.08   0.04   0.80
>> sdd             230.00   199.00  266.00  353.00     2.19 2.11     14.24     0.18    0.28    0.23    0.32   0.16   9.60
>
> Second server DRBD secondary disks:
>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdf              67.00    76.00   64.00  113.00     0.52 0.62    13.17     0.26    1.47    0.06    2.27   1.45  25.60
>> sdg              39.00    61.00   50.00  114.00     0.35 0.56    11.38     0.45    2.76    0.08    3.93   2.71  44.40
>> sdd              49.00    67.00   50.00  109.00     0.39 0.57    12.40     0.75    4.73    0.00    6.90   4.70  74.80
>> sdh              55.00    54.00   52.00  104.00     0.42 0.51    12.12     0.81    5.21    0.23    7.69   5.13  80.00
>> sde              67.00    67.00   75.00  129.00     0.56 0.65    12.13     0.94    4.59    0.69    6.85   4.24  86.40
>> sda              64.00    76.00   58.00  109.00     0.48 0.61    13.29     0.84    5.03    0.21    7.60   4.89  81.60
>> sdb              35.00    72.00   57.00  104.00     0.36 0.57    11.84     0.69    4.27    0.14    6.54   4.22  68.00
>> sdc             118.00   144.00  228.00  269.00     1.39 1.50    11.92     1.21    2.43    1.88    2.90   1.50  74.40
>> md1               0.00     0.00    0.00  260.00     0.00 1.70    13.38     0.00    0.00    0.00    0.00   0.00   0.00
>
>> I've confirmed that the problem is that we have mixed two
>> models of SSD (520 series and 530 series), and that the 530
>> series drives perform significantly worse (under load) in
>> comparison.
>
> The queue sizes and waiting time on the second server are very
> low (on a somewhat similar system using 4TB disks I see waiting
> times in the 1-5 seconds range, not milliseconds).

The expectation, in terms of performance for VDI is quite high.
vmWare like to say you can get away with 8-12 IOPS per virtual.  Most
people think you only get good performance with 100 IOPS per virtual.
The "bad" time for VDI is what is called a "boot storm".  Boot, or
reboot, all of the windows clients at the same time and see how long
they take to settle.  The IO workload for this is 80%+ 4K random
writes.  At 100 IOPS, it takes windows about 2 minutes to boot, so if
you need to support 500 VDI seats from a storage node, that nodes
needs to sustain 500 x 100 = 50,000 IOPS of 4K random writes.  This is
so far past what hard disks can do as to be silly.  Even with SSDs,
you need reasonably large arrays running RAID-10 to sustain this.  If
you want to support 5000 VDI seats like this, then stock Linux just
can't get there, but it can be done.
>
> The impression I get is that there is some issue with DRBD
> latency, because the second server's storage seems to me very
> underutilized. This latency may be related to the flash SSDs
> that you are using, because by default DRBD uses the "C"
> synchronization protocol. Probably if you switched to the "B" or
> even "A" protocols speed could improve, maybe a lot, even if
> performance arguably would be the same or much worse.
>
> Thus the most likely issue here is the 'fsync' problem: for
> "consumerish" SSDs barrier-writes are synchronous, because they
> don't have a battery/capacitor-backed cache, and rather slow for
> small writes, because of the large size of erase blocks, which
> can be mitigated with higher over-provisioning. These have much
> of the story:

On many consumer SSDs, barrier writes are only barriers, and are not
syncs at all.  You are guaranteed serialization but not actual
storage.  Then again, in a server setup, especially with redundant
power supplies, power loss to the SSDs is rare.  You are more
protecting against system hangs and other inter-connectivity issues.
The real system solution is to have some quantity of non volatile DRAM
that you can stage writes (either a PCI-e card like a FlashTec or one
or more nvDIMMs).  This is how the "major vendors" deal with sync
writes.

>   https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>   https://www.redhat.com/en/resources/ceph-pcie-ssd-performance-part-1
>   http://www.spinics.net/lists/ceph-users/msg25928.html
>
> The 520s seem not too bad, but still a long way from the disks
> with battery/capacity-backed cache.
>
>> the actual work-load is small random read/write, with the
>> writes causing the biggest load.
>
> Here most of the wise comments from the reply from D Dimitru
> apply, to summarize:
>
> * Small writes are a challenging workload for DRBD, regardless
>   of other issues.

My comment about DRBD was not that small writes are harder, but that
if your target can keep up with them at low queue depths, then DRBD
can saturate GigE at 4K q=1 on a single thread.  So DRBD is not really
the issue, but the latency/IOPS behaviour of the target.

> * Small writes are a very challenging workload for flash SSDs
>   without battery/capacitor-backed caches.

Even with battery backup, small writes create garbage collection, so
while batteries may give you some short term bursts, longer term, you
still have to do the writes.  A main benefit of batter backup in the
SSDs is that the meta data (mapping information) does not need to be
flushed with the actual data real-time, which makes the FTL algorithms
easier to implement.

> * Parity RAID is a bad idea in general, in particular for
>   workloads with many small writes, for they amplify writes via
>   RMW.
>
> etc. etc. :-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Dumitru
WildFire Storage
http://www.wildfire-storage.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html