Re: RBD fio Performance concerns

Mark Kampe <mark.kampe@xxxxxxxxxxx> · Fri, 16 Nov 2012 14:59:26 -0800

On 11/15/2012 12:23 PM, Sébastien Han wrote:

First of all, I would like to thank you for this well explained,
structured and clear answer. I guess I got better IOPS thanks to the 10K disks.

10K RPM would bring your per-drive throughput (for 4K random writes)
up to 142 IOPS and your aggregate cluster throughput up to 1700.
This would predict a corresponding RADOSbench throughput somewhere
above 425 (how much better depending on write aggregation and cylinder 
affinity).  Your RADOSbench 708 now seems even more reasonable.

To be really honest I wasn't so concerned about the RADOS benchmarks
but more about the RBD fio benchmarks and the amont of IOPS that comes
out of it, which I found à bit to low.

Sticking with 4K random writes, it looks to me like you were running
fio with libaio (which means direct, no buffer cache).  Because it
is direct, every I/O operation is really happening and the best
sustained throughput you should expect from this cluster is
the aggregate raw fio 4K write throughput (1700 IOPS) divided
by two copies = 850 random 4K writes per second.  If I read the
output correctly you got 763 or about 90% of back-of-envelope.

BUT, there are some footnotes (there always are with performance)

If you had been doing buffered I/O you would have seen a lot more
(up front) benefit from page caching ... but you wouldn't have been
measuring real (and hence sustainable) I/O throughput ... which is
ultimately limited by the heads on those twelve disk drives, where
all of those writes ultimately wind up.  It is easy to be fast
if you aren't really doing the writes :-)

I would have expected write aggregation and cylinder affinity to
have eliminated some seeks and improved rotational latency resulting
in better than theoretical random write throughput.  Against those
expectations 763/850 IOPS is not so impressive.  But, it looks to
me like you were running fio in a 1G file with 100 parallel requests.
The default RBD stripe width is 4M.  This means that those 100
parallel requests were being spread across 256 (1G/4M) objects.
People in the know tell me that writes to a single object are
serialized, which means that many of those (potentially) parallel
writes were to the same object, and hence serialized.  This would
increase the average request time for the colliding operations,
and reduce the aggregate throughput correspondingly.  Use a
bigger file (or a narrower stripe) and this will get better.

Thus, getting 763 random 4K write IOPs out of those 12 drives
still sounds about right to me.

On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote:

Dear Sebastien,

Ross Turn forwarded me your e-mail.  You sent a great deal
of information, but it was not immediately obvious to me
what your specific concern was.

You have 4 servers, 3 OSDs per, 2 copy, and you measured a
radosbench (4K object creation) throughput of 2.9MB/s
(or 708 IOPS).  I infer that you were disappointed by
this number, but it looks right to me.

Assuming typical 7200 RPM drives, I would guess that each
of them would deliver a sustained direct 4K random write
performance in the general neighborhood of:
    4ms seek (short seeks with write-settle-downs)
    4ms latency (1/2 rotation)
    0ms write (4K/144MB/s ~ 30us)
    -----
    8ms or about 125 IOPS

Your twelve drives should therefore have a sustainable
aggregate direct 4K random write throughput of 1500 IOPS.

Each 4K object create involves four writes (two copies,
each getting one data write and one data update).  Thus
I would expect a (crude) 4K create rate of 375 IOPS (1500/4).

You are getting almost twice the expected raw IOPS ...
and we should expect that a large number of parallel
operations would realize some write/seek aggregation
benefits ... so these numbers look right to me.

Is this the number you were concerned about, or have I
misunderstood?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html