Re: EC backend benchmark

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Tue, 12 May 2015 20:23:21 +0000

Nick thanks for your feedback.
Please find my response inline.

Regards
Somnath

-----Original Message-----
From: Nick Fisk [mailto:nick@xxxxxxxxxx]
Sent: Tuesday, May 12, 2015 1:02 PM
To: Somnath Roy; 'Christian Balzer'; ceph-users@xxxxxxxxxxxxxx
Subject: RE:  EC backend benchmark

Hi Somanth,

Firstly, thank you for sharing these results.

I suspect you are struggling to saturate anything due to the effects of serial latency, have you tried scaling the clients above 8?

[Somnath] No, the reason is the scaling is not much between 4->8 , so, not expecting much by increasing more clients.

I noticed a similar ceiling at a much albeit at a much lower performance threshold when using 1Gb networking. I found as soon as I started approaching ~30-40MB/s with small IO's (64k), the utilisation on
Network+Disks started increasing the average latency, so that there was
diminishing returns in increasing the queue depth. By 50MB/s I couldn't seem to push anymore no matter what the queue depth was, of course larger object sizes allowed me to max out the 1Gb network. Upgrading to 10GB removed this bottleneck for smaller IO's due to lower latency. Is it possible you are starting to hit the same ceiling at 40Gb with 4MB IO's? Have you tried larger objects?

[Somnath] Yes, for larger objects as expected, getting more BW.

Also Christian raises a very good point, there are a lot of extra writes that happen to the filestore than just the main data write. When you are doing <100iops these are insignificant, but they increasingly become more apparent as the number of IOPs goes up.

[Somnath] Yes, filestore has inherent 2X WA + ceph metadata overhead. But, in case of EC, it should be writing less data than Replication and that is one of the reason we are getting better write throughput than replicated pool.

Looking at the first couple of lines I can see that your latency nearly doubles as you double the number of clients, this probably has the effect of reducing the benefit that you would normally get of the extra clients. Also the fact that increasing the number of K chunks seems to help is probably because each OSD has less data to write so has a slightly lower latency, which means it can scale slightly better. I would imagine that at 16 clients, the 9 and 15 k chunks would probably scale a bit more whereas the 4 k and 6k chunks would still be stuck at the same speed. EC is probably quite similar to Raid, where latency is lowest when "queue depth<=Number of disks".

It might be interesting to see the average latency of the SSD's via iostat to see if they are increasing as the number of clients goes up.

One thing I did notice which confused me slightly, your latency figures don't seem to match the bandwidth you are getting. First line as an
example:-

Latency =0.5ms, which should give you about 2000 iops.

[Somnath] Well, I think we can't correlate latency and iops like that. IOPS are heavily dependent on QD you are driving with. More QD iops will go up and latency will go up as well....Also, consider this as a avg latency not a 99th percentile.

2000 iops at 4M block size should give you 8000MB/s

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> Of Somnath Roy
> Sent: 12 May 2015 16:28
> To: Christian Balzer; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  EC backend benchmark
>
> Hi Christian,
> Wonder why are you saying EC will write more data than replication ?
> Anyways, as you suggested, I will see how can I measure WA for EC vs
> replication.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx]
> Sent: Monday, May 11, 2015 11:28 PM
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Somnath Roy; Loic Dachary (loic@xxxxxxxxxxx)
> Subject: Re:  EC backend benchmark
>
>
> Hello,
>
> Could you have another EC run with differing block sizes like
> described
> here:
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-
> October/043949.html
> and look for write amplification?
>
> I'd suspect that by the very nature of EC and the addition local
> checksums
it
> (potentially) writes it to be worse than replication.
>
> Which is something very much to consider with SSDs.
>
> Christian
>
> On Mon, 11 May 2015 21:23:40 +0000 Somnath Roy wrote:
>
> > Hi Loic and community,
> >
> > I have gathered the following data on EC backend (all flash). I have
> > decided to use Jerasure since space saving is the utmost priority.
> >
> > Setup:
> > --------
> > 41 OSDs (each on 8 TB flash), 5 node Ceph cluster. 48 core HT
> > enabled
> > cpu/64 GB RAM. Tested with Rados Bench clients.
> >
> > Result:
> > ---------
> >
> > It is attached in the doc.
> >
> > Summary :
> > -------------
> >
> > 1. It is doing pretty good in Reads and 4 Rados Bench clients are
> > saturating 40 GB network. With more physical server, it is scaling
> > almost linearly and saturating 40 GbE on both the host.
> >
> > 2. As suspected with Ceph, problem is again with writes. Throughput
> > wise it is beating replicated pools in significant numbers. But, it
> > is not scaling with multiple clients and not saturating anything.
> >
> > So, my question is the following.
> >
> > 1. Probably, nothing to do with EC backend, we are suffering because
> > of filestore inefficiencies. Do you think any tunable like EC stipe
> > size (or anything else) will help here ?
> >
> > 2. I couldn't make fault domain as 'host', because of HW limitation.
> > Do you think will that play a role in performance for bigger k values ?
> >
> > 3. Even though it is not saturating 40 GbE for writes, do you think
> > separating out public/private network will help in terms of
> > performance
?
> >
> > Any feedback on this is much appreciated.
> >
> > Thanks & Regards
> > Somnath
> >
> >
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail
> > message is intended only for the use of the designated recipient(s) named above.
> > If the reader of this message is not the intended recipient, you are
> > hereby notified that you have received this message in error and
> > that any review, dissemination, distribution, or copying of this
> > message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies
> > of this message in your possession (whether hard copies or electronically stored copies).
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If
the
> reader of this message is not the intended recipient, you are hereby
notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
prohibited. If
> you have received this communication in error, please notify the
> sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard
> copies or electronically stored copies).
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com