Re: NVMe + SSD + HDD RBD Replicas with Bluestore...

Xavier Trilla <xavier.trilla@xxxxxxxxxxxxxxxx> · Thu, 24 Aug 2017 19:49:24 +0000

Mark, thanks for the information. 

Well, maybe EC and RBD once Luminous is  released makes sense for a lower speed storage tier. Where costs are more important than performance. Let's see if I can find some time -pretty busy with other projects- to test it with Luminous and Bluestore.

Thanks!

-----Mensaje original-----
De: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] En nombre de Mark Nelson
Enviado el: jueves, 24 de agosto de 2017 2:18
Para: ceph-users@xxxxxxxxxxxxxx
Asunto: Re:  NVMe + SSD + HDD RBD Replicas with Bluestore...

On 08/23/2017 06:18 PM, Xavier Trilla wrote:
> Oh man, what do you know!... I'm quite amazed. I've been reviewing more documentation about min_replica_size and seems like it doesn't work as I thought (Although I remember specifically reading it somewhere some years ago :/ ).
>
> And, as all replicas need to be written before primary OSD informs the client about the write being completed, we cannot have the third replica on HDDs, no way. It would kill latency.
>
> Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and P4500 price difference is negligible) and we'll decrease the primary affinity weight for SATA SSDs, just to be sure we get the most out of NVMe.
>
> BTW, does anybody have any experience so far with erasure coding and 
> rbd? A 2/3 profile, would really save space on SSDs but I'm afraid 
> about the extra calculations needed and how will it affect 
> performance... Well, maybe I'll check into it, and I'll start a new 
> thread :)

There's a decent chance you'll get higher performance with something like EC 6+2 vs 3X replication for large writes due simply to having less data to write (we see somewhere between 2x and 3x rep performance in the lab for 4MB writes to RBD). Small random writes will almost certainly be slower due to increased latency.  Reads in general will be slower as well.  With replication the read comes entirely from the primary but in EC you have to fetch chunks from the secondaries and reconstruct the object before sending it back to the client.

So basically compared to 3X rep you'll likely gain some performance on large writes, lose some performance on large reads, and lose more performance on small writes/reads (dependent on cpu speed and various other factors).

Mark

>
> Anyway, thanks for the info!
> Xavier.
>
> -----Mensaje original-----
> De: Christian Balzer [mailto:chibi@xxxxxxx] Enviado el: martes, 22 de 
> agosto de 2017 2:40
> Para: ceph-users@xxxxxxxxxxxxxx
> CC: Xavier Trilla <xavier.trilla@xxxxxxxxxxxxxxxx>
> Asunto: Re:  NVMe + SSD + HDD RBD Replicas with Bluestore...
>
>
> Hello,
>
>
> Firstly, what David said.
>
> On Mon, 21 Aug 2017 20:25:07 +0000 Xavier Trilla wrote:
>
>> Hi,
>>
>> I'm working into improving the costs of our actual ceph cluster. We actually keep 3 x replicas, all of them in SSDs (That cluster hosts several hundred VMs RBD disks) and lately I've been wondering if the following setup would make sense, in order to improve cost / performance.
>>
>
> Have you done a full analysis of your current cluster, as in utilization of your SSDs (IOPS), CPU, etc with atop/iostat/collectd/grafana?
> During peak utilization times?
>
> If so, you should have a decent enough idea of what level IOPS you need and can design from there.
>
>> The ideal would be to move PG primaries to high performance nodes using NVMe, keep secondary replica in SSDs and move the third replica to HDDs.
>>
>> Most probably the hardware will be:
>>
>> 1st Replica: Intel P4500 NVMe (2TB)
>> 2nd Replica: Intel S3520 SATA SSD (1.6TB)
> Unless you have:
> a) a lot of these and/or
> b) very little writes
> what David said.
>
> Aside from that whole replica idea not working. as you think.
>
>> 3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o 
>> 2TB model, as I want to have as many spins as possible)
>>
>> Also, hosts running OSDs would have a quite different HW 
>> configuration (In our experience NVMe need crazy CPU power in order 
>> to get the best out of them)
>>
> Correct, one might run into that with pure NVMe/SSD nodes.
>
>> I know the NVMe and SATA SSD replicas will work, no problem about that (We'll just adjust the primary affinity and crushmap in order to have the desired data layoff + primary OSDs) what I'm worried is about the HDD replica.
>>
>> Also the pool will have min_size 1 (Would love to use min_size 2, but it would kill latency times) so, even if we have to do some maintenance in the NVMe nodes, writes to HDDs will be always "lazy".
>>
>> Before bluestore (we are planning to move to luminous most probably by the end of the year or beginning 2018, once it is released and tested properly) I would just use  SSD/NVMe journals for the HDDs. So, all writes would go to the SSD journal, and then moved to the HDD. But now, with Bluestore I don't think that's an option anymore.
>>
> Bluestore bits are still a bit of dark magic in terms of concise and complete documentation, but the essentials have been mentioned here before.
>
> Essentially, if you can get the needed IOPS with SSD/NVMe journals and HDDs, Bluestore won't be worse than that, if done correctly.
>
> With Bluestore use either NVMe for the WAL (small space, high IOPS/data), SSDs for the actual rocksdb and the (surprise, surprise!) journal for small writes (large space, nobody knows for sure how large is large enough) and finally the HDDs.
>
> If you're trying to optimize costs, decent SSDs (good luck finding any with Intel 37xx and 36xx basically unavailable), maybe the S or P 4600, to hold both the WAL and DB should do the trick.
>
> Christian
>
>> What I'm worried is how would affect to the NVMe primary OSDs having a quite slow third replica. WD Gold hard drives seem quite decent (For a SATA drive) but obviously performance is nowhere near to SSDs or NVMe.
>>
>> So, what do you think? Does anybody have some opinions or experience he would like to share?
>>
>> Thanks!
>> Xavier.
>>
>>
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com