Oh man, what do you know!... I'm quite amazed. I've been reviewing more documentation about min_replica_size and seems like it doesn't work as I thought (Although I remember specifically reading it somewhere some years ago :/ ). And, as all replicas need to be written before primary OSD informs the client about the write being completed, we cannot have the third replica on HDDs, no way. It would kill latency. Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and P4500 price difference is negligible) and we'll decrease the primary affinity weight for SATA SSDs, just to be sure we get the most out of NVMe. BTW, does anybody have any experience so far with erasure coding and rbd? A 2/3 profile, would really save space on SSDs but I'm afraid about the extra calculations needed and how will it affect performance... Well, maybe I'll check into it, and I'll start a new thread :) Anyway, thanks for the info! Xavier. -----Mensaje original----- De: Christian Balzer [mailto:chibi@xxxxxxx] Enviado el: martes, 22 de agosto de 2017 2:40 Para: ceph-users@xxxxxxxxxxxxxx CC: Xavier Trilla <xavier.trilla@xxxxxxxxxxxxxxxx> Asunto: Re: NVMe + SSD + HDD RBD Replicas with Bluestore... Hello, Firstly, what David said. On Mon, 21 Aug 2017 20:25:07 +0000 Xavier Trilla wrote: > Hi, > > I'm working into improving the costs of our actual ceph cluster. We actually keep 3 x replicas, all of them in SSDs (That cluster hosts several hundred VMs RBD disks) and lately I've been wondering if the following setup would make sense, in order to improve cost / performance. > Have you done a full analysis of your current cluster, as in utilization of your SSDs (IOPS), CPU, etc with atop/iostat/collectd/grafana? During peak utilization times? If so, you should have a decent enough idea of what level IOPS you need and can design from there. > The ideal would be to move PG primaries to high performance nodes using NVMe, keep secondary replica in SSDs and move the third replica to HDDs. > > Most probably the hardware will be: > > 1st Replica: Intel P4500 NVMe (2TB) > 2nd Replica: Intel S3520 SATA SSD (1.6TB) Unless you have: a) a lot of these and/or b) very little writes what David said. Aside from that whole replica idea not working. as you think. > 3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o > 2TB model, as I want to have as many spins as possible) > > Also, hosts running OSDs would have a quite different HW configuration > (In our experience NVMe need crazy CPU power in order to get the best > out of them) > Correct, one might run into that with pure NVMe/SSD nodes. > I know the NVMe and SATA SSD replicas will work, no problem about that (We'll just adjust the primary affinity and crushmap in order to have the desired data layoff + primary OSDs) what I'm worried is about the HDD replica. > > Also the pool will have min_size 1 (Would love to use min_size 2, but it would kill latency times) so, even if we have to do some maintenance in the NVMe nodes, writes to HDDs will be always "lazy". > > Before bluestore (we are planning to move to luminous most probably by the end of the year or beginning 2018, once it is released and tested properly) I would just use SSD/NVMe journals for the HDDs. So, all writes would go to the SSD journal, and then moved to the HDD. But now, with Bluestore I don't think that's an option anymore. > Bluestore bits are still a bit of dark magic in terms of concise and complete documentation, but the essentials have been mentioned here before. Essentially, if you can get the needed IOPS with SSD/NVMe journals and HDDs, Bluestore won't be worse than that, if done correctly. With Bluestore use either NVMe for the WAL (small space, high IOPS/data), SSDs for the actual rocksdb and the (surprise, surprise!) journal for small writes (large space, nobody knows for sure how large is large enough) and finally the HDDs. If you're trying to optimize costs, decent SSDs (good luck finding any with Intel 37xx and 36xx basically unavailable), maybe the S or P 4600, to hold both the WAL and DB should do the trick. Christian > What I'm worried is how would affect to the NVMe primary OSDs having a quite slow third replica. WD Gold hard drives seem quite decent (For a SATA drive) but obviously performance is nowhere near to SSDs or NVMe. > > So, what do you think? Does anybody have some opinions or experience he would like to share? > > Thanks! > Xavier. > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com