Re: NVMe + SSD + HDD RBD Replicas with Bluestore...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I don't understand why min_size = 2 would kill latency times.  Regardless of your min_size, a write to ceph does not ack until it completes to all copies.  That means that even with min_size = 1 the write will not be successful until it's written to the NVME, the SSD, and the HDD (given your proposed setup).  Every write will have to write to the HDD before it acks every time.  The performance boost that you gain from doing primary affinity to use HDDs as secondary storage and SSDs/NVMe's as primary storage is in the reads not the writes.

Having journals in front of the HDDs will still have the write happen initially to the SSD.  If you can't configure that in Bluestore for the HDDs then don't use Bluestore... There's no reason you can't use Filestore with SSD/NVMe journals in front of them for your HDDs if it performs faster for your configuration.  Bluestore is not the fastest solution for all use cases and Filestore is not getting deprecated... yet.

Another note is that with 100GB DC S3510 and DC S3500 SSDs for journals of 4 HDDs each, they ran out of writes in under 18 months in a large RBD cluster.  The DC S3520 is not drastically more durable than those.  I wouldn't recommend using them.  The DC S3610's are much more durable and not much more expensive.

On Mon, Aug 21, 2017 at 4:25 PM Xavier Trilla <xavier.trilla@xxxxxxxxxxxxxxxx> wrote:

Hi,

 

I’m working into improving the costs of our actual ceph cluster. We actually keep 3 x replicas, all of them in SSDs (That cluster hosts several hundred VMs RBD disks) and lately I’ve been wondering if the following setup would make sense, in order to improve cost / performance.

 

The ideal would be to move PG primaries to high performance nodes using NVMe, keep secondary replica in SSDs and move the third replica to HDDs.

 

Most probably the hardware will be:

 

1st Replica: Intel P4500 NVMe (2TB)

2nd Replica: Intel S3520 SATA SSD (1.6TB)

3rd Replica: WD Gold Harddrives (2 TB) (I’m considering either 1TB o 2TB model, as I want to have as many spins as possible)

 

Also, hosts running OSDs would have a quite different HW configuration (In our experience NVMe need crazy CPU power in order to get the best out of them)

 

I know the NVMe and SATA SSD replicas will work, no problem about that (We’ll just adjust the primary affinity and crushmap in order to have the desired data layoff + primary OSDs) what I’m worried is about the HDD replica.

 

Also the pool will have min_size 1 (Would love to use min_size 2, but it would kill latency times) so, even if we have to do some maintenance in the NVMe nodes, writes to HDDs will be always “lazy”.

 

Before bluestore (we are planning to move to luminous most probably by the end of the year or beginning 2018, once it is released and tested properly) I would just use  SSD/NVMe journals for the HDDs. So, all writes would go to the SSD journal, and then moved to the HDD. But now, with Bluestore I don’t think that’s an option anymore.

 

What I’m worried is how would affect to the NVMe primary OSDs having a quite slow third replica. WD Gold hard drives seem quite decent (For a SATA drive) but obviously performance is nowhere near to SSDs or NVMe.

 

So, what do you think? Does anybody have some opinions or experience he would like to share?

 

Thanks!

Xavier.

 

 

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux