Re: CEPH cluster to meet 5 msec latency

Christian Balzer <chibi@xxxxxxx> · Fri, 21 Oct 2016 12:04:28 +0900

Hello,

re-adding the ML, so everybody benefits from this.

On Thu, 20 Oct 2016 14:03:56 +0530 Subba Rao K wrote:

> Hi Christian,
> 
> I have seen one of your responses in CEPH user group and wanted some help
> from you.
> 
> Can you please share HW configuration of the CEPH cluster which can service
> within 5 msec.
> 

Make sure you re-read that mail and understand what I wrote there.
It's not the entire cluster, just the cache-tier.
And I have a very specific use case, ideally suited for cache-tiering.

There are 450 VMs, all running the same application, which tends to write
small logs and more important status and lock files.
It basically never does any reads after booting, or if so these come from
the in-VM pagecache. 
A typical state of affairs from the view of Ceph is this:
---
client io 12948 kB/s wr, 2626 op/s
---

Again, these writes tend to be mostly to the same files (and thus Ceph
objects) over and over.

So the hot data and working set is rather small, significantly smaller
than the actual cache-pool: 4x DC S3610 800GB x2 (nodes) / 2 (replication).

> To meet the 5 msec latency, I was contemplating between All-SSD Ceph
> Cluster and Cache-tier Ceph Cluster with SAS Drives. With out test data I
> am unable to decide.
> 

Both can work, but without knowing your use case and working set that's
also impossible to answer.

Do read the current "RBD with SSD journals and SAS OSDs" thread, it has
lots of valuable information pertaining to this.

Based on that thread, the cache-tier of my test cluster can do this from
inside a VM:

fio --size=1G --ioengine=libaio --invalidate=1 --sync=1 --numjobs=1 --rw=write --name=fiojob --blocksize=4K --iodepth=1
---
  write: io=31972KB, bw=1183.7KB/s, iops=295, runt= 27012msec
    slat (msec): min=1, max=12, avg= 3.38, stdev= 1.31
    clat (usec): min=0, max=13, avg= 1.16, stdev= 0.44
     lat (msec): min=1, max=12, avg= 3.38, stdev= 1.31
---

The cache-tier HW are 2 nodes with 32GB RAM, one rather meek E5-2620 v3
(running at PERFORMANCE though), 2x DC S3610 400GB (split into 4 OSDs) and
QDDR (40Gb/s) Infiniband (IPoIB).
Hammer, replication 2.

So obviously something beefier in the CPU and storage department (NVMe
comes to mind) should be even better, people have reached about 1ms for
4k sync writes. 

So if you have DB type application with a well known working set and can
fit that into a NVMe cache-tier which you can afford, that would be
perfect.
Settings like "readforward" on the cache-tier can also keep it from
getting "polluted" and thus free for all your writes.

An ideal cache-tier node would have something like 2 NVMes (both Intel
and Samsung make decent ones, specifics depend on your needs like
endurance), a single CPU with FAST cores (6-8 cores over 3GHz) and the
lowest latency networking you can afford (40Gb/s better than 10, etc).
You may get away with a replication of 2 here IF the NVMes are well known,
trusted AND monitored, thus saving you a good deal of latency (0.5ms at
least I reckon).

I'd still go for SSD journals for any HDD OSDs, though.

The inherent (write) latency of SSDs is larger than NVMes, but if you were
to go for a full SSD cluster you still should be able to meet that 5ms
easily and don't have to worry about the complexity and risks of
cache-tiering. 
OTOH you will want a replication of 3, with the resulting latency penalty
(and costs).

Then at the top end of cost and performance, you'd have a SSD cluster with
NVMe journals.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com