Hello, re-adding the ML, so everybody benefits from this. On Thu, 20 Oct 2016 14:03:56 +0530 Subba Rao K wrote: > Hi Christian, > > I have seen one of your responses in CEPH user group and wanted some help > from you. > > Can you please share HW configuration of the CEPH cluster which can service > within 5 msec. > Make sure you re-read that mail and understand what I wrote there. It's not the entire cluster, just the cache-tier. And I have a very specific use case, ideally suited for cache-tiering. There are 450 VMs, all running the same application, which tends to write small logs and more important status and lock files. It basically never does any reads after booting, or if so these come from the in-VM pagecache. A typical state of affairs from the view of Ceph is this: --- client io 12948 kB/s wr, 2626 op/s --- Again, these writes tend to be mostly to the same files (and thus Ceph objects) over and over. So the hot data and working set is rather small, significantly smaller than the actual cache-pool: 4x DC S3610 800GB x2 (nodes) / 2 (replication). > To meet the 5 msec latency, I was contemplating between All-SSD Ceph > Cluster and Cache-tier Ceph Cluster with SAS Drives. With out test data I > am unable to decide. > Both can work, but without knowing your use case and working set that's also impossible to answer. Do read the current "RBD with SSD journals and SAS OSDs" thread, it has lots of valuable information pertaining to this. Based on that thread, the cache-tier of my test cluster can do this from inside a VM: fio --size=1G --ioengine=libaio --invalidate=1 --sync=1 --numjobs=1 --rw=write --name=fiojob --blocksize=4K --iodepth=1 --- write: io=31972KB, bw=1183.7KB/s, iops=295, runt= 27012msec slat (msec): min=1, max=12, avg= 3.38, stdev= 1.31 clat (usec): min=0, max=13, avg= 1.16, stdev= 0.44 lat (msec): min=1, max=12, avg= 3.38, stdev= 1.31 --- The cache-tier HW are 2 nodes with 32GB RAM, one rather meek E5-2620 v3 (running at PERFORMANCE though), 2x DC S3610 400GB (split into 4 OSDs) and QDDR (40Gb/s) Infiniband (IPoIB). Hammer, replication 2. So obviously something beefier in the CPU and storage department (NVMe comes to mind) should be even better, people have reached about 1ms for 4k sync writes. So if you have DB type application with a well known working set and can fit that into a NVMe cache-tier which you can afford, that would be perfect. Settings like "readforward" on the cache-tier can also keep it from getting "polluted" and thus free for all your writes. An ideal cache-tier node would have something like 2 NVMes (both Intel and Samsung make decent ones, specifics depend on your needs like endurance), a single CPU with FAST cores (6-8 cores over 3GHz) and the lowest latency networking you can afford (40Gb/s better than 10, etc). You may get away with a replication of 2 here IF the NVMes are well known, trusted AND monitored, thus saving you a good deal of latency (0.5ms at least I reckon). I'd still go for SSD journals for any HDD OSDs, though. The inherent (write) latency of SSDs is larger than NVMes, but if you were to go for a full SSD cluster you still should be able to meet that 5ms easily and don't have to worry about the complexity and risks of cache-tiering. OTOH you will want a replication of 3, with the resulting latency penalty (and costs). Then at the top end of cost and performance, you'd have a SSD cluster with NVMe journals. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com