One option you left out: you could put the journals on NVMe plus use the leftover space for a writeback bcache device which caches those 5 OSDs. This is exactly what I’m testing at the moment - 4xNVMe + 20 disks per box. Or just use the NVMe itself as a bcache cache device (don’t partition it) and let the journal be a file on the writeback-cached OSD :-)
Might be interesting to compare this to the cache pool version.
I’d love to hear other’s opinions on this!
On 03 Feb 2016, at 13:01, Sascha Vogt <sascha.vogt@xxxxxxxxx> wrote:
Hi all,
we recently tried adding a cache tier to our ceph cluster. We had 5 spinning disks per hosts with a single journal NVMe disk, hosting the 5 journals (1 OSD per spinning disk). We have 4 hosts up to now, so overall 4 NVMes hosting 20 journals for 20 spinning disks.
As we had some space left on the NVMes so we made two additional partitions on each NVMe and created a 4 OSD cache tier.
To our surprise the 4 OSD cache pool was able to deliver the same performance then the previous 20 OSD pool while reducing the OPs on the spinning disk to zero as long as the cache pool was sufficient to hold all / most data (ceph is used for very short living KVM virtual machines which do pretty heavy disk IO).
As we don't need that much more storage right now we decided to extend our cluster by adding 8 additional NVMe disks solely as a cache pool and freeing the journal NVMes again. Now the question is: How to organize the OSDs on the NVMe disks (2 per host)?
As the NVMes peak around 5-7 concurrent sequential writes (tested with fio) I thought about using 5 OSDs per NVMe. That would mean 10 partitions (5 journals, 5 data). On the other hand the NVMes are only 400GB large, so that would result in OSD disk sizes for <80 GB (depending on the journal size).
Would it make sense to skip the separate Journal partition and leave the journal on the data disk itself and limitting it to a rather small amount (lets say 1 GB or even less?) as SSDs typically don't like sequential writes anyway?
Or, if I leave journal and data on separate partitions should I reduce the number of OSDs per disk to 3 as Ceph will most likly write to journal and data in parallel and I therefore already get 6 parallel "threads" of IO?
Any feedback is highly appreciated :)
Greetings -Sascha- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com