Optimal OSD count for SSDs / NVMe disks

Sascha Vogt <sascha.vogt@xxxxxxxxx> · Wed, 3 Feb 2016 13:01:37 +0100

Hi all,

we recently tried adding a cache tier to our ceph cluster. We had 5
spinning disks per hosts with a single journal NVMe disk, hosting the 5
journals (1 OSD per spinning disk). We have 4 hosts up to now, so
overall 4 NVMes hosting 20 journals for 20 spinning disks.

As we had some space left on the NVMes so we made two additional
partitions on each NVMe and created a 4 OSD cache tier.

To our surprise the 4 OSD cache pool was able to deliver the same
performance then the previous 20 OSD pool while reducing the OPs on the
spinning disk to zero as long as the cache pool was sufficient to hold
all / most data (ceph is used for very short living KVM virtual machines
which do pretty heavy disk IO).

As we don't need that much more storage right now we decided to extend
our cluster by adding 8 additional NVMe disks solely as a cache pool and
freeing the journal NVMes again. Now the question is: How to organize
the OSDs on the NVMe disks (2 per host)?

As the NVMes peak around 5-7 concurrent sequential writes (tested with
fio) I thought about using 5 OSDs per NVMe. That would mean 10
partitions (5 journals, 5 data). On the other hand the NVMes are only
400GB large, so that would result in OSD disk sizes for <80 GB
(depending on the journal size).

Would it make sense to skip the separate Journal partition and leave the
journal on the data disk itself and limitting it to a rather small
amount (lets say 1 GB or even less?) as SSDs typically don't like
sequential writes anyway?

Or, if I leave journal and data on separate partitions should I reduce
the number of OSDs per disk to 3 as Ceph will most likly write to
journal and data in parallel and I therefore already get 6 parallel
"threads" of IO?

Any feedback is highly appreciated :)

Greetings
-Sascha-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com