Re: Optimal OSD count for SSDs / NVMe disks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One option you left out: you could put the journals on NVMe plus use the leftover space for a writeback bcache device which caches those 5 OSDs. This is exactly what I’m testing at the moment - 4xNVMe + 20 disks per box.
Or just use the NVMe itself as a bcache cache device (don’t partition it) and let the journal be a file on the writeback-cached OSD :-)

Might be interesting to compare this to the cache pool version.

I’d love to hear other’s opinions on this!

On 03 Feb 2016, at 13:01, Sascha Vogt <sascha.vogt@xxxxxxxxx> wrote:

Hi all,

we recently tried adding a cache tier to our ceph cluster. We had 5
spinning disks per hosts with a single journal NVMe disk, hosting the 5
journals (1 OSD per spinning disk). We have 4 hosts up to now, so
overall 4 NVMes hosting 20 journals for 20 spinning disks.

As we had some space left on the NVMes so we made two additional
partitions on each NVMe and created a 4 OSD cache tier.

To our surprise the 4 OSD cache pool was able to deliver the same
performance then the previous 20 OSD pool while reducing the OPs on the
spinning disk to zero as long as the cache pool was sufficient to hold
all / most data (ceph is used for very short living KVM virtual machines
which do pretty heavy disk IO).

As we don't need that much more storage right now we decided to extend
our cluster by adding 8 additional NVMe disks solely as a cache pool and
freeing the journal NVMes again. Now the question is: How to organize
the OSDs on the NVMe disks (2 per host)?

As the NVMes peak around 5-7 concurrent sequential writes (tested with
fio) I thought about using 5 OSDs per NVMe. That would mean 10
partitions (5 journals, 5 data). On the other hand the NVMes are only
400GB large, so that would result in OSD disk sizes for <80 GB
(depending on the journal size).

Would it make sense to skip the separate Journal partition and leave the
journal on the data disk itself and limitting it to a rather small
amount (lets say 1 GB or even less?) as SSDs typically don't like
sequential writes anyway?

Or, if I leave journal and data on separate partitions should I reduce
the number of OSDs per disk to 3 as Ceph will most likly write to
journal and data in parallel and I therefore already get 6 parallel
"threads" of IO?

Any feedback is highly appreciated :)

Greetings
-Sascha-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux