-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Once we put in our cache tier the I/O on the spindles was so low, we just moved the journals off the SSDs onto the spindles and left the SSD space for cache. There have been testing showing that better performance can be achieved by putting more OSDs on an NVMe disk, but you also have to balance that with OSDs not being evenly distributed so some OSDs will use more space than others. I probably wouldn't go more than 4 100 GB partitions, but it really depends on the number of PGs and your data distribution. Also, even with all the data in the cache, there is still a performance penalty for having the caching tier vs. a native SSD pool. So if you are not using the tiering, move to a straight SSD pool. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Feb 3, 2016 at 5:01 AM, Sascha Vogt wrote: > Hi all, > > we recently tried adding a cache tier to our ceph cluster. We had 5 > spinning disks per hosts with a single journal NVMe disk, hosting the 5 > journals (1 OSD per spinning disk). We have 4 hosts up to now, so > overall 4 NVMes hosting 20 journals for 20 spinning disks. > > As we had some space left on the NVMes so we made two additional > partitions on each NVMe and created a 4 OSD cache tier. > > To our surprise the 4 OSD cache pool was able to deliver the same > performance then the previous 20 OSD pool while reducing the OPs on the > spinning disk to zero as long as the cache pool was sufficient to hold > all / most data (ceph is used for very short living KVM virtual machines > which do pretty heavy disk IO). > > As we don't need that much more storage right now we decided to extend > our cluster by adding 8 additional NVMe disks solely as a cache pool and > freeing the journal NVMes again. Now the question is: How to organize > the OSDs on the NVMe disks (2 per host)? > > As the NVMes peak around 5-7 concurrent sequential writes (tested with > fio) I thought about using 5 OSDs per NVMe. That would mean 10 > partitions (5 journals, 5 data). On the other hand the NVMes are only > 400GB large, so that would result in OSD disk sizes for <80 GB > (depending on the journal size). > > Would it make sense to skip the separate Journal partition and leave the > journal on the data disk itself and limitting it to a rather small > amount (lets say 1 GB or even less?) as SSDs typically don't like > sequential writes anyway? > > Or, if I leave journal and data on separate partitions should I reduce > the number of OSDs per disk to 3 as Ceph will most likly write to > journal and data in parallel and I therefore already get 6 parallel > "threads" of IO? > > Any feedback is highly appreciated :) > > Greetings > -Sascha- > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWspEZCRDmVDuy+mK58QAA3/0P/3ZXZ+VZxKqQmDkw178R mgPzRnBrUavzjC4UI+CSg2q2xtcU0hhqW1htz/EXnd9Ou7pMUP5pG+FpInmw aOAjBqVGjVsxauQlbPeSmw2h+E0BfbRMp3YnFeI8Lx/OKBvpXbm1XDJFZ7PK 4EWI9QLpXwF0inb9qgVU9qwmsT1ZJYSHe3P9F+nue1QQhDijdIjCZ8PzHWK6 02rnuHHMynfA+J9JN05Uy9M5qynHleO6LPeoFwEfzq1S+VOFz/HMNRm5Sua4 u4EwZAhDKGBZ1F01+HMQdwYBshVf87YahPqRuvE9dL3MFR6v0loMhNDikDpD nbwtHsS3cR1Ti6CU+SJniXxYSjiYOyWwXIwGMn6xVl0VkcRBrt/o8fonIe6o Zdb/8+1Jo7Z26NjBsyZ0sNv2kBlhJmlElj0ANEtwScDL7tcVhXNt97BFvJbF aDpTpBvSWcipEOdlPEMN5rgeIYJRWu6A/w925cd5mXgqD5p98IKdkh7nc9OE JbiNe4Aw4FeLqF6EqKx/pYxUucSW0GwS8K9nlQFcz53UmqenbISGy4C699Lx unxCAewFCLfQFztiLhoHntwyyQTUX+wpERURGv76asP9M3RxDHFyWrZMBw65 skeWf5PNu2kiMS7RDYWmm12tvIbi+8w/xib/VwTmjxNf4MtDb2qfTb72ssbh 2Xn2 =gfXk -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com