If you are not tweaking ceph.conf settings when using NVRAM as journal , I would highly recommend to try the following. 1. Since you have very small journal , try to reduce filestore_max_sync_interval/min_sync_interval significantly. 2. If you are using Jewel , there are bunch of filestore throttle parameter introduced (discussed over ceph-devl) which is now doing no throttling by default. But, since your journal size is small and NVRAM is much faster you may need to tweak those to extract better and stable performance out. Thanks & Regards Somnath -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Brian :: Sent: Tuesday, May 24, 2016 1:37 AM To: ceph-users@xxxxxxxxxxxxxx Subject: Re: NVRAM cards as OSD journals Hello List To confirm what Christian has said. We have been playing with a 3 node 4 SSD (3610) per node cluster. Putting the journals on the OSD SSDs we were getting 770MB /s sustained with large sequential writes, and 35 MB/s and about 9200 IOPS with small random writes. Putting an NVME as journals decreased the sustained throughput marginally, probably by 40MB/s and increased consistently the small random writes by about 10 MB/s and 3100 IOPS or so. But now with my small cluster I've got a huge failure domain in each OSD server. As the number of OSDs increase I would imagine the value of backing SSDs with NVME journals diminishes. B On Tue, May 24, 2016 at 3:28 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > Hello, > > On Fri, 20 May 2016 15:52:45 +0000 EP Komarla wrote: > >> Hi, >> >> I am contemplating using a NVRAM card for OSD journals in place of >> SSD drives in our ceph cluster. >> >> Configuration: >> >> * 4 Ceph servers >> >> * Each server has 24 OSDs (each OSD is a 1TB SAS drive) >> >> * 1 PCIe NVRAM card of 16GB capacity per ceph server >> >> * Both Client & cluster network is 10Gbps >> > Since you were afraid of loosing just 5 OSDs if a single journal SSD > would fail, putting all your eggs in one NVRAM basket is quite the leap. > > Your failure domains should match your cluster size and abilities and > 4 nodes is small cluster, loosing one because your NVRAM card failed > will have massive impacts during re-balancing and then you'll have a 3 > cluster node with less overall performance until you can fix things. > > And while a node can of course fail as well in it's entirety (like bad > Mainboard, CPU, RAM) these things often times can be fixed quickly > (especially if you have spares on hand) and don't need to involve a > full re-balancing if Ceph is configured accordingly > (mon_osd_down_out_subtree_limit = host). > > As for your question, this has been discussed to some extend less than > two months ago, especially concerning journal size and usage: > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html > > That being said, it would be best to have a comparison between a > normal sized journal on a fast SSD/NVMe versus the 600MB NVRAM journals. > > I'd expect small write IOPS to be faster with the NVRAM and _maybe_ to > see some slowdown compared to SSDs when comes to large writes, like > during a backfill. > >> As per ceph documents: >> The expected throughput number should include the expected disk >> throughput (i.e., sustained data transfer rate), and network throughput. >> For example, a 7200 RPM disk will likely have approximately 100 MB/s. >> Taking the min() of the disk and network throughput should provide a >> reasonable expected throughput. Some users just start off with a 10GB >> journal size. For example: osd journal size = 10000 Given that I have >> a single 16GB card per server that has to be carved among all 24OSDs, >> I will have to configure each OSD journal to be much smaller around >> 600MB, i.e., 16GB/24 drives. This value is much smaller than >> 10GB/OSD journal that is generally used. So, I am wondering if this >> configuration and journal size is valid. Is there a performance >> benefit of having a journal that is this small? Also, do I have to >> reduce the default "filestore maxsync interval" from 5 seconds to a >> smaller value say 2 seconds to match the smaller journal size? >> > Yes, just to be on the safe side. > > Regards, > > Christian > >> Have people used NVRAM cards in the Ceph clusters as journals? What >> is their experience? >> >> Any thoughts? >> >> >> >> Legal Disclaimer: >> The information contained in this message may be privileged and >> confidential. It is intended to be read only by the individual or >> entity to whom it is addressed or by their designee. If the reader of >> this message is not the intended recipient, you are on notice that >> any distribution of this message, in any form, is strictly >> prohibited. If you have received this message in error, please >> immediately notify the sender and delete or destroy any copy of this message! > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com