Hello, On Tue, 24 May 2016 14:30:41 +0000 Somnath Roy wrote: > If you are not tweaking ceph.conf settings when using NVRAM as journal , > I would highly recommend to try the following. > > 1. Since you have very small journal , try to reduce > filestore_max_sync_interval/min_sync_interval significantly. > Already mentioned that to the OP, confirmed really as he thought about this himself. > 2. If you are using Jewel , there are bunch of filestore throttle > parameter introduced (discussed over ceph-devl) which is now doing no > throttling by default. But, since your journal size is small and NVRAM > is much faster you may need to tweak those to extract better and stable > performance out. > Interesting. I suppose these parameters are not actually documented outside the ML and/or named in such a fashion that guessing at their purpose and parameters is an exercise in futility? ^_- Christian > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Brian :: Sent: Tuesday, May 24, 2016 1:37 AM > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: NVRAM cards as OSD journals > > Hello List > > To confirm what Christian has said. We have been playing with a 3 node > 4 SSD (3610) per node cluster. Putting the journals on the OSD SSDs we > were getting 770MB /s sustained with large sequential writes, and 35 > MB/s and about 9200 IOPS with small random writes. Putting an NVME as > journals decreased the sustained throughput marginally, probably by > 40MB/s and increased consistently the small random writes by about 10 > MB/s and 3100 IOPS or so. But now with my small cluster I've got a huge > failure domain in each OSD server. > > As the number of OSDs increase I would imagine the value of backing SSDs > with NVME journals diminishes. > > B > > On Tue, May 24, 2016 at 3:28 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > Hello, > > > > On Fri, 20 May 2016 15:52:45 +0000 EP Komarla wrote: > > > >> Hi, > >> > >> I am contemplating using a NVRAM card for OSD journals in place of > >> SSD drives in our ceph cluster. > >> > >> Configuration: > >> > >> * 4 Ceph servers > >> > >> * Each server has 24 OSDs (each OSD is a 1TB SAS drive) > >> > >> * 1 PCIe NVRAM card of 16GB capacity per ceph server > >> > >> * Both Client & cluster network is 10Gbps > >> > > Since you were afraid of loosing just 5 OSDs if a single journal SSD > > would fail, putting all your eggs in one NVRAM basket is quite the > > leap. > > > > Your failure domains should match your cluster size and abilities and > > 4 nodes is small cluster, loosing one because your NVRAM card failed > > will have massive impacts during re-balancing and then you'll have a 3 > > cluster node with less overall performance until you can fix things. > > > > And while a node can of course fail as well in it's entirety (like bad > > Mainboard, CPU, RAM) these things often times can be fixed quickly > > (especially if you have spares on hand) and don't need to involve a > > full re-balancing if Ceph is configured accordingly > > (mon_osd_down_out_subtree_limit = host). > > > > As for your question, this has been discussed to some extend less than > > two months ago, especially concerning journal size and usage: > > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html > > > > That being said, it would be best to have a comparison between a > > normal sized journal on a fast SSD/NVMe versus the 600MB NVRAM > > journals. > > > > I'd expect small write IOPS to be faster with the NVRAM and _maybe_ to > > see some slowdown compared to SSDs when comes to large writes, like > > during a backfill. > > > >> As per ceph documents: > >> The expected throughput number should include the expected disk > >> throughput (i.e., sustained data transfer rate), and network > >> throughput. For example, a 7200 RPM disk will likely have > >> approximately 100 MB/s. Taking the min() of the disk and network > >> throughput should provide a reasonable expected throughput. Some > >> users just start off with a 10GB journal size. For example: osd > >> journal size = 10000 Given that I have a single 16GB card per server > >> that has to be carved among all 24OSDs, I will have to configure each > >> OSD journal to be much smaller around 600MB, i.e., 16GB/24 drives. > >> This value is much smaller than 10GB/OSD journal that is generally > >> used. So, I am wondering if this configuration and journal size is > >> valid. Is there a performance benefit of having a journal that is > >> this small? Also, do I have to reduce the default "filestore maxsync > >> interval" from 5 seconds to a smaller value say 2 seconds to match > >> the smaller journal size? > >> > > Yes, just to be on the safe side. > > > > Regards, > > > > Christian > > > >> Have people used NVRAM cards in the Ceph clusters as journals? What > >> is their experience? > >> > >> Any thoughts? > >> > >> > >> > >> Legal Disclaimer: > >> The information contained in this message may be privileged and > >> confidential. It is intended to be read only by the individual or > >> entity to whom it is addressed or by their designee. If the reader of > >> this message is not the intended recipient, you are on notice that > >> any distribution of this message, in any form, is strictly > >> prohibited. If you have received this message in error, please > >> immediately notify the sender and delete or destroy any copy of this > >> message! > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > PLEASE NOTE: The information contained in this electronic mail message > is intended only for the use of the designated recipient(s) named above. > If the reader of this message is not the intended recipient, you are > hereby notified that you have received this message in error and that > any review, dissemination, distribution, or copying of this message is > strictly prohibited. If you have received this communication in error, > please notify the sender by telephone or e-mail (as shown above) > immediately and destroy any and all copies of this message in your > possession (whether hard copies or electronically stored copies). > _______________________________________________ ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com