Re: NVRAM cards as OSD journals

Christian Balzer <chibi@xxxxxxx> · Thu, 26 May 2016 14:36:29 +0900



Hello,

On Tue, 24 May 2016 14:30:41 +0000 Somnath Roy wrote:

> If you are not tweaking ceph.conf settings when using NVRAM as journal ,
> I would highly recommend to try the following.
> 
> 1. Since you have very small journal , try to reduce
> filestore_max_sync_interval/min_sync_interval significantly.
> 
Already mentioned that to the OP, confirmed really as he thought about
this himself.

> 2. If you are using Jewel , there are bunch of filestore throttle
> parameter introduced (discussed over ceph-devl) which is now doing no
> throttling by default. But, since your journal size is small and NVRAM
> is much faster you may need to tweak those to extract better and stable
> performance out.
> 
Interesting.
I suppose these parameters are not actually documented outside the ML
and/or named in such a fashion that guessing at their purpose and
parameters is an exercise in futility? ^_-

Christian

> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Brian :: Sent: Tuesday, May 24, 2016 1:37 AM
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  NVRAM cards as OSD journals
> 
> Hello List
> 
> To confirm what Christian has said. We have been playing with a 3 node
> 4 SSD (3610) per node cluster. Putting the journals on the OSD SSDs we
> were getting 770MB /s sustained with large sequential writes, and 35
> MB/s and about 9200 IOPS with small random writes. Putting an NVME as
> journals decreased the sustained throughput marginally, probably by
> 40MB/s and increased consistently the small random writes by about 10
> MB/s and 3100 IOPS or so. But now with my small cluster I've got a huge
> failure domain in each OSD server.
> 
> As the number of OSDs increase I would imagine the value of backing SSDs
> with NVME journals diminishes.
> 
> B
> 
> On Tue, May 24, 2016 at 3:28 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> > On Fri, 20 May 2016 15:52:45 +0000 EP Komarla wrote:
> >
> >> Hi,
> >>
> >> I am contemplating using a NVRAM card for OSD journals in place of
> >> SSD drives in our ceph cluster.
> >>
> >> Configuration:
> >>
> >> *         4 Ceph servers
> >>
> >> *         Each server has 24 OSDs (each OSD is a 1TB SAS drive)
> >>
> >> *         1 PCIe NVRAM card of 16GB capacity per ceph server
> >>
> >> *         Both Client & cluster network is 10Gbps
> >>
> > Since you were afraid of loosing just 5 OSDs if a single journal SSD
> > would fail, putting all your eggs in one NVRAM basket is quite the
> > leap.
> >
> > Your failure domains should match your cluster size and abilities and
> > 4 nodes is small cluster, loosing one because your NVRAM card failed
> > will have massive impacts during re-balancing and then you'll have a 3
> > cluster node with less overall performance until you can fix things.
> >
> > And while a node can of course fail as well in it's entirety (like bad
> > Mainboard, CPU, RAM) these things often times can be fixed quickly
> > (especially if you have spares on hand) and don't need to involve a
> > full re-balancing if Ceph is configured accordingly
> > (mon_osd_down_out_subtree_limit = host).
> >
> > As for your question, this has been discussed to some extend less than
> > two months ago, especially concerning journal size and usage:
> > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html
> >
> > That being said, it would be best to have a comparison between a
> > normal sized journal on a fast SSD/NVMe versus the 600MB NVRAM
> > journals.
> >
> > I'd expect small write IOPS to be faster with the NVRAM and _maybe_ to
> > see some slowdown compared to SSDs when comes to large writes, like
> > during a backfill.
> >
> >> As per ceph documents:
> >> The expected throughput number should include the expected disk
> >> throughput (i.e., sustained data transfer rate), and network
> >> throughput. For example, a 7200 RPM disk will likely have
> >> approximately 100 MB/s. Taking the min() of the disk and network
> >> throughput should provide a reasonable expected throughput. Some
> >> users just start off with a 10GB journal size. For example: osd
> >> journal size = 10000 Given that I have a single 16GB card per server
> >> that has to be carved among all 24OSDs, I will have to configure each
> >> OSD journal to be much smaller around 600MB, i.e., 16GB/24 drives.
> >> This value is much smaller than 10GB/OSD journal that is generally
> >> used.  So, I am wondering if this configuration and journal size is
> >> valid.  Is there a performance benefit of having a journal that is
> >> this small?  Also, do I have to reduce the default "filestore maxsync
> >> interval" from 5 seconds to a smaller value say 2 seconds to match
> >> the smaller journal size?
> >>
> > Yes, just to be on the safe side.
> >
> > Regards,
> >
> > Christian
> >
> >> Have people used NVRAM cards in the Ceph clusters as journals?  What
> >> is their experience?
> >>
> >> Any thoughts?
> >>
> >>
> >>
> >> Legal Disclaimer:
> >> The information contained in this message may be privileged and
> >> confidential. It is intended to be read only by the individual or
> >> entity to whom it is addressed or by their designee. If the reader of
> >> this message is not the intended recipient, you are on notice that
> >> any distribution of this message, in any form, is strictly
> >> prohibited. If you have received this message in error, please
> >> immediately notify the sender and delete or destroy any copy of this
> >> message!
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named above.
> If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that
> any review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> _______________________________________________ ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com