Re: NVRAM cards as OSD journals

"Brian ::" <bc@xxxxxxxx> · Tue, 24 May 2016 09:37:01 +0100

Hello List

To confirm what Christian has said. We have been playing with a 3 node
4 SSD (3610) per node cluster. Putting the journals on the OSD SSDs we
were getting 770MB /s sustained with large sequential writes, and 35
MB/s and about 9200 IOPS with small random writes. Putting an NVME as
journals decreased the sustained throughput marginally, probably by
40MB/s and increased consistently the small random writes by about 10
MB/s and 3100 IOPS or so. But now with my small cluster I've got a
huge failure domain in each OSD server.

As the number of OSDs increase I would imagine the value of backing
SSDs with NVME journals diminishes.

B

On Tue, May 24, 2016 at 3:28 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> On Fri, 20 May 2016 15:52:45 +0000 EP Komarla wrote:
>
>> Hi,
>>
>> I am contemplating using a NVRAM card for OSD journals in place of SSD
>> drives in our ceph cluster.
>>
>> Configuration:
>>
>> *         4 Ceph servers
>>
>> *         Each server has 24 OSDs (each OSD is a 1TB SAS drive)
>>
>> *         1 PCIe NVRAM card of 16GB capacity per ceph server
>>
>> *         Both Client & cluster network is 10Gbps
>>
> Since you were afraid of loosing just 5 OSDs if a single journal SSD would
> fail, putting all your eggs in one NVRAM basket is quite the leap.
>
> Your failure domains should match your cluster size and abilities and 4
> nodes is small cluster, loosing one because your NVRAM card failed will
> have massive impacts during re-balancing and then you'll have a 3 cluster
> node with less overall performance until you can fix things.
>
> And while a node can of course fail as well in it's entirety (like bad
> Mainboard, CPU, RAM) these things often times can be fixed quickly
> (especially if you have spares on hand) and don't need to involve a full
> re-balancing if Ceph is configured accordingly
> (mon_osd_down_out_subtree_limit = host).
>
> As for your question, this has been discussed to some extend less than two
> months ago, especially concerning journal size and usage:
> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html
>
> That being said, it would be best to have a comparison between a normal
> sized journal on a fast SSD/NVMe versus the 600MB NVRAM journals.
>
> I'd expect small write IOPS to be faster with the NVRAM and _maybe_ to see
> some slowdown compared to SSDs when comes to large writes, like during a
> backfill.
>
>> As per ceph documents:
>> The expected throughput number should include the expected disk
>> throughput (i.e., sustained data transfer rate), and network throughput.
>> For example, a 7200 RPM disk will likely have approximately 100 MB/s.
>> Taking the min() of the disk and network throughput should provide a
>> reasonable expected throughput. Some users just start off with a 10GB
>> journal size. For example: osd journal size = 10000 Given that I have a
>> single 16GB card per server that has to be carved among all 24OSDs, I
>> will have to configure each OSD journal to be much smaller around 600MB,
>> i.e., 16GB/24 drives.  This value is much smaller than 10GB/OSD journal
>> that is generally used.  So, I am wondering if this configuration and
>> journal size is valid.  Is there a performance benefit of having a
>> journal that is this small?  Also, do I have to reduce the default
>> "filestore maxsync interval" from 5 seconds to a smaller value say 2
>> seconds to match the smaller journal size?
>>
> Yes, just to be on the safe side.
>
> Regards,
>
> Christian
>
>> Have people used NVRAM cards in the Ceph clusters as journals?  What is
>> their experience?
>>
>> Any thoughts?
>>
>>
>>
>> Legal Disclaimer:
>> The information contained in this message may be privileged and
>> confidential. It is intended to be read only by the individual or entity
>> to whom it is addressed or by their designee. If the reader of this
>> message is not the intended recipient, you are on notice that any
>> distribution of this message, in any form, is strictly prohibited. If
>> you have received this message in error, please immediately notify the
>> sender and delete or destroy any copy of this message!
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com