HW recommendations for OSD journals?

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Wed, 16 Jul 2014 16:23:32 -0700

The good SSDs will report how much of their estimated life has been used.
 It's not in the SMART spec though, so different manufacturers do it
differently (or not at all).

I've got Intel DC S3700s, and the SMART attributes include:
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always
    -       0

I'm planning to monitor those value, and replace the SSD when "gets old".
 I don't know exactly what that means yet, but I'll figure it out.  It's
easy to replace SSDs before they fail, without losing the whole OSD.

With my write volume, I might just make it a quarterly manual task instead
of adding it to my monitoring tool.  TBD.

Of course, this won't prevent other types of SSD failure.  I've lost both
SSDs in a RAID1 when I triggered an Intel firmware bug.  I've lost both
SSDs in a RAID1 when the colo lost power (older SSDs without super caps).

The only way I can think of that would make RAID1 SSDs safer than a single
SSD is if you use two SSDs from different manufacturers.

Ceph's mantra is "failure is constant".  I'm not going to RAID my journal
devices.  I will use SSDs with power loss protection though.  I can handle
one or two SSDs dropping out at a time.  I can't handle a large percentage
of them dropping out at the same time.

On Wed, Jul 16, 2014 at 8:28 AM, Mark Nelson <mark.nelson at inktank.com>
wrote:

> On 07/16/2014 09:58 AM, Riccardo Murri wrote:
>
>> Hello,
>>
>> I am new to Ceph; the group I'm working in is currently evaluating it
>> for our new large-scale storage.
>>
>> Is there any recommendation for the OSD journals?  E.g., does it make
>> sense to keep them on SSDs?  Would it make sense to host the journal
>> on a RAID-1 array for added safety? (IOW: what happens if the journal
>> device fails and the journal is lost?)
>>
>> Thanks for any explanation and suggestion!
>>
>
> Hi,
>
> There are a couple of common configurations that make sense imho:
>
> 1) Leave journals on the same disks as the data (best to have them in
> their own partition).  This is a fairly safe option since the OSDs only
> have a single disk they rely on (IE minimize potential failures).  It can
> be slow, but it depends on the controller you use and possibly the IO
> scheduler.  Often times a controller with writeback cache seems to help
> avoid seek contention during writes, but you will currently lose about half
> your disk throughput to journal writes during sequential write IO.
>
> 2) Put journals on SSDs.  In this scenario you want to match your per
> journal SSD speed and disk speed.  IE if you have an SSD that can do
> 400MB/s and disks that can do ~125MB/s of sequential writes, you probably
> want to put somewhere around 3-5 journals on the SSD depending on how much
> sequential write throughput matters to you.  OSDs are now dependant on both
> the spinning disk and the SSD not to fail, and one SSD failure will take
> down multiple OSDs.  You gain speed though and may not need more expensive
> controllers with WB cache (though they may still be useful to protect
> against power failure).
>
> Some folks have used raid-1 LUNs for the journals and it works fine, but
> I'm not really a fan of it, especially with SSDs.  You are causing double
> the writes to the SSDs, and SSDs tend to fail in clumps based on the number
> of writes.  If the choice is between 6 journals per SSD RAID-1 or 3
> journals per SSD JBOD, I'd choose the later.  I'd want to keep my overall
> OSD count high though to minimize the fallout from 3 OSDs going down at
> once.
>
> Arguably if you do the RAID1, can swap failed SSDs quickly, and anticipate
> that the remaining SSD is likely going to die soon after the first, maybe
> the RAID1 is worth it.  The disadvantages seem pretty steep to me though.
>
> Mark
>
>
>
>> Riccardo
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140716/f8f3d739/attachment.htm>