SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan Van Der Ster) · Fri, 5 Sep 2014 09:42:02 +0000

> On 05 Sep 2014, at 11:04, Christian Balzer <chibi at gol.com> wrote:
> 
> 
> Hello Dan,
> 
> On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote:
> 
>> Hi Christian,
>> 
>>> On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote:
>>> 
>>> 
>>> Hello,
>>> 
>>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
>>> 
>>>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
>>>> <daniel.vanderster at cern.ch> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> 1) How often are DC S3700's failing in your deployments?
>>>>> 
>>>> 
>>>> None of mine have failed yet.  I am planning to monitor the wear level
>>>> indicator, and preemptively replace any SSDs that go below 10%.
>>>> Manually flushing the journal, replacing the SSD, and building a new
>>>> journal is much faster than backfilling all the dependent OSDs.
>>>> 
>>> What Craig said.
>>> 
>>> Hell, even none of the consumer Intels (3xx, 520s) I have ever failed,
>>> though they are aging faster of course. 
>>> Still got some ancient X-25s that haven't gone below 96% wearout.
>>> 
>>> I expect my DC 3700s to outlive 2 HDD generations. ^o^ 
>>> 
>>> Monitor and replace them accordingly and I doubt you'll ever loose one
>>> in operation.
>> 
>> OK, that?s good to know.
>> 
>>> 
>>>> 
>>>> 
>>>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is
>>>>> the backfilling which results from an SSD failure? Have you
>>>>> considered tricks like increasing the down out interval so
>>>>> backfilling doesn?t happen in this case (leaving time for the SSD to
>>>>> be replaced)?
>>>>> 
>>>> 
>>>> Replacing a failed SSD won't help your backfill.  I haven't actually
>>>> tested it, but I'm pretty sure that losing the journal effectively
>>>> corrupts your OSDs.  I don't know what steps are required to complete
>>>> this operation, but it wouldn't surprise me if you need to re-format
>>>> the OSD.
>>>> 
>>> This.
>>> All the threads I've read about this indicate that journal loss during
>>> operation means OSD loss. Total OSD loss, no recovery.
>>> From what I gathered the developers are aware of this and it might be
>>> addressed in the future.
>>> 
>> 
>> I suppose I need to try it then. I don?t understand why you can't just
>> use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for example.
>> 
> I think the logic is if you shut down an OSD cleanly beforehand you can
> just do that.
> However from what I gathered there is no logic to re-issue transactions
> that made it to the journal but not the filestore.
> So a journal SSD failing mid-operation with a busy OSD would certainly be
> in that state.
> 

I had thought that the journal write and the buffered filestore write happen at the same time. So all the previous journal writes that succeeded are already on their way to the filestore. My (could be incorrect) understanding is that the real purpose of the journal is to be able to replay writes after a power outage (since the buffered filestore writes would be lost in that case). If there is no power outage, then filestore writes are still good regardless of a journal failure.

> I'm sure (hope) somebody from the Ceph team will pipe up about this.

Ditto!

>>> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
>>> ratio is sensible. However these will be the ones limiting your max
>>> sequential write speed if that is of importance to you. In nearly all
>>> use cases you run out of IOPS (on your HDDs) long before that becomes
>>> an issue, though.
>> 
>> IOPS is definitely the main limit, but we also only have 1 single
>> 10Gig-E NIC on these servers, so 4 drives that can write (even only
>> 200MB/s) would be good enough.
>> 
> Fair enough. ^o^
> 
>> Also, we?ll put the SSDs in the first four ports of an SAS2008 HBA which
>> is shared with the other 20 spinning disks. Counting the double writes,
>> the HBA will run out of bandwidth before these SSDs, I expect.
>> 
> Depends on what PCIe slot it is and so forth. A 2008 should give you 4GB/s,
> enough to keep the SSDs happy at least. ^o^
> 
> A 2008 has only 8 SAS/SATA ports, so are you using port expanders on your
> case backplane? 
> In that case you might want to spread the SSDs out over channels, as in
> have 3 HDDs sharing one channel with one SSD.

We use a Promise VTrak J830sS, and now I?ll got ask our hardware team if there would be any benefit to store the SSDs row or column wise.

With the current config, when I dd to all drives in parallel I can write at 24*74MB/s = 1776MB/s.

> 
>>> Raiding the journal SSDs seems wasteful given the cost and quality of
>>> the DC 3700s. 
>>> Configure your cluster in a way that re-balancing doesn't happen unless
>>> you want to (when the load low) by:
>>> a) Setting the "mon osd downout subtree limit" so that a host going
>>> down doesn't result in a full re-balancing and the resulting IO shit
>>> storm. In nearly all cases nodes a recoverable and if it isn't the
>>> OSDs may be. And even if that fails, you get to pick the time for the
>>> recovery.
>> 
>> This is a good point ? I have it set at the rack level now. The whole
>> node failure we experienced manifested as a device remove of all 24
>> drives followed quickly by a hot-insert. Restarting the daemons brought
>> those OSDs back online (though it was outside of working hours, so
>> backfilling kicked in before anyone noticed).
>> 
> Lucky! ^o^
> 
>> 
>>> b) As you mentioned and others have before, set the out interval so you
>>> can react to things. 
>> 
>> We use 15 minutes, which is so we can reboot a host without backfilling.
>> What do you use?
>> 
> I'm not using it right now, but for the cluster I'm currently deploying
> will go with something like 4 hours (as do others here) or more if I feel
> that I might not be in time to set the cluster to "noout" if warranted.

Hmm, not sure I?d be comfortable with 4 hours. According to the rados reliability tool that would drop your durability from ~11-nines to ~9-nines (assuming 3 replicas, consumer drives).

auto mark-out:     15 minutes
    storage               durability    PL(site)  PL(copies)     PL(NRE)     PL(rep)    loss/PiB
    ----------            ----------  ----------  ----------  ----------  ----------  ----------
    RADOS: 3 cp             11-nines   0.000e+00   4.644e-10   0.000020%   0.000e+00   3.813e+02

auto mark-out:    240 minutes
    storage               durability    PL(site)  PL(copies)     PL(NRE)     PL(rep)    loss/PiB
    ----------            ----------  ----------  ----------  ----------  ----------  ----------
    RADOS: 3 cp              9-nines   0.000e+00   7.836e-08   0.000254%   0.000e+00   5.016e+04

> 
>>> c) Configure the various backfill options to have only a small impact.
>>> Journal SSDs will improve things compared to your current situation.
>>> And if I recall correctly, you're using a replica size of 3 to 4, so
>>> you can afford a more sedate recovery.
>> 
>> It?s already at 1 backfill, 1 recovery, and the lowest queue priority
>> (1/63) for recovery IOs.
>> 
> So how long does that take you to recover 1TB then in the case of a
> single OSD failure?

Single OSD failures take us ~1 hour to backfill. The 24 OSD failure took ~2 hours to backfill.

> And is that setting still impacting your performance more than you'd like?

Single failures are transparent, but the 24 failure was noticeable. Journal SSDs will improve the situation, like you said, so 5 OSD failures would probably be close to transparent.

> 
>>> Journals on a filesystem go against KISS. 
>>> Not only do you add one more layer of complexity that can fail (and
>>> filesystems do have bugs as people were reminded when Firefly came
>>> out), you're also wasting CPU cycles that might needed over in the
>>> less than optimal OSD code. ^o^
>>> And you gain nothing from putting journals on a filesystem.
>> 
>> Well the gains that I had in mind resulted from my assumption that you
>> can create a new empty journal on another device, then restart the OSD.
>> If that?s not possible, then I agree there are no gains to speak of.
>> 
> Can always create a new partition as well, if there is enough space.

True..

Cheers, Dan

> 
> Regards,
> 
> Christian
> 
>> 
>>> You might want to look into cache pools (and dedicated SSD servers with
>>> fast controllers and CPUs) in your test cluster and for the future.
>>> Right now my impression is that there is quite a bit more polishing to
>>> be done (retention of hot objects, etc) and there have been stability
>>> concerns raised here.
>> 
>> Right, Greg already said publicly not to use the cache tiers for RBD.
>> 
>> Thanks for your thorough response? you?ve provide a lot of confidence
>> that the traditional journal deployment is still a good or even the best
>> option.
>> 
>> Cheers, Dan
>> 
>> 
>>> 
>>> Regards,
>>> 
>>> Christian
>>> -- 
>>> Christian Balzer        Network/Systems Engineer                
>>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/