SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan Van Der Ster) · Fri, 5 Sep 2014 07:46:12 +0000

Hi Christian,

> On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote:
> 
> 
> Hello,
> 
> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> 
>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
>> <daniel.vanderster at cern.ch> wrote:
>> 
>>> 
>>> 
>>> 1) How often are DC S3700's failing in your deployments?
>>> 
>> 
>> None of mine have failed yet.  I am planning to monitor the wear level
>> indicator, and preemptively replace any SSDs that go below 10%.  Manually
>> flushing the journal, replacing the SSD, and building a new journal is
>> much faster than backfilling all the dependent OSDs.
>> 
> What Craig said.
> 
> Hell, even none of the consumer Intels (3xx, 520s) I have ever failed,
> though they are aging faster of course. 
> Still got some ancient X-25s that haven't gone below 96% wearout.
> 
> I expect my DC 3700s to outlive 2 HDD generations. ^o^ 
> 
> Monitor and replace them accordingly and I doubt you'll ever loose one in
> operation.

OK, that?s good to know.

> 
>> 
>> 
>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is
>>> the backfilling which results from an SSD failure? Have you considered
>>> tricks like increasing the down out interval so backfilling doesn?t
>>> happen in this case (leaving time for the SSD to be replaced)?
>>> 
>> 
>> Replacing a failed SSD won't help your backfill.  I haven't actually
>> tested it, but I'm pretty sure that losing the journal effectively
>> corrupts your OSDs.  I don't know what steps are required to complete
>> this operation, but it wouldn't surprise me if you need to re-format the
>> OSD.
>> 
> This.
> All the threads I've read about this indicate that journal loss during
> operation means OSD loss. Total OSD loss, no recovery.
> From what I gathered the developers are aware of this and it might be
> addressed in the future.
> 

I suppose I need to try it then. I don?t understand why you can't just use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for example.

> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5 ratio
> is sensible. However these will be the ones limiting your max sequential
> write speed if that is of importance to you. In nearly all use cases you
> run out of IOPS (on your HDDs) long before that becomes an issue, though.

IOPS is definitely the main limit, but we also only have 1 single 10Gig-E NIC on these servers, so 4 drives that can write (even only 200MB/s) would be good enough.

Also, we?ll put the SSDs in the first four ports of an SAS2008 HBA which is shared with the other 20 spinning disks. Counting the double writes, the HBA will run out of bandwidth before these SSDs, I expect.

> Raiding the journal SSDs seems wasteful given the cost and quality of the
> DC 3700s. 
> Configure your cluster in a way that re-balancing doesn't happen unless
> you want to (when the load low) by:
> a) Setting the "mon osd downout subtree limit" so that a host going down
> doesn't result in a full re-balancing and the resulting IO shit storm. In
> nearly all cases nodes a recoverable and if it isn't the OSDs may be. And
> even if that fails, you get to pick the time for the recovery.

This is a good point ? I have it set at the rack level now. The whole node failure we experienced manifested as a device remove of all 24 drives followed quickly by a hot-insert. Restarting the daemons brought those OSDs back online (though it was outside of working hours, so backfilling kicked in before anyone noticed).

> b) As you mentioned and others have before, set the out interval so you
> can react to things. 

We use 15 minutes, which is so we can reboot a host without backfilling. What do you use?

> c) Configure the various backfill options to have only a small impact.
> Journal SSDs will improve things compared to your current situation. And
> if I recall correctly, you're using a replica size of 3 to 4, so you can
> afford a more sedate recovery.

It?s already at 1 backfill, 1 recovery, and the lowest queue priority (1/63) for recovery IOs.

> Journals on a filesystem go against KISS. 
> Not only do you add one more layer of complexity that can fail (and
> filesystems do have bugs as people were reminded when Firefly came out),
> you're also wasting CPU cycles that might needed over in the less than
> optimal OSD code. ^o^
> And you gain nothing from putting journals on a filesystem.

Well the gains that I had in mind resulted from my assumption that you can create a new empty journal on another device, then restart the OSD. If that?s not possible, then I agree there are no gains to speak of.

> You might want to look into cache pools (and dedicated SSD servers with
> fast controllers and CPUs) in your test cluster and for the future.
> Right now my impression is that there is quite a bit more polishing to be
> done (retention of hot objects, etc) and there have been stability concerns
> raised here.

Right, Greg already said publicly not to use the cache tiers for RBD.

Thanks for your thorough response? you?ve provide a lot of confidence that the traditional journal deployment is still a good or even the best option.

Cheers, Dan

> 
> Regards,
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com