SSD journal deployment experiences

chibi@xxxxxxx (Christian Balzer) · Fri, 5 Sep 2014 10:09:25 +0900

Hello,

On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:

> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> <daniel.vanderster at cern.ch> wrote:
> 
> >
> >
> > 1) How often are DC S3700's failing in your deployments?
> >
> 
> None of mine have failed yet.  I am planning to monitor the wear level
> indicator, and preemptively replace any SSDs that go below 10%.  Manually
> flushing the journal, replacing the SSD, and building a new journal is
> much faster than backfilling all the dependent OSDs.
>
What Craig said.

Hell, even none of the consumer Intels (3xx, 520s) I have ever failed,
though they are aging faster of course. 
Still got some ancient X-25s that haven't gone below 96% wearout.

I expect my DC 3700s to outlive 2 HDD generations. ^o^ 

Monitor and replace them accordingly and I doubt you'll ever loose one in
operation.

> 
> 
> > 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is
> > the backfilling which results from an SSD failure? Have you considered
> > tricks like increasing the down out interval so backfilling doesn?t
> > happen in this case (leaving time for the SSD to be replaced)?
> >
> 
> Replacing a failed SSD won't help your backfill.  I haven't actually
> tested it, but I'm pretty sure that losing the journal effectively
> corrupts your OSDs.  I don't know what steps are required to complete
> this operation, but it wouldn't surprise me if you need to re-format the
> OSD.
>
This.
All the threads I've read about this indicate that journal loss during
operation means OSD loss. Total OSD loss, no recovery.