SSD journal deployment experiences

chibi@xxxxxxx (Christian Balzer) · Sat, 6 Sep 2014 20:27:18 +0900

On Fri, 5 Sep 2014 09:42:02 +0000 Dan Van Der Ster wrote:

> 
> > On 05 Sep 2014, at 11:04, Christian Balzer <chibi at gol.com> wrote:
> > 
> > On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote:
> >> 
> >>> On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote:
> >>> 
> >>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> >>> 
> >>>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> >>>> <daniel.vanderster at cern.ch> wrote:
> >>>> 
[snip]
> >>>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful
> >>>>> is the backfilling which results from an SSD failure? Have you
> >>>>> considered tricks like increasing the down out interval so
> >>>>> backfilling doesn?t happen in this case (leaving time for the SSD
> >>>>> to be replaced)?
> >>>>> 
> >>>> 
> >>>> Replacing a failed SSD won't help your backfill.  I haven't actually
> >>>> tested it, but I'm pretty sure that losing the journal effectively
> >>>> corrupts your OSDs.  I don't know what steps are required to
> >>>> complete this operation, but it wouldn't surprise me if you need to
> >>>> re-format the OSD.
> >>>> 
> >>> This.
> >>> All the threads I've read about this indicate that journal loss
> >>> during operation means OSD loss. Total OSD loss, no recovery.
> >>> From what I gathered the developers are aware of this and it might be
> >>> addressed in the future.
> >>> 
> >> 
> >> I suppose I need to try it then. I don?t understand why you can't just
> >> use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for
> >> example.
> >> 
> > I think the logic is if you shut down an OSD cleanly beforehand you can
> > just do that.
> > However from what I gathered there is no logic to re-issue transactions
> > that made it to the journal but not the filestore.
> > So a journal SSD failing mid-operation with a busy OSD would certainly
> > be in that state.
> > 
> 
> I had thought that the journal write and the buffered filestore write
> happen at the same time. 

Nope, definitely not.

That's why we have tunables like the ones at:
http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals

And people (me included) tend to crank that up (to eleven ^o^).

The write-out to the filestore may start roughly at the same time as the
journal gets things, but it can and will fall behind.

> So all the previous journal writes that
> succeeded are already on their way to the filestore. My (could be
> incorrect) understanding is that the real purpose of the journal is to
> be able to replay writes after a power outage (since the buffered
> filestore writes would be lost in that case). If there is no power
> outage, then filestore writes are still good regardless of a journal
> failure.
>