On Fri, 5 Sep 2014 09:42:02 +0000 Dan Van Der Ster wrote: > > > On 05 Sep 2014, at 11:04, Christian Balzer <chibi at gol.com> wrote: > > > > On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote: > >> > >>> On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote: > >>> > >>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote: > >>> > >>>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster > >>>> <daniel.vanderster at cern.ch> wrote: > >>>> [snip] > >>>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful > >>>>> is the backfilling which results from an SSD failure? Have you > >>>>> considered tricks like increasing the down out interval so > >>>>> backfilling doesn?t happen in this case (leaving time for the SSD > >>>>> to be replaced)? > >>>>> > >>>> > >>>> Replacing a failed SSD won't help your backfill. I haven't actually > >>>> tested it, but I'm pretty sure that losing the journal effectively > >>>> corrupts your OSDs. I don't know what steps are required to > >>>> complete this operation, but it wouldn't surprise me if you need to > >>>> re-format the OSD. > >>>> > >>> This. > >>> All the threads I've read about this indicate that journal loss > >>> during operation means OSD loss. Total OSD loss, no recovery. > >>> From what I gathered the developers are aware of this and it might be > >>> addressed in the future. > >>> > >> > >> I suppose I need to try it then. I don?t understand why you can't just > >> use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for > >> example. > >> > > I think the logic is if you shut down an OSD cleanly beforehand you can > > just do that. > > However from what I gathered there is no logic to re-issue transactions > > that made it to the journal but not the filestore. > > So a journal SSD failing mid-operation with a busy OSD would certainly > > be in that state. > > > > I had thought that the journal write and the buffered filestore write > happen at the same time. Nope, definitely not. That's why we have tunables like the ones at: http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals And people (me included) tend to crank that up (to eleven ^o^). The write-out to the filestore may start roughly at the same time as the journal gets things, but it can and will fall behind. > So all the previous journal writes that > succeeded are already on their way to the filestore. My (could be > incorrect) understanding is that the real purpose of the journal is to > be able to replay writes after a power outage (since the buffered > filestore writes would be lost in that case). If there is no power > outage, then filestore writes are still good regardless of a journal > failure. >