Re: Is it possible to recover from block.db failure?

Wido den Hollander <wido@xxxxxxxx> · Thu, 19 Oct 2017 17:02:02 +0200 (CEST)

> Op 19 oktober 2017 om 16:47 schreef Caspar Smit <casparsmit@xxxxxxxxxxx>:
> 
> 
> Hi David,
> 
> Thank you for your answer, but wouldn't scrub (deep-scrub) handle
> that? It will flag the unflushed journal pg's as inconsistent and you
> would have to repair the pg's. Or am i overlooking something here? The
> official blog doesn't state anything about this method being a bad
> idea.
> 

No, it doesn't. You would have to wipe the whole OSD when you loose a journal from a unclean shutdown.

Same goes for BlueStore and it's WAL+DB. It's not a cache, it contains vital information of the OSD.

If you loose either the WAL or DB you can't be sure the OSD is still consistent and loose it.

Look at it the other way around. Why would either the WAL or DB require persistent storage if it can be disposed?

Wido

> Caspar
> 
> 2017-10-19 16:14 GMT+02:00 David Turner <drakonstein@xxxxxxxxx>:
> > I'm speaking to the method in general and don't know the specifics of
> > bluestore.  Recovering from a failed journal in this way is only a good idea
> > if you were able to flush the journal before making a new one.  If the
> > journal failed during operation and you couldn't cleanly flush the journal,
> > then the data on the OSD could not be guaranteed and would need to be wiped
> > and started over.  The same would go for the block.wal and block.db
> > partitions if you can find the corresponding commands for them.
> >
> > On Thu, Oct 19, 2017 at 7:44 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
> >>
> >> Hi all,
> >>
> >> I'm testing some scenario's with the new Ceph luminous/bluestore
> >> combination.
> >>
> >> I've created a demo setup with 3 nodes (each has 10 HDD's and 2 SSD's)
> >> So i created 10 BlueStore OSD's with a seperate 20GB block.db on the
> >> SSD's (5 HDD's per block.db SSD).
> >>
> >> I'm testing a failure of one of those SSD's (block.db failure).
> >>
> >> With filestore i have used the following blog/script to recover from a
> >> journal SSD failure:
> >>
> >> http://ceph.com/no-category/ceph-recover-osds-after-ssd-journal-failure/
> >>
> >> I tried to adapt the script to bluestore but i couldn't find any
> >> BlueStore equivalent to the following command (where the journal is
> >> re-created):
> >>
> >> sudo ceph-osd –mkjournal -i $osd_id
> >>
> >> Tracing the 'ceph-disk prepare' command didn't result in a seperate
> >> command that the BlueStore block.db is initialized. It looks like the
> >> --mkfs switch does all the work (including the data part). Am i
> >> correct?
> >>
> >> Is there any way a seperate block.db can be initialized after the OSD
> >> was created? In other words: is it possible to recover from a block.db
> >> failure or do i need to start over?
> >>
> >> block.db is probably no equivalent to a FileStore's journal, but what
> >> about block.wal? If i use a seperate block.wal device only will the
> >> --mkjournal command re-initialize that or is the --mkjournal command
> >> only used for FileStore ?
> >>
> >> Kind regards and thanks in advance for any reply,
> >> Caspar Smit
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com