Re: rocksdb corruption, stale pg, rebuild bucket index

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 13 Jun 2019 14:14:09 +0000 (UTC)

On Thu, 13 Jun 2019, Harald Staub wrote:
> On 13.06.19 15:52, Sage Weil wrote:
> > On Thu, 13 Jun 2019, Harald Staub wrote:
> [...]
> > I think that increasing the various suicide timeout options will allow
> > it to stay up long enough to clean up the ginormous objects:
> > 
> >   ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> 
> ok
> 
> > > It looks healthy so far:
> > > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> > > fsck success
> > > 
> > > Now we have to choose how to continue, trying to reduce the risk of losing
> > > data (most bucket indexes are intact currently). My guess would be to let
> > > this
> > > OSD (which was not the primary) go in and hope that it recovers. In case
> > > of a
> > > problem, maybe we could still use the other OSDs "somehow"? In case of
> > > success, we would bring back the other OSDs as well?
> > > 
> > > OTOH we could try to continue with the key dump from earlier today.
> > 
> > I would start all three osds the same way, with 'noout' set on the
> > cluster.  You should try to avoid triggering recovery because it will have
> > a hard time getting through the big index object on that bucket (i.e., it
> > will take a long time, and might trigger some blocked ios and so forth).
> 
> This I do not understand, how would I avoid recovery?

Well, simply doing 'ceph osd set noout' is sufficient to avoid 
recovery, I suppose.  But in any case, getting at least 2 of the 
existing copies/OSDs online (assuming your pool's min_size=2) will mean 
you can finish the reshard process and clean up the big object without 
copying the PG anywhere.

I think you may as well do all 3 OSDs this way, then clean up the big 
object--that way in the end no data will have to move.

This is Nautilus, right?  If you scrub the PGs in question, that will also 
now raise the health alert if there are any remaining big omap objects... 
if that warning goes away you'll know you're doing cleaning up.  A final 
rocksdb compaction should then be enough to remove any remaing weirdness 
from rocksdb's internal layout.

> > (Side note that since you started the OSD read-write using the internal
> > copy of rocksdb, don't forget that the external copy you extracted
> > (/mnt/ceph/db?) is now stale!)
> 
> As suggested by Paul Emmerich (see next E-mail in this thread), I exported
> this PG. It took not that long (20 minutes).

Great :)

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com