Re: rocksdb corruption, stale pg, rebuild bucket index

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 17 Jun 2019 10:15:34 +0200

Nice to hear this was resolved in the end.

Coming back to the beginning -- is it clear to anyone what was the
root cause and how other users can avoid this from happening? Maybe
some better default configs to warn users earlier about too-large
omaps?

Cheers, Dan

On Thu, Jun 13, 2019 at 7:36 PM Harald Staub <harald.staub@xxxxxxxxx> wrote:
>
> Looks fine (at least so far), thank you all!
>
> After having exported all 3 copies of the bad PG, we decided to try it
> in-place. We also set norebalance to make sure that no data is moved.
> When the PG was up, the resharding finished with a "success" message.
> The large omap warning is gone after deep-scrubbing the PG.
>
> Then we set the 3 OSDs to out. Soon after, one after the other was down
> (maybe for 2 minutes) and we got degraded PGs, but only once.
>
> Thank you!
>   Harry
>
> On 13.06.19 16:14, Sage Weil wrote:
> > On Thu, 13 Jun 2019, Harald Staub wrote:
> >> On 13.06.19 15:52, Sage Weil wrote:
> >>> On Thu, 13 Jun 2019, Harald Staub wrote:
> >> [...]
> >>> I think that increasing the various suicide timeout options will allow
> >>> it to stay up long enough to clean up the ginormous objects:
> >>>
> >>>    ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> >>
> >> ok
> >>
> >>>> It looks healthy so far:
> >>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> >>>> fsck success
> >>>>
> >>>> Now we have to choose how to continue, trying to reduce the risk of losing
> >>>> data (most bucket indexes are intact currently). My guess would be to let
> >>>> this
> >>>> OSD (which was not the primary) go in and hope that it recovers. In case
> >>>> of a
> >>>> problem, maybe we could still use the other OSDs "somehow"? In case of
> >>>> success, we would bring back the other OSDs as well?
> >>>>
> >>>> OTOH we could try to continue with the key dump from earlier today.
> >>>
> >>> I would start all three osds the same way, with 'noout' set on the
> >>> cluster.  You should try to avoid triggering recovery because it will have
> >>> a hard time getting through the big index object on that bucket (i.e., it
> >>> will take a long time, and might trigger some blocked ios and so forth).
> >>
> >> This I do not understand, how would I avoid recovery?
> >
> > Well, simply doing 'ceph osd set noout' is sufficient to avoid
> > recovery, I suppose.  But in any case, getting at least 2 of the
> > existing copies/OSDs online (assuming your pool's min_size=2) will mean
> > you can finish the reshard process and clean up the big object without
> > copying the PG anywhere.
> >
> > I think you may as well do all 3 OSDs this way, then clean up the big
> > object--that way in the end no data will have to move.
> >
> > This is Nautilus, right?  If you scrub the PGs in question, that will also
> > now raise the health alert if there are any remaining big omap objects...
> > if that warning goes away you'll know you're doing cleaning up.  A final
> > rocksdb compaction should then be enough to remove any remaing weirdness
> > from rocksdb's internal layout.
> >
> >>> (Side note that since you started the OSD read-write using the internal
> >>> copy of rocksdb, don't forget that the external copy you extracted
> >>> (/mnt/ceph/db?) is now stale!)
> >>
> >> As suggested by Paul Emmerich (see next E-mail in this thread), I exported
> >> this PG. It took not that long (20 minutes).
> >
> > Great :)
> >
> > sage
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com