Re: rocksdb corruption, stale pg, rebuild bucket index

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 17 Jun 2019 11:19:58 +0200

We have resharded a bucket with 60 million objects from 32 to 64
shards without any problem. (Though there were several slow ops at the
"stalls after counting the objects phase", so I set nodown as a
precaution).
We're now resharding that bucket from 64 to 1024.

In your case I wonder if it was the large step up to 1024 shards that
caused the crashes somehow? Or maybe your bluefs didn't have enough
free space for the compaction after the large omaps were removed?

-- dan

On Mon, Jun 17, 2019 at 11:14 AM Harald Staub <harald.staub@xxxxxxxxx> wrote:
>
> We received the large omap warning before, but for some reasons we could
> not react quickly. We accepted the risk of the bucket becoming slow, but
> had not thought of further risks ...
>
> On 17.06.19 10:15, Dan van der Ster wrote:
> > Nice to hear this was resolved in the end.
> >
> > Coming back to the beginning -- is it clear to anyone what was the
> > root cause and how other users can avoid this from happening? Maybe
> > some better default configs to warn users earlier about too-large
> > omaps?
> >
> > Cheers, Dan
> >
> > On Thu, Jun 13, 2019 at 7:36 PM Harald Staub <harald.staub@xxxxxxxxx> wrote:
> >>
> >> Looks fine (at least so far), thank you all!
> >>
> >> After having exported all 3 copies of the bad PG, we decided to try it
> >> in-place. We also set norebalance to make sure that no data is moved.
> >> When the PG was up, the resharding finished with a "success" message.
> >> The large omap warning is gone after deep-scrubbing the PG.
> >>
> >> Then we set the 3 OSDs to out. Soon after, one after the other was down
> >> (maybe for 2 minutes) and we got degraded PGs, but only once.
> >>
> >> Thank you!
> >>    Harry
> >>
> >> On 13.06.19 16:14, Sage Weil wrote:
> >>> On Thu, 13 Jun 2019, Harald Staub wrote:
> >>>> On 13.06.19 15:52, Sage Weil wrote:
> >>>>> On Thu, 13 Jun 2019, Harald Staub wrote:
> >>>> [...]
> >>>>> I think that increasing the various suicide timeout options will allow
> >>>>> it to stay up long enough to clean up the ginormous objects:
> >>>>>
> >>>>>     ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> >>>>
> >>>> ok
> >>>>
> >>>>>> It looks healthy so far:
> >>>>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> >>>>>> fsck success
> >>>>>>
> >>>>>> Now we have to choose how to continue, trying to reduce the risk of losing
> >>>>>> data (most bucket indexes are intact currently). My guess would be to let
> >>>>>> this
> >>>>>> OSD (which was not the primary) go in and hope that it recovers. In case
> >>>>>> of a
> >>>>>> problem, maybe we could still use the other OSDs "somehow"? In case of
> >>>>>> success, we would bring back the other OSDs as well?
> >>>>>>
> >>>>>> OTOH we could try to continue with the key dump from earlier today.
> >>>>>
> >>>>> I would start all three osds the same way, with 'noout' set on the
> >>>>> cluster.  You should try to avoid triggering recovery because it will have
> >>>>> a hard time getting through the big index object on that bucket (i.e., it
> >>>>> will take a long time, and might trigger some blocked ios and so forth).
> >>>>
> >>>> This I do not understand, how would I avoid recovery?
> >>>
> >>> Well, simply doing 'ceph osd set noout' is sufficient to avoid
> >>> recovery, I suppose.  But in any case, getting at least 2 of the
> >>> existing copies/OSDs online (assuming your pool's min_size=2) will mean
> >>> you can finish the reshard process and clean up the big object without
> >>> copying the PG anywhere.
> >>>
> >>> I think you may as well do all 3 OSDs this way, then clean up the big
> >>> object--that way in the end no data will have to move.
> >>>
> >>> This is Nautilus, right?  If you scrub the PGs in question, that will also
> >>> now raise the health alert if there are any remaining big omap objects...
> >>> if that warning goes away you'll know you're doing cleaning up.  A final
> >>> rocksdb compaction should then be enough to remove any remaing weirdness
> >>> from rocksdb's internal layout.
> >>>
> >>>>> (Side note that since you started the OSD read-write using the internal
> >>>>> copy of rocksdb, don't forget that the external copy you extracted
> >>>>> (/mnt/ceph/db?) is now stale!)
> >>>>
> >>>> As suggested by Paul Emmerich (see next E-mail in this thread), I exported
> >>>> this PG. It took not that long (20 minutes).
> >>>
> >>> Great :)
> >>>
> >>> sage
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com