Re: rocksdb corruption, stale pg, rebuild bucket index

Harald Staub <harald.staub@xxxxxxxxx> · Mon, 17 Jun 2019 11:12:56 +0200

We received the large omap warning before, but for some reasons we could 
not react quickly. We accepted the risk of the bucket becoming slow, but 
had not thought of further risks ...

On 17.06.19 10:15, Dan van der Ster wrote:
Nice to hear this was resolved in the end.

Coming back to the beginning -- is it clear to anyone what was the
root cause and how other users can avoid this from happening? Maybe
some better default configs to warn users earlier about too-large
omaps?

Cheers, Dan

On Thu, Jun 13, 2019 at 7:36 PM Harald Staub <harald.staub@xxxxxxxxx> wrote:

Looks fine (at least so far), thank you all!

After having exported all 3 copies of the bad PG, we decided to try it
in-place. We also set norebalance to make sure that no data is moved.
When the PG was up, the resharding finished with a "success" message.
The large omap warning is gone after deep-scrubbing the PG.

Then we set the 3 OSDs to out. Soon after, one after the other was down
(maybe for 2 minutes) and we got degraded PGs, but only once.

Thank you!
   Harry

On 13.06.19 16:14, Sage Weil wrote:
On Thu, 13 Jun 2019, Harald Staub wrote:
On 13.06.19 15:52, Sage Weil wrote:
On Thu, 13 Jun 2019, Harald Staub wrote:
[...]
I think that increasing the various suicide timeout options will allow
it to stay up long enough to clean up the ginormous objects:

    ceph config set osd.NNN osd_op_thread_suicide_timeout 2h

ok

It looks healthy so far:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
fsck success

Now we have to choose how to continue, trying to reduce the risk of losing
data (most bucket indexes are intact currently). My guess would be to let
this
OSD (which was not the primary) go in and hope that it recovers. In case
of a
problem, maybe we could still use the other OSDs "somehow"? In case of
success, we would bring back the other OSDs as well?

OTOH we could try to continue with the key dump from earlier today.

I would start all three osds the same way, with 'noout' set on the
cluster.  You should try to avoid triggering recovery because it will have
a hard time getting through the big index object on that bucket (i.e., it
will take a long time, and might trigger some blocked ios and so forth).

This I do not understand, how would I avoid recovery?

Well, simply doing 'ceph osd set noout' is sufficient to avoid
recovery, I suppose.  But in any case, getting at least 2 of the
existing copies/OSDs online (assuming your pool's min_size=2) will mean
you can finish the reshard process and clean up the big object without
copying the PG anywhere.

I think you may as well do all 3 OSDs this way, then clean up the big
object--that way in the end no data will have to move.

This is Nautilus, right?  If you scrub the PGs in question, that will also
now raise the health alert if there are any remaining big omap objects...
if that warning goes away you'll know you're doing cleaning up.  A final
rocksdb compaction should then be enough to remove any remaing weirdness
from rocksdb's internal layout.

(Side note that since you started the OSD read-write using the internal
copy of rocksdb, don't forget that the external copy you extracted
(/mnt/ceph/db?) is now stale!)

As suggested by Paul Emmerich (see next E-mail in this thread), I exported
this PG. It took not that long (20 minutes).

Great :)

sage

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com