radosgw-admin reshard stale-instances rm experience

Wido den Hollander <wido@xxxxxxxx> · Thu, 21 Feb 2019 16:04:36 +0100

Hi,

For the last few months I've been getting question about people seeing
warnings about large OMAP objects after scrubs.

I've been digging for a few months (You'll also find multiple threads
about this) and it all seemed to trace back to RGW indexes.

Resharding didn't clean up old indexes properly which caused the RGW
indexes to keep growing and growing in number of Objects.

Last week I got a case where a RGW-only cluster running on HDD became
unusable slow. OSDs flapping, slow requests, the whole package. (yay!)

I traced it down to OSDs sometimes scanning RocksDB (debug bluefs) and
the HDD would become 100% busy for a few minutes.

Compacting these OSDs could take more then 30 minutes and it helped for
a while.

This cluster was running 12.2.8 and we upgraded to 12.2.11 to run:

$ radosgw-admin reshard stale-instances list > instances.json
$ cat instances.json|jq -r '.[]'|wc -l

It showed that there we 88k stale Instances.

The rgw.buckets.index pool showed 222k objects according to 'ceph df'.

So we started the clean up the stale Instances as they are stored in
RocksDB mainly.

$ radosgw-admin reshard stale-instances rm

While this was running OSDs would sometimes start to flap. We had to
cancel, compact and restart the rm.

After 6 days (!) of rm'ing all the indexes were gone.

The index pool went from 222k objects to just 43k objects.

We compacted all the OSDs, which now took just 3 minutes, and things are
running again properly.

As a precaution NVMe devices have been added and using device classes we
move the index pool to NVMe-backend OSDs only, but nevertheless, this
would have also not worked on NVMe.

For some reason RocksDB couldn't handle the tens of millions OMAP
entries stored in the OSDs and would start to scan the whole DB.

It could be that the 4GB of memory per OSD just was not sufficient to
store all the indexes for RocksDB, but I wasn't able to confirm that.

This cluster has ~1200 buckets in RGW and had 222k objects prior to the
cleanup.

I got another call yesterday about a cluster with identical symptoms and
that has just 250 buckets, but it has ~700k (!!) objects in the RGW
index pool.

My advise: Upgrade to 12.2.11 and run the stale-instances list asap and
see if you need to rm data.

This isn't available in 13.2.4, but should be in 13.2.5, so on Mimic you
will need to wait. But this might bite you at some point.

I hope I can prevent some admins from having sleepless nights about a
Ceph cluster flapping.

Wido
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com