Hi list, we need to recover an index pool distributed over 4 ssd based osd's. We needed to kick out one of the OSDs cause it was blocking all rgw access due to leveldb compacting. Since then we've restarted the OSD with "leveldb compact on mount = true" and noup flag set, running the leveldb compact offline, but the index pg's are now running in degraded mode. Goal is to make the recovery as fast as possible during a small maintenance window and/or with minimal client impact. Cluster is running jewel 10.2.7 (recently upgraded from hammer) and has ongoing backfill operations (from changing the tunables). We have some buckets with a large amount of objects in it. Bucket index re-sharding would be needed, but we don't have the opportunity to do that right now. Plan so far: * set global I/O scheduling priority to 7 (lowest) * set index pool osd's specifics: - set recovery prio to highest (63) - set client prio to lowest (1) - increase recovery threads to 2 - set disk thread prio to highest (0) - limit omap entries per chunk for recovery to 32k (64k seems to give timeouts) * unset noup flag to let the misbehaving OSD kick in and start recovery Any further ideas, experience or remarks would be very much appreciated... r, Sam _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com