Re: Unrepairable PG

David Turner <drakonstein@xxxxxxxxx> · Tue, 28 Aug 2018 07:49:10 -0700

There is a [1] tracker open for this issue. There are 2 steps that should get a pg to scrub/repair that is just issuing the scrub, but not running it. First is to increase osd_max_scrubs on the OSDs involved in the PG. If that doesn't fix it, then try increasing your osd_deep_scrub_interval on all osds in your cluster. Both settings can be injected and in my experience that should allow your PG to repair/deep-scrub.
The idea is that your cluster isn't able to keep up with the deep-scrub schedule and the deep-scrubs being forced to run by the cluster due to the interval are higher priority than the ones you manually submit. That was definitely the case when I had this problem a few weeks ago and these steps resolved it. When I had it a few months ago I just let it run its course and the repair finally happened about 3 weeks after I issued it. My osd_deep_scrub_interval was set to 30 days, but apparently it was taking closer to 7 weeks to get through all of the PGs.

[1] https://tracker.ceph.com/issues/23576#change-119460

On Tue, Aug 28, 2018, 5:16 AM Maks Kowalik <maks_kowalik@xxxxxxxxx> wrote:
Scrubs discovered the following inconsistency:

2018-08-23 17:21:07.933458 osd.62 osd.62 10.122.0.140:6805/77767 6 : cluster [ERR] 9.3cd shard 113: soid 9:b3cd8d89:::.dir.default.153398310.112:head omap_digest 0xea4ba012 != omap_digest 0xc5acebfd from shard 62, omap_digest 0xea4ba012 != omap_digest 0xc5acebfd from auth oi 9:b3cd8d89:::.dir.default.153398310.112:head(138609'2009129 osd.250.0:64658209 dirty|omap|data_digest|omap_digest s 0 uv 1995230 dd ffffffff od c5acebfd alloc_hint [0 0 0])

The omap_digest_mismatch appears on a non-primary OSD in a pool with 4 replicas. In this situation I decided to issue "pg repair" as I expected ceph will repair the broken object. The command was successful but repair on 9.3cd didn't start.

Then I have tried the procedure described here (setting a temporary key on the object to force recalculation of omap_digest):
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg47219.html
But deep-scrub on 9.3cd didn't start. The OSD marked the 9.3cd for scrubbing, but that's all what happened:

2018-08-27 14:36:22.703848 7faa7e860700 20 osd.62 713813 OSD::ms_dispatch: scrub([9.3cd] deep) v2
2018-08-27 14:36:22.703869 7faa7e860700 20 osd.62 713813 _dispatch 0x55725b76d180 scrub([9.3cd] deep) v2
2018-08-27 14:36:22.703871 7faa7e860700 10 osd.62 713813 handle_scrub scrub([9.3cd] deep) v2
2018-08-27 14:36:22.703878 7faa7e860700 10 osd.62 713813 marking pg[9.3cd( v 713813'2359292 (713107'2357731,713813'2359292] local-lis/les=711049/711050 n=41419 ec=178/178 lis/c 711049/711049 les/c/f 711050/711149/222921 711049/711049/710352) [62,53,163,113] r=0 lpr=711049 crt=713813'2359292 lcod 713813'2359291 mlcod 713813'2359291 active+clean+inconsistent MUST_DEEP_SCRUB MUST_SCRUB] for scrub

Does anyone know how to recover from inconsistency in such case?

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com