Re: deep scrub error caused by missing object

ceph@xxxxxxxxxx · Fri, 05 Oct 2018 19:46:33 +0200

Hello Roman,

I am Not sure if i could be a help but perhaps this Commands can help to find the objects in question...

Ceph Heath Detail
rados list-inconsistent-pg rbd
rados list-inconsistent-obj 2.10d

I guess it is also interresting to know you use  bluestore or filestore...

Hth
- Mehmet 

Am 4. Oktober 2018 14:06:07 MESZ schrieb Roman Steinhart <roman@xxxxxxxxxxx>:
Hi all,

since some weeks we have a small problem with one of the PG's on our ceph cluster.
Every time the pg 2.10d is deep scrubbing it fails because of this:
2018-08-06 19:36:28.080707 osd.14 osd.14 *.*.*.110:6809/3935 133 : cluster [ERR] 2.10d scrub stat mismatch, got 397/398 objects, 0/0 clones, 397/398 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 2609281919/2609293215 bytes, 0/0 hit_set_archive bytes. 
2018-08-06 19:36:28.080905 osd.14 osd.14 *.*.*.110:6809/3935 134 : cluster [ERR] 2.10d scrub 1 errors
As far as I understand ceph is missing an object on that osd.14 which should be stored on this osd. A small ceph pg repair 2.10d fixes the problem but as soon as a deep scrubbing job for that pg is running again(manual or automatically) the problem is back again.

I tried to find out which object is missing, but a small search leads me to the result that there is no real way to find out which objects are stored in this PG or which object exactly is missing.
That's why I've gone for some "unconventional" methods.
I completely removed OSD.14 from the cluster. I waited until everything was balanced and then added the OSD again.
Unfortunately the problem is still there.

Some weeks later we've added a huge amount of OSD's to our cluster which had a big impact on the crush map.
Since then the PG 2.10d was running on two other OSD's -> [119,93] (We have a replica of 2)

Still the same error message, but another OSD:
2018-10-03 03:39:22.776521 7f12d9979700 -1 log_channel(cluster) log [ERR] : 2.10d scrub stat mismatch, got 728/729 objects, 0/0 clones, 728/729 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7281369687/7281381269 bytes, 0/0 hit_set_archive bytes.

As a first step it would be enough for me to find out which the problematic object is. Then I am able to check if the object is critical, if any recovery is required or if I am able to just drop that object(That would be 90% of the case)

I hope anyone is able to help me to get rid of this.
It's not really a problem for us. Ceph runs despite this message without further problems.
It's just a bit annoying that every time the error occurs our monitoring triggers a big alarm because Ceph is in ERROR status. :)

Thanks in advance,
Roman

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com