Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

Christian Rohmann <christian.rohmann@xxxxxxxxx> · Mon, 20 Dec 2021 15:43:21 +0100

Hello Ceph-Users,

for about 3 weeks now I see batches of scrub errors on a 4 node Octopus 
cluster:

# ceph health detail HEALTH_ERR 7 scrub errors; Possible data damage: 
6 pgs inconsistent [ERR] OSD_SCRUB_ERRORS: 7 scrub errors [ERR] 
PG_DAMAGED: Possible data damage: 6 pgs inconsistent     pg 5.3 is 
active+clean+inconsistent, acting [9,12,6]     pg 5.4 is 
active+clean+inconsistent, acting [15,17,18]     pg 7.2 is 
active+clean+inconsistent, acting [13,15,10]     pg 7.9 is 
active+clean+inconsistent, acting [5,19,4]     pg 7.e is 
active+clean+inconsistent, acting [1,15,20]     pg 7.18 is 
active+clean+inconsistent, acting [5,10,0] 

this cluster only serves RADOSGW and it's a multisite master.

I already found another thread 
(https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LXMQSRNSCPS5YJMFXIS3K5NMROHZKDJU/), 
but with no recent comments about such an issue.

In my case I am still seeing more scrub errors every few days. All those 
inconsistencies are "omap_digest_mismatch" in the "zone.rgw.log" or 
"zone.rgw.buckets.index" pool and are spread all across nodes and OSDs.

I already raised I bug ticket (https://tracker.ceph.com/issues/53663), 
but am wondering if anybody of you ever observed something similar?
Traffic to and from the object storage seems totally fine and I can even 
run a manual deep-scrub with no errors and then receive 3-4 errors the 
next day.

Is there anything I could look into / collect when the next 
inconsistency occurs?
Could there be any misconfiguration causing this?

Thanks and with kind regards

Christian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx