Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

Eugen Block <eblock@xxxxxx> · Mon, 20 Dec 2021 21:02:24 +0000

Hi,

you wrote that this cluster was initially installed with Octopus, so  
no upgrade ceph wise? Are all RGW daemons on the exact same ceph  
(minor) versions?
I remember one of our customers reporting inconsistent objects on a  
regular basis although no hardware issues were detectable. They  
replicate between two sites, too. A couple of months ago both sites  
were updated to the same exact ceph minor version (also Octopus), they  
haven't faced inconsistencies since then. I don't have details about  
the ceph version(s) though, only that both sites were initially  
installed with Octopus. Maybe it's worth checking your versions?

Regards,
Eugen

Zitat von Christian Rohmann <christian.rohmann@xxxxxxxxx>:

Hello Ceph-Users,

for about 3 weeks now I see batches of scrub errors on a 4 node  
Octopus cluster:

# ceph health detail HEALTH_ERR 7 scrub errors; Possible data  
damage: 6 pgs inconsistent [ERR] OSD_SCRUB_ERRORS: 7 scrub errors  
[ERR] PG_DAMAGED: Possible data damage: 6 pgs inconsistent     pg  
5.3 is active+clean+inconsistent, acting [9,12,6]     pg 5.4 is  
active+clean+inconsistent, acting [15,17,18]     pg 7.2 is  
active+clean+inconsistent, acting [13,15,10]     pg 7.9 is  
active+clean+inconsistent, acting [5,19,4]     pg 7.e is  
active+clean+inconsistent, acting [1,15,20]     pg 7.18 is  
active+clean+inconsistent, acting [5,10,0]

this cluster only serves RADOSGW and it's a multisite master.

I already found another thread  
(https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LXMQSRNSCPS5YJMFXIS3K5NMROHZKDJU/), but with no recent comments about such an  
issue.

In my case I am still seeing more scrub errors every few days. All  
those inconsistencies are "omap_digest_mismatch" in the  
"zone.rgw.log" or "zone.rgw.buckets.index" pool and are spread all  
across nodes and OSDs.

I already raised I bug ticket  
(https://tracker.ceph.com/issues/53663), but am wondering if anybody  
of you ever observed something similar?
Traffic to and from the object storage seems totally fine and I can  
even run a manual deep-scrub with no errors and then receive 3-4  
errors the next day.

Is there anything I could look into / collect when the next  
inconsistency occurs?
Could there be any misconfiguration causing this?

Thanks and with kind regards

Christian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx