Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

you wrote that this cluster was initially installed with Octopus, so no upgrade ceph wise? Are all RGW daemons on the exact same ceph (minor) versions? I remember one of our customers reporting inconsistent objects on a regular basis although no hardware issues were detectable. They replicate between two sites, too. A couple of months ago both sites were updated to the same exact ceph minor version (also Octopus), they haven't faced inconsistencies since then. I don't have details about the ceph version(s) though, only that both sites were initially installed with Octopus. Maybe it's worth checking your versions?

Regards,
Eugen

Zitat von Christian Rohmann <christian.rohmann@xxxxxxxxx>:

Hello Ceph-Users,

for about 3 weeks now I see batches of scrub errors on a 4 node Octopus cluster:

# ceph health detail HEALTH_ERR 7 scrub errors; Possible data damage: 6 pgs inconsistent [ERR] OSD_SCRUB_ERRORS: 7 scrub errors [ERR] PG_DAMAGED: Possible data damage: 6 pgs inconsistent     pg 5.3 is active+clean+inconsistent, acting [9,12,6]     pg 5.4 is active+clean+inconsistent, acting [15,17,18]     pg 7.2 is active+clean+inconsistent, acting [13,15,10]     pg 7.9 is active+clean+inconsistent, acting [5,19,4]     pg 7.e is active+clean+inconsistent, acting [1,15,20]     pg 7.18 is active+clean+inconsistent, acting [5,10,0]

this cluster only serves RADOSGW and it's a multisite master.

I already found another thread (https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LXMQSRNSCPS5YJMFXIS3K5NMROHZKDJU/), but with no recent comments about such an issue.

In my case I am still seeing more scrub errors every few days. All those inconsistencies are "omap_digest_mismatch" in the "zone.rgw.log" or "zone.rgw.buckets.index" pool and are spread all across nodes and OSDs.

I already raised I bug ticket (https://tracker.ceph.com/issues/53663), but am wondering if anybody of you ever observed something similar? Traffic to and from the object storage seems totally fine and I can even run a manual deep-scrub with no errors and then receive 3-4 errors the next day.


Is there anything I could look into / collect when the next inconsistency occurs?
Could there be any misconfiguration causing this?


Thanks and with kind regards


Christian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux