Re: PG_DAMAGED

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 4 Dec 2020 11:23:36 +0100



In my experience inconsistencies caused by IO errors always have a
SCSI Medium Error showing up in the kernel logs. (dmesg, journalctl
-k, /v/l/messages, ...)
(Except in the case of one very bad non-enterprise SMR drive I run at
home, not at work).

-- dan

On Fri, Dec 4, 2020 at 11:03 AM Hans van den Bogert
<hansbogert@xxxxxxxxx> wrote:
>
> Interesting, your comment implies that it is a replication issue, which
> does not stem from a faulty disk. But, couldn't the disk have a bit
> flip? Or would you argue that would've shown as a disk read error
> somewhere (because of ECC on the disk.)
>
> On 12/4/20 10:51 AM, Dan van der Ster wrote:
> > Note that in this case the inconsistencies are not coming from object
> > reads, but from comparing the omap digests of an rgw index shard.
> > This seems to be a result of a replication issue sometime in the past
> > on this cluster.
> >
> > On Fri, Dec 4, 2020 at 10:32 AM Eugen Block <eblock@xxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> this is not necessarily but most likely a hint to a (slowly) failing
> >> disk. Check all OSDs for this PG for disk errors in dmesg and smartctl.
> >>
> >> Regards,
> >> Eugen
> >>
> >>
> >> Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:
> >>
> >>> Hi,
> >>>
> >>> Not sure is it related to my 15.2.7 update, but today I got many
> >>> time this issue:
> >>>
> >>> 2020-12-04T15:14:23.910799+0700 osd.40 (osd.40) 11 : cluster [DBG]
> >>> 11.2 deep-scrub starts
> >>> 2020-12-04T15:14:23.947255+0700 osd.40 (osd.40) 12 : cluster [ERR]
> >>> 11.2 soid
> >>> 11:434f049b:::.dir.75333f99-93d0-4238-91a4-ba833a0edd24.1744118.372.1:head :
> >>> omap_digest 0x48532c00 != omap_digest 0x8a18f5d7 from shard 40
> >>> 2020-12-04T15:14:23.977138+0700 mgr.hk-cephmon-2s02 (mgr.2120884)
> >>> 4330 : cluster [DBG] pgmap v4338: 209 pgs: 209 active+clean; 2.8 GiB
> >>> data, 21 TiB used, 513 TiB / 534 TiB avail; 32 KiB/s rd, 32 op/s
> >>> 2020-12-04T15:14:24.030888+0700 osd.40 (osd.40) 13 : cluster [ERR]
> >>> 11.2 soid
> >>> 11:4b86603b:::.dir.75333f99-93d0-4238-91a4-ba833a0edd24.1744118.197.3:head :
> >>> omap_digest 0xcb62779b != omap_digest 0xefef7471 from shard 40
> >>> 2020-12-04T15:14:24.229000+0700 osd.40 (osd.40) 14 : cluster [ERR]
> >>> 11.2 deep-scrub 0 missing, 2 inconsistent objects
> >>> 2020-12-04T15:14:24.229003+0700 osd.40 (osd.40) 15 : cluster [ERR]
> >>> 11.2 deep-scrub 2 errors
> >>> 2020-12-04T15:14:25.978189+0700 mgr.hk-cephmon-2s02 (mgr.2120884)
> >>> 4331 : cluster [DBG] pgmap v4339: 209 pgs: 1
> >>> active+clean+scrubbing+deep, 208 active+clean; 2.8 GiB data, 21 TiB
> >>> used, 513 TiB / 534 TiB avail; 55 KiB/s rd, 0 B/s wr, 61 op/s
> >>> 2020-12-04T15:14:27.978588+0700 mgr.hk-cephmon-2s02 (mgr.2120884)
> >>> 4332 : cluster [DBG] pgmap v4340: 209 pgs: 1
> >>> active+clean+scrubbing+deep, 208 active+clean; 2.8 GiB data, 21 TiB
> >>> used, 513 TiB / 534 TiB avail; 43 KiB/s rd, 0 B/s wr, 49 op/s
> >>> 2020-12-04T15:14:30.293180+0700 mon.hk-cephmon-2s01 (mon.0) 4475 :
> >>> cluster [ERR] Health check failed: 2 scrub errors (OSD_SCRUB_ERRORS)
> >>> 2020-12-04T15:14:30.293196+0700 mon.hk-cephmon-2s01 (mon.0) 4476 :
> >>> cluster [ERR] Health check failed: Possible data damage: 1 pg
> >>> inconsistent (PG_DAMAGED)
> >>>
> >>> I had to repair pg and it worked fine, but not sure where this come
> >>> from. I have this in the log only :/
> >>>
> >>> Thank you.
> >>>
> >>> ________________________________
> >>> This message is confidential and is for the sole use of the intended
> >>> recipient(s). It may also be privileged or otherwise protected by
> >>> copyright or other legal rules. If you have received it by mistake
> >>> please let us know by reply email and delete it from your system. It
> >>> is prohibited to copy this message or disclose its content to
> >>> anyone. Any confidentiality or privilege is not waived or lost by
> >>> any mistaken delivery or unauthorized disclosure of the message. All
> >>> messages sent to and from Agoda may be monitored to ensure
> >>> compliance with company policies, to protect the company's interests
> >>> and to remove potential malware. Electronic messages may be
> >>> intercepted, amended, lost or deleted, or contain viruses.
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx