Re: pg inconsistent, scrub stat mismatch on bytes

Adrian <aussieade@xxxxxxxxx> · Thu, 7 Jun 2018 10:57:36 +1000

Update to this.

The affected pg didn't seem inconsistent:

[root@admin-ceph1-qh2 ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.20 is active+clean+inconsistent, acting [114,26,44]
[root@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20 --format=json-pretty
{
   "epoch": 210034,
   "inconsistents": []
}

Although pg query showed the primary info.stats.stat_sum.num_bytes differed from the peers

A pg repair on 6.20 seems to have resolved the issue for now but the info.stats.stat_sum.num_bytes still differs so presumably will become inconsistent again next time it scrubs.

Adrian.

On Tue, Jun 5, 2018 at 12:09 PM, Adrian <aussieade@xxxxxxxxx> wrote:
Hi Cephers,

We recently upgraded one of our clusters from hammer to jewel and then to luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some deep-scubs we have an inconsistent pg with a log message we've not seen before:

HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 6.20 is active+clean+inconsistent, acting [114,26,44]

Ceph log shows
2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : cluster [ERR] 6.20 scrub 1 errors
2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
There are no EC pools - looks like it may be the same as https://tracker.ceph.com/issues/22656 although as in #7 this is not a cache pool.

Wondering if this is ok to issue a pg repair on 6.20 or if there's something else we should be looking at first ?

Thanks in advance,
Adrian.

---
Adrian : aussieade@xxxxxxxxx
If violence doesn't solve your problem, you're not using enough of it.

-- 
---
Adrian : aussieade@xxxxxxxxx
If violence doesn't solve your problem, you're not using enough of it.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com