Significant uptick in inconsistent pgs in Jewel 10.2.9

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Thu, 7 Sep 2017 20:24:04 +0000

Hi,

Our clusters were upgraded to v10.2.9, from ~v10.2.7 (actually a local
git snapshot that was not quite 10.2.7), and since then, we're seeing a
LOT more scrub errors than previously.

The digest logging on the scrub errors, in some cases, is also now maddeningly
short: it doesn't contain ANY information on what the mismatch was, and many of
the errors seem to also be 3-way mismatches in the digest :-(.

I'm wondering if other people have seen something similar rises in scrub errors
in the upgrade, and/or the lack of digest output. I did hear one anecdotal
report that 10.2.9 seemed much more likely to fail out marginal disks.

The only two changesets I can spot in Jewel that I think might be related are these:
1.
http://tracker.ceph.com/issues/20089
https://github.com/ceph/ceph/pull/15416
2.
http://tracker.ceph.com/issues/19404
https://github.com/ceph/ceph/pull/14204

Two example PGs that are inconsistent (chosen because they didn't convey any private information so I didn't have to redact anything except IP):
$ sudo ceph health detail |grep -e 5.3d40 -e 5.f1c0
pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]

$ fgrep 5.3d40 /var/log/ceph/ceph.log
2017-09-07 19:50:16.231523 osd.1322 [REDACTED::8861]:6808/3479303 1736 : cluster [INF] osd.1322 pg 5.3d40 Deep scrub errors, upgrading scrub to deep-scrub
2017-09-07 19:50:16.231862 osd.1322 [REDACTED::8861]:6808/3479303 1737 : cluster [INF] 5.3d40 deep-scrub starts
2017-09-07 19:54:38.631232 osd.1322 [REDACTED::8861]:6808/3479303 1738 : cluster [ERR] 5.3d40 shard 655: soid 5:02bc4def:::.dir.default.64449186.344176:head omap_digest 0x3242b04e != omap_digest 0x337cf025 from auth oi 5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd ffffffff od 337cf025 alloc_hint [0 0])
2017-09-07 19:54:38.631332 osd.1322 [REDACTED::8861]:6808/3479303 1739 : cluster [ERR] 5.3d40 shard 1322: soid 5:02bc4def:::.dir.default.64449186.344176:head omap_digest 0xc90d06a8 != omap_digest 0x3242b04e from shard 655, omap_digest 0xc90d06a8 != omap_digest 0x337cf025 from auth oi 5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd ffffffff od 337cf025 alloc_hint [0 0])
2017-09-07 20:03:54.721681 osd.1322 [REDACTED::8861]:6808/3479303 1740 : cluster [ERR] 5.3d40 deep-scrub 0 missing, 1 inconsistent objects
2017-09-07 20:03:54.721687 osd.1322 [REDACTED::8861]:6808/3479303 1741 : cluster [ERR] 5.3d40 deep-scrub 3 errors

$ fgrep 5.f1c0   /var/log/ceph/ceph.log
2017-09-07 11:11:36.773986 osd.631 [REDACTED::8877]:6813/4036028 4234 : cluster [INF] osd.631 pg 5.f1c0 Deep scrub errors, upgrading scrub to deep-scrub
2017-09-07 11:11:36.774127 osd.631 [REDACTED::8877]:6813/4036028 4235 : cluster [INF] 5.f1c0 deep-scrub starts
2017-09-07 11:25:26.231502 osd.631 [REDACTED::8877]:6813/4036028 4236 : cluster [ERR] 5.f1c0 deep-scrub 0 missing, 1 inconsistent objects
2017-09-07 11:25:26.231508 osd.631 [REDACTED::8877]:6813/4036028 4237 : cluster [ERR] 5.f1c0 deep-scrub 1 errors

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Asst. Treasurer
E-Mail   : robbat2@xxxxxxxxxx
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc

Description: Digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com