Hi Greg, Curiously, some of these scrub errors went away on their own. The example pg in the original post is now active+clean, and nothing interesting in the logs: # zgrep "36.277b" ceph-osd.244*gz ceph-osd.244.log-20170510.gz:2017-05-09 06:56:40.739855 7f0184623700 0 log_channel(cluster) log [INF] : 36.277b scrub starts ceph-osd.244.log-20170510.gz:2017-05-09 06:58:01.872484 7f0186e28700 0 log_channel(cluster) log [INF] : 36.277b scrub ok ceph-osd.244.log-20170511.gz:2017-05-10 20:40:47.536974 7f0186e28700 0 log_channel(cluster) log [INF] : 36.277b scrub starts ceph-osd.244.log-20170511.gz:2017-05-10 20:41:38.399614 7f0184623700 0 log_channel(cluster) log [INF] : 36.277b scrub ok ceph-osd.244.log-20170514.gz:2017-05-13 20:49:47.063789 7f0186e28700 0 log_channel(cluster) log [INF] : 36.277b scrub starts ceph-osd.244.log-20170514.gz:2017-05-13 20:50:42.085718 7f0186e28700 0 log_channel(cluster) log [INF] : 36.277b scrub ok ceph-osd.244.log-20170515.gz:2017-05-15 00:10:39.417578 7f0184623700 0 log_channel(cluster) log [INF] : 36.277b scrub starts ceph-osd.244.log-20170515.gz:2017-05-15 00:11:26.189777 7f0186e28700 0 log_channel(cluster) log [INF] : 36.277b scrub ok (No matches in the logs for osd 175 and osd 297 — perhaps already rotated away?) Other PGs still exhibit this behavior though: # rados list-inconsistent-obj 36.2953 | jq . { "epoch": 737940, "inconsistents": [ { "object": { "name": "1002378da6c.00000001", "nspace": "", "locator": "", "snap": "head", "version": 2213621 }, "errors": [], "union_shard_errors": [ "size_mismatch_oi" ], "selected_object_info": "36:ca95a23b:::1002378da6c.00000001:head(737930'2177823 client.36346283.1:5635626 dirty s 4067328 uv 2213621)", "shards": [ { "osd": 113, "errors": [ "size_mismatch_oi" ], "size": 0 }, { "osd": 123, "errors": [ "size_mismatch_oi" ], "size": 0 }, { "osd": 173, "errors": [ "size_mismatch_oi" ], "size": 0 } ] } ] } Perhaps new data being written to this pg cleared things up? The only other data point that I can add is that, due to some tweaking of the cache tier size before this happened, the cache tier was reporting near full / full in `ceph -s` for a brief amount of time (maybe <1hr ?). Thanks for looking into this. --Lincoln > On May 15, 2017, at 4:50 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Mon, May 1, 2017 at 9:28 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote: >> Hi all, >> >> I’ve run across a peculiar issue on 10.2.7. On my 3x replicated cache tiering cache pool, routine scrubbing suddenly found a bunch of PGs with size_mismatch_oi errors. From the “rados list-inconsistent-pg tool”[1], I see that all OSDs are reporting size 0 for a particular pg. I’ve checked this pg on disk, and it is indeed 0 bytes: >> -rw-r--r-- 1 root root 0 Apr 29 06:12 100235614fe.00000005__head_6E9A677B__24 >> >> I’ve tried re-issuing a scrub, which informs me that the object info size (2994176) doesn’t match the on-disk size (0) (see [2]). I’ve tried a repair operation as well to no avail. >> >> For what it’s worth, this particular cluster is currently migrating several disks from one CRUSH root to another, and there is a nightly cache flush/eviction script that is lowering the cache_target_*_ratios before raising them again in the morning. >> >> This issue is currently affecting ~10 PGs in my cache pool. Any ideas how to proceed here? > > Did anything come from this? It's tickling my brain (especially with > the cache pool) but I'm not seeing anything relevant when I search my > email. > >> >> Thanks, >> Lincoln >> >> [1]: >> { >> "epoch": 721312, >> "inconsistents": [ >> { >> "object": { >> "name": "100235614fe.00000005", >> "nspace": "", >> "locator": "", >> "snap": "head", >> "version": 2233551 >> }, >> "errors": [], >> "union_shard_errors": [ >> "size_mismatch_oi" >> ], >> "selected_object_info": "36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)", >> "shards": [ >> { >> "osd": 175, >> "errors": [ >> "size_mismatch_oi" >> ], >> "size": 0 >> }, >> { >> "osd": 244, >> "errors": [ >> "size_mismatch_oi" >> ], >> "size": 0 >> }, >> { >> "osd": 297, >> "errors": [ >> "size_mismatch_oi" >> ], >> "size": 0 >> } >> ] >> } >> ] >> } >> >> [2]: >> 2017-05-01 10:50:13.812992 7f0184623700 0 log_channel(cluster) log [INF] : 36.277b scrub starts >> 2017-05-01 10:51:02.495229 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 175: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551) >> 2017-05-01 10:51:02.495234 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 244: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551) >> 2017-05-01 10:51:02.495326 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 297: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551) >> 2017-05-01 10:51:02.495328 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b soid 36:dee65976:::100235614fe.00000005:head: failed to pick suitable auth object >> 2017-05-01 10:51:02.495450 7f0186e28700 -1 log_channel(cluster) log [ERR] : scrub 36.277b 36:dee65976:::100235614fe.00000005:head on disk size (0) does not match object info size (2994176) adjusted for ondisk to (2994176) >> 2017-05-01 10:51:20.223733 7f0184623700 -1 log_channel(cluster) log [ERR] : 36.277b scrub 4 errors >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com