Inconsistent pgs with size_mismatch_oi

Lincoln Bryant <lincolnb@xxxxxxxxxxxx> · Mon, 1 May 2017 11:28:43 -0500

Hi all,

I’ve run across a peculiar issue on 10.2.7. On my 3x replicated cache tiering cache pool, routine scrubbing suddenly found a bunch of PGs with size_mismatch_oi errors. From the “rados list-inconsistent-pg tool”[1], I see that all OSDs are reporting size 0 for a particular pg. I’ve checked this pg on disk, and it is indeed 0 bytes:
	-rw-r--r--  1 root root    0 Apr 29 06:12 100235614fe.00000005__head_6E9A677B__24

I’ve tried re-issuing a scrub, which informs me that the object info size (2994176) doesn’t match the on-disk size (0) (see [2]). I’ve tried a repair operation as well to no avail. 

For what it’s worth, this particular cluster is currently migrating several disks from one CRUSH root to another, and there is a nightly cache flush/eviction script that is lowering the cache_target_*_ratios before raising them again in the morning. 

This issue is currently affecting ~10 PGs in my cache pool. Any ideas how to proceed here? 

Thanks,
Lincoln

[1]:
{
  "epoch": 721312,
  "inconsistents": [
    {
      "object": {
        "name": "100235614fe.00000005",
        "nspace": "",
        "locator": "",
        "snap": "head",
        "version": 2233551
      },
      "errors": [],
      "union_shard_errors": [
        "size_mismatch_oi"
      ],
      "selected_object_info": "36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)",
      "shards": [
        {
          "osd": 175,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 244,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 297,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        }
      ]
    }
  ]
}

[2]:
2017-05-01 10:50:13.812992 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
2017-05-01 10:51:02.495229 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 175: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
2017-05-01 10:51:02.495234 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 244: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
2017-05-01 10:51:02.495326 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 297: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
2017-05-01 10:51:02.495328 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b soid 36:dee65976:::100235614fe.00000005:head: failed to pick suitable auth object
2017-05-01 10:51:02.495450 7f0186e28700 -1 log_channel(cluster) log [ERR] : scrub 36.277b 36:dee65976:::100235614fe.00000005:head on disk size (0) does not match object info size (2994176) adjusted for ondisk to (2994176)
2017-05-01 10:51:20.223733 7f0184623700 -1 log_channel(cluster) log [ERR] : 36.277b scrub 4 errors

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com