Re: Inconsistent pgs with size_mismatch_oi

Lincoln Bryant <lincolnb@xxxxxxxxxxxx> · Mon, 15 May 2017 17:09:53 -0500

Hi Greg,

Curiously, some of these scrub errors went away on their own. The example pg in the original post is now active+clean, and nothing interesting in the logs:

# zgrep "36.277b" ceph-osd.244*gz
ceph-osd.244.log-20170510.gz:2017-05-09 06:56:40.739855 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170510.gz:2017-05-09 06:58:01.872484 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub ok
ceph-osd.244.log-20170511.gz:2017-05-10 20:40:47.536974 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170511.gz:2017-05-10 20:41:38.399614 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub ok
ceph-osd.244.log-20170514.gz:2017-05-13 20:49:47.063789 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170514.gz:2017-05-13 20:50:42.085718 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub ok
ceph-osd.244.log-20170515.gz:2017-05-15 00:10:39.417578 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170515.gz:2017-05-15 00:11:26.189777 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub ok

(No matches in the logs for osd 175 and osd 297  — perhaps already rotated away?)

Other PGs still exhibit this behavior though:

# rados list-inconsistent-obj 36.2953 | jq .
{
  "epoch": 737940,
  "inconsistents": [
    {
      "object": {
        "name": "1002378da6c.00000001",
        "nspace": "",
        "locator": "",
        "snap": "head",
        "version": 2213621
      },
      "errors": [],
      "union_shard_errors": [
        "size_mismatch_oi"
      ],
      "selected_object_info": "36:ca95a23b:::1002378da6c.00000001:head(737930'2177823 client.36346283.1:5635626 dirty s 4067328 uv 2213621)",
      "shards": [
        {
          "osd": 113,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 123,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 173,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        }
      ]
    }
  ]
}

Perhaps new data being written to this pg cleared things up? 

The only other data point that I can add is that, due to some tweaking of the cache tier size before this happened, the cache tier was reporting near full / full in `ceph -s` for a brief amount of time (maybe <1hr ?). 

Thanks for looking into this.

--Lincoln

> On May 15, 2017, at 4:50 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> 
> On Mon, May 1, 2017 at 9:28 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
>> Hi all,
>> 
>> I’ve run across a peculiar issue on 10.2.7. On my 3x replicated cache tiering cache pool, routine scrubbing suddenly found a bunch of PGs with size_mismatch_oi errors. From the “rados list-inconsistent-pg tool”[1], I see that all OSDs are reporting size 0 for a particular pg. I’ve checked this pg on disk, and it is indeed 0 bytes:
>>        -rw-r--r--  1 root root    0 Apr 29 06:12 100235614fe.00000005__head_6E9A677B__24
>> 
>> I’ve tried re-issuing a scrub, which informs me that the object info size (2994176) doesn’t match the on-disk size (0) (see [2]). I’ve tried a repair operation as well to no avail.
>> 
>> For what it’s worth, this particular cluster is currently migrating several disks from one CRUSH root to another, and there is a nightly cache flush/eviction script that is lowering the cache_target_*_ratios before raising them again in the morning.
>> 
>> This issue is currently affecting ~10 PGs in my cache pool. Any ideas how to proceed here?
> 
> Did anything come from this? It's tickling my brain (especially with
> the cache pool) but I'm not seeing anything relevant when I search my
> email.
> 
>> 
>> Thanks,
>> Lincoln
>> 
>> [1]:
>> {
>>  "epoch": 721312,
>>  "inconsistents": [
>>    {
>>      "object": {
>>        "name": "100235614fe.00000005",
>>        "nspace": "",
>>        "locator": "",
>>        "snap": "head",
>>        "version": 2233551
>>      },
>>      "errors": [],
>>      "union_shard_errors": [
>>        "size_mismatch_oi"
>>      ],
>>      "selected_object_info": "36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)",
>>      "shards": [
>>        {
>>          "osd": 175,
>>          "errors": [
>>            "size_mismatch_oi"
>>          ],
>>          "size": 0
>>        },
>>        {
>>          "osd": 244,
>>          "errors": [
>>            "size_mismatch_oi"
>>          ],
>>          "size": 0
>>        },
>>        {
>>          "osd": 297,
>>          "errors": [
>>            "size_mismatch_oi"
>>          ],
>>          "size": 0
>>        }
>>      ]
>>    }
>>  ]
>> }
>> 
>> [2]:
>> 2017-05-01 10:50:13.812992 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
>> 2017-05-01 10:51:02.495229 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 175: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
>> 2017-05-01 10:51:02.495234 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 244: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
>> 2017-05-01 10:51:02.495326 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 297: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
>> 2017-05-01 10:51:02.495328 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b soid 36:dee65976:::100235614fe.00000005:head: failed to pick suitable auth object
>> 2017-05-01 10:51:02.495450 7f0186e28700 -1 log_channel(cluster) log [ERR] : scrub 36.277b 36:dee65976:::100235614fe.00000005:head on disk size (0) does not match object info size (2994176) adjusted for ondisk to (2994176)
>> 2017-05-01 10:51:20.223733 7f0184623700 -1 log_channel(cluster) log [ERR] : 36.277b scrub 4 errors
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com