Re: Inconsistent pgs with size_mismatch_oi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry to bring up an old post but on Kraken I am unable to repair a PG that is inconsistent in a  cache tier . We remove the bad object but am still seeing the following error in the OSD's logs. 



Prior to removing invalid object:

/var/log/ceph/ceph-osd.126.log:928:2017-07-03 08:07:55.331479 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f shard 63:  soid 1:fa86fe35:::10006cdc2c5.00000000:head data_digest 0x931041e9 != data_digest 0xcd130b55 from auth oi 1:fa86fe35:::10006cdc2c5.00000000:head(25726'1664129 client.8168902.0:607753 dirty|data_digest s 1713351 uv 1664129 dd cd130b55 alloc_hint [0 0 0])
/var/log/ceph/ceph-osd.126.log:929:2017-07-03 08:07:55.331483 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f shard 126: soid 1:fa86fe35:::10006cdc2c5.00000000:head data_digest 0x931041e9 != data_digest 0xcd130b55 from auth oi 1:fa86fe35:::10006cdc2c5.00000000:head(25726'1664129 client.8168902.0:607753 dirty|data_digest s 1713351 uv 1664129 dd cd130b55 alloc_hint [0 0 0])
/var/log/ceph/ceph-osd.126.log:930:2017-07-03 08:07:55.331487 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f shard 143: soid 1:fa86fe35:::10006cdc2c5.00000000:head data_digest 0x931041e9 != data_digest 0xcd130b55 from auth oi 1:fa86fe35:::10006cdc2c5.00000000:head(25726'1664129 client.8168902.0:607753 dirty|data_digest s 1713351 uv 1664129 dd cd130b55 alloc_hint [0 0 0])
/var/log/ceph/ceph-osd.126.log:931:2017-07-03 08:07:55.331491 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f soid 1:fa86fe35:::10006cdc2c5.00000000:head: failed to pick suitable auth object
/var/log/ceph/ceph-osd.126.log:932:2017-07-03 08:08:27.605139 7f95a4be6700 -1 log_channel(cluster) log [ERR] : 1.15f repair 3 errors, 0 fixed



Post Removing invalid object:
/var/log/ceph/ceph-osd.126.log:3433:2017-07-03 08:37:03.780584 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f shard 63:  soid 1:fa86fe35:::10006cdc2c5.00000000:head data_digest 0x931041e9 != data_digest 0xcd130b55 from auth oi 1:fa86fe35:::10006cdc2c5.00000000:head(25726'1664129 client.8168902.0:607753 dirty|data_digest s 1713351 uv 1664129 dd cd130b55 alloc_hint [0 0 0])
/var/log/ceph/ceph-osd.126.log:3434:2017-07-03 08:37:03.780591 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f shard 126: soid 1:fa86fe35:::10006cdc2c5.00000000:head data_digest 0x931041e9 != data_digest 0xcd130b55 from auth oi 1:fa86fe35:::10006cdc2c5.00000000:head(25726'1664129 client.8168902.0:607753 dirty|data_digest s 1713351 uv 1664129 dd cd130b55 alloc_hint [0 0 0])
/var/log/ceph/ceph-osd.126.log:3435:2017-07-03 08:37:03.780593 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f shard 143  missing 1:fa86fe35:::10006cdc2c5.00000000:head
/var/log/ceph/ceph-osd.126.log:3436:2017-07-03 08:37:03.780594 7f95a73eb700 -1 log_channel(cluster) log [ERR] : 1.15f soid 1:fa86fe35:::10006cdc2c5.00000000:head: failed to pick suitable auth object
/var/log/ceph/ceph-osd.126.log:3437:2017-07-03 08:37:39.278991 7f95a4be6700 -1 log_channel(cluster) log [ERR] : 1.15f repair 3 errors, 0 fixed



Is it possible this thread is related to the error we are seeing?


Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 image




From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Gregory Farnum <gfarnum@xxxxxxxxxx>
Sent: Monday, May 15, 2017 6:28 PM
To: Lincoln Bryant; Weil, Sage
Cc: ceph-users
Subject: Re: Inconsistent pgs with size_mismatch_oi
 


On Mon, May 15, 2017 at 3:19 PM Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
Hi Greg,

Curiously, some of these scrub errors went away on their own. The example pg in the original post is now active+clean, and nothing interesting in the logs:

# zgrep "36.277b" ceph-osd.244*gz
ceph-osd.244.log-20170510.gz:2017-05-09 06:56:40.739855 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170510.gz:2017-05-09 06:58:01.872484 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub ok
ceph-osd.244.log-20170511.gz:2017-05-10 20:40:47.536974 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170511.gz:2017-05-10 20:41:38.399614 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub ok
ceph-osd.244.log-20170514.gz:2017-05-13 20:49:47.063789 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170514.gz:2017-05-13 20:50:42.085718 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub ok
ceph-osd.244.log-20170515.gz:2017-05-15 00:10:39.417578 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
ceph-osd.244.log-20170515.gz:2017-05-15 00:11:26.189777 7f0186e28700  0 log_channel(cluster) log [INF] : 36.277b scrub ok

(No matches in the logs for osd 175 and osd 297  — perhaps already rotated away?)

Other PGs still exhibit this behavior though:

# rados list-inconsistent-obj 36.2953 | jq .
{
  "epoch": 737940,
  "inconsistents": [
    {
      "object": {
        "name": "1002378da6c.00000001",
        "nspace": "",
        "locator": "",
        "snap": "head",
        "version": 2213621
      },
      "errors": [],
      "union_shard_errors": [
        "size_mismatch_oi"
      ],
      "selected_object_info": "36:ca95a23b:::1002378da6c.00000001:head(737930'2177823 client.36346283.1:5635626 dirty s 4067328 uv 2213621)",
      "shards": [
        {
          "osd": 113,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 123,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 173,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        }
      ]
    }
  ]
}

Perhaps new data being written to this pg cleared things up?

Hmm, somebody else did report the same thing (and the symptoms disappearing) recently as well. I wonder if we broke the synchronization around eviction and scrubbing within cache pools. Sage, you've done work on cache pools recently; any thoughts?
-Greg
 

The only other data point that I can add is that, due to some tweaking of the cache tier size before this happened, the cache tier was reporting near full / full in `ceph -s` for a brief amount of time (maybe <1hr ?).

Thanks for looking into this.

--Lincoln

> On May 15, 2017, at 4:50 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Mon, May 1, 2017 at 9:28 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
>> Hi all,
>>
>> I’ve run across a peculiar issue on 10.2.7. On my 3x replicated cache tiering cache pool, routine scrubbing suddenly found a bunch of PGs with size_mismatch_oi errors. From the “rados list-inconsistent-pg tool”[1], I see that all OSDs are reporting size 0 for a particular pg. I’ve checked this pg on disk, and it is indeed 0 bytes:
>>        -rw-r--r--  1 root root    0 Apr 29 06:12 100235614fe.00000005__head_6E9A677B__24
>>
>> I’ve tried re-issuing a scrub, which informs me that the object info size (2994176) doesn’t match the on-disk size (0) (see [2]). I’ve tried a repair operation as well to no avail.
>>
>> For what it’s worth, this particular cluster is currently migrating several disks from one CRUSH root to another, and there is a nightly cache flush/eviction script that is lowering the cache_target_*_ratios before raising them again in the morning.
>>
>> This issue is currently affecting ~10 PGs in my cache pool. Any ideas how to proceed here?
>
> Did anything come from this? It's tickling my brain (especially with
> the cache pool) but I'm not seeing anything relevant when I search my
> email.
>
>>
>> Thanks,
>> Lincoln
>>
>> [1]:
>> {
>>  "epoch": 721312,
>>  "inconsistents": [
>>    {
>>      "object": {
>>        "name": "100235614fe.00000005",
>>        "nspace": "",
>>        "locator": "",
>>        "snap": "head",
>>        "version": 2233551
>>      },
>>      "errors": [],
>>      "union_shard_errors": [
>>        "size_mismatch_oi"
>>      ],
>>      "selected_object_info": "36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)",
>>      "shards": [
>>        {
>>          "osd": 175,
>>          "errors": [
>>            "size_mismatch_oi"
>>          ],
>>          "size": 0
>>        },
>>        {
>>          "osd": 244,
>>          "errors": [
>>            "size_mismatch_oi"
>>          ],
>>          "size": 0
>>        },
>>        {
>>          "osd": 297,
>>          "errors": [
>>            "size_mismatch_oi"
>>          ],
>>          "size": 0
>>        }
>>      ]
>>    }
>>  ]
>> }
>>
>> [2]:
>> 2017-05-01 10:50:13.812992 7f0184623700  0 log_channel(cluster) log [INF] : 36.277b scrub starts
>> 2017-05-01 10:51:02.495229 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 175: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
>> 2017-05-01 10:51:02.495234 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 244: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
>> 2017-05-01 10:51:02.495326 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b shard 297: soid 36:dee65976:::100235614fe.00000005:head size 0 != size 2994176 from auth oi 36:dee65976:::100235614fe.00000005:head(737928'2182216 client.36346283.1:5754260 dirty s 2994176 uv 2233551)
>> 2017-05-01 10:51:02.495328 7f0186e28700 -1 log_channel(cluster) log [ERR] : 36.277b soid 36:dee65976:::100235614fe.00000005:head: failed to pick suitable auth object
>> 2017-05-01 10:51:02.495450 7f0186e28700 -1 log_channel(cluster) log [ERR] : scrub 36.277b 36:dee65976:::100235614fe.00000005:head on disk size (0) does not match object info size (2994176) adjusted for ondisk to (2994176)
>> 2017-05-01 10:51:20.223733 7f0184623700 -1 log_channel(cluster) log [ERR] : 36.277b scrub 4 errors
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux