Re: Inconsistent PGs after upgrade to Pacific

Dan van der Ster <dvanders@xxxxxxxxx> · Thu, 23 Jun 2022 15:01:21 +0200

Hi Pascal,

It's not clear to me how the upgrade procedure you described would
lead to inconsistent PGs.

Even if you didn't record every step, do you have the ceph.log, the
mds logs, perhaps some osd logs from this time?
And which versions did you upgrade from / to ?

Cheers, Dan

On Wed, Jun 22, 2022 at 7:41 PM Pascal Ehlert <pascal@xxxxxxxxxxxx> wrote:
>
> Hi all,
>
> I am currently battling inconsistent PGs after a far-reaching mistake
> during the upgrade from Octopus to Pacific.
> While otherwise following the guide, I restarted the Ceph MDS daemons
> (and this started the Pacific daemons) without previously reducing the
> ranks to 1 (from 2).
>
> This resulted in daemons not coming up and reporting inconsistencies.
> After later reducing the ranks and bringing the MDS back up (I did not
> record every step as this was an emergency situation), we started seeing
> health errors on every scrub.
>
> Now after three weeks, while our CephFS is still working fine and we
> haven't noticed any data damage, we realized that every single PG of the
> cephfs metadata pool is affected.
> Below you can find some information on the actual status and a detailed
> inspection of one of the affected pgs. I am happy to provide any other
> information that could be useful of course.
>
> A repair of the affected PGs does not resolve the issue.
> Does anyone else here have an idea what we could try apart from copying
> all the data to a new CephFS pool?
>
>
>
> Thank you!
>
> Pascal
>
>
>
>
> root@srv02:~# ceph status
>    cluster:
>      id:     f0d6d4d0-8c17-471a-9f95-ebc80f1fee78
>      health: HEALTH_ERR
>              insufficient standby MDS daemons available
>              69262 scrub errors
>              Too many repaired reads on 2 OSDs
>              Possible data damage: 64 pgs inconsistent
>
>    services:
>      mon: 3 daemons, quorum srv02,srv03,srv01 (age 3w)
>      mgr: srv03(active, since 3w), standbys: srv01, srv02
>      mds: 2/2 daemons up, 1 hot standby
>      osd: 44 osds: 44 up (since 3w), 44 in (since 10M)
>
>    data:
>      volumes: 2/2 healthy
>      pools:   13 pools, 1217 pgs
>      objects: 75.72M objects, 26 TiB
>      usage:   80 TiB used, 42 TiB / 122 TiB avail
>      pgs:     1153 active+clean
>               55   active+clean+inconsistent
>               9    active+clean+inconsistent+failed_repair
>
>    io:
>      client:   2.0 MiB/s rd, 21 MiB/s wr, 240 op/s rd, 1.75k op/s wr
>
>
> {
>    "epoch": 4962617,
>    "inconsistents": [
>      {
>        "object": {
>          "name": "1000000cc8e.00000000",
>          "nspace": "",
>          "locator": "",
>          "snap": 1,
>          "version": 4253817
>        },
>        "errors": [],
>        "union_shard_errors": [
>          "omap_digest_mismatch_info"
>        ],
>        "selected_object_info": {
>          "oid": {
>            "oid": "1000000cc8e.00000000",
>            "key": "",
>            "snapid": 1,
>            "hash": 1369745244,
>            "max": 0,
>            "pool": 7,
>            "namespace": ""
>          },
>          "version": "4962847'6209730",
>          "prior_version": "3916665'4306116",
>          "last_reqid": "osd.27.0:757107407",
>          "user_version": 4253817,
>          "size": 0,
>          "mtime": "2022-02-26T12:56:55.612420+0100",
>          "local_mtime": "2022-02-26T12:56:55.614429+0100",
>          "lost": 0,
>          "flags": [
>            "dirty",
>            "omap",
>            "data_digest",
>            "omap_digest"
>          ],
>          "truncate_seq": 0,
>          "truncate_size": 0,
>          "data_digest": "0xffffffff",
>          "omap_digest": "0xe5211a9e",
>          "expected_object_size": 0,
>          "expected_write_size": 0,
>          "alloc_hint_flags": 0,
>          "manifest": {
>            "type": 0
>          },
>          "watchers": {}
>        },
>        "shards": [
>          {
>            "osd": 20,
>            "primary": false,
>            "errors": [
>              "omap_digest_mismatch_info"
>            ],
>            "size": 0,
>            "omap_digest": "0xffffffff",
>            "data_digest": "0xffffffff"
>          },
>          {
>            "osd": 27,
>            "primary": true,
>            "errors": [
>              "omap_digest_mismatch_info"
>            ],
>            "size": 0,
>            "omap_digest": "0xffffffff",
>            "data_digest": "0xffffffff"
>          },
>          {
>            "osd": 43,
>            "primary": false,
>            "errors": [
>              "omap_digest_mismatch_info"
>            ],
>            "size": 0,
>            "omap_digest": "0xffffffff",
>            "data_digest": "0xffffffff"
>          }
>        ]
>      },
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx