Re: Inconsistent PGs after upgrade to Pacific

Dan van der Ster <dvanders@xxxxxxxxx> · Fri, 24 Jun 2022 11:41:58 +0200

Hi,

>From what I can tell, the ceph osd pool command is indeed the same as
rados mksnap.

But bizarrely I just created a new snapshot, changed max_mds, then
removed the snap -- this time I can't manage to "fix" the
inconsistency.
It may be that my first test was so simple (no client IO, no fs
snapshots) that removing the snap fixed it.

In this case, the inconsistent object appears to be an old version of
mds0_openfiles.0

# rados list-inconsistent-obj 3.6 | jq .
{
  "epoch": 7754,
  "inconsistents": [
    {
      "object": {
        "name": "mds0_openfiles.0",
        "nspace": "",
        "locator": "",
        "snap": 3,
        "version": 2467
      },

I tried modifying the current version of that with setomapval, but the
object stays inconsistent.
I even removed it from the pool (head version) and somehow that old
snapshotted version remains with the wrong checksum even though the
snap exists.

# rados rm -p cephfs.cephfs.meta mds0_openfiles.0
#

# ceph pg ls inconsistent
PG   OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES     OMAP_BYTES*
OMAP_KEYS*  LOG  STATE                      SINCE  VERSION    REPORTED
   UP         ACTING     SCRUB_STAMP
DEEP_SCRUB_STAMP
3.6       13         0          0        0  20971520            0
     0   41  active+clean+inconsistent     2m  7852'2479  7852:12048
[0,3,2]p0  [0,3,2]p0  2022-06-24T11:31:05.605434+0200
2022-06-24T11:31:05.605434+0200

# rados lssnap -p cephfs.cephfs.meta
0 snaps

This is getting super weird (I can list the object but not stat it):

# rados ls -p cephfs.cephfs.meta | grep open
mds1_openfiles.0
mds3_openfiles.0
mds0_openfiles.0
mds2_openfiles.0

# rados stat -p cephfs.cephfs.meta mds0_openfiles.0
 error stat-ing cephfs.cephfs.meta/mds0_openfiles.0: (2) No such file
or directory

I then failed over the mds to a standby so mds0_openfiles.0 exists
again, but the PG remains inconsistent with that old version of the
object.

I will add this to the tracker.

Clearly the objects are not all trimmed correctly when the pool
snapshot is removed.

-- dan

On Fri, Jun 24, 2022 at 11:10 AM Pascal Ehlert <pascal@xxxxxxxxxxxx> wrote:
>
> Hi Dan,
>
> Just a quick addition here:
>
> I have not used the rados command to create the snapshot but "ceph osd
> pool mksnap $POOL $SNAPNAME" - which I think is the same internally?
>
> And yes, our CephFS has numerous snapshots itself for backup purposes.
>
>
> Cheers,
> Pascal
>
>
>
> Dan van der Ster wrote on 24.06.22 11:06:
> > Hi Pascal,
> >
> > I'm not sure why you don't see that snap, and I'm also not sure if you
> > can just delete the objects directly.
> > BTW, does your CephFS have snapshots itself (e.g. create via mkdir
> > .snap/foobar)?
> >
> > Cheers, Dan
> >
> > On Fri, Jun 24, 2022 at 10:34 AM Pascal Ehlert <pascal@xxxxxxxxxxxx> wrote:
> >> Hi Dan,
> >>
> >> Thank you so much for going through the effort of reproducing this!
> >> I was just about to plan how to bring up a test cluster but it would've
> >> taken me much longer.
> >>
> >> While I totally assume this is the root cause for our issues, there is
> >> one small difference.
> >> rados lssnap does not list any snapshots for me:
> >>
> >> root@srv01:~# rados lssnap -p kubernetes_cephfs_metadata
> >> 0 snaps
> >>
> >> I do definitely recall having made a snapshot and apparently there are
> >> snapshot objects present in the pool.
> >> Not sure how the reference seemingly got lost.
> >>
> >> Do you have any ideas how I could anyway remove the broken snapshot objects?
> >>
> >>
> >> Cheers,
> >>
> >> Pascal
> >>
> >>
> >> Dan van der Ster wrote on 24.06.22 09:27:
> >>> Hi,
> >>>
> >>> It's trivial to reproduce. Running 16.2.9 with max_mds=2, take a pool
> >>> snapshot of the meta pool, then decrease to max_mds=1, then deep scrub
> >>> each meta pg.
> >>>
> >>> In my test I could list and remove the pool snap, then deep-scrub
> >>> again cleared the inconsistencies.
> >>>
> >>> https://tracker.ceph.com/issues/56386
> >>>
> >>> Cheers, Dan
> >>>
> >>> On Fri, Jun 24, 2022 at 8:41 AM Ansgar Jazdzewski
> >>> <a.jazdzewski@xxxxxxxxxxxxxx> wrote:
> >>>> Hi,
> >>>>
> >>>> I would say yes but it would be nice if other people can confirm it too.
> >>>>
> >>>> also can you create a test cluster and do the same tasks
> >>>> * create it with octopus
> >>>> * create snapshot
> >>>> * reduce rank to 1
> >>>> * upgrade to pacific
> >>>>
> >>>> and then try to fix the PG, assuming that you will have the same
> >>>> issues in your test-cluster,
> >>>>
> >>>> cheers,
> >>>> Ansgar
> >>>>
> >>>> Am Do., 23. Juni 2022 um 22:12 Uhr schrieb Pascal Ehlert <pascal@xxxxxxxxxxxx>:
> >>>>> Hi,
> >>>>>
> >>>>> I have now tried to "ceph osd pool rmsnap $POOL beforefixes" and it says the snapshot could not be found although I have definitely run "ceph osd pool mksnap $POOL beforefixes" about three weeks ago.
> >>>>> When running rados list-inconsistent-obj $PG on one of the affected PGs, all of the objects returned have "snap" set to 1:
> >>>>>
> >>>>> root@srv01:~# for i in $(rados list-inconsistent-pg $POOL | jq -er .[]); do rados list-inconsistent-obj $i | jq -er .inconsistents[].object; done
> >>>>> [..]
> >>>>> {
> >>>>>     "name": "200020744f4.00000000",
> >>>>>     "nspace": "",
> >>>>>     "locator": "",
> >>>>>     "snap": 1,
> >>>>>     "version": 5704208
> >>>>> }
> >>>>> {
> >>>>>     "name": "200021aeb16.00000000",
> >>>>>     "nspace": "",
> >>>>>     "locator": "",
> >>>>>     "snap": 1,
> >>>>>     "version": 6189078
> >>>>> }
> >>>>> [..]
> >>>>>
> >>>>> Running listsnaps on any of them then looks like this:
> >>>>>
> >>>>> root@srv01:~# rados listsnaps 200020744f4.00000000 -p $POOL
> >>>>> 200020744f4.00000000:
> >>>>> cloneid    snaps    size    overlap
> >>>>> 1    1    0    []
> >>>>> head    -    0
> >>>>>
> >>>>>
> >>>>> Is it save to assume that these objects belong to a somewhat broken snapshot and can be removed safely without causing further damage?
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Pascal
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ansgar Jazdzewski wrote on 23.06.22 20:36:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> we could identify the rbd images that wehre affected and did an export before, but in the case of cephfs metadata i have no plan that will work.
> >>>>>
> >>>>> can you try to delete the snapshot?
> >>>>> also if the filesystem can be shutdown? try to do a backup of the metadatapool
> >>>>>
> >>>>> hope you will have some luck, let me know if I can help,
> >>>>> Ansgar
> >>>>>
> >>>>> Pascal Ehlert <pascal@xxxxxxxxxxxx> schrieb am Do., 23. Juni 2022, 16:45:
> >>>>>> Hi Ansgar,
> >>>>>>
> >>>>>> Thank you very much for the response.
> >>>>>> Running your first command to obtain inconsistent objects, I retrieve a
> >>>>>> total of 23114 only some of which are snaps.
> >>>>>>
> >>>>>> You mentioning snapshots did remind me of the fact however that I
> >>>>>> created a snapshot on the Ceph metadata pool via "ceph osd pool $POOL
> >>>>>> mksnap" before I reduced the number of ranks.
> >>>>>> Maybe that has causes the inconsistencies and would explain why the
> >>>>>> actual file system appears unaffected?
> >>>>>>
> >>>>>> Is there any way to validate that theory? I am a bit hesitant to just
> >>>>>> run "rmsnap". Could that cause inconsistent data to be written back to
> >>>>>> the actual objects?
> >>>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Pascal
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Ansgar Jazdzewski wrote on 23.06.22 16:11:
> >>>>>>> Hi Pascal,
> >>>>>>>
> >>>>>>> We just had a similar situation on our RBD and had found some bad data
> >>>>>>> in RADOS here is How we did it:
> >>>>>>>
> >>>>>>> for i in $(rados list-inconsistent-pg $POOL | jq -er .[]); do rados
> >>>>>>> list-inconsistent-obj $i | jq -er .inconsistents[].object.name| awk
> >>>>>>> -F'.' '{print $2}'; done
> >>>>>>>
> >>>>>>> we than found inconsistent snaps on the Object:
> >>>>>>>
> >>>>>>> rados list-inconsistent-snapset $PG --format=json-pretty | jq
> >>>>>>> .inconsistents[].name
> >>>>>>>
> >>>>>>> List the data on the OSD's (ceph pg map $PG)
> >>>>>>>
> >>>>>>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${OSD}/ --op
> >>>>>>> list ${OBJ} --pgid ${PG}
> >>>>>>>
> >>>>>>> and finally remove the object, like:
> >>>>>>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-459/ --op
> >>>>>>> list rbd_data.762a94d768c04d.000000000036b7ac --pgid
> >>>>>>> 2.704ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-459/
> >>>>>>> '["2.704",{"oid":"rbd_data.801e1d1d9c719d.0000000000044943","key":"","snapid":125458,"hash":4136961796,"max":0,"pool":2,"namespace":"","max":0}]'
> >>>>>>> remove
> >>>>>>>
> >>>>>>> we had to do it for all OSD one after the other after this a 'pg repair' worked
> >>>>>>>
> >>>>>>> i hope it will help
> >>>>>>> Ansgar
> >>>>>>>
> >>>>>>> Am Do., 23. Juni 2022 um 15:02 Uhr schrieb Dan van der Ster
> >>>>>>> <dvanders@xxxxxxxxx>:
> >>>>>>>> Hi Pascal,
> >>>>>>>>
> >>>>>>>> It's not clear to me how the upgrade procedure you described would
> >>>>>>>> lead to inconsistent PGs.
> >>>>>>>>
> >>>>>>>> Even if you didn't record every step, do you have the ceph.log, the
> >>>>>>>> mds logs, perhaps some osd logs from this time?
> >>>>>>>> And which versions did you upgrade from / to ?
> >>>>>>>>
> >>>>>>>> Cheers, Dan
> >>>>>>>>
> >>>>>>>> On Wed, Jun 22, 2022 at 7:41 PM Pascal Ehlert <pascal@xxxxxxxxxxxx> wrote:
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I am currently battling inconsistent PGs after a far-reaching mistake
> >>>>>>>>> during the upgrade from Octopus to Pacific.
> >>>>>>>>> While otherwise following the guide, I restarted the Ceph MDS daemons
> >>>>>>>>> (and this started the Pacific daemons) without previously reducing the
> >>>>>>>>> ranks to 1 (from 2).
> >>>>>>>>>
> >>>>>>>>> This resulted in daemons not coming up and reporting inconsistencies.
> >>>>>>>>> After later reducing the ranks and bringing the MDS back up (I did not
> >>>>>>>>> record every step as this was an emergency situation), we started seeing
> >>>>>>>>> health errors on every scrub.
> >>>>>>>>>
> >>>>>>>>> Now after three weeks, while our CephFS is still working fine and we
> >>>>>>>>> haven't noticed any data damage, we realized that every single PG of the
> >>>>>>>>> cephfs metadata pool is affected.
> >>>>>>>>> Below you can find some information on the actual status and a detailed
> >>>>>>>>> inspection of one of the affected pgs. I am happy to provide any other
> >>>>>>>>> information that could be useful of course.
> >>>>>>>>>
> >>>>>>>>> A repair of the affected PGs does not resolve the issue.
> >>>>>>>>> Does anyone else here have an idea what we could try apart from copying
> >>>>>>>>> all the data to a new CephFS pool?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thank you!
> >>>>>>>>>
> >>>>>>>>> Pascal
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> root@srv02:~# ceph status
> >>>>>>>>>       cluster:
> >>>>>>>>>         id:     f0d6d4d0-8c17-471a-9f95-ebc80f1fee78
> >>>>>>>>>         health: HEALTH_ERR
> >>>>>>>>>                 insufficient standby MDS daemons available
> >>>>>>>>>                 69262 scrub errors
> >>>>>>>>>                 Too many repaired reads on 2 OSDs
> >>>>>>>>>                 Possible data damage: 64 pgs inconsistent
> >>>>>>>>>
> >>>>>>>>>       services:
> >>>>>>>>>         mon: 3 daemons, quorum srv02,srv03,srv01 (age 3w)
> >>>>>>>>>         mgr: srv03(active, since 3w), standbys: srv01, srv02
> >>>>>>>>>         mds: 2/2 daemons up, 1 hot standby
> >>>>>>>>>         osd: 44 osds: 44 up (since 3w), 44 in (since 10M)
> >>>>>>>>>
> >>>>>>>>>       data:
> >>>>>>>>>         volumes: 2/2 healthy
> >>>>>>>>>         pools:   13 pools, 1217 pgs
> >>>>>>>>>         objects: 75.72M objects, 26 TiB
> >>>>>>>>>         usage:   80 TiB used, 42 TiB / 122 TiB avail
> >>>>>>>>>         pgs:     1153 active+clean
> >>>>>>>>>                  55   active+clean+inconsistent
> >>>>>>>>>                  9    active+clean+inconsistent+failed_repair
> >>>>>>>>>
> >>>>>>>>>       io:
> >>>>>>>>>         client:   2.0 MiB/s rd, 21 MiB/s wr, 240 op/s rd, 1.75k op/s wr
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>>       "epoch": 4962617,
> >>>>>>>>>       "inconsistents": [
> >>>>>>>>>         {
> >>>>>>>>>           "object": {
> >>>>>>>>>             "name": "1000000cc8e.00000000",
> >>>>>>>>>             "nspace": "",
> >>>>>>>>>             "locator": "",
> >>>>>>>>>             "snap": 1,
> >>>>>>>>>             "version": 4253817
> >>>>>>>>>           },
> >>>>>>>>>           "errors": [],
> >>>>>>>>>           "union_shard_errors": [
> >>>>>>>>>             "omap_digest_mismatch_info"
> >>>>>>>>>           ],
> >>>>>>>>>           "selected_object_info": {
> >>>>>>>>>             "oid": {
> >>>>>>>>>               "oid": "1000000cc8e.00000000",
> >>>>>>>>>               "key": "",
> >>>>>>>>>               "snapid": 1,
> >>>>>>>>>               "hash": 1369745244,
> >>>>>>>>>               "max": 0,
> >>>>>>>>>               "pool": 7,
> >>>>>>>>>               "namespace": ""
> >>>>>>>>>             },
> >>>>>>>>>             "version": "4962847'6209730",
> >>>>>>>>>             "prior_version": "3916665'4306116",
> >>>>>>>>>             "last_reqid": "osd.27.0:757107407",
> >>>>>>>>>             "user_version": 4253817,
> >>>>>>>>>             "size": 0,
> >>>>>>>>>             "mtime": "2022-02-26T12:56:55.612420+0100",
> >>>>>>>>>             "local_mtime": "2022-02-26T12:56:55.614429+0100",
> >>>>>>>>>             "lost": 0,
> >>>>>>>>>             "flags": [
> >>>>>>>>>               "dirty",
> >>>>>>>>>               "omap",
> >>>>>>>>>               "data_digest",
> >>>>>>>>>               "omap_digest"
> >>>>>>>>>             ],
> >>>>>>>>>             "truncate_seq": 0,
> >>>>>>>>>             "truncate_size": 0,
> >>>>>>>>>             "data_digest": "0xffffffff",
> >>>>>>>>>             "omap_digest": "0xe5211a9e",
> >>>>>>>>>             "expected_object_size": 0,
> >>>>>>>>>             "expected_write_size": 0,
> >>>>>>>>>             "alloc_hint_flags": 0,
> >>>>>>>>>             "manifest": {
> >>>>>>>>>               "type": 0
> >>>>>>>>>             },
> >>>>>>>>>             "watchers": {}
> >>>>>>>>>           },
> >>>>>>>>>           "shards": [
> >>>>>>>>>             {
> >>>>>>>>>               "osd": 20,
> >>>>>>>>>               "primary": false,
> >>>>>>>>>               "errors": [
> >>>>>>>>>                 "omap_digest_mismatch_info"
> >>>>>>>>>               ],
> >>>>>>>>>               "size": 0,
> >>>>>>>>>               "omap_digest": "0xffffffff",
> >>>>>>>>>               "data_digest": "0xffffffff"
> >>>>>>>>>             },
> >>>>>>>>>             {
> >>>>>>>>>               "osd": 27,
> >>>>>>>>>               "primary": true,
> >>>>>>>>>               "errors": [
> >>>>>>>>>                 "omap_digest_mismatch_info"
> >>>>>>>>>               ],
> >>>>>>>>>               "size": 0,
> >>>>>>>>>               "omap_digest": "0xffffffff",
> >>>>>>>>>               "data_digest": "0xffffffff"
> >>>>>>>>>             },
> >>>>>>>>>             {
> >>>>>>>>>               "osd": 43,
> >>>>>>>>>               "primary": false,
> >>>>>>>>>               "errors": [
> >>>>>>>>>                 "omap_digest_mismatch_info"
> >>>>>>>>>               ],
> >>>>>>>>>               "size": 0,
> >>>>>>>>>               "omap_digest": "0xffffffff",
> >>>>>>>>>               "data_digest": "0xffffffff"
> >>>>>>>>>             }
> >>>>>>>>>           ]
> >>>>>>>>>         },
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx