Re: MDS: corrupted header/values: decode past end of struct encoding: Malformed input

"von Hoesslin, Volker" <Volker.Hoesslin@xxxxxxx> · Mon, 4 Oct 2021 09:09:33 +0000

hmmm... more and more PGs are broken:

# ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds daemon; 1 filesystem is offline; insufficient standby MDS daemons available; 46 scrub errors; Possible data damage: 32 pgs inconsistent; 2625 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
    fs cephfs has 1 failed mds
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
    fs cephfs is offline because no MDS is active for it.
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[ERR] OSD_SCRUB_ERRORS: 46 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 32 pgs inconsistent
    pg 1.3 is active+clean+inconsistent, acting [4,15,25]
    pg 1.5 is active+clean+inconsistent, acting [22,8,11]
    pg 1.a is active+clean+inconsistent, acting [23,19,6]
    pg 1.10 is active+clean+inconsistent, acting [18,22,0]
    pg 1.1c is active+clean+inconsistent+failed_repair, acting [28,16,9]
    pg 1.1e is active+clean+inconsistent, acting [22,10,6]
    pg 1.26 is active+clean+inconsistent, acting [22,2,17]
    pg 1.35 is active+clean+inconsistent, acting [27,7,11]
    pg 1.37 is active+clean+inconsistent+failed_repair, acting [7,16,26]
    pg 1.3d is active+clean+inconsistent, acting [0,17,22]
    pg 5.47 is active+clean+inconsistent, acting [8,28,13]
    pg 5.90 is active+clean+inconsistent+failed_repair, acting [13,9,21]
    pg 5.a6 is active+clean+inconsistent, acting [20,19,8]
    pg 5.b0 is active+clean+inconsistent, acting [20,3,17]
    pg 5.b2 is active+clean+inconsistent, acting [24,11,9]
    pg 5.b3 is active+clean+inconsistent, acting [1,23,18]
    pg 5.d0 is active+clean+inconsistent, acting [27,4,14]
    pg 5.d2 is active+clean+inconsistent, acting [15,24,0]
    pg 11.5 is active+clean+inconsistent, acting [11,3,25]
    pg 11.17 is active+clean+inconsistent, acting [24,19,8]
    pg 16.0 is active+clean+inconsistent, acting [5,15,24]
    pg 16.2 is active+clean+inconsistent, acting [12,1,27]
    pg 16.15 is active+clean+inconsistent, acting [0,28,11]
    pg 16.17 is active+clean+inconsistent, acting [2,21,13]
    pg 16.1c is active+clean+inconsistent, acting [25,7,15]
    pg 16.25 is active+clean+inconsistent, acting [15,4,25]
    pg 16.2f is active+clean+inconsistent, acting [20,13,1]
    pg 16.38 is active+clean+inconsistent, acting [2,18,22]
    pg 16.3a is active+clean+inconsistent, acting [12,1,20]
    pg 16.3d is active+clean+inconsistent, acting [21,19,6]
    pg 16.3e is active+clean+inconsistent, acting [14,9,21]
    pg 16.3f is active+clean+inconsistent, acting [23,5,15]
[WRN] RECENT_CRASH: 2625 daemons have recently crashed
    client.admin crashed on host pve06 at 2021-09-30T05:08:19.213324Z
    mds.pve05 crashed on host pve05 at 2021-09-30T06:09:49.543530Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:22.059405Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:26.077956Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:30.117664Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:34.149385Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:37.607766Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:41.639585Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:45.684791Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:49.711284Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:53.757538Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:10:57.622000Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:11:01.798656Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:11:05.821116Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:11:09.860788Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:11:13.903719Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:11:17.630383Z
    mds.pve04 crashed on host pve04 at 2021-09-30T13:11:21.948918Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:25.979666Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:30.013149Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:34.044069Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:37.633660Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:41.664662Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:45.690034Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:49.735077Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:53.765387Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:11:57.655313Z
    mds.pve05 crashed on host pve05 at 2021-09-30T13:12:01.812882Z
    mds.pve06 crashed on host pve06 at 2021-09-30T13:12:05.838469Z
    mds.pve06 crashed on host pve06 at 2021-09-30T13:12:09.874958Z
    and 2595 more

for now, i have all three mds daemons are stoped.

at the risk of making a fool of myself, but how do i check what data is in a PG?

i have already done a backup at the beginning using "cephfs-journal-tool journal export backup.bin", there is only a limited backup of the data itself from the ceph.

regards, volker.

________________________________
Von: Stefan Kooman <stefan@xxxxxx>
Gesendet: Sonntag, 3. Oktober 2021 10:39:20
An: von Hoesslin, Volker; ceph-users@xxxxxxx
Betreff: [URL wurde verändert]  Re: MDS: corrupted header/values: decode past end of struct encoding: Malformed input

Externe E-Mail! Öffnen Sie nur Links oder Anhänge von vertrauenswürdigen Absendern!
On 10/1/21 14:07, von Hoesslin, Volker wrote:
> is there any chance to fix this? there are some "advanced metadata
> repair tools"
> (https://sis-schwerin.de/externer-link/?href=https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
> <https://sis-schwerin.de/externer-link/?href=https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>) but
> i'm not realy sure is it the right way to handle this issue?

Are you sure the PGs that are inconsistent doesn't have anything to do
with the MDS issues? What data is on those PGs?

> i have  created an "backup" bevor any tries with this command:
>
> cephfs-journal-tool  journal  export  backup.bin
>
> maybe can i delete the mds database and recreate it? is this possible?

The experts link you pasted shows how you can do this. But I would
consider this a last resort. Do you have backups?

Does increased debug level for the MDS show any more clues (debug_mds
20/20)?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx