MDS: corrupted header/values: decode past end of struct encoding: Malformed input

"von Hoesslin, Volker" <Volker.Hoesslin@xxxxxxx> · Fri, 1 Oct 2021 05:07:54 +0000

hi!

my cephfs is broken and i can not recover the mds-daemons. yesterday i have update my ceph-cluster from v15 to v16 and i thought all working fine. next day (today) some of my services goes down and throw errors, so i dig into the problem and find my cephfs is down, all mds-daemons in standby modus but no one is active, and cannot  successfully restarted.

my current status is:

# ceph status
cluster:
id: acd880fe-5f42-4930-8071-c4894c9b678e
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
11 scrub errors
Possible data damage: 3 pgs inconsistent
2 daemons have recently crashed

services:
mon: 3 daemons, quorum pve04,pve05,pve06 (age 103m)
mgr: pve04(active, since 107m), standbys: pve05, pve06
mds: 0/1 daemons up, 3 standby
osd: 30 osds: 30 up (since 103m), 30 in (since 8M)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 0/1 healthy, 1 recovering; 1 damaged
pools: 12 pools, 800 pgs
objects: 483.49k objects, 1.8 TiB
usage: 5.3 TiB used, 104 TiB / 109 TiB avail
pgs: 797 active+clean
3 active+clean+inconsistent+failed_repair

io:
client: 255 B/s rd, 229 KiB/s wr, 0 op/s rd, 17 op/s wr

i know, there are also 3 inconsistent pgs, but this is another story. my next try was to repaired the mds:

# ceph mds repaired 0
repaired: restoring rank 1:0

the log output call something about "corrupt values", checkout: https://pastebin.com/AePicagc

so i do not know which file is corrupted? ceph.conf?

the given errors "corrupt sessionmap values: Corrupt entity name in sessionmap" are thrown by this code:

https://github.com/ceph/ceph/blob/master/src/mds/SessionMap.cc

and there is also no "sessionmap" file on hard-drive: # find / -name '*.sessionmap' -> no results!

my next try is the harder way, for now, i have tried this:

# systemctl stop ceph-mds@pve04.service
# cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
# cephfs-journal-tool --rank=cephfs:0 journal reset
# cephfs-table-tool all reset session
# systemctl start ceph-mds@pve04.service
# ceph mds repaired 0

 this is the log output: https://pastebin.com/DBRq8iwM

not the same but similar errors... i'm a little bit confused about the definition of `ceph::buffer::v15_2_0::list`, so i'm running ceph v16?!

on top of this ceph cluster, i'm running my virtual environment, most of my VMs are still running but how long? i'm very happe for any support!

regards, volker.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx