Re: MDS rank 0 damaged after update to 14.2.20

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Tue, 18 May 2021 10:35:13 +0200

Hi,

Do you have the mds log from the initial crash?

Also, I don't see the new global_id warnings in your status output --
did you change any settings from the defaults during this upgrade?

Cheers, Dan

On Tue, May 18, 2021 at 10:22 AM Eugen Block <eblock@xxxxxx> wrote:
>
> Hi *,
>
> I tried a minor update (14.2.9 --> 14.2.20) on our ceph cluster today
> and got into a damaged CephFS. It's rather urgent since noone can
> really work right now, so any quick help is highly appreciated.
>
> As for the update process I followed the usual update procedure, when
> all MONs were finished I started to restart the OSDs, but suddenly our
> cephfs got unresponsive (and still is).
>
> I believe these lines are the critical ones:
>
> ---snap---
>     -12> 2021-05-18 09:53:01.488 7f7e9ed82700  5 mds.beacon.mds01
> received beacon reply up:replay seq 906 rtt 0
>     -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient:
> get_auth_request con 0x5608a5171600 auth_method 0
>     -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1
> mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
>      -9> 2021-05-18 09:53:03.732 7f7e94d6e700  0 mds.0.log _replay
> journaler got error -22, aborting
>      -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster)
> log [ERR] : Error loading MDS rank 0: (22) Invalid argument
>      -7> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
> set_want_state: up:replay -> down:damaged
>      -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  log_queue
> is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
>      -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  will send
> 2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error
> loading MDS rank 0: (22) Invalid argument
>      -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
>      -3> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
> Sending beacon down:damaged seq 907
>      -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
>      -1> 2021-05-18 09:53:03.908 7f7e9ed82700  5 mds.beacon.mds01
> received beacon reply down:damaged seq 907 rtt 0.176001
>       0> 2021-05-18 09:53:03.908 7f7e94d6e700  1 mds.mds01 respawn!
> ---snap---
>
> These logs are from the attempt to bring the mds rank back up with
>
> ceph mds repaired 0
>
> I attached a longer excerpt of the log files if it helps. Before
> trying anything from the disaster recovery steps I'd like to ask for
> your input since one can damage it even more. The current status is
> below, please let me know if more information is required.
>
> Thanks!
> Eugen
>
>
> ceph01:~ # ceph -s
>    cluster:
>      id:     655cb05a-435a-41ba-83d9-8549f7c36167
>      health: HEALTH_ERR
>              1 filesystem is degraded
>              1 filesystem is offline
>              1 mds daemon damaged
>              noout flag(s) set
>              Some pool(s) have the nodeep-scrub flag(s) set
>
>    services:
>      mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 116m)
>      mgr: ceph03(active, since 118m), standbys: ceph02, ceph01
>      mds: cephfs:0/1 3 up:standby, 1 damaged
>      osd: 32 osds: 32 up (since 64m), 32 in (since 8w)
>           flags noout
>
>    data:
>      pools:   14 pools, 512 pgs
>      objects: 5.08M objects, 8.6 TiB
>      usage:   27 TiB used, 33 TiB / 59 TiB avail
>      pgs:     512 active+clean
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx