Re: MDS rank 0 damaged after update to 14.2.20

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Tue, 18 May 2021 16:16:47 +0200

On Tue, May 18, 2021 at 4:00 PM Eugen Block <eblock@xxxxxx> wrote:
>
> Hi,
>
> sorry for not responding, our mail server was affected, too, I got
> your response after we got our CephFS back online.

Glad to hear it's back online!

> > Do you have the mds log from the initial crash?
>
> I would need to take a closer look but we're currently dealing with
> the affected clients to get everything back in order.

No rush, but this would be useful to analyze what broke the fs initially.

> > Also, I don't see the new global_id warnings in your status output --
> > did you change any settings from the defaults during this upgrade?
>
> I definitely did deactivate that warning, it could have indeed been
> during the upgrade. Could that have caused the MDS damage? We have
> older clients, it may take some time to update so I decided to silence
> that warning. Was that a mistake? Maybe I just missed that information
> but I didn't find any warnings for the update case. Do you have more
> information?

If you set these, it should be safe:

  mon       advanced mon_warn_on_insecure_global_id_reclaim         false
  mon       advanced mon_warn_on_insecure_global_id_reclaim_allowed false

but if you changed these other settings to false, it might have caused the
old mds to error:

auth_allow_insecure_global_id_reclaim
auth_expose_insecure_global_id_reclaim

Cheers, Dan

>
> Thanks!
> Eugen
>
>
> Zitat von Dan van der Ster <dan@xxxxxxxxxxxxxx>:
>
> > Hi,
> >
> > Do you have the mds log from the initial crash?
> >
> > Also, I don't see the new global_id warnings in your status output --
> > did you change any settings from the defaults during this upgrade?
> >
> > Cheers, Dan
> >
> > On Tue, May 18, 2021 at 10:22 AM Eugen Block <eblock@xxxxxx> wrote:
> >>
> >> Hi *,
> >>
> >> I tried a minor update (14.2.9 --> 14.2.20) on our ceph cluster today
> >> and got into a damaged CephFS. It's rather urgent since noone can
> >> really work right now, so any quick help is highly appreciated.
> >>
> >> As for the update process I followed the usual update procedure, when
> >> all MONs were finished I started to restart the OSDs, but suddenly our
> >> cephfs got unresponsive (and still is).
> >>
> >> I believe these lines are the critical ones:
> >>
> >> ---snap---
> >>     -12> 2021-05-18 09:53:01.488 7f7e9ed82700  5 mds.beacon.mds01
> >> received beacon reply up:replay seq 906 rtt 0
> >>     -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient:
> >> get_auth_request con 0x5608a5171600 auth_method 0
> >>     -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1
> >> mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
> >>      -9> 2021-05-18 09:53:03.732 7f7e94d6e700  0 mds.0.log _replay
> >> journaler got error -22, aborting
> >>      -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster)
> >> log [ERR] : Error loading MDS rank 0: (22) Invalid argument
> >>      -7> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
> >> set_want_state: up:replay -> down:damaged
> >>      -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  log_queue
> >> is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
> >>      -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  will send
> >> 2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error
> >> loading MDS rank 0: (22) Invalid argument
> >>      -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
> >> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
> >>      -3> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
> >> Sending beacon down:damaged seq 907
> >>      -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
> >> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
> >>      -1> 2021-05-18 09:53:03.908 7f7e9ed82700  5 mds.beacon.mds01
> >> received beacon reply down:damaged seq 907 rtt 0.176001
> >>       0> 2021-05-18 09:53:03.908 7f7e94d6e700  1 mds.mds01 respawn!
> >> ---snap---
> >>
> >> These logs are from the attempt to bring the mds rank back up with
> >>
> >> ceph mds repaired 0
> >>
> >> I attached a longer excerpt of the log files if it helps. Before
> >> trying anything from the disaster recovery steps I'd like to ask for
> >> your input since one can damage it even more. The current status is
> >> below, please let me know if more information is required.
> >>
> >> Thanks!
> >> Eugen
> >>
> >>
> >> ceph01:~ # ceph -s
> >>    cluster:
> >>      id:     655cb05a-435a-41ba-83d9-8549f7c36167
> >>      health: HEALTH_ERR
> >>              1 filesystem is degraded
> >>              1 filesystem is offline
> >>              1 mds daemon damaged
> >>              noout flag(s) set
> >>              Some pool(s) have the nodeep-scrub flag(s) set
> >>
> >>    services:
> >>      mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 116m)
> >>      mgr: ceph03(active, since 118m), standbys: ceph02, ceph01
> >>      mds: cephfs:0/1 3 up:standby, 1 damaged
> >>      osd: 32 osds: 32 up (since 64m), 32 in (since 8w)
> >>           flags noout
> >>
> >>    data:
> >>      pools:   14 pools, 512 pgs
> >>      objects: 5.08M objects, 8.6 TiB
> >>      usage:   27 TiB used, 33 TiB / 59 TiB avail
> >>      pgs:     512 active+clean
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx