Re: MDS rank 0 damaged after update to 14.2.20

Eugen Block <eblock@xxxxxx> · Tue, 18 May 2021 14:00:07 +0000

Hi,

sorry for not responding, our mail server was affected, too, I got  
your response after we got our CephFS back online.

Do you have the mds log from the initial crash?

I would need to take a closer look but we're currently dealing with  
the affected clients to get everything back in order.

Also, I don't see the new global_id warnings in your status output --
did you change any settings from the defaults during this upgrade?

I definitely did deactivate that warning, it could have indeed been  
during the upgrade. Could that have caused the MDS damage? We have  
older clients, it may take some time to update so I decided to silence  
that warning. Was that a mistake? Maybe I just missed that information  
but I didn't find any warnings for the update case. Do you have more  
information?

Thanks!
Eugen

Zitat von Dan van der Ster <dan@xxxxxxxxxxxxxx>:

Hi,

Do you have the mds log from the initial crash?

Also, I don't see the new global_id warnings in your status output --
did you change any settings from the defaults during this upgrade?

Cheers, Dan

On Tue, May 18, 2021 at 10:22 AM Eugen Block <eblock@xxxxxx> wrote:

Hi *,

I tried a minor update (14.2.9 --> 14.2.20) on our ceph cluster today
and got into a damaged CephFS. It's rather urgent since noone can
really work right now, so any quick help is highly appreciated.

As for the update process I followed the usual update procedure, when
all MONs were finished I started to restart the OSDs, but suddenly our
cephfs got unresponsive (and still is).

I believe these lines are the critical ones:

---snap---
    -12> 2021-05-18 09:53:01.488 7f7e9ed82700  5 mds.beacon.mds01
received beacon reply up:replay seq 906 rtt 0
    -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient:
get_auth_request con 0x5608a5171600 auth_method 0
    -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1
mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
     -9> 2021-05-18 09:53:03.732 7f7e94d6e700  0 mds.0.log _replay
journaler got error -22, aborting
     -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster)
log [ERR] : Error loading MDS rank 0: (22) Invalid argument
     -7> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
set_want_state: up:replay -> down:damaged
     -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  log_queue
is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
     -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  will send
2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error
loading MDS rank 0: (22) Invalid argument
     -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
_send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
     -3> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
Sending beacon down:damaged seq 907
     -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
_send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
     -1> 2021-05-18 09:53:03.908 7f7e9ed82700  5 mds.beacon.mds01
received beacon reply down:damaged seq 907 rtt 0.176001
      0> 2021-05-18 09:53:03.908 7f7e94d6e700  1 mds.mds01 respawn!
---snap---

These logs are from the attempt to bring the mds rank back up with

ceph mds repaired 0

I attached a longer excerpt of the log files if it helps. Before
trying anything from the disaster recovery steps I'd like to ask for
your input since one can damage it even more. The current status is
below, please let me know if more information is required.

Thanks!
Eugen

ceph01:~ # ceph -s
   cluster:
     id:     655cb05a-435a-41ba-83d9-8549f7c36167
     health: HEALTH_ERR
             1 filesystem is degraded
             1 filesystem is offline
             1 mds daemon damaged
             noout flag(s) set
             Some pool(s) have the nodeep-scrub flag(s) set

   services:
     mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 116m)
     mgr: ceph03(active, since 118m), standbys: ceph02, ceph01
     mds: cephfs:0/1 3 up:standby, 1 damaged
     osd: 32 osds: 32 up (since 64m), 32 in (since 8w)
          flags noout

   data:
     pools:   14 pools, 512 pgs
     objects: 5.08M objects, 8.6 TiB
     usage:   27 TiB used, 33 TiB / 59 TiB avail
     pgs:     512 active+clean

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx