Re: MDS rank 0 damaged after update to 14.2.20

Eugen Block <eblock@xxxxxx> · Wed, 19 May 2021 09:14:47 +0000

Hi,

I looked a little closer into what happened yesterday during the  
update, I'll summarize for documentation purposes, maybe it helps  
other users in this situation (sorry for the long post):

The update process started at around 8 am.
The MONs/MGRs updated successfully, I restarted the OSD daemons on the  
first host (colocated with MONs) at 8:17 am, the second host at 8:19,  
the third at 8:20, all daemons came back up.
But I didn't get to update the rest of the OSD servers (EC pool) or  
MDS because my user session became unresponsive (home directories is  
mounted via CephFS) shortly after I restarted the osd.target on the  
third host.
The unresponsive mds session could be because of a couple of slow  
requests which eventually resolved, but my session didn't come back.  
Here's a short excerpt from the logs:

2021-05-18 08:35:51.498 7fcf3f399700  0 log_channel(cluster) log [WRN]  
: 5 slow requests, 0 included below; oldest blocked for > 975.015743  
secs
2021-05-18 08:35:21.494 7fcf3f399700  0 log_channel(cluster) log [WRN]  
: slow request 124.603846 seconds old, received at 2021-05-18  
08:33:16.896561: client_request(client.517094781:5126586689 getattr  
pAsLsXsFs #0x10000144138 2021-05-18 08:33:16.892492  
caller_uid=<user_id>, caller_gid=100{}) currently failed to rdlock,  
waiting

Then the MDS shut down without an obvious reason, at least I couldn't  
find any hint for the root cause, these are the last log lines before  
the respawn:

---snip---
2021-05-18 08:35:51.498 7fcf3f399700  0 log_channel(cluster) log [WRN]  
: 5 slow requests, 0 included below; oldest blocked for > 975.015743  
secs
2021-05-18 08:35:52.790 7fcf4139d700  1 mds.mds01 Updating MDS map to  
version 27067 from mon.1
2021-05-18 08:35:54.158 7fcf4139d700  1 mds.mds01 Updating MDS map to  
version 27068 from mon.1
2021-05-18 08:35:55.934 7fcf41e26700 -1 received  signal: Terminated  
from /usr/lib/systemd/systemd --switched-root --system --deserialize  
24  (PID: 1) UID: 0
2021-05-18 08:35:55.934 7fcf41e26700 -1 mds.mds01 *** got signal  
Terminated ***
2021-05-18 08:35:55.934 7fcf41e26700  1 mds.mds01 suicide! Wanted  
state up:active
2021-05-18 08:35:56.534 7fcf41e26700  1 mds.0.24535 shutdown: shutting  
down rank 0
2021-05-18 08:35:56.534 7fcf4139d700  0 ms_deliver_dispatch: unhandled  
message 0x5600e9d4c100 client_request(client.945151521:16960146  
setfilelock rule 1, type 4, owner 11936550059977471360, pid 31266,  
start 0, length 0, wait 0 #0x100000697f5 2021-05-18 08:35:56.101588  
caller_uid=<user_id>, caller_gid=100{}) v2 from client.945151521  
v1:<IP>:0/4172979547
2021-05-18 08:35:56.534 7fcf4139d700  0 ms_deliver_dispatch: unhandled  
message 0x5602a0be3c00 osd_map(296392..296392 src has 295891..296392)  
v4 from mon.1 v1:<IP>:6789/0
2021-05-18 08:35:57.778 7f130d3ab540  0 set uid:gid to 167:167 (ceph:ceph)
2021-05-18 08:35:57.778 7f130d3ab540  0 ceph version  
14.2.9-899-gc02349c600 (c02349c60052aaa6c7bd0c2270c7f7be16fab632)  
nautilus (stable), process ceph-mds, pid 9937
2021-05-18 08:35:57.778 7f130d3ab540  0 pidfile_write: ignore empty --pid-file
2021-05-18 08:35:57.822 7f12fb6fa700  1 mds.mds01 Updating MDS map to  
version 27070 from mon.1
2021-05-18 08:35:59.214 7f12fb6fa700  1 mds.mds01 Updating MDS map to  
version 27071 from mon.1
2021-05-18 08:36:03.130 7f12fb6fa700  1 mds.mds01 Updating MDS map to  
version 27072 from mon.1
2021-05-18 08:36:03.130 7f12fb6fa700  1 mds.mds01 Monitors have  
assigned me to become a standby.
---snip---

These are the log lines from one of the MONs:

---snip---
2021-05-18 08:35:56.130088 mgr.ceph03 (mgr.969654526) 648 : cluster  
[DBG] pgmap v720: 512 pgs: 512 active+clean; 8.6 TiB data, 27 TiB  
used, 33 TiB / 59 TiB avail; 21 MiB/s rd, 55 MiB/s wr, 680 op/s
2021-05-18 08:35:56.415375 mon.ceph01 (mon.0) 2260 : cluster [WRN]  
Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2021-05-18 08:35:56.415416 mon.ceph01 (mon.0) 2261 : cluster [ERR]  
Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2021-05-18 08:35:56.415435 mon.ceph01 (mon.0) 2262 : cluster [WRN]  
Health check update: 1 MDSs report slow metadata IOs  
(MDS_SLOW_METADATA_IO)
2021-05-18 08:35:56.415450 mon.ceph01 (mon.0) 2263 : cluster [INF]  
Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow  
requests)
2021-05-18 08:35:56.550700 mon.ceph01 (mon.0) 2264 : cluster [DBG]  
mds.0 v1:<IP>:6800/3681851469 down:dne
2021-05-18 08:35:56.550751 mon.ceph01 (mon.0) 2265 : cluster [INF]  
Standby daemon mds.mds02 assigned to filesystem cephfs as rank 0
2021-05-18 08:35:56.550880 mon.ceph01 (mon.0) 2266 : cluster [INF]  
Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
2021-05-18 08:35:56.581947 mon.ceph01 (mon.0) 2267 : cluster [DBG]  
fsmap cephfs:0/1 1 up:standby-replay 1 up:standby, 1 failed
2021-05-18 08:35:56.662170 mon.ceph01 (mon.0) 2268 : cluster [DBG]  
fsmap cephfs:1/1 {0=mds02=up:replay} 1 up:standby
2021-05-18 08:35:59.195292 mon.ceph01 (mon.0) 2269 : cluster [DBG]  
mds.0 v1:<IP>:6800/1315589958 up:replay
2021-05-18 08:35:59.195335 mon.ceph01 (mon.0) 2270 : cluster [DBG]  
fsmap cephfs:1/1 {0=mds02=up:replay} 1 up:standby
2021-05-18 08:36:03.105907 mon.ceph01 (mon.0) 2273 : cluster [DBG]  
mds.0 v1:<IP>:6800/1315589958 up:replay
2021-05-18 08:36:03.105941 mon.ceph01 (mon.0) 2274 : cluster [DBG]  
mds.? v1:<IP>:6800/357038325 up:boot
2021-05-18 08:36:03.105990 mon.ceph01 (mon.0) 2275 : cluster [DBG]  
fsmap cephfs:1/1 {0=mds02=up:replay} 2 up:standby
2021-05-18 08:36:06.161627 mon.ceph01 (mon.0) 2278 : cluster [DBG]  
mds.0 v1:<IP>:6800/1315589958 up:replay
2021-05-18 08:36:06.161668 mon.ceph01 (mon.0) 2279 : cluster [DBG]  
fsmap cephfs:1/1 {0=mds02=up:replay} 2 up:standby
---snip---

But the replay never finished. It took some time to get me into a  
couple of SSH sessions without my home dir, then at 08:47 I tried to  
provoke a failover with

ceph mds fail 0
ceph mds repaired 0

without success:

---snip---
2021-05-18 08:48:33.678 7f12fb6fa700  1 mds.mds01 Updating MDS map to  
version 27218 from mon.1
2021-05-18 08:48:33.678 7f12fb6fa700  1 mds.0.27218 handle_mds_map i  
am now mds.0.27218
2021-05-18 08:48:33.678 7f12fb6fa700  1 mds.0.27218 handle_mds_map  
state change up:boot --> up:replay
2021-05-18 08:48:33.678 7f12fb6fa700  1 mds.0.27218 replay_start
2021-05-18 08:48:33.678 7f12fb6fa700  1 mds.0.27218  recovery set is
2021-05-18 08:48:33.678 7f12fb6fa700  1 mds.0.27218  waiting for  
osdmap 296394 (which blacklists prior instance)
2021-05-18 08:48:33.686 7f12f46ec700  0 mds.0.cache creating system  
inode with ino:0x100
2021-05-18 08:48:33.686 7f12f46ec700  0 mds.0.cache creating system  
inode with ino:0x1
2021-05-18 08:48:35.770 7f12f36ea700 -1 mds.0.journaler.mdlog(ro)  
try_read_entry: decode error from _is_readable
2021-05-18 08:48:35.770 7f12f36ea700  0 mds.0.log _replay journaler  
got error -22, aborting
2021-05-18 08:48:35.774 7f12f36ea700 -1 log_channel(cluster) log [ERR]  
: Error loading MDS rank 0: (22) Invalid argument
2021-05-18 08:48:35.858 7f12f36ea700  1 mds.mds01 respawn!
---snip---

Apparently the journal was corrupted according to this line:

2021-05-18 08:48:35.770 7f12f36ea700 -1 mds.0.journaler.mdlog(ro)  
try_read_entry: decode error from _is_readable

I tried a couple of harmless things like trying to set max_mds to 2,  
restarting daemons, tried to finish the update on the rest of the  
servers including the MDS daemons, but nothing really helped.
At 08:57 I silenced the global_id warnings:

ceph config set mon mon_warn_on_insecure_global_id_reclaim false
ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false

But I don't believe it had any impact on the MDS failure.
What I did in the end was to dump the journal as backup and executed  
these commands:

---snip---
mds01:~ # systemctl stop ceph-mds@mds01.service

mds01:~ # cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
recover_dentries summary
Events by type:
  OPEN: 9499
  SESSION: 2
  SUBTREEMAP: 161
  UPDATE: 116903
Errors: 0

mds01:~ # cephfs-journal-tool --rank=cephfs:0 journal reset

mds01:~ # cephfs-table-tool all reset session

mds01:~ # systemctl start ceph-mds@mds01.service

mds01:~ # ceph mds repaired 0

mds01:~ # ceph daemon mds.mds01 scrub_path / recursive repair
---snip---

This brought the MDS back up, the standby daemons also started  
successfully. The scrub didn't show any errors, but shortly after that  
the MDS reported damaged metadata, so I analysed the damage with

ceph tell mds.mds01 damage ls

Fortunately, those were only a couple of temporary directories for  
some compile jobs, the developers cleaned them up but the error didn't  
go away. In a different mailing list thread a user reported that he  
had to restart the active MDS daemon which I also did, and then  
finally the cluster got back into HEALTH_OK state.

I'm not sure what the conclusion is, but to me it appears to be just a  
conincidence that the MDS failed during an update. But if you  
interpret it differently I'd be interested to learn more.

Best regards,
Eugen

Zitat von Dan van der Ster <dan@xxxxxxxxxxxxxx>:

On Tue, May 18, 2021 at 4:00 PM Eugen Block <eblock@xxxxxx> wrote:

Hi,

sorry for not responding, our mail server was affected, too, I got
your response after we got our CephFS back online.

Glad to hear it's back online!

> Do you have the mds log from the initial crash?

I would need to take a closer look but we're currently dealing with
the affected clients to get everything back in order.

No rush, but this would be useful to analyze what broke the fs initially.

> Also, I don't see the new global_id warnings in your status output --
> did you change any settings from the defaults during this upgrade?

I definitely did deactivate that warning, it could have indeed been
during the upgrade. Could that have caused the MDS damage? We have
older clients, it may take some time to update so I decided to silence
that warning. Was that a mistake? Maybe I just missed that information
but I didn't find any warnings for the update case. Do you have more
information?

If you set these, it should be safe:

  mon       advanced mon_warn_on_insecure_global_id_reclaim         false
  mon       advanced mon_warn_on_insecure_global_id_reclaim_allowed false

but if you changed these other settings to false, it might have caused the
old mds to error:

auth_allow_insecure_global_id_reclaim
auth_expose_insecure_global_id_reclaim

Cheers, Dan

Thanks!
Eugen

Zitat von Dan van der Ster <dan@xxxxxxxxxxxxxx>:

> Hi,
>
> Do you have the mds log from the initial crash?
>
> Also, I don't see the new global_id warnings in your status output --
> did you change any settings from the defaults during this upgrade?
>
> Cheers, Dan
>
> On Tue, May 18, 2021 at 10:22 AM Eugen Block <eblock@xxxxxx> wrote:
>>
>> Hi *,
>>
>> I tried a minor update (14.2.9 --> 14.2.20) on our ceph cluster today
>> and got into a damaged CephFS. It's rather urgent since noone can
>> really work right now, so any quick help is highly appreciated.
>>
>> As for the update process I followed the usual update procedure, when
>> all MONs were finished I started to restart the OSDs, but suddenly our
>> cephfs got unresponsive (and still is).
>>
>> I believe these lines are the critical ones:
>>
>> ---snap---
>>     -12> 2021-05-18 09:53:01.488 7f7e9ed82700  5 mds.beacon.mds01
>> received beacon reply up:replay seq 906 rtt 0
>>     -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient:
>> get_auth_request con 0x5608a5171600 auth_method 0
>>     -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1
>> mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
>>      -9> 2021-05-18 09:53:03.732 7f7e94d6e700  0 mds.0.log _replay
>> journaler got error -22, aborting
>>      -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster)
>> log [ERR] : Error loading MDS rank 0: (22) Invalid argument
>>      -7> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
>> set_want_state: up:replay -> down:damaged
>>      -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  log_queue
>> is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
>>      -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  will send
>> 2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error
>> loading MDS rank 0: (22) Invalid argument
>>      -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
>> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
>>      -3> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01
>> Sending beacon down:damaged seq 907
>>      -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient:
>> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
>>      -1> 2021-05-18 09:53:03.908 7f7e9ed82700  5 mds.beacon.mds01
>> received beacon reply down:damaged seq 907 rtt 0.176001
>>       0> 2021-05-18 09:53:03.908 7f7e94d6e700  1 mds.mds01 respawn!
>> ---snap---
>>
>> These logs are from the attempt to bring the mds rank back up with
>>
>> ceph mds repaired 0
>>
>> I attached a longer excerpt of the log files if it helps. Before
>> trying anything from the disaster recovery steps I'd like to ask for
>> your input since one can damage it even more. The current status is
>> below, please let me know if more information is required.
>>
>> Thanks!
>> Eugen
>>
>>
>> ceph01:~ # ceph -s
>>    cluster:
>>      id:     655cb05a-435a-41ba-83d9-8549f7c36167
>>      health: HEALTH_ERR
>>              1 filesystem is degraded
>>              1 filesystem is offline
>>              1 mds daemon damaged
>>              noout flag(s) set
>>              Some pool(s) have the nodeep-scrub flag(s) set
>>
>>    services:
>>      mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 116m)
>>      mgr: ceph03(active, since 118m), standbys: ceph02, ceph01
>>      mds: cephfs:0/1 3 up:standby, 1 damaged
>>      osd: 32 osds: 32 up (since 64m), 32 in (since 8w)
>>           flags noout
>>
>>    data:
>>      pools:   14 pools, 512 pgs
>>      objects: 5.08M objects, 8.6 TiB
>>      usage:   27 TiB used, 33 TiB / 59 TiB avail
>>      pgs:     512 active+clean
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx