Re: recovery from catastrophic mon and mds failure after reboot and ip address change

Florian Jonas <florian.jonas@xxxxxxx> · Tue, 28 Jun 2022 18:33:25 +0200

Dear all,

just when we received Eugens message, we managed (with additional help 
via zoom from other experts) to recover our filesystem. Thank you again 
for your help. I briefly document our solution here. The monitors were 
corrupted due to repeated destruction and recreation, destroying the 
store.db of the monitors. The OSDs were intact. We followed the solution 
here to recover the monitors from the store.db collected form the OSDs:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds

However, we had made one mistake during one of the steps. For anyone 
reading this: make sure that the OSD services are not running before 
running the procedure. We then stopped all ceph services and replaced 
the corrupted store.db for each node:

mv $extractedstoredb/store.db /var/lib/ceph/mon/mon.foo/store.db

chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db

we then started the monitors one by one and then started the osd 
services again. At this stage we got the pools again. We then roughly 
followed the guide here:

https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/

to restore the filesystem, while making sure that NO MDS is running. 
However, I think the exact commands depend on the ceph version, so I 
would double check with an expert for the last step, since as far as i 
understood it can lead to erasure of files if the --recover flag is not 
properly implemented.

Best regards,

Florian

On 28/06/2022 15:12, Eugen Block wrote:
I agree, having one MON out of quorum should not result in hanging 
ceph commands, maybe a little delay until all clients have noticed it. 
So the first question is, what happened there? Did you notice anything 
else that could disturb the cluster? Do you have the logs from the 
remaining two MONs and do they reveal anything? But this is just 
relevant for the analysis and maybe prevent something similar from 
happening in the future. Have you tried restarting the MGR after the 
OSDs came back up? If not, I would restart it (do you have a second 
MGR to be able to failover?) and then also restart a single OSD to see 
if anything changes in the cluster status. You're right about the MDS, 
of course. First you need the cephfs pools to be available again 
before the MDS can start its work.

Zitat von Florian Jonas <florian.jonas@xxxxxxx>:

Hi,

thanks a lot for getting back to me. I will try to clarify what 
happened and reconstruct the timeline. For context, our computing 
cluster is part of a bigger network infrastructure that is managed by 
someone else, and for the particular node running the MON and MDS we 
had not assigned a static IP address due to an oversight on our part. 
The cluster is run semi-professionally by me and a colleague and 
started as a small test but quickly grew in scale, so we are still 
somewhat beginners. The machine got stuck due to some unrelated issue 
and we had to reboot, and after reboot only this one address changed 
(last three digits).

After the reboot, the ceph status command was no longer working, 
which caused a bit of a panic. In principle, it should have still 
worked since the other two machines still should have had quorum. We 
quickly realized the IP address change and destroyed the monitor in 
question and re-created it after we had changed the mon ip in the 
ceph config. However, I think this was a mistake since in general the 
system was not in a good state (I assume due to the crashed MDS). In 
the rush to get things back online (second mistake), the other two 
monitors were also destroyed and re-created, even though their IP 
address did not change. At this point the ceph status command was 
still not available and just hanging.

We proceeded following the procedure outline here:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds 

in order to restore the monitors  using the OSDs on each node. After 
following this procedure we managed to get all three monitors back 
online and they now all show a quorum. This is the current situation. 
I think this whole mess is a mix of unlucky circumstances and 
panicked incompetence on our part ...

By restarting the MDS, do you mean restarting the MDS service on the 
node in question? All three of them currently show up as "inactive", 
I think because no filesystem is recognized and they see no reason to 
become active. Regarding your question why the backup MDS did not 
start, I do not know.  It is indeed strange!

Best regards,

Florian Jonas

On 28/06/2022 13:29, Eugen Block wrote:
Hi,

just to clarify, only one of the MONs had a different IP address 
(how and why, DHCP?), but you got it up again (since your cluster 
shows quorum). So the subnet didn't change, only the one address? 
Did you already try to restart the MDS? And what about the standby 
MDS, it could have taken over, couldn't it? The "0 in" OSDs could be 
a MGR issue, I'm not sure how that worked in Mimic. But they appear 
to be working, so it's not really clear yet what the actual problem 
is, but data loss is unlikely since the OSDs have not been wiped and 
they also load their PGs, it appears:

2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened 
67 pgs

Zitat von Florian Jonas <florian.jonas@xxxxxxx>:

Dear experts,

we have a small computing cluster with 21 OSDs and 3 monitors and 
3MDS running on ceph version 13.2.10 on ubuntu 18.04. A few days 
ago we had an unexpected reboot of all machines, as well as a 
change of the IP address of one machine, which was hosting a MDS as 
well as a monitor. I am not exactly sure what played out during 
that night, but we lost quorum of all three monitors and no 
filesystem was visible anymore, so we are starting to get quite 
worried about data loss. We tried destroying and recreating the 
monitor of which the ip address changed, but it did not help (which 
however might have been a mistake).

Long story short, we tried to recover restoring by adapting the 
changed ip address in the config and tried to recover the monitors 
using the information from the OSDs, following the procedure 
outline here:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds 
We are now in a situation where ceph status shows the following:

  cluster:
    id:     61fd9a61-89d6-4383-a2e6-ec4f4a13830f
    health: HEALTH_WARN
            43 slow ops, oldest one blocked for 57132 sec, daemons 
[mon.dip01,mon.pc078,mon.pc147] have slow ops.

  services:
    mon: 3 daemons, quorum pc147,pc078,dip01
    mgr: dip01(active)
    osd: 22 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

The monitors show a quorum (i think that's a good start), but we do 
not see any of the pools that were previously there and also no 
filesystem is visible. Running the command "ceph fs status" shows 
all MDS are in standby and no filesystem is activated.

I looked into the HEALTH_WARNING, by checking the journalctl -xe on 
the monitor machines and one finds errors of the type:

Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978 
7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting 
4 slow ops, oldest is osd_boot(osd.12 booted 0 features 
4611087854031667195 v13031)

In order to check what is going on with the osd_boot error, i 
checked the logs on the osd machines and found warning such as:

2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls> 
/build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to 
load kvs
2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls> 
/build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to 
load lua
2022-06-24 09:16:42.387 7fdc165d5c00  0 _get_class not permitted to 
load sdk
2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 warning: got an 
error loading one or more classes: (1) Operation not permitted
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 288514051259236352, adjusting msgr requires for clients
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 288514051259236352 was 8705, adjusting msgr requires for mons
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 1009089991638532096, adjusting msgr requires for osds
2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 
check_osdmap_features require_osd_release 0 ->
2022-06-24 09:16:44.527 7fdc165d5c00  0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened 
67 pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 using 
weightedpriority op queue with priority op cut off at 64.
2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors 
{default=true}
2022-06-24 09:16:50.383 7fdc165d5c00  0 osd.6 13035 done with init, 
starting boot process
2022-06-24 09:16:50.383 7fdc165d5c00  1 osd.6 13035 start_boot
2022-06-24 09:16:50.495 7fdbec933700  1 osd.6 pg_epoch: 13035 
pg[5.1( v 2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1 
ec=2782/2782 lis/c 12997/12997 les/c/f 12999/12999/0 
12997/12997/12954) [6,17,14] r=0 lpr=13021 crt=2785'2 lcod 0'0 
mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary

The 21 OSDs themselves show as "exists,new" in ceph osd status, 
even though they remained untouched during the whole incident 
(which I hope means they still contain all our data somewhere)

We only started operating our distributed filesystem about one year 
ago, and I must admit with this problem we are a bit out of our 
depth, so we would very much would appreciate any leads/help we can 
get on getting our filesystem up and running again. Alternatively, 
if all else fails, we would also appreciate any information about 
the possibility of recovering the data from the 21 OSDs, which 
amounts to over 60TB.

Attached you find our ceph.conf file, as well as the logs from one 
example monitor and one osd node. If you need any other information 
let us know.

Thank you in advance for you help, I know your time is valuable!

Best regards,

Florian Jonas

p.s. to the moderators: This message is a resubmit with smaller log 
files. I was not aware of the 1MB limit. The previously bounced 
message can be ignored!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

On 28/06/2022 13:29, Eugen Block wrote:
Hi,

just to clarify, only one of the MONs had a different IP address 
(how and why, DHCP?), but you got it up again (since your cluster 
shows quorum). So the subnet didn't change, only the one address? 
Did you already try to restart the MDS? And what about the standby 
MDS, it could have taken over, couldn't it? The "0 in" OSDs could be 
a MGR issue, I'm not sure how that worked in Mimic. But they appear 
to be working, so it's not really clear yet what the actual problem 
is, but data loss is unlikely since the OSDs have not been wiped and 
they also load their PGs, it appears:

2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened 
67 pgs

Zitat von Florian Jonas <florian.jonas@xxxxxxx>:

Dear experts,

we have a small computing cluster with 21 OSDs and 3 monitors and 
3MDS running on ceph version 13.2.10 on ubuntu 18.04. A few days 
ago we had an unexpected reboot of all machines, as well as a 
change of the IP address of one machine, which was hosting a MDS as 
well as a monitor. I am not exactly sure what played out during 
that night, but we lost quorum of all three monitors and no 
filesystem was visible anymore, so we are starting to get quite 
worried about data loss. We tried destroying and recreating the 
monitor of which the ip address changed, but it did not help (which 
however might have been a mistake).

Long story short, we tried to recover restoring by adapting the 
changed ip address in the config and tried to recover the monitors 
using the information from the OSDs, following the procedure 
outline here:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds 
We are now in a situation where ceph status shows the following:

  cluster:
    id:     61fd9a61-89d6-4383-a2e6-ec4f4a13830f
    health: HEALTH_WARN
            43 slow ops, oldest one blocked for 57132 sec, daemons 
[mon.dip01,mon.pc078,mon.pc147] have slow ops.

  services:
    mon: 3 daemons, quorum pc147,pc078,dip01
    mgr: dip01(active)
    osd: 22 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

The monitors show a quorum (i think that's a good start), but we do 
not see any of the pools that were previously there and also no 
filesystem is visible. Running the command "ceph fs status" shows 
all MDS are in standby and no filesystem is activated.

I looked into the HEALTH_WARNING, by checking the journalctl -xe on 
the monitor machines and one finds errors of the type:

Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978 
7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting 
4 slow ops, oldest is osd_boot(osd.12 booted 0 features 
4611087854031667195 v13031)

In order to check what is going on with the osd_boot error, i 
checked the logs on the osd machines and found warning such as:

2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls> 
/build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to 
load kvs
2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls> 
/build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to 
load lua
2022-06-24 09:16:42.387 7fdc165d5c00  0 _get_class not permitted to 
load sdk
2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 warning: got an 
error loading one or more classes: (1) Operation not permitted
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 288514051259236352, adjusting msgr requires for clients
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 288514051259236352 was 8705, adjusting msgr requires for mons
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 1009089991638532096, adjusting msgr requires for osds
2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 
check_osdmap_features require_osd_release 0 ->
2022-06-24 09:16:44.527 7fdc165d5c00  0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened 
67 pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 using 
weightedpriority op queue with priority op cut off at 64.
2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors 
{default=true}
2022-06-24 09:16:50.383 7fdc165d5c00  0 osd.6 13035 done with init, 
starting boot process
2022-06-24 09:16:50.383 7fdc165d5c00  1 osd.6 13035 start_boot
2022-06-24 09:16:50.495 7fdbec933700  1 osd.6 pg_epoch: 13035 
pg[5.1( v 2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1 
ec=2782/2782 lis/c 12997/12997 les/c/f 12999/12999/0 
12997/12997/12954) [6,17,14] r=0 lpr=13021 crt=2785'2 lcod 0'0 
mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary

The 21 OSDs themselves show as "exists,new" in ceph osd status, 
even though they remained untouched during the whole incident 
(which I hope means they still contain all our data somewhere)

We only started operating our distributed filesystem about one year 
ago, and I must admit with this problem we are a bit out of our 
depth, so we would very much would appreciate any leads/help we can 
get on getting our filesystem up and running again. Alternatively, 
if all else fails, we would also appreciate any information about 
the possibility of recovering the data from the 21 OSDs, which 
amounts to over 60TB.

Attached you find our ceph.conf file, as well as the logs from one 
example monitor and one osd node. If you need any other information 
let us know.

Thank you in advance for you help, I know your time is valuable!

Best regards,

Florian Jonas

p.s. to the moderators: This message is a resubmit with smaller log 
files. I was not aware of the 1MB limit. The previously bounced 
message can be ignored!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx