Re: recovery from catastrophic mon and mds failure after reboot and ip address change

Robert Gallop <robert.gallop@xxxxxxxxx> · Tue, 28 Jun 2022 19:16:35 -0600

Thanks for sharing, hope I never need the info, but glad to know it’s here
and doable!

On Tue, Jun 28, 2022 at 10:36 AM Florian Jonas <florian.jonas@xxxxxxx>
wrote:

> Dear all,
>
> just when we received Eugens message, we managed (with additional help
> via zoom from other experts) to recover our filesystem. Thank you again
> for your help. I briefly document our solution here. The monitors were
> corrupted due to repeated destruction and recreation, destroying the
> store.db of the monitors. The OSDs were intact. We followed the solution
> here to recover the monitors from the store.db collected form the OSDs:
>
>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
>
> However, we had made one mistake during one of the steps. For anyone
> reading this: make sure that the OSD services are not running before
> running the procedure. We then stopped all ceph services and replaced
> the corrupted store.db for each node:
>
> mv $extractedstoredb/store.db /var/lib/ceph/mon/mon.foo/store.db
>
> chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
>
> we then started the monitors one by one and then started the osd
> services again. At this stage we got the pools again. We then roughly
> followed the guide here:
>
> https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/
>
> to restore the filesystem, while making sure that NO MDS is running.
> However, I think the exact commands depend on the ceph version, so I
> would double check with an expert for the last step, since as far as i
> understood it can lead to erasure of files if the --recover flag is not
> properly implemented.
>
> Best regards,
>
> Florian
>
>
>
> On 28/06/2022 15:12, Eugen Block wrote:
> > I agree, having one MON out of quorum should not result in hanging
> > ceph commands, maybe a little delay until all clients have noticed it.
> > So the first question is, what happened there? Did you notice anything
> > else that could disturb the cluster? Do you have the logs from the
> > remaining two MONs and do they reveal anything? But this is just
> > relevant for the analysis and maybe prevent something similar from
> > happening in the future. Have you tried restarting the MGR after the
> > OSDs came back up? If not, I would restart it (do you have a second
> > MGR to be able to failover?) and then also restart a single OSD to see
> > if anything changes in the cluster status. You're right about the MDS,
> > of course. First you need the cephfs pools to be available again
> > before the MDS can start its work.
> >
> > Zitat von Florian Jonas <florian.jonas@xxxxxxx>:
> >
> >> Hi,
> >>
> >> thanks a lot for getting back to me. I will try to clarify what
> >> happened and reconstruct the timeline. For context, our computing
> >> cluster is part of a bigger network infrastructure that is managed by
> >> someone else, and for the particular node running the MON and MDS we
> >> had not assigned a static IP address due to an oversight on our part.
> >> The cluster is run semi-professionally by me and a colleague and
> >> started as a small test but quickly grew in scale, so we are still
> >> somewhat beginners. The machine got stuck due to some unrelated issue
> >> and we had to reboot, and after reboot only this one address changed
> >> (last three digits).
> >>
> >> After the reboot, the ceph status command was no longer working,
> >> which caused a bit of a panic. In principle, it should have still
> >> worked since the other two machines still should have had quorum. We
> >> quickly realized the IP address change and destroyed the monitor in
> >> question and re-created it after we had changed the mon ip in the
> >> ceph config. However, I think this was a mistake since in general the
> >> system was not in a good state (I assume due to the crashed MDS). In
> >> the rush to get things back online (second mistake), the other two
> >> monitors were also destroyed and re-created, even though their IP
> >> address did not change. At this point the ceph status command was
> >> still not available and just hanging.
> >>
> >> We proceeded following the procedure outline here:
> >>
> >>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
> >>
> >>
> >> in order to restore the monitors  using the OSDs on each node. After
> >> following this procedure we managed to get all three monitors back
> >> online and they now all show a quorum. This is the current situation.
> >> I think this whole mess is a mix of unlucky circumstances and
> >> panicked incompetence on our part ...
> >>
> >> By restarting the MDS, do you mean restarting the MDS service on the
> >> node in question? All three of them currently show up as "inactive",
> >> I think because no filesystem is recognized and they see no reason to
> >> become active. Regarding your question why the backup MDS did not
> >> start, I do not know.  It is indeed strange!
> >>
> >> Best regards,
> >>
> >> Florian Jonas
> >>
> >>
> >> On 28/06/2022 13:29, Eugen Block wrote:
> >>> Hi,
> >>>
> >>> just to clarify, only one of the MONs had a different IP address
> >>> (how and why, DHCP?), but you got it up again (since your cluster
> >>> shows quorum). So the subnet didn't change, only the one address?
> >>> Did you already try to restart the MDS? And what about the standby
> >>> MDS, it could have taken over, couldn't it? The "0 in" OSDs could be
> >>> a MGR issue, I'm not sure how that worked in Mimic. But they appear
> >>> to be working, so it's not really clear yet what the actual problem
> >>> is, but data loss is unlikely since the OSDs have not been wiped and
> >>> they also load their PGs, it appears:
> >>>
> >>>> 2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened
> >>>> 67 pgs
> >>>
> >>>
> >>> Zitat von Florian Jonas <florian.jonas@xxxxxxx>:
> >>>
> >>>> Dear experts,
> >>>>
> >>>> we have a small computing cluster with 21 OSDs and 3 monitors and
> >>>> 3MDS running on ceph version 13.2.10 on ubuntu 18.04. A few days
> >>>> ago we had an unexpected reboot of all machines, as well as a
> >>>> change of the IP address of one machine, which was hosting a MDS as
> >>>> well as a monitor. I am not exactly sure what played out during
> >>>> that night, but we lost quorum of all three monitors and no
> >>>> filesystem was visible anymore, so we are starting to get quite
> >>>> worried about data loss. We tried destroying and recreating the
> >>>> monitor of which the ip address changed, but it did not help (which
> >>>> however might have been a mistake).
> >>>>
> >>>> Long story short, we tried to recover restoring by adapting the
> >>>> changed ip address in the config and tried to recover the monitors
> >>>> using the information from the OSDs, following the procedure
> >>>> outline here:
> >>>>
> >>>>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
> >>>> We are now in a situation where ceph status shows the following:
> >>>>
> >>>>   cluster:
> >>>>     id:     61fd9a61-89d6-4383-a2e6-ec4f4a13830f
> >>>>     health: HEALTH_WARN
> >>>>             43 slow ops, oldest one blocked for 57132 sec, daemons
> >>>> [mon.dip01,mon.pc078,mon.pc147] have slow ops.
> >>>>
> >>>>   services:
> >>>>     mon: 3 daemons, quorum pc147,pc078,dip01
> >>>>     mgr: dip01(active)
> >>>>     osd: 22 osds: 0 up, 0 in
> >>>>
> >>>>   data:
> >>>>     pools:   0 pools, 0 pgs
> >>>>     objects: 0  objects, 0 B
> >>>>     usage:   0 B used, 0 B / 0 B avail
> >>>>     pgs:
> >>>>
> >>>> The monitors show a quorum (i think that's a good start), but we do
> >>>> not see any of the pools that were previously there and also no
> >>>> filesystem is visible. Running the command "ceph fs status" shows
> >>>> all MDS are in standby and no filesystem is activated.
> >>>>
> >>>> I looked into the HEALTH_WARNING, by checking the journalctl -xe on
> >>>> the monitor machines and one finds errors of the type:
> >>>>
> >>>> Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978
> >>>> 7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting
> >>>> 4 slow ops, oldest is osd_boot(osd.12 booted 0 features
> >>>> 4611087854031667195 v13031)
> >>>>
> >>>> In order to check what is going on with the osd_boot error, i
> >>>> checked the logs on the osd machines and found warning such as:
> >>>>
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls>
> >>>> /build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to
> >>>> load kvs
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls>
> >>>> /build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to
> >>>> load lua
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 _get_class not permitted to
> >>>> load sdk
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 warning: got an
> >>>> error loading one or more classes: (1) Operation not permitted
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has
> >>>> features 288514051259236352, adjusting msgr requires for clients
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has
> >>>> features 288514051259236352 was 8705, adjusting msgr requires for mons
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has
> >>>> features 1009089991638532096, adjusting msgr requires for osds
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035
> >>>> check_osdmap_features require_osd_release 0 ->
> >>>> 2022-06-24 09:16:44.527 7fdc165d5c00  0 osd.6 13035 load_pgs
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened
> >>>> 67 pgs
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 using
> >>>> weightedpriority op queue with priority op cut off at 64.
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors
> >>>> {default=true}
> >>>> 2022-06-24 09:16:50.383 7fdc165d5c00  0 osd.6 13035 done with init,
> >>>> starting boot process
> >>>> 2022-06-24 09:16:50.383 7fdc165d5c00  1 osd.6 13035 start_boot
> >>>> 2022-06-24 09:16:50.495 7fdbec933700  1 osd.6 pg_epoch: 13035
> >>>> pg[5.1( v 2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1
> >>>> ec=2782/2782 lis/c 12997/12997 les/c/f 12999/12999/0
> >>>> 12997/12997/12954) [6,17,14] r=0 lpr=13021 crt=2785'2 lcod 0'0
> >>>> mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
> >>>>
> >>>> The 21 OSDs themselves show as "exists,new" in ceph osd status,
> >>>> even though they remained untouched during the whole incident
> >>>> (which I hope means they still contain all our data somewhere)
> >>>>
> >>>> We only started operating our distributed filesystem about one year
> >>>> ago, and I must admit with this problem we are a bit out of our
> >>>> depth, so we would very much would appreciate any leads/help we can
> >>>> get on getting our filesystem up and running again. Alternatively,
> >>>> if all else fails, we would also appreciate any information about
> >>>> the possibility of recovering the data from the 21 OSDs, which
> >>>> amounts to over 60TB.
> >>>>
> >>>> Attached you find our ceph.conf file, as well as the logs from one
> >>>> example monitor and one osd node. If you need any other information
> >>>> let us know.
> >>>>
> >>>> Thank you in advance for you help, I know your time is valuable!
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Florian Jonas
> >>>>
> >>>> p.s. to the moderators: This message is a resubmit with smaller log
> >>>> files. I was not aware of the 1MB limit. The previously bounced
> >>>> message can be ignored!
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >> On 28/06/2022 13:29, Eugen Block wrote:
> >>> Hi,
> >>>
> >>> just to clarify, only one of the MONs had a different IP address
> >>> (how and why, DHCP?), but you got it up again (since your cluster
> >>> shows quorum). So the subnet didn't change, only the one address?
> >>> Did you already try to restart the MDS? And what about the standby
> >>> MDS, it could have taken over, couldn't it? The "0 in" OSDs could be
> >>> a MGR issue, I'm not sure how that worked in Mimic. But they appear
> >>> to be working, so it's not really clear yet what the actual problem
> >>> is, but data loss is unlikely since the OSDs have not been wiped and
> >>> they also load their PGs, it appears:
> >>>
> >>>> 2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened
> >>>> 67 pgs
> >>>
> >>>
> >>> Zitat von Florian Jonas <florian.jonas@xxxxxxx>:
> >>>
> >>>> Dear experts,
> >>>>
> >>>> we have a small computing cluster with 21 OSDs and 3 monitors and
> >>>> 3MDS running on ceph version 13.2.10 on ubuntu 18.04. A few days
> >>>> ago we had an unexpected reboot of all machines, as well as a
> >>>> change of the IP address of one machine, which was hosting a MDS as
> >>>> well as a monitor. I am not exactly sure what played out during
> >>>> that night, but we lost quorum of all three monitors and no
> >>>> filesystem was visible anymore, so we are starting to get quite
> >>>> worried about data loss. We tried destroying and recreating the
> >>>> monitor of which the ip address changed, but it did not help (which
> >>>> however might have been a mistake).
> >>>>
> >>>> Long story short, we tried to recover restoring by adapting the
> >>>> changed ip address in the config and tried to recover the monitors
> >>>> using the information from the OSDs, following the procedure
> >>>> outline here:
> >>>>
> >>>>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
> >>>> We are now in a situation where ceph status shows the following:
> >>>>
> >>>>   cluster:
> >>>>     id:     61fd9a61-89d6-4383-a2e6-ec4f4a13830f
> >>>>     health: HEALTH_WARN
> >>>>             43 slow ops, oldest one blocked for 57132 sec, daemons
> >>>> [mon.dip01,mon.pc078,mon.pc147] have slow ops.
> >>>>
> >>>>   services:
> >>>>     mon: 3 daemons, quorum pc147,pc078,dip01
> >>>>     mgr: dip01(active)
> >>>>     osd: 22 osds: 0 up, 0 in
> >>>>
> >>>>   data:
> >>>>     pools:   0 pools, 0 pgs
> >>>>     objects: 0  objects, 0 B
> >>>>     usage:   0 B used, 0 B / 0 B avail
> >>>>     pgs:
> >>>>
> >>>> The monitors show a quorum (i think that's a good start), but we do
> >>>> not see any of the pools that were previously there and also no
> >>>> filesystem is visible. Running the command "ceph fs status" shows
> >>>> all MDS are in standby and no filesystem is activated.
> >>>>
> >>>> I looked into the HEALTH_WARNING, by checking the journalctl -xe on
> >>>> the monitor machines and one finds errors of the type:
> >>>>
> >>>> Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978
> >>>> 7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting
> >>>> 4 slow ops, oldest is osd_boot(osd.12 booted 0 features
> >>>> 4611087854031667195 v13031)
> >>>>
> >>>> In order to check what is going on with the osd_boot error, i
> >>>> checked the logs on the osd machines and found warning such as:
> >>>>
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls>
> >>>> /build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to
> >>>> load kvs
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls>
> >>>> /build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
> >>>> 2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to
> >>>> load lua
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 _get_class not permitted to
> >>>> load sdk
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 warning: got an
> >>>> error loading one or more classes: (1) Operation not permitted
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has
> >>>> features 288514051259236352, adjusting msgr requires for clients
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has
> >>>> features 288514051259236352 was 8705, adjusting msgr requires for mons
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has
> >>>> features 1009089991638532096, adjusting msgr requires for osds
> >>>> 2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035
> >>>> check_osdmap_features require_osd_release 0 ->
> >>>> 2022-06-24 09:16:44.527 7fdc165d5c00  0 osd.6 13035 load_pgs
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened
> >>>> 67 pgs
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 using
> >>>> weightedpriority op queue with priority op cut off at 64.
> >>>> 2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors
> >>>> {default=true}
> >>>> 2022-06-24 09:16:50.383 7fdc165d5c00  0 osd.6 13035 done with init,
> >>>> starting boot process
> >>>> 2022-06-24 09:16:50.383 7fdc165d5c00  1 osd.6 13035 start_boot
> >>>> 2022-06-24 09:16:50.495 7fdbec933700  1 osd.6 pg_epoch: 13035
> >>>> pg[5.1( v 2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1
> >>>> ec=2782/2782 lis/c 12997/12997 les/c/f 12999/12999/0
> >>>> 12997/12997/12954) [6,17,14] r=0 lpr=13021 crt=2785'2 lcod 0'0
> >>>> mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
> >>>>
> >>>> The 21 OSDs themselves show as "exists,new" in ceph osd status,
> >>>> even though they remained untouched during the whole incident
> >>>> (which I hope means they still contain all our data somewhere)
> >>>>
> >>>> We only started operating our distributed filesystem about one year
> >>>> ago, and I must admit with this problem we are a bit out of our
> >>>> depth, so we would very much would appreciate any leads/help we can
> >>>> get on getting our filesystem up and running again. Alternatively,
> >>>> if all else fails, we would also appreciate any information about
> >>>> the possibility of recovering the data from the 21 OSDs, which
> >>>> amounts to over 60TB.
> >>>>
> >>>> Attached you find our ceph.conf file, as well as the logs from one
> >>>> example monitor and one osd node. If you need any other information
> >>>> let us know.
> >>>>
> >>>> Thank you in advance for you help, I know your time is valuable!
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Florian Jonas
> >>>>
> >>>> p.s. to the moderators: This message is a resubmit with smaller log
> >>>> files. I was not aware of the 1MB limit. The previously bounced
> >>>> message can be ignored!
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx