Hi,
thanks a lot for getting back to me. I will try to clarify what happened
and reconstruct the timeline. For context, our computing cluster is part
of a bigger network infrastructure that is managed by someone else, and
for the particular node running the MON and MDS we had not assigned a
static IP address due to an oversight on our part. The cluster is run
semi-professionally by me and a colleague and started as a small test
but quickly grew in scale, so we are still somewhat beginners. The
machine got stuck due to some unrelated issue and we had to reboot, and
after reboot only this one address changed (last three digits).
After the reboot, the ceph status command was no longer working, which
caused a bit of a panic. In principle, it should have still worked since
the other two machines still should have had quorum. We quickly realized
the IP address change and destroyed the monitor in question and
re-created it after we had changed the mon ip in the ceph config.
However, I think this was a mistake since in general the system was not
in a good state (I assume due to the crashed MDS). In the rush to get
things back online (second mistake), the other two monitors were also
destroyed and re-created, even though their IP address did not change.
At this point the ceph status command was still not available and just
hanging.
We proceeded following the procedure outline here:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
in order to restore the monitors using the OSDs on each node. After
following this procedure we managed to get all three monitors back
online and they now all show a quorum. This is the current situation. I
think this whole mess is a mix of unlucky circumstances and panicked
incompetence on our part ...
By restarting the MDS, do you mean restarting the MDS service on the
node in question? All three of them currently show up as "inactive", I
think because no filesystem is recognized and they see no reason to
become active. Regarding your question why the backup MDS did not start,
I do not know. It is indeed strange!
Best regards,
Florian Jonas
On 28/06/2022 13:29, Eugen Block wrote:
Hi,
just to clarify, only one of the MONs had a different IP address (how
and why, DHCP?), but you got it up again (since your cluster shows
quorum). So the subnet didn't change, only the one address? Did you
already try to restart the MDS? And what about the standby MDS, it
could have taken over, couldn't it? The "0 in" OSDs could be a MGR
issue, I'm not sure how that worked in Mimic. But they appear to be
working, so it's not really clear yet what the actual problem is, but
data loss is unlikely since the OSDs have not been wiped and they also
load their PGs, it appears:
2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00 0 osd.6 13035 load_pgs opened
67 pgs
Zitat von Florian Jonas <florian.jonas@xxxxxxx>:
Dear experts,
we have a small computing cluster with 21 OSDs and 3 monitors and
3MDS running on ceph version 13.2.10 on ubuntu 18.04. A few days ago
we had an unexpected reboot of all machines, as well as a change of
the IP address of one machine, which was hosting a MDS as well as a
monitor. I am not exactly sure what played out during that night, but
we lost quorum of all three monitors and no filesystem was visible
anymore, so we are starting to get quite worried about data loss. We
tried destroying and recreating the monitor of which the ip address
changed, but it did not help (which however might have been a mistake).
Long story short, we tried to recover restoring by adapting the
changed ip address in the config and tried to recover the monitors
using the information from the OSDs, following the procedure outline
here:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
We are now in a situation where ceph status shows the following:
cluster:
id: 61fd9a61-89d6-4383-a2e6-ec4f4a13830f
health: HEALTH_WARN
43 slow ops, oldest one blocked for 57132 sec, daemons
[mon.dip01,mon.pc078,mon.pc147] have slow ops.
services:
mon: 3 daemons, quorum pc147,pc078,dip01
mgr: dip01(active)
osd: 22 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
The monitors show a quorum (i think that's a good start), but we do
not see any of the pools that were previously there and also no
filesystem is visible. Running the command "ceph fs status" shows all
MDS are in standby and no filesystem is activated.
I looked into the HEALTH_WARNING, by checking the journalctl -xe on
the monitor machines and one finds errors of the type:
Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978
7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting 4
slow ops, oldest is osd_boot(osd.12 booted 0 features
4611087854031667195 v13031)
In order to check what is going on with the osd_boot error, i checked
the logs on the osd machines and found warning such as:
2022-06-24 09:16:42.383 7fdc165d5c00 0 <cls>
/build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2022-06-24 09:16:42.383 7fdc165d5c00 0 _get_class not permitted to
load kvs
2022-06-24 09:16:42.383 7fdc165d5c00 0 <cls>
/build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
2022-06-24 09:16:42.383 7fdc165d5c00 0 _get_class not permitted to
load lua
2022-06-24 09:16:42.387 7fdc165d5c00 0 _get_class not permitted to
load sdk
2022-06-24 09:16:42.387 7fdc165d5c00 1 osd.6 13035 warning: got an
error loading one or more classes: (1) Operation not permitted
2022-06-24 09:16:42.387 7fdc165d5c00 0 osd.6 13035 crush map has
features 288514051259236352, adjusting msgr requires for clients
2022-06-24 09:16:42.387 7fdc165d5c00 0 osd.6 13035 crush map has
features 288514051259236352 was 8705, adjusting msgr requires for mons
2022-06-24 09:16:42.387 7fdc165d5c00 0 osd.6 13035 crush map has
features 1009089991638532096, adjusting msgr requires for osds
2022-06-24 09:16:42.387 7fdc165d5c00 1 osd.6 13035
check_osdmap_features require_osd_release 0 ->
2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00 0 osd.6 13035 load_pgs opened
67 pgs
2022-06-24 09:16:50.375 7fdc165d5c00 0 osd.6 13035 using
weightedpriority op queue with priority op cut off at 64.
2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors
{default=true}
2022-06-24 09:16:50.383 7fdc165d5c00 0 osd.6 13035 done with init,
starting boot process
2022-06-24 09:16:50.383 7fdc165d5c00 1 osd.6 13035 start_boot
2022-06-24 09:16:50.495 7fdbec933700 1 osd.6 pg_epoch: 13035 pg[5.1(
v 2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1 ec=2782/2782
lis/c 12997/12997 les/c/f 12999/12999/0 12997/12997/12954) [6,17,14]
r=0 lpr=13021 crt=2785'2 lcod 0'0 mlcod 0'0 unknown mbc={}]
state<Start>: transitioning to Primary
The 21 OSDs themselves show as "exists,new" in ceph osd status, even
though they remained untouched during the whole incident (which I
hope means they still contain all our data somewhere)
We only started operating our distributed filesystem about one year
ago, and I must admit with this problem we are a bit out of our
depth, so we would very much would appreciate any leads/help we can
get on getting our filesystem up and running again. Alternatively, if
all else fails, we would also appreciate any information about the
possibility of recovering the data from the 21 OSDs, which amounts to
over 60TB.
Attached you find our ceph.conf file, as well as the logs from one
example monitor and one osd node. If you need any other information
let us know.
Thank you in advance for you help, I know your time is valuable!
Best regards,
Florian Jonas
p.s. to the moderators: This message is a resubmit with smaller log
files. I was not aware of the 1MB limit. The previously bounced
message can be ignored!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
On 28/06/2022 13:29, Eugen Block wrote:
Hi,
just to clarify, only one of the MONs had a different IP address (how
and why, DHCP?), but you got it up again (since your cluster shows
quorum). So the subnet didn't change, only the one address? Did you
already try to restart the MDS? And what about the standby MDS, it
could have taken over, couldn't it? The "0 in" OSDs could be a MGR
issue, I'm not sure how that worked in Mimic. But they appear to be
working, so it's not really clear yet what the actual problem is, but
data loss is unlikely since the OSDs have not been wiped and they also
load their PGs, it appears:
2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00 0 osd.6 13035 load_pgs opened
67 pgs
Zitat von Florian Jonas <florian.jonas@xxxxxxx>:
Dear experts,
we have a small computing cluster with 21 OSDs and 3 monitors and
3MDS running on ceph version 13.2.10 on ubuntu 18.04. A few days ago
we had an unexpected reboot of all machines, as well as a change of
the IP address of one machine, which was hosting a MDS as well as a
monitor. I am not exactly sure what played out during that night, but
we lost quorum of all three monitors and no filesystem was visible
anymore, so we are starting to get quite worried about data loss. We
tried destroying and recreating the monitor of which the ip address
changed, but it did not help (which however might have been a mistake).
Long story short, we tried to recover restoring by adapting the
changed ip address in the config and tried to recover the monitors
using the information from the OSDs, following the procedure outline
here:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
We are now in a situation where ceph status shows the following:
cluster:
id: 61fd9a61-89d6-4383-a2e6-ec4f4a13830f
health: HEALTH_WARN
43 slow ops, oldest one blocked for 57132 sec, daemons
[mon.dip01,mon.pc078,mon.pc147] have slow ops.
services:
mon: 3 daemons, quorum pc147,pc078,dip01
mgr: dip01(active)
osd: 22 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
The monitors show a quorum (i think that's a good start), but we do
not see any of the pools that were previously there and also no
filesystem is visible. Running the command "ceph fs status" shows all
MDS are in standby and no filesystem is activated.
I looked into the HEALTH_WARNING, by checking the journalctl -xe on
the monitor machines and one finds errors of the type:
Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978
7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting 4
slow ops, oldest is osd_boot(osd.12 booted 0 features
4611087854031667195 v13031)
In order to check what is going on with the osd_boot error, i checked
the logs on the osd machines and found warning such as:
2022-06-24 09:16:42.383 7fdc165d5c00 0 <cls>
/build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2022-06-24 09:16:42.383 7fdc165d5c00 0 _get_class not permitted to
load kvs
2022-06-24 09:16:42.383 7fdc165d5c00 0 <cls>
/build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
2022-06-24 09:16:42.383 7fdc165d5c00 0 _get_class not permitted to
load lua
2022-06-24 09:16:42.387 7fdc165d5c00 0 _get_class not permitted to
load sdk
2022-06-24 09:16:42.387 7fdc165d5c00 1 osd.6 13035 warning: got an
error loading one or more classes: (1) Operation not permitted
2022-06-24 09:16:42.387 7fdc165d5c00 0 osd.6 13035 crush map has
features 288514051259236352, adjusting msgr requires for clients
2022-06-24 09:16:42.387 7fdc165d5c00 0 osd.6 13035 crush map has
features 288514051259236352 was 8705, adjusting msgr requires for mons
2022-06-24 09:16:42.387 7fdc165d5c00 0 osd.6 13035 crush map has
features 1009089991638532096, adjusting msgr requires for osds
2022-06-24 09:16:42.387 7fdc165d5c00 1 osd.6 13035
check_osdmap_features require_osd_release 0 ->
2022-06-24 09:16:44.527 7fdc165d5c00 0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00 0 osd.6 13035 load_pgs opened
67 pgs
2022-06-24 09:16:50.375 7fdc165d5c00 0 osd.6 13035 using
weightedpriority op queue with priority op cut off at 64.
2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors
{default=true}
2022-06-24 09:16:50.383 7fdc165d5c00 0 osd.6 13035 done with init,
starting boot process
2022-06-24 09:16:50.383 7fdc165d5c00 1 osd.6 13035 start_boot
2022-06-24 09:16:50.495 7fdbec933700 1 osd.6 pg_epoch: 13035 pg[5.1(
v 2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1 ec=2782/2782
lis/c 12997/12997 les/c/f 12999/12999/0 12997/12997/12954) [6,17,14]
r=0 lpr=13021 crt=2785'2 lcod 0'0 mlcod 0'0 unknown mbc={}]
state<Start>: transitioning to Primary
The 21 OSDs themselves show as "exists,new" in ceph osd status, even
though they remained untouched during the whole incident (which I
hope means they still contain all our data somewhere)
We only started operating our distributed filesystem about one year
ago, and I must admit with this problem we are a bit out of our
depth, so we would very much would appreciate any leads/help we can
get on getting our filesystem up and running again. Alternatively, if
all else fails, we would also appreciate any information about the
possibility of recovering the data from the 21 OSDs, which amounts to
over 60TB.
Attached you find our ceph.conf file, as well as the logs from one
example monitor and one osd node. If you need any other information
let us know.
Thank you in advance for you help, I know your time is valuable!
Best regards,
Florian Jonas
p.s. to the moderators: This message is a resubmit with smaller log
files. I was not aware of the 1MB limit. The previously bounced
message can be ignored!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx