recovery from catastrophic mon and mds failure after reboot and ip address change

Florian Jonas <florian.jonas@xxxxxxx> · Mon, 27 Jun 2022 09:14:52 +0200

Dear experts,

we have a small computing cluster with 21 OSDs and 3 monitors and 3MDS 
running on ceph version 13.2.10 on ubuntu 18.04. A few  days ago we had 
an unexpected reboot of all machines, as well as a change of the IP 
address of one machine, which was hosting a MDS as well as a monitor. I 
am not exactly sure what played out during that night, but we lost 
quorum of all three monitors and no filesystem was visible anymore, so 
we are starting to get quite worried about data loss. We tried 
destroying and recreating the monitor of which the ip address changed, 
but it did not help (which however might have been a mistake).

Long story short, we tried to recover restoring by adapting the changed 
ip address in the config and tried to recover the monitors using the 
information from the OSDs, following the procedure outline here:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds

We are now in a situation where ceph status shows the following:

  cluster:
    id:     61fd9a61-89d6-4383-a2e6-ec4f4a13830f
    health: HEALTH_WARN
            43 slow ops, oldest one blocked for 57132 sec, daemons 
[mon.dip01,mon.pc078,mon.pc147] have slow ops.

  services:
    mon: 3 daemons, quorum pc147,pc078,dip01
    mgr: dip01(active)
    osd: 22 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

The monitors show a quorum (i think that's a good start), but we do not 
see any of the pools that were previously there and also no filesystem 
is visible. Running the command "ceph fs status" shows all MDS are in 
standby and no filesystem is activated.

I looked into the HEALTH_WARNING, by checking the journalctl -xe on the 
monitor machines and one finds errors of the type:

Jun 24 09:10:30 dip01 ceph-mon[69148]: 2022-06-24 09:10:30.978 
7f0173e02700 -1 mon.dip01@2(peon) e15 get_health_metrics reporting 4 
slow ops, oldest is osd_boot(osd.12 booted 0 features 
4611087854031667195 v13031)

In order to check what is going on with the osd_boot error, i checked 
the logs on the osd machines and found warning such as:

2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls> 
/build/ceph-13.2.10/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to load kvs
2022-06-24 09:16:42.383 7fdc165d5c00  0 <cls> 
/build/ceph-13.2.10/src/cls/hello/cls_hello.cc:296: loading cls_hello
2022-06-24 09:16:42.383 7fdc165d5c00  0 _get_class not permitted to load lua
2022-06-24 09:16:42.387 7fdc165d5c00  0 _get_class not permitted to load sdk
2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 warning: got an 
error loading one or more classes: (1) Operation not permitted
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 288514051259236352, adjusting msgr requires for clients
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 288514051259236352 was 8705, adjusting msgr requires for mons
2022-06-24 09:16:42.387 7fdc165d5c00  0 osd.6 13035 crush map has 
features 1009089991638532096, adjusting msgr requires for osds
2022-06-24 09:16:42.387 7fdc165d5c00  1 osd.6 13035 
check_osdmap_features require_osd_release 0 ->
2022-06-24 09:16:44.527 7fdc165d5c00  0 osd.6 13035 load_pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 load_pgs opened 67 pgs
2022-06-24 09:16:50.375 7fdc165d5c00  0 osd.6 13035 using 
weightedpriority op queue with priority op cut off at 64.
2022-06-24 09:16:50.375 7fdc165d5c00 -1 osd.6 13035 log_to_monitors 
{default=true}
2022-06-24 09:16:50.383 7fdc165d5c00  0 osd.6 13035 done with init, 
starting boot process
2022-06-24 09:16:50.383 7fdc165d5c00  1 osd.6 13035 start_boot
2022-06-24 09:16:50.495 7fdbec933700  1 osd.6 pg_epoch: 13035 pg[5.1( v 
2785'2 (0'0,2785'2] local-lis/les=12997/12999 n=1 ec=2782/2782 lis/c 
12997/12997 les/c/f 12999/12999/0 12997/12997/12954) [6,17,14] r=0 
lpr=13021 crt=2785'2 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: 
transitioning to Primary

The 21 OSDs themselves show as "exists,new" in ceph osd status, even 
though they remained untouched during the whole incident (which I hope 
means they still contain all our data somewhere)

We only started operating our distributed filesystem about one year ago, 
and I must admit with this problem we are a bit out of our depth, so we 
would very much would appreciate any leads/help we can get on getting 
our filesystem up and running again. Alternatively, if all else fails, 
we would also appreciate any information about the possibility of 
recovering the data from the 21 OSDs, which amounts to over 60TB.

Attached you find our ceph.conf file, as well as the logs from one 
example monitor and one osd node. If you need any other information let 
us know.

Thank you in advance for you help, I know your time is valuable!

Best regards,

Florian Jonas

p.s. to the moderators: This message is a resubmit with smaller log 
files. I was not aware of the 1MB limit. The previously bounced message 
can be ignored![global]
fsid = 61fd9a61-89d6-4383-a2e6-ec4f4a13830f
mon_initial_members = pc078, dip01, pc147
mon_host = XXX.XXX.XXX.XX,XXX.XXX.XXX.XXX,XXX.XXX.XXX.XX
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
public_network = XXX.XXX.XXX.X/XX
[mon]
mon_mds_skip_sanity = true

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx