On Tue, May 18, 2021 at 4:00 PM Eugen Block <eblock@xxxxxx> wrote: > > Hi, > > sorry for not responding, our mail server was affected, too, I got > your response after we got our CephFS back online. Glad to hear it's back online! > > Do you have the mds log from the initial crash? > > I would need to take a closer look but we're currently dealing with > the affected clients to get everything back in order. No rush, but this would be useful to analyze what broke the fs initially. > > Also, I don't see the new global_id warnings in your status output -- > > did you change any settings from the defaults during this upgrade? > > I definitely did deactivate that warning, it could have indeed been > during the upgrade. Could that have caused the MDS damage? We have > older clients, it may take some time to update so I decided to silence > that warning. Was that a mistake? Maybe I just missed that information > but I didn't find any warnings for the update case. Do you have more > information? If you set these, it should be safe: mon advanced mon_warn_on_insecure_global_id_reclaim false mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false but if you changed these other settings to false, it might have caused the old mds to error: auth_allow_insecure_global_id_reclaim auth_expose_insecure_global_id_reclaim Cheers, Dan > > Thanks! > Eugen > > > Zitat von Dan van der Ster <dan@xxxxxxxxxxxxxx>: > > > Hi, > > > > Do you have the mds log from the initial crash? > > > > Also, I don't see the new global_id warnings in your status output -- > > did you change any settings from the defaults during this upgrade? > > > > Cheers, Dan > > > > On Tue, May 18, 2021 at 10:22 AM Eugen Block <eblock@xxxxxx> wrote: > >> > >> Hi *, > >> > >> I tried a minor update (14.2.9 --> 14.2.20) on our ceph cluster today > >> and got into a damaged CephFS. It's rather urgent since noone can > >> really work right now, so any quick help is highly appreciated. > >> > >> As for the update process I followed the usual update procedure, when > >> all MONs were finished I started to restart the OSDs, but suddenly our > >> cephfs got unresponsive (and still is). > >> > >> I believe these lines are the critical ones: > >> > >> ---snap--- > >> -12> 2021-05-18 09:53:01.488 7f7e9ed82700 5 mds.beacon.mds01 > >> received beacon reply up:replay seq 906 rtt 0 > >> -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient: > >> get_auth_request con 0x5608a5171600 auth_method 0 > >> -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 > >> mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable > >> -9> 2021-05-18 09:53:03.732 7f7e94d6e700 0 mds.0.log _replay > >> journaler got error -22, aborting > >> -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster) > >> log [ERR] : Error loading MDS rank 0: (22) Invalid argument > >> -7> 2021-05-18 09:53:03.732 7f7e94d6e700 5 mds.beacon.mds01 > >> set_want_state: up:replay -> down:damaged > >> -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client log_queue > >> is 1 last_log 1 sent 0 num 1 unsent 1 sending 1 > >> -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client will send > >> 2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error > >> loading MDS rank 0: (22) Invalid argument > >> -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient: > >> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0 > >> -3> 2021-05-18 09:53:03.732 7f7e94d6e700 5 mds.beacon.mds01 > >> Sending beacon down:damaged seq 907 > >> -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient: > >> _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0 > >> -1> 2021-05-18 09:53:03.908 7f7e9ed82700 5 mds.beacon.mds01 > >> received beacon reply down:damaged seq 907 rtt 0.176001 > >> 0> 2021-05-18 09:53:03.908 7f7e94d6e700 1 mds.mds01 respawn! > >> ---snap--- > >> > >> These logs are from the attempt to bring the mds rank back up with > >> > >> ceph mds repaired 0 > >> > >> I attached a longer excerpt of the log files if it helps. Before > >> trying anything from the disaster recovery steps I'd like to ask for > >> your input since one can damage it even more. The current status is > >> below, please let me know if more information is required. > >> > >> Thanks! > >> Eugen > >> > >> > >> ceph01:~ # ceph -s > >> cluster: > >> id: 655cb05a-435a-41ba-83d9-8549f7c36167 > >> health: HEALTH_ERR > >> 1 filesystem is degraded > >> 1 filesystem is offline > >> 1 mds daemon damaged > >> noout flag(s) set > >> Some pool(s) have the nodeep-scrub flag(s) set > >> > >> services: > >> mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 116m) > >> mgr: ceph03(active, since 118m), standbys: ceph02, ceph01 > >> mds: cephfs:0/1 3 up:standby, 1 damaged > >> osd: 32 osds: 32 up (since 64m), 32 in (since 8w) > >> flags noout > >> > >> data: > >> pools: 14 pools, 512 pgs > >> objects: 5.08M objects, 8.6 TiB > >> usage: 27 TiB used, 33 TiB / 59 TiB avail > >> pgs: 512 active+clean > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx