I hit a very similar issue with my Ceph cluster that I assumed was caused by faulty RAM, but the timing also matches the timeframe of my upgrading from 17.2.5 to 17.2.6. I took the same recovery steps, stopping at online deep scrub, and everything appears to be working fine now. On Wed, Jun 14, 2023, 6:55 AM Ben Stöver <bstoever@xxxxxxxxxxx> wrote: > Hi everyone, > > as discussed on this list before, we had an issue upgrading the metadata > servers while performing an upgrade from 17.2.5 to 17.2.6. (See also > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/U3VEPCXYDYO2YSGF76CJLU25YOPEB3XU/#EVEP2MEMEI5HAXLYAXMHMWM6ZLJ2KUR6 > .) We had to pause the upgrade leaving us in a state where the MDS were > running on the old version, while most other cluster components were > already running the newer version. In the meantime we were able solve > this issue and also reduced the number of active MDS from 2 to 1. > > Today we resumed the upgrade and all components are on version 17.2.6 > now. Right after the upgrade a file system inconsistency (backtrace) was > found, which we solved by scrubbing. Shortly after users coming back on > the cluster using the CephFS again, all MDS failed and cannot be started > again. The first MDS gets stuck in state "replay (laggy)", while both > standby MDS crash right away ("Caught signal (Segmentation fault)"). > > All MDS report multiple of the following problems in the log: > - "replayed ESubtreeMap at xxx subtree root 0x1 not in cache" > - "replayed ESubtreeMap at xxx subtree root 0x1 is not mine in cache > (it's -2,-2)" > > As a result, we are currently not able to access our CephFS anymore. > Does anyone have an idea how to solve this or what could be the cause? > (I have found this ticket, which seems to be very similar but there does > not seem to be a solution mentioned there. > https://bugzilla.redhat.com/show_bug.cgi?id=2056935 ) > > Below are some parts of the MDS logs that seem relevant to us for this > issue. > > We are thankful for any ideas. :-) > > Best > Ben > > > Log excerpt of Active MDS (replay): > > -140> 2023-06-14T07:51:59.585+0000 7feb588c0700 5 > asok(0x55e56e23a000) register_command objecter_requests hook 0x55e56e1ca380 > -139> 2023-06-14T07:51:59.585+0000 7feb588c0700 10 monclient: > _renew_subs > -138> 2023-06-14T07:51:59.585+0000 7feb588c0700 10 monclient: > _send_mon_message to mon.ceph-service-01 at v2:131.220.126.65:3300/0 > -137> 2023-06-14T07:51:59.585+0000 7feb588c0700 10 > log_channel(cluster) update_config to_monitors: true to_syslog: false > syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 > graylog_port: 12201) > -136> 2023-06-14T07:51:59.585+0000 7feb588c0700 4 mds.0.purge_queue > operator(): data pool 3 not found in OSDMap > -135> 2023-06-14T07:51:59.585+0000 7feb588c0700 4 mds.0.0 > apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0) > -134> 2023-06-14T07:51:59.585+0000 7feb588c0700 1 mds.0.172004 > handle_mds_map i am now mds.0.172004 > -133> 2023-06-14T07:51:59.585+0000 7feb588c0700 1 mds.0.172004 > handle_mds_map state change up:standby --> up:replay > -132> 2023-06-14T07:51:59.585+0000 7feb588c0700 5 > mds.beacon.fdi-cephfs.ceph-service-01.gfwudy set_want_state: up:standby > -> up:replay > -131> 2023-06-14T07:51:59.585+0000 7feb588c0700 1 mds.0.172004 > replay_start > -130> 2023-06-14T07:51:59.585+0000 7feb588c0700 1 mds.0.172004 > waiting for osdmap 341143 (which blocklists prior instance) > -129> 2023-06-14T07:51:59.585+0000 7feb588c0700 10 monclient: > _send_mon_message to mon.ceph-service-01 at v2:131.220.126.65:3300/0 > -128> 2023-06-14T07:51:59.585+0000 7feb548b8700 2 mds.0.cache Memory > usage: total 300004, rss 34196, heap 182556, baseline 182556, 0 / 0 > inodes have caps, 0 caps, 0 caps per inode > -127> 2023-06-14T07:51:59.603+0000 7feb588c0700 10 monclient: > _renew_subs > -126> 2023-06-14T07:51:59.603+0000 7feb588c0700 10 monclient: > _send_mon_message to mon.ceph-service-01 at v2:131.220.126.65:3300/0 > -125> 2023-06-14T07:51:59.603+0000 7feb588c0700 10 monclient: > handle_get_version_reply finishing 1 version 341143 > -124> 2023-06-14T07:51:59.603+0000 7feb528b4700 2 mds.0.172004 > Booting: 0: opening inotable > -123> 2023-06-14T07:51:59.603+0000 7feb528b4700 2 mds.0.172004 > Booting: 0: opening sessionmap > -122> 2023-06-14T07:51:59.603+0000 7feb528b4700 2 mds.0.172004 > Booting: 0: opening mds log > -121> 2023-06-14T07:51:59.603+0000 7feb528b4700 5 mds.0.log open > discovering log bounds > -120> 2023-06-14T07:51:59.603+0000 7feb528b4700 2 mds.0.172004 > Booting: 0: opening purge queue (async) > -119> 2023-06-14T07:51:59.603+0000 7feb528b4700 4 mds.0.purge_queue > open: opening > -118> 2023-06-14T07:51:59.603+0000 7feb528b4700 1 > mds.0.journaler.pq(ro) recover start > -117> 2023-06-14T07:51:59.603+0000 7feb528b4700 1 > mds.0.journaler.pq(ro) read_head > -116> 2023-06-14T07:51:59.603+0000 7feb520b3700 4 > mds.0.journalpointer Reading journal pointer '400.00000000' > -115> 2023-06-14T07:51:59.603+0000 7feb528b4700 2 mds.0.172004 > Booting: 0: loading open file table (async) > -114> 2023-06-14T07:51:59.604+0000 7feb528b4700 2 mds.0.172004 > Booting: 0: opening snap table > -113> 2023-06-14T07:51:59.604+0000 7feb5c0c7700 10 monclient: > get_auth_request con 0x55e56ef7e000 auth_method 0 > -112> 2023-06-14T07:51:59.604+0000 7feb5b8c6700 10 monclient: > get_auth_request con 0x55e56ef7e800 auth_method 0 > -111> 2023-06-14T07:51:59.604+0000 7feb5b0c5700 10 monclient: > get_auth_request con 0x55e56ef7f000 auth_method 0 > -110> 2023-06-14T07:51:59.604+0000 7feb5c0c7700 10 monclient: > get_auth_request con 0x55e56ef92000 auth_method 0 > -109> 2023-06-14T07:51:59.605+0000 7feb538b6700 1 > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 18366857216, > expire 18367220777, write 18367220777, stream_format 1). probing for > end of log (from 18367220777)... > -108> 2023-06-14T07:51:59.605+0000 7feb538b6700 1 > mds.0.journaler.pq(ro) probing for end of the log > -107> 2023-06-14T07:51:59.605+0000 7feb520b3700 1 > mds.0.journaler.mdlog(ro) recover start > -106> 2023-06-14T07:51:59.605+0000 7feb520b3700 1 > mds.0.journaler.mdlog(ro) read_head > -105> 2023-06-14T07:51:59.605+0000 7feb520b3700 4 mds.0.log Waiting > for journal 0x200 to recover... > -104> 2023-06-14T07:51:59.605+0000 7feb5b8c6700 10 monclient: > get_auth_request con 0x55e56ef7f400 auth_method 0 > -103> 2023-06-14T07:51:59.605+0000 7feb5c0c7700 10 monclient: > get_auth_request con 0x55e56ef92800 auth_method 0 > -102> 2023-06-14T07:51:59.605+0000 7feb5b0c5700 10 monclient: > get_auth_request con 0x55e56ef7f800 auth_method 0 > -101> 2023-06-14T07:51:59.606+0000 7feb528b4700 1 > mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim > 320109193199616, expire 320109196838948, write 320109612836052, > stream_format 1). probing for end of log (from 320109612836052)... > -100> 2023-06-14T07:51:59.606+0000 7feb528b4700 1 > mds.0.journaler.mdlog(ro) probing for end of the log > -99> 2023-06-14T07:51:59.606+0000 7feb5b0c5700 10 monclient: > get_auth_request con 0x55e56f2ea800 auth_method 0 > -98> 2023-06-14T07:51:59.606+0000 7feb5b8c6700 10 monclient: > get_auth_request con 0x55e56f2ea000 auth_method 0 > -97> 2023-06-14T07:51:59.606+0000 7feb538b6700 1 > mds.0.journaler.pq(ro) _finish_probe_end write_pos = 18367220777 (header > had 18367220777). recovered. > -96> 2023-06-14T07:51:59.606+0000 7feb538b6700 4 mds.0.purge_queue > operator(): open complete > -95> 2023-06-14T07:51:59.606+0000 7feb538b6700 1 > mds.0.journaler.pq(ro) set_writeable > -94> 2023-06-14T07:51:59.607+0000 7feb528b4700 1 > mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 320109613139120 > (header had 320109612836052). recovered. > -93> 2023-06-14T07:51:59.607+0000 7feb520b3700 4 mds.0.log Journal > 0x200 recovered. > -92> 2023-06-14T07:51:59.607+0000 7feb520b3700 4 mds.0.log > Recovered journal 0x200 in format 1 > -91> 2023-06-14T07:51:59.607+0000 7feb520b3700 2 mds.0.172004 > Booting: 1: loading/discovering base inodes > -90> 2023-06-14T07:51:59.607+0000 7feb520b3700 0 mds.0.cache > creating system inode with ino:0x100 > -89> 2023-06-14T07:51:59.607+0000 7feb520b3700 0 mds.0.cache > creating system inode with ino:0x1 > -88> 2023-06-14T07:51:59.608+0000 7feb5c0c7700 10 monclient: > get_auth_request con 0x55e56f2eb400 auth_method 0 > -87> 2023-06-14T07:51:59.615+0000 7feb528b4700 2 mds.0.172004 > Booting: 2: replaying mds log > -86> 2023-06-14T07:51:59.615+0000 7feb528b4700 2 mds.0.172004 > Booting: 2: waiting for purge queue recovered > -85> 2023-06-14T07:51:59.615+0000 7feb5b8c6700 10 monclient: > get_auth_request con 0x55e56f1cec00 auth_method 0 > -84> 2023-06-14T07:51:59.615+0000 7feb5b0c5700 10 monclient: > get_auth_request con 0x55e56f2eb800 auth_method 0 > -83> 2023-06-14T07:51:59.615+0000 7feb5c0c7700 10 monclient: > get_auth_request con 0x55e56f1ce400 auth_method 0 > -82> 2023-06-14T07:51:59.615+0000 7feb5b8c6700 10 monclient: > get_auth_request con 0x55e56ef7e400 auth_method 0 > -81> 2023-06-14T07:51:59.615+0000 7feb5b0c5700 10 monclient: > get_auth_request con 0x55e56f1cf400 auth_method 0 > -80> 2023-06-14T07:51:59.634+0000 7feb510b1700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109197452549 subtree root 0x1 not in cache > -79> 2023-06-14T07:51:59.634+0000 7feb510b1700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -78> 2023-06-14T07:51:59.634+0000 7feb510b1700 0 mds.0.journal > journal ambig_subtrees: > -77> 2023-06-14T07:51:59.637+0000 7feb510b1700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109198067908 subtree root 0x1 not in cache > -76> 2023-06-14T07:51:59.637+0000 7feb510b1700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -75> 2023-06-14T07:51:59.637+0000 7feb510b1700 0 mds.0.journal > journal ambig_subtrees: > -74> 2023-06-14T07:51:59.640+0000 7feb510b1700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109198682422 subtree root 0x1 not in cache > -73> 2023-06-14T07:51:59.640+0000 7feb510b1700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > > > Log excerpt of standby MDS: > > -23> 2023-06-14T07:48:56.590+0000 7f14fcb97700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109205645870 subtree root 0x1 is not mine in cache (it's -2,-2) > -22> 2023-06-14T07:48:56.590+0000 7f14fcb97700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -21> 2023-06-14T07:48:56.590+0000 7f14fcb97700 0 mds.0.journal > journal ambig_subtrees: > -20> 2023-06-14T07:48:56.592+0000 7f14fcb97700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109206260384 subtree root 0x1 is not mine in cache (it's -2,-2) > -19> 2023-06-14T07:48:56.592+0000 7f14fcb97700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -18> 2023-06-14T07:48:56.592+0000 7f14fcb97700 0 mds.0.journal > journal ambig_subtrees: > -17> 2023-06-14T07:48:56.595+0000 7f14fcb97700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109206875743 subtree root 0x1 is not mine in cache (it's -2,-2) > -16> 2023-06-14T07:48:56.595+0000 7f14fcb97700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -15> 2023-06-14T07:48:56.595+0000 7f14fcb97700 0 mds.0.journal > journal ambig_subtrees: > -14> 2023-06-14T07:48:56.598+0000 7f14fcb97700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109207491102 subtree root 0x1 is not mine in cache (it's -2,-2) > -13> 2023-06-14T07:48:56.598+0000 7f14fcb97700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -12> 2023-06-14T07:48:56.598+0000 7f14fcb97700 0 mds.0.journal > journal ambig_subtrees: > -11> 2023-06-14T07:48:56.601+0000 7f14fcb97700 -1 > log_channel(cluster) log [ERR] : replayed ESubtreeMap at > 320109208105616 subtree root 0x1 is not mine in cache (it's -2,-2) > -10> 2023-06-14T07:48:56.601+0000 7f14fcb97700 0 mds.0.journal > journal subtrees: {0x1=[],0x100=[]} > -9> 2023-06-14T07:48:56.601+0000 7f14fcb97700 0 mds.0.journal > journal ambig_subtrees: > -8> 2023-06-14T07:48:56.638+0000 7f1507bad700 10 monclient: > get_auth_request con 0x556935cb3400 auth_method 0 > -7> 2023-06-14T07:48:56.677+0000 7f15073ac700 10 monclient: > get_auth_request con 0x556935c92c00 auth_method 0 > -6> 2023-06-14T07:48:56.687+0000 7f1506bab700 10 monclient: > get_auth_request con 0x5569359f6400 auth_method 0 > -5> 2023-06-14T07:48:56.967+0000 7f1507bad700 10 monclient: > get_auth_request con 0x556935cb2800 auth_method 0 > -4> 2023-06-14T07:48:57.008+0000 7f15073ac700 10 monclient: > get_auth_request con 0x556935cb8400 auth_method 0 > -3> 2023-06-14T07:48:57.159+0000 7f1506bab700 10 monclient: > get_auth_request con 0x556935cb3800 auth_method 0 > -2> 2023-06-14T07:48:57.169+0000 7f1507bad700 10 monclient: > get_auth_request con 0x556935cb9000 auth_method 0 > -1> 2023-06-14T07:48:57.197+0000 7f15073ac700 10 monclient: > get_auth_request con 0x5569359f7400 auth_method 0 > 0> 2023-06-14T07:48:57.317+0000 7f14fcb97700 -1 *** Caught signal > (Segmentation fault) ** > in thread 7f14fcb97700 thread_name:md_log_replay > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx