Hi Thomas, On Thu, Jan 19, 2023 at 7:15 PM Thomas Widhalm <thomas.widhalm@xxxxxxxxxx> wrote: > > Hi, > > Unfortunately the workaround didn't work out: > > [ceph: root@ceph05 /]# ceph config show mds.mds01.ceph06.hsuhqd | grep > mds_wipe > mds_wipe_sessions true > > mon > [ceph: root@ceph05 /]# ceph config show mds.mds01.ceph04.cvdhsx | grep > mds_wipe > mds_wipe_sessions true > > mon > [ceph: root@ceph05 /]# ceph config show mds.mds01.ceph05.pqxmvt | grep > mds_wipe > mds_wipe_sessions true > > mon > [ceph: root@ceph05 /]# ceph tell mds.mds01.ceph05.pqxmvt flush journal > 2023-01-19T13:38:07.403+0000 7ff94e7fc700 0 client.61855055 > ms_handle_reset on v2:192.168.23.65:6800/957802673 > 2023-01-19T13:38:07.427+0000 7ff94e7fc700 0 client.61855061 > ms_handle_reset on v2:192.168.23.65:6800/957802673 > Error ENOSYS: > [ceph: root@ceph05 /]# ceph tell mds.mds01.ceph06.hsuhqd flush journal > 2023-01-19T13:38:34.694+0000 7f789effd700 0 client.61855142 > ms_handle_reset on v2:192.168.23.66:6810/2868317045 > 2023-01-19T13:38:34.728+0000 7f789effd700 0 client.61855148 > ms_handle_reset on v2:192.168.23.66:6810/2868317045 > { > "message": "", > "return_code": 0 > } > [ceph: root@ceph05 /]# ceph tell mds.mds01.ceph04.cvdhsx flush journal > 2023-01-19T13:38:46.402+0000 7fdee77fe700 0 client.61855172 > ms_handle_reset on v2:192.168.23.64:6800/1605877585 > 2023-01-19T13:38:46.435+0000 7fdee77fe700 0 client.61855178 > ms_handle_reset on v2:192.168.23.64:6800/1605877585 > { > "message": "", > "return_code": 0 > } The journal flush did work which is an indication that the mds made progress. > [ceph: root@ceph05 /]# ceph fs dump > e198622 > enable_multiple, ever_enabled_multiple: 1,1 > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client > writeable ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 2 > > Filesystem 'cephfs' (2) > fs_name cephfs > epoch 198622 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-01-14T14:30:05.723421+0000 > modified 2023-01-19T13:39:25.239395+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 13541 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {0=61834171} > failed > damaged > stopped > data_pools [4] > metadata_pool 5 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph04.cvdhsx{0:61834171} state up:replay seq 240 addr > [v2:192.168.23.64:6800/1605877585,v1:192.168.23.64:6801/1605877585] > compat {c=[1],r=[1],i=[7ff]}] > > > Filesystem 'cephfs_insecure' (3) > fs_name cephfs_insecure > epoch 198621 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-01-14T14:22:46.360062+0000 > modified 2023-01-19T13:39:22.799446+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 13539 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {0=61834120} > failed > damaged > stopped > data_pools [7] > metadata_pool 6 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph06.hsuhqd{0:61834120} state up:replay seq 241 addr > [v2:192.168.23.66:6810/2868317045,v1:192.168.23.66:6811/2868317045] > compat {c=[1],r=[1],i=[7ff]}] > > > Standby daemons: > > [mds.mds01.ceph05.pqxmvt{-1:61834887} state up:standby seq 1 addr > [v2:192.168.23.65:6800/957802673,v1:192.168.23.65:6801/957802673] compat > {c=[1],r=[1],i=[7ff]}] > dumped fsmap epoch 198622 Hmmm.. the MDSs are going through the replay state which is expected. Do you see the MDSs crashing again? > > On 19.01.23 14:01, Venky Shankar wrote: > > Hi Thomas, > > > > On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm > > <thomas.widhalm@xxxxxxxxxx> wrote: > >> > >> Another new thing that just happened: > >> > >> One of the MDS just crashed out of nowhere. > >> > >> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: > >> In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, > >> MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000 > >> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: > >> 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions) > >> > >> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy > >> (stable) > >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > >> const*)+0x135) [0x7fccd759943f] > >> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] > >> 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) > >> [0x55fb2b98e89c] > >> 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] > >> 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] > >> 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] > >> 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] > >> 8: clone() > > > > To workaround this (for now) till the bug is fixed, set > > > > mds_wipe_sessions = true > > > > in ceph.conf, allow the MDS to transition to `active` state. Once > > done, flush the journal: > > > > ceph tell mds.<> flush journal > > > > then you can safely remove the config. > > > >> > >> > >> and > >> > >> > >> > >> *** Caught signal (Aborted) ** > >> in thread 7fccc7153700 thread_name:md_log_replay > >> > >> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy > >> (stable) > >> 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0] > >> 2: gsignal() > >> 3: abort() > >> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char > >> const*)+0x18f) [0x7fccd7599499] > >> 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] > >> 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) > >> [0x55fb2b98e89c] > >> 7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] > >> 8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] > >> 9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] > >> 10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] > >> 11: clone() > >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is > >> needed to interpret this. > >> > >> Is what I found in the logs. Since it's referring to log replaying, > >> could this be related to my issue? > >> > >> On 17.01.23 10:54, Thomas Widhalm wrote: > >>> Hi again, > >>> > >>> Another thing I found: Out of pure desperation, I started MDS on all > >>> nodes. I had them configured in the past so I was hoping, they could > >>> help with bringing in missing data even when they were down for quite a > >>> while now. I didn't see any changes in the logs but the CPU on the hosts > >>> that usually don't run MDS just spiked. So high I had to kill the MDS > >>> again because otherwise they kept killing OSD containers. So I don't > >>> really have any new information, but maybe that could be a hint of some > >>> kind? > >>> > >>> Cheers, > >>> Thomas > >>> > >>> On 17.01.23 10:13, Thomas Widhalm wrote: > >>>> Hi, > >>>> > >>>> Thanks again. :-) > >>>> > >>>> Ok, that seems like an error to me. I never configured an extra rank for > >>>> MDS. Maybe that's where my knowledge failed me but I guess, MDS is > >>>> waiting for something that was never there. > >>>> > >>>> Yes, there are two filesystems. Due to "budget restrictions" (it's my > >>>> personal system at home, I configured a second CephFS with only one > >>>> replica for data that could be easily restored. > >>>> > >>>> Here's what I got when turning up the debug level: > >>>> > >>>> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> Sending beacon up:replay seq 11107 > >>>> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> sender thread waiting interval 4s > >>>> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> received beacon reply up:replay seq 11107 rtt 0.00200002 > >>>> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > >>>> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 > >>>> schedule_update_timer_task > >>>> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > >>>> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 > >>>> schedule_update_timer_task > >>>> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> Sending beacon up:replay seq 11108 > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> sender thread waiting interval 4s > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> received beacon reply up:replay seq 11108 rtt 0.00200002 > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > >>>> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 > >>>> schedule_update_timer_task > >>>> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > >>>> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 > >>>> schedule_update_timer_task > >>>> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> Sending beacon up:replay seq 11109 > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> sender thread waiting interval 4s > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > >>>> received beacon reply up:replay seq 11109 rtt 0.00600006 > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > >>>> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 > >>>> schedule_update_timer_task > >>>> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free memory > >>>> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > >>>> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 > >>>> schedule_update_timer_task > >>>> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > >>>> 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have caps, > >>>> 0 caps, 0 caps per inode > >>>> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > >>>> trimming > >>>> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > >>>> interval 1.000000000s > >>>> > >>>> > >>>> The only thing that gives me hope here is that the line > >>>> mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is > >>>> chaning its sequence number. > >>>> > >>>> Anything else I can provide? > >>>> > >>>> Cheers, > >>>> Thomas > >>>> > >>>> On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote: > >>>>> Hi Thomas, > >>>>> > >>>>> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The > >>>>> mds is stuck in 'up:replay' which means the MDS taking over a failed > >>>>> rank. > >>>>> This state represents that the MDS is recovering its journal and other > >>>>> metadata. > >>>>> > >>>>> I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' > >>>>> and the active mds for both filesystems are stuck in 'up:replay'. The > >>>>> mds > >>>>> logs shared are not providing much information to infer anything. > >>>>> > >>>>> Could you please enable the debug logs and pass on the mds logs ? > >>>>> > >>>>> Thanks, > >>>>> Kotresh H R > >>>>> > >>>>> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm > >>>>> <thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx>> wrote: > >>>>> > >>>>> Hi Kotresh, > >>>>> > >>>>> Thanks for your reply! > >>>>> > >>>>> I only have one rank. Here's the output of all MDS I have: > >>>>> > >>>>> ################### > >>>>> > >>>>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status > >>>>> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 > >>>>> ms_handle_reset on v2:192.168.23.65:6800/2680651694 > >>>>> <http://192.168.23.65:6800/2680651694> > >>>>> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 > >>>>> ms_handle_reset on v2:192.168.23.65:6800/2680651694 > >>>>> <http://192.168.23.65:6800/2680651694> > >>>>> { > >>>>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > >>>>> "whoami": 0, > >>>>> "id": 60984167, > >>>>> "want_state": "up:replay", > >>>>> "state": "up:replay", > >>>>> "fs_name": "cephfs", > >>>>> "replay_status": { > >>>>> "journal_read_pos": 0, > >>>>> "journal_write_pos": 0, > >>>>> "journal_expire_pos": 0, > >>>>> "num_events": 0, > >>>>> "num_segments": 0 > >>>>> }, > >>>>> "rank_uptime": 150224.982558844, > >>>>> "mdsmap_epoch": 143757, > >>>>> "osdmap_epoch": 12395, > >>>>> "osdmap_epoch_barrier": 0, > >>>>> "uptime": 150225.39968057699 > >>>>> } > >>>>> > >>>>> ######################## > >>>>> > >>>>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status > >>>>> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 > >>>>> ms_handle_reset on v2:192.168.23.64:6800/3930607515 > >>>>> <http://192.168.23.64:6800/3930607515> > >>>>> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 > >>>>> ms_handle_reset on v2:192.168.23.64:6800/3930607515 > >>>>> <http://192.168.23.64:6800/3930607515> > >>>>> { > >>>>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > >>>>> "whoami": 0, > >>>>> "id": 60984134, > >>>>> "want_state": "up:replay", > >>>>> "state": "up:replay", > >>>>> "fs_name": "cephfs_insecure", > >>>>> "replay_status": { > >>>>> "journal_read_pos": 0, > >>>>> "journal_write_pos": 0, > >>>>> "journal_expire_pos": 0, > >>>>> "num_events": 0, > >>>>> "num_segments": 0 > >>>>> }, > >>>>> "rank_uptime": 150450.96934037199, > >>>>> "mdsmap_epoch": 143815, > >>>>> "osdmap_epoch": 12395, > >>>>> "osdmap_epoch_barrier": 0, > >>>>> "uptime": 150451.93533502301 > >>>>> } > >>>>> > >>>>> ########################### > >>>>> > >>>>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status > >>>>> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 > >>>>> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' > >>>>> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: > >>>>> cephfs:1/1 cephfs_insecure:1/1 > >>>>> > >>>>> {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} > >>>>> 2 up:standby > >>>>> Error ENOENT: problem getting command descriptions from > >>>>> mds.mds01.ceph06.wcfdom > >>>>> > >>>>> ############################ > >>>>> > >>>>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status > >>>>> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 > >>>>> ms_handle_reset on v2:192.168.23.67:6800/942898192 > >>>>> <http://192.168.23.67:6800/942898192> > >>>>> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 > >>>>> ms_handle_reset on v2:192.168.23.67:6800/942898192 > >>>>> <http://192.168.23.67:6800/942898192> > >>>>> { > >>>>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > >>>>> "whoami": -1, > >>>>> "id": 60984161, > >>>>> "want_state": "up:standby", > >>>>> "state": "up:standby", > >>>>> "mdsmap_epoch": 97687, > >>>>> "osdmap_epoch": 0, > >>>>> "osdmap_epoch_barrier": 0, > >>>>> "uptime": 150508.29091721401 > >>>>> } > >>>>> > >>>>> The error message from ceph06 is new to me. That didn't happen the > >>>>> last > >>>>> times. > >>>>> > >>>>> [ceph: root@ceph06 /]# ceph fs dump > >>>>> e143850 > >>>>> enable_multiple, ever_enabled_multiple: 1,1 > >>>>> default compat: compat={},rocompat={},incompat={1=base > >>>>> v0.20,2=client > >>>>> writeable ranges,3=default file layouts on dirs,4=dir inode in > >>>>> separate > >>>>> object,5=mds uses versioned encoding,6=dirfrag is stored in > >>>>> omap,8=no > >>>>> anchor table,9=file layout v2,10=snaprealm v2} > >>>>> legacy client fscid: 2 > >>>>> > >>>>> Filesystem 'cephfs' (2) > >>>>> fs_name cephfs > >>>>> epoch 143850 > >>>>> flags 12 joinable allow_snaps allow_multimds_snaps > >>>>> created 2023-01-14T14:30:05.723421+0000 > >>>>> modified 2023-01-16T09:00:53.663007+0000 > >>>>> tableserver 0 > >>>>> root 0 > >>>>> session_timeout 60 > >>>>> session_autoclose 300 > >>>>> max_file_size 1099511627776 > >>>>> required_client_features {} > >>>>> last_failure 0 > >>>>> last_failure_osd_epoch 12321 > >>>>> compat compat={},rocompat={},incompat={1=base v0.20,2=client > >>>>> writeable > >>>>> ranges,3=default file layouts on dirs,4=dir inode in separate > >>>>> object,5=mds uses versioned encoding,6=dirfrag is stored in > >>>>> omap,7=mds > >>>>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > >>>>> max_mds 1 > >>>>> in 0 > >>>>> up {0=60984167} > >>>>> failed > >>>>> damaged > >>>>> stopped > >>>>> data_pools [4] > >>>>> metadata_pool 5 > >>>>> inline_data disabled > >>>>> balancer > >>>>> standby_count_wanted 1 > >>>>> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr > >>>>> [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 > >>>>> > >>>>> <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>] > >>>>> compat {c=[1],r=[1],i=[7ff]}] > >>>>> > >>>>> > >>>>> Filesystem 'cephfs_insecure' (3) > >>>>> fs_name cephfs_insecure > >>>>> epoch 143849 > >>>>> flags 12 joinable allow_snaps allow_multimds_snaps > >>>>> created 2023-01-14T14:22:46.360062+0000 > >>>>> modified 2023-01-16T09:00:52.632163+0000 > >>>>> tableserver 0 > >>>>> root 0 > >>>>> session_timeout 60 > >>>>> session_autoclose 300 > >>>>> max_file_size 1099511627776 > >>>>> required_client_features {} > >>>>> last_failure 0 > >>>>> last_failure_osd_epoch 12319 > >>>>> compat compat={},rocompat={},incompat={1=base v0.20,2=client > >>>>> writeable > >>>>> ranges,3=default file layouts on dirs,4=dir inode in separate > >>>>> object,5=mds uses versioned encoding,6=dirfrag is stored in > >>>>> omap,7=mds > >>>>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > >>>>> max_mds 1 > >>>>> in 0 > >>>>> up {0=60984134} > >>>>> failed > >>>>> damaged > >>>>> stopped > >>>>> data_pools [7] > >>>>> metadata_pool 6 > >>>>> inline_data disabled > >>>>> balancer > >>>>> standby_count_wanted 1 > >>>>> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr > >>>>> [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 > >>>>> > >>>>> <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] > >>>>> compat {c=[1],r=[1],i=[7ff]}] > >>>>> > >>>>> > >>>>> Standby daemons: > >>>>> > >>>>> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr > >>>>> [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 > >>>>> > >>>>> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>] > >>>>> compat > >>>>> {c=[1],r=[1],i=[7ff]}] > >>>>> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr > >>>>> [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 > >>>>> > >>>>> <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>] > >>>>> compat {c=[1],r=[1],i=[7ff]}] > >>>>> dumped fsmap epoch 143850 > >>>>> > >>>>> ############################# > >>>>> > >>>>> [ceph: root@ceph06 /]# ceph fs status > >>>>> > >>>>> (doesn't come back) > >>>>> > >>>>> ############################# > >>>>> > >>>>> All MDS show log lines similar to this one: > >>>>> > >>>>> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143927 from mon.1 > >>>>> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143929 from mon.1 > >>>>> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143930 from mon.1 > >>>>> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143931 from mon.1 > >>>>> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143933 from mon.1 > >>>>> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143935 from mon.1 > >>>>> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143936 from mon.1 > >>>>> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143937 from mon.1 > >>>>> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143939 from mon.1 > >>>>> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143941 from mon.1 > >>>>> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > >>>>> Updating > >>>>> MDS map to version 143942 from mon.1 > >>>>> > >>>>> Anything else, I can provide? > >>>>> > >>>>> Cheers and thanks again! > >>>>> Thomas > >>>>> > >>>>> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: > >>>>> > Hi Thomas, > >>>>> > > >>>>> > As the documentation says, the MDS enters up:resolve from > >>>>> |up:replay| if > >>>>> > the Ceph file system has multiple ranks (including this one), > >>>>> i.e. it’s > >>>>> > not a single active MDS cluster. > >>>>> > The MDS is resolving any uncommitted inter-MDS operations. All > >>>>> ranks in > >>>>> > the file system must be in this state or later for progress to be > >>>>> made, > >>>>> > i.e. no rank can be failed/damaged or |up:replay|. > >>>>> > > >>>>> > So please check the status of the other active mds if it's > >>>>> failed. > >>>>> > > >>>>> > Also please share the mds logs and the output of 'ceph fs dump' > >>>>> and > >>>>> > 'ceph fs status' > >>>>> > > >>>>> > Thanks, > >>>>> > Kotresh H R > >>>>> > > >>>>> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm > >>>>> > <thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> > >>>>> <mailto:thomas.widhalm@xxxxxxxxxx > >>>>> <mailto:thomas.widhalm@xxxxxxxxxx>>> wrote: > >>>>> > > >>>>> > Hi, > >>>>> > > >>>>> > I'm really lost with my Ceph system. I built a small cluster > >>>>> for home > >>>>> > usage which has two uses for me: I want to replace an old NAS > >>>>> and I want > >>>>> > to learn about Ceph so that I have hands-on experience. We're > >>>>> using it > >>>>> > in our company but I need some real-life experience without > >>>>> risking any > >>>>> > company or customers data. That's my preferred way of > >>>>> learning. > >>>>> > > >>>>> > The cluster consists of 3 Raspberry Pis plus a few VMs > >>>>> running on > >>>>> > Proxmox. I'm not using Proxmox' built in Ceph because I want > >>>>> to focus on > >>>>> > Ceph and not just use it as a preconfigured tool. > >>>>> > > >>>>> > All hosts are running Fedora (x86_64 and arm64) and during an > >>>>> Upgrade > >>>>> > from F36 to F37 my cluster suddenly showed all PGs as > >>>>> unavailable. I > >>>>> > worked nearly a week to get it back online and I learned a > >>>>> lot about > >>>>> > Ceph management and recovery. The cluster is back but I still > >>>>> can't > >>>>> > access my data. Maybe you can help me? > >>>>> > > >>>>> > Here are my versions: > >>>>> > > >>>>> > [ceph: root@ceph04 /]# ceph versions > >>>>> > { > >>>>> > "mon": { > >>>>> > "ceph version 17.2.5 > >>>>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) > >>>>> > quincy (stable)": 3 > >>>>> > }, > >>>>> > "mgr": { > >>>>> > "ceph version 17.2.5 > >>>>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) > >>>>> > quincy (stable)": 3 > >>>>> > }, > >>>>> > "osd": { > >>>>> > "ceph version 17.2.5 > >>>>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) > >>>>> > quincy (stable)": 5 > >>>>> > }, > >>>>> > "mds": { > >>>>> > "ceph version 17.2.5 > >>>>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) > >>>>> > quincy (stable)": 4 > >>>>> > }, > >>>>> > "overall": { > >>>>> > "ceph version 17.2.5 > >>>>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) > >>>>> > quincy (stable)": 15 > >>>>> > } > >>>>> > } > >>>>> > > >>>>> > > >>>>> > Here's MDS status output of one MDS: > >>>>> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt > >>>>> status > >>>>> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 > >>>>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > >>>>> <http://192.168.23.65:6800/2680651694> > >>>>> > <http://192.168.23.65:6800/2680651694 > >>>>> <http://192.168.23.65:6800/2680651694>> > >>>>> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 > >>>>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > >>>>> <http://192.168.23.65:6800/2680651694> > >>>>> > <http://192.168.23.65:6800/2680651694 > >>>>> <http://192.168.23.65:6800/2680651694>> > >>>>> > { > >>>>> > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > >>>>> > "whoami": 0, > >>>>> > "id": 60984167, > >>>>> > "want_state": "up:replay", > >>>>> > "state": "up:replay", > >>>>> > "fs_name": "cephfs", > >>>>> > "replay_status": { > >>>>> > "journal_read_pos": 0, > >>>>> > "journal_write_pos": 0, > >>>>> > "journal_expire_pos": 0, > >>>>> > "num_events": 0, > >>>>> > "num_segments": 0 > >>>>> > }, > >>>>> > "rank_uptime": 1127.54018615, > >>>>> > "mdsmap_epoch": 98056, > >>>>> > "osdmap_epoch": 12362, > >>>>> > "osdmap_epoch_barrier": 0, > >>>>> > "uptime": 1127.957307273 > >>>>> > } > >>>>> > > >>>>> > It's staying like that for days now. If there was a counter > >>>>> moving, I > >>>>> > just would wait but it doesn't change anything and alle stats > >>>>> says, the > >>>>> > MDS aren't working at all. > >>>>> > > >>>>> > The symptom I have is that Dashboard and all other tools I > >>>>> use say, it's > >>>>> > more or less ok. (Some old messages about failed daemons and > >>>>> scrubbing > >>>>> > aside). But I can't mount anything. When I try to start a VM > >>>>> that's on > >>>>> > RDS I just get a timeout. And when I try to mount a CephFS, > >>>>> mount just > >>>>> > hangs forever. > >>>>> > > >>>>> > Whatever command I give MDS or journal, it just hangs. The > >>>>> only thing I > >>>>> > could do, was take all CephFS offline, kill the MDS's and do > >>>>> a "ceph fs > >>>>> > reset <fs name> --yes-i-really-mean-it". After that I > >>>>> rebooted all > >>>>> > nodes, just to be sure but I still have no access to data. > >>>>> > > >>>>> > Could you please help me? I'm kinda desperate. If you need > >>>>> any more > >>>>> > information, just let me know. > >>>>> > > >>>>> > Cheers, > >>>>> > Thomas > >>>>> > > >>>>> > -- > >>>>> > Thomas Widhalm > >>>>> > Lead Systems Engineer > >>>>> > > >>>>> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > >>>>> > D-90429 Nuernberg > >>>>> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > >>>>> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > >>>>> > https://www.netways.de <https://www.netways.de> > >>>>> <https://www.netways.de <https://www.netways.de>> | > >>>>> > thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> > >>>>> <mailto:thomas.widhalm@xxxxxxxxxx > >>>>> <mailto:thomas.widhalm@xxxxxxxxxx>> > >>>>> > > >>>>> > ** stackconf 2023 - September - https://stackconf.eu > >>>>> <https://stackconf.eu> > >>>>> > <https://stackconf.eu <https://stackconf.eu>> ** > >>>>> > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> > >>>>> <https://osmc.de <https://osmc.de>> ** > >>>>> > ** New at NWS: Managed Database - > >>>>> > https://nws.netways.de/managed-database > >>>>> <https://nws.netways.de/managed-database> > >>>>> > <https://nws.netways.de/managed-database > >>>>> <https://nws.netways.de/managed-database>> ** > >>>>> > ** NETWAYS Web Services - https://nws.netways.de > >>>>> <https://nws.netways.de> > >>>>> > <https://nws.netways.de <https://nws.netways.de>> ** > >>>>> > _______________________________________________ > >>>>> > ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> <mailto:ceph-users@xxxxxxx> > >>>>> > <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> > >>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> <mailto:ceph-users-leave@xxxxxxx> > >>>>> > <mailto:ceph-users-leave@xxxxxxx > >>>>> <mailto:ceph-users-leave@xxxxxxx>> > >>>>> > > >>>>> > >>>>> -- > >>>>> Thomas Widhalm > >>>>> Lead Systems Engineer > >>>>> > >>>>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > >>>>> D-90429 Nuernberg > >>>>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > >>>>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > >>>>> https://www.netways.de <https://www.netways.de> | > >>>>> thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> > >>>>> > >>>>> ** stackconf 2023 - September - https://stackconf.eu > >>>>> <https://stackconf.eu> ** > >>>>> ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** > >>>>> ** New at NWS: Managed Database - > >>>>> https://nws.netways.de/managed-database > >>>>> <https://nws.netways.de/managed-database> ** > >>>>> ** NETWAYS Web Services - https://nws.netways.de > >>>>> <https://nws.netways.de> ** > >>>>> > >>>> > >>>> -- > >>>> Thomas Widhalm > >>>> Lead Systems Engineer > >>>> > >>>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 > >>>> Nuernberg > >>>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > >>>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > >>>> https://www.netways.de | thomas.widhalm@xxxxxxxxxx > >>>> > >>>> ** stackconf 2023 - September - https://stackconf.eu ** > >>>> ** OSMC 2023 - November - https://osmc.de ** > >>>> ** New at NWS: Managed Database - > >>>> https://nws.netways.de/managed-database ** > >>>> ** NETWAYS Web Services - https://nws.netways.de ** > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >>> -- > >>> Thomas Widhalm > >>> Lead Systems Engineer > >>> > >>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 > >>> Nuernberg > >>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > >>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > >>> https://www.netways.de | thomas.widhalm@xxxxxxxxxx > >>> > >>> ** stackconf 2023 - September - https://stackconf.eu ** > >>> ** OSMC 2023 - November - https://osmc.de ** > >>> ** New at NWS: Managed Database - > >>> https://nws.netways.de/managed-database ** > >>> ** NETWAYS Web Services - https://nws.netways.de ** > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> -- > >> Thomas Widhalm > >> Lead Systems Engineer > >> > >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg > >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > >> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > >> https://www.netways.de | thomas.widhalm@xxxxxxxxxx > >> > >> ** stackconf 2023 - September - https://stackconf.eu ** > >> ** OSMC 2023 - November - https://osmc.de ** > >> ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** > >> ** NETWAYS Web Services - https://nws.netways.de ** > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de | thomas.widhalm@xxxxxxxxxx > > ** stackconf 2023 - September - https://stackconf.eu ** > ** OSMC 2023 - November - https://osmc.de ** > ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** > ** NETWAYS Web Services - https://nws.netways.de ** > -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx