Hi Thomas, I have created the tracker https://tracker.ceph.com/issues/58489 to track this. Please upload the debug mds logs here. Thanks, Kotresh H R On Wed, Jan 18, 2023 at 4:56 PM Kotresh Hiremath Ravishankar < khiremat@xxxxxxxxxx> wrote: > Hi Thomas, > > This looks like it requires more investigation than I expected. What's the > current status ? > Did the crashed mds come back and become active ? > > Increase the debug log level to 20 and share the mds logs. I will create a > tracker and share it here. > You can upload the mds logs there. > > Thanks, > Kotresh H R > > On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm@xxxxxxxxxx> > wrote: > >> Another new thing that just happened: >> >> One of the MDS just crashed out of nowhere. >> >> >> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: >> In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, >> MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000 >> >> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: >> 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions) >> >> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy >> (stable) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x135) [0x7fccd759943f] >> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] >> 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) >> [0x55fb2b98e89c] >> 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] >> 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] >> 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] >> 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] >> 8: clone() >> >> >> and >> >> >> >> *** Caught signal (Aborted) ** >> in thread 7fccc7153700 thread_name:md_log_replay >> >> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy >> (stable) >> 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0] >> 2: gsignal() >> 3: abort() >> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x18f) [0x7fccd7599499] >> 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] >> 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) >> [0x55fb2b98e89c] >> 7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] >> 8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] >> 9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] >> 10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] >> 11: clone() >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> needed to interpret this. >> >> Is what I found in the logs. Since it's referring to log replaying, >> could this be related to my issue? >> >> On 17.01.23 10:54, Thomas Widhalm wrote: >> > Hi again, >> > >> > Another thing I found: Out of pure desperation, I started MDS on all >> > nodes. I had them configured in the past so I was hoping, they could >> > help with bringing in missing data even when they were down for quite a >> > while now. I didn't see any changes in the logs but the CPU on the hosts >> > that usually don't run MDS just spiked. So high I had to kill the MDS >> > again because otherwise they kept killing OSD containers. So I don't >> > really have any new information, but maybe that could be a hint of some >> > kind? >> > >> > Cheers, >> > Thomas >> > >> > On 17.01.23 10:13, Thomas Widhalm wrote: >> >> Hi, >> >> >> >> Thanks again. :-) >> >> >> >> Ok, that seems like an error to me. I never configured an extra rank >> for >> >> MDS. Maybe that's where my knowledge failed me but I guess, MDS is >> >> waiting for something that was never there. >> >> >> >> Yes, there are two filesystems. Due to "budget restrictions" (it's my >> >> personal system at home, I configured a second CephFS with only one >> >> replica for data that could be easily restored. >> >> >> >> Here's what I got when turning up the debug level: >> >> >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> Sending beacon up:replay seq 11107 >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> sender thread waiting interval 4s >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> received beacon reply up:replay seq 11107 rtt 0.00200002 >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 >> >> schedule_update_timer_task >> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 >> >> schedule_update_timer_task >> >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> Sending beacon up:replay seq 11108 >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> sender thread waiting interval 4s >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> received beacon reply up:replay seq 11108 rtt 0.00200002 >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 >> >> schedule_update_timer_task >> >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 >> >> schedule_update_timer_task >> >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> Sending beacon up:replay seq 11109 >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> sender thread waiting interval 4s >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> >> received beacon reply up:replay seq 11109 rtt 0.00600006 >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 >> >> schedule_update_timer_task >> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free >> memory >> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 >> >> schedule_update_timer_task >> >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> >> 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have >> caps, >> >> 0 caps, 0 caps per inode >> >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> >> trimming >> >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread >> waiting >> >> interval 1.000000000s >> >> >> >> >> >> The only thing that gives me hope here is that the line >> >> mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is >> >> chaning its sequence number. >> >> >> >> Anything else I can provide? >> >> >> >> Cheers, >> >> Thomas >> >> >> >> On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote: >> >>> Hi Thomas, >> >>> >> >>> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The >> >>> mds is stuck in 'up:replay' which means the MDS taking over a failed >> >>> rank. >> >>> This state represents that the MDS is recovering its journal and other >> >>> metadata. >> >>> >> >>> I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' >> >>> and the active mds for both filesystems are stuck in 'up:replay'. The >> >>> mds >> >>> logs shared are not providing much information to infer anything. >> >>> >> >>> Could you please enable the debug logs and pass on the mds logs ? >> >>> >> >>> Thanks, >> >>> Kotresh H R >> >>> >> >>> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm >> >>> <thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx>> wrote: >> >>> >> >>> Hi Kotresh, >> >>> >> >>> Thanks for your reply! >> >>> >> >>> I only have one rank. Here's the output of all MDS I have: >> >>> >> >>> ################### >> >>> >> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status >> >>> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 >> >>> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> >>> <http://192.168.23.65:6800/2680651694> >> >>> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 >> >>> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> >>> <http://192.168.23.65:6800/2680651694> >> >>> { >> >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> >>> "whoami": 0, >> >>> "id": 60984167, >> >>> "want_state": "up:replay", >> >>> "state": "up:replay", >> >>> "fs_name": "cephfs", >> >>> "replay_status": { >> >>> "journal_read_pos": 0, >> >>> "journal_write_pos": 0, >> >>> "journal_expire_pos": 0, >> >>> "num_events": 0, >> >>> "num_segments": 0 >> >>> }, >> >>> "rank_uptime": 150224.982558844, >> >>> "mdsmap_epoch": 143757, >> >>> "osdmap_epoch": 12395, >> >>> "osdmap_epoch_barrier": 0, >> >>> "uptime": 150225.39968057699 >> >>> } >> >>> >> >>> ######################## >> >>> >> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status >> >>> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 >> >>> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >> >>> <http://192.168.23.64:6800/3930607515> >> >>> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 >> >>> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >> >>> <http://192.168.23.64:6800/3930607515> >> >>> { >> >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> >>> "whoami": 0, >> >>> "id": 60984134, >> >>> "want_state": "up:replay", >> >>> "state": "up:replay", >> >>> "fs_name": "cephfs_insecure", >> >>> "replay_status": { >> >>> "journal_read_pos": 0, >> >>> "journal_write_pos": 0, >> >>> "journal_expire_pos": 0, >> >>> "num_events": 0, >> >>> "num_segments": 0 >> >>> }, >> >>> "rank_uptime": 150450.96934037199, >> >>> "mdsmap_epoch": 143815, >> >>> "osdmap_epoch": 12395, >> >>> "osdmap_epoch_barrier": 0, >> >>> "uptime": 150451.93533502301 >> >>> } >> >>> >> >>> ########################### >> >>> >> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status >> >>> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 >> >>> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' >> >>> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 >> FSMap: >> >>> cephfs:1/1 cephfs_insecure:1/1 >> >>> >> >>> >> {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} >> >>> 2 up:standby >> >>> Error ENOENT: problem getting command descriptions from >> >>> mds.mds01.ceph06.wcfdom >> >>> >> >>> ############################ >> >>> >> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status >> >>> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 >> >>> ms_handle_reset on v2:192.168.23.67:6800/942898192 >> >>> <http://192.168.23.67:6800/942898192> >> >>> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 >> >>> ms_handle_reset on v2:192.168.23.67:6800/942898192 >> >>> <http://192.168.23.67:6800/942898192> >> >>> { >> >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> >>> "whoami": -1, >> >>> "id": 60984161, >> >>> "want_state": "up:standby", >> >>> "state": "up:standby", >> >>> "mdsmap_epoch": 97687, >> >>> "osdmap_epoch": 0, >> >>> "osdmap_epoch_barrier": 0, >> >>> "uptime": 150508.29091721401 >> >>> } >> >>> >> >>> The error message from ceph06 is new to me. That didn't happen the >> >>> last >> >>> times. >> >>> >> >>> [ceph: root@ceph06 /]# ceph fs dump >> >>> e143850 >> >>> enable_multiple, ever_enabled_multiple: 1,1 >> >>> default compat: compat={},rocompat={},incompat={1=base >> >>> v0.20,2=client >> >>> writeable ranges,3=default file layouts on dirs,4=dir inode in >> >>> separate >> >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >> >>> omap,8=no >> >>> anchor table,9=file layout v2,10=snaprealm v2} >> >>> legacy client fscid: 2 >> >>> >> >>> Filesystem 'cephfs' (2) >> >>> fs_name cephfs >> >>> epoch 143850 >> >>> flags 12 joinable allow_snaps allow_multimds_snaps >> >>> created 2023-01-14T14:30:05.723421+0000 >> >>> modified 2023-01-16T09:00:53.663007+0000 >> >>> tableserver 0 >> >>> root 0 >> >>> session_timeout 60 >> >>> session_autoclose 300 >> >>> max_file_size 1099511627776 >> >>> required_client_features {} >> >>> last_failure 0 >> >>> last_failure_osd_epoch 12321 >> >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> >>> writeable >> >>> ranges,3=default file layouts on dirs,4=dir inode in separate >> >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >> >>> omap,7=mds >> >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm >> v2} >> >>> max_mds 1 >> >>> in 0 >> >>> up {0=60984167} >> >>> failed >> >>> damaged >> >>> stopped >> >>> data_pools [4] >> >>> metadata_pool 5 >> >>> inline_data disabled >> >>> balancer >> >>> standby_count_wanted 1 >> >>> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 >> addr >> >>> [v2: >> 192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 >> >>> >> >>> < >> http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>] >> >>> compat {c=[1],r=[1],i=[7ff]}] >> >>> >> >>> >> >>> Filesystem 'cephfs_insecure' (3) >> >>> fs_name cephfs_insecure >> >>> epoch 143849 >> >>> flags 12 joinable allow_snaps allow_multimds_snaps >> >>> created 2023-01-14T14:22:46.360062+0000 >> >>> modified 2023-01-16T09:00:52.632163+0000 >> >>> tableserver 0 >> >>> root 0 >> >>> session_timeout 60 >> >>> session_autoclose 300 >> >>> max_file_size 1099511627776 >> >>> required_client_features {} >> >>> last_failure 0 >> >>> last_failure_osd_epoch 12319 >> >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> >>> writeable >> >>> ranges,3=default file layouts on dirs,4=dir inode in separate >> >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >> >>> omap,7=mds >> >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm >> v2} >> >>> max_mds 1 >> >>> in 0 >> >>> up {0=60984134} >> >>> failed >> >>> damaged >> >>> stopped >> >>> data_pools [7] >> >>> metadata_pool 6 >> >>> inline_data disabled >> >>> balancer >> >>> standby_count_wanted 1 >> >>> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 >> addr >> >>> [v2: >> 192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 >> >>> >> >>> < >> http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] >> >>> compat {c=[1],r=[1],i=[7ff]}] >> >>> >> >>> >> >>> Standby daemons: >> >>> >> >>> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr >> >>> [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 >> >>> >> >>> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 >> >] >> >>> compat >> >>> {c=[1],r=[1],i=[7ff]}] >> >>> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr >> >>> [v2: >> 192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 >> >>> >> >>> < >> http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>] >> >>> compat {c=[1],r=[1],i=[7ff]}] >> >>> dumped fsmap epoch 143850 >> >>> >> >>> ############################# >> >>> >> >>> [ceph: root@ceph06 /]# ceph fs status >> >>> >> >>> (doesn't come back) >> >>> >> >>> ############################# >> >>> >> >>> All MDS show log lines similar to this one: >> >>> >> >>> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143927 from mon.1 >> >>> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143929 from mon.1 >> >>> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143930 from mon.1 >> >>> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143931 from mon.1 >> >>> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143933 from mon.1 >> >>> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143935 from mon.1 >> >>> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143936 from mon.1 >> >>> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143937 from mon.1 >> >>> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143939 from mon.1 >> >>> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143941 from mon.1 >> >>> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> >>> Updating >> >>> MDS map to version 143942 from mon.1 >> >>> >> >>> Anything else, I can provide? >> >>> >> >>> Cheers and thanks again! >> >>> Thomas >> >>> >> >>> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: >> >>> > Hi Thomas, >> >>> > >> >>> > As the documentation says, the MDS enters up:resolve from >> >>> |up:replay| if >> >>> > the Ceph file system has multiple ranks (including this one), >> >>> i.e. it’s >> >>> > not a single active MDS cluster. >> >>> > The MDS is resolving any uncommitted inter-MDS operations. All >> >>> ranks in >> >>> > the file system must be in this state or later for progress to >> be >> >>> made, >> >>> > i.e. no rank can be failed/damaged or |up:replay|. >> >>> > >> >>> > So please check the status of the other active mds if it's >> >>> failed. >> >>> > >> >>> > Also please share the mds logs and the output of 'ceph fs dump' >> >>> and >> >>> > 'ceph fs status' >> >>> > >> >>> > Thanks, >> >>> > Kotresh H R >> >>> > >> >>> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm >> >>> > <thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> >> >>> <mailto:thomas.widhalm@xxxxxxxxxx >> >>> <mailto:thomas.widhalm@xxxxxxxxxx>>> wrote: >> >>> > >> >>> > Hi, >> >>> > >> >>> > I'm really lost with my Ceph system. I built a small >> cluster >> >>> for home >> >>> > usage which has two uses for me: I want to replace an old >> NAS >> >>> and I want >> >>> > to learn about Ceph so that I have hands-on experience. >> We're >> >>> using it >> >>> > in our company but I need some real-life experience without >> >>> risking any >> >>> > company or customers data. That's my preferred way of >> >>> learning. >> >>> > >> >>> > The cluster consists of 3 Raspberry Pis plus a few VMs >> >>> running on >> >>> > Proxmox. I'm not using Proxmox' built in Ceph because I >> want >> >>> to focus on >> >>> > Ceph and not just use it as a preconfigured tool. >> >>> > >> >>> > All hosts are running Fedora (x86_64 and arm64) and during >> an >> >>> Upgrade >> >>> > from F36 to F37 my cluster suddenly showed all PGs as >> >>> unavailable. I >> >>> > worked nearly a week to get it back online and I learned a >> >>> lot about >> >>> > Ceph management and recovery. The cluster is back but I >> still >> >>> can't >> >>> > access my data. Maybe you can help me? >> >>> > >> >>> > Here are my versions: >> >>> > >> >>> > [ceph: root@ceph04 /]# ceph versions >> >>> > { >> >>> > "mon": { >> >>> > "ceph version 17.2.5 >> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> >>> > quincy (stable)": 3 >> >>> > }, >> >>> > "mgr": { >> >>> > "ceph version 17.2.5 >> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> >>> > quincy (stable)": 3 >> >>> > }, >> >>> > "osd": { >> >>> > "ceph version 17.2.5 >> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> >>> > quincy (stable)": 5 >> >>> > }, >> >>> > "mds": { >> >>> > "ceph version 17.2.5 >> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> >>> > quincy (stable)": 4 >> >>> > }, >> >>> > "overall": { >> >>> > "ceph version 17.2.5 >> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> >>> > quincy (stable)": 15 >> >>> > } >> >>> > } >> >>> > >> >>> > >> >>> > Here's MDS status output of one MDS: >> >>> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt >> >>> status >> >>> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 >> client.60986454 >> >>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> >>> <http://192.168.23.65:6800/2680651694> >> >>> > <http://192.168.23.65:6800/2680651694 >> >>> <http://192.168.23.65:6800/2680651694>> >> >>> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 >> client.60986460 >> >>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> >>> <http://192.168.23.65:6800/2680651694> >> >>> > <http://192.168.23.65:6800/2680651694 >> >>> <http://192.168.23.65:6800/2680651694>> >> >>> > { >> >>> > "cluster_fsid": >> "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> >>> > "whoami": 0, >> >>> > "id": 60984167, >> >>> > "want_state": "up:replay", >> >>> > "state": "up:replay", >> >>> > "fs_name": "cephfs", >> >>> > "replay_status": { >> >>> > "journal_read_pos": 0, >> >>> > "journal_write_pos": 0, >> >>> > "journal_expire_pos": 0, >> >>> > "num_events": 0, >> >>> > "num_segments": 0 >> >>> > }, >> >>> > "rank_uptime": 1127.54018615, >> >>> > "mdsmap_epoch": 98056, >> >>> > "osdmap_epoch": 12362, >> >>> > "osdmap_epoch_barrier": 0, >> >>> > "uptime": 1127.957307273 >> >>> > } >> >>> > >> >>> > It's staying like that for days now. If there was a counter >> >>> moving, I >> >>> > just would wait but it doesn't change anything and alle >> stats >> >>> says, the >> >>> > MDS aren't working at all. >> >>> > >> >>> > The symptom I have is that Dashboard and all other tools I >> >>> use say, it's >> >>> > more or less ok. (Some old messages about failed daemons >> and >> >>> scrubbing >> >>> > aside). But I can't mount anything. When I try to start a >> VM >> >>> that's on >> >>> > RDS I just get a timeout. And when I try to mount a CephFS, >> >>> mount just >> >>> > hangs forever. >> >>> > >> >>> > Whatever command I give MDS or journal, it just hangs. The >> >>> only thing I >> >>> > could do, was take all CephFS offline, kill the MDS's and >> do >> >>> a "ceph fs >> >>> > reset <fs name> --yes-i-really-mean-it". After that I >> >>> rebooted all >> >>> > nodes, just to be sure but I still have no access to data. >> >>> > >> >>> > Could you please help me? I'm kinda desperate. If you need >> >>> any more >> >>> > information, just let me know. >> >>> > >> >>> > Cheers, >> >>> > Thomas >> >>> > >> >>> > -- >> >>> > Thomas Widhalm >> >>> > Lead Systems Engineer >> >>> > >> >>> > NETWAYS Professional Services GmbH | Deutschherrnstr. >> 15-19 | >> >>> > D-90429 Nuernberg >> >>> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> >>> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> >>> > https://www.netways.de <https://www.netways.de> >> >>> <https://www.netways.de <https://www.netways.de>> | >> >>> > thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> >> >>> <mailto:thomas.widhalm@xxxxxxxxxx >> >>> <mailto:thomas.widhalm@xxxxxxxxxx>> >> >>> > >> >>> > ** stackconf 2023 - September - https://stackconf.eu >> >>> <https://stackconf.eu> >> >>> > <https://stackconf.eu <https://stackconf.eu>> ** >> >>> > ** OSMC 2023 - November - https://osmc.de <https://osmc.de >> > >> >>> <https://osmc.de <https://osmc.de>> ** >> >>> > ** New at NWS: Managed Database - >> >>> > https://nws.netways.de/managed-database >> >>> <https://nws.netways.de/managed-database> >> >>> > <https://nws.netways.de/managed-database >> >>> <https://nws.netways.de/managed-database>> ** >> >>> > ** NETWAYS Web Services - https://nws.netways.de >> >>> <https://nws.netways.de> >> >>> > <https://nws.netways.de <https://nws.netways.de>> ** >> >>> > _______________________________________________ >> >>> > ceph-users mailing list -- ceph-users@xxxxxxx >> >>> <mailto:ceph-users@xxxxxxx> >> >>> > <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> >> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >>> <mailto:ceph-users-leave@xxxxxxx> >> >>> > <mailto:ceph-users-leave@xxxxxxx >> >>> <mailto:ceph-users-leave@xxxxxxx>> >> >>> > >> >>> >> >>> -- >> >>> Thomas Widhalm >> >>> Lead Systems Engineer >> >>> >> >>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | >> >>> D-90429 Nuernberg >> >>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> >>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> >>> https://www.netways.de <https://www.netways.de> | >> >>> thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> >> >>> >> >>> ** stackconf 2023 - September - https://stackconf.eu >> >>> <https://stackconf.eu> ** >> >>> ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** >> >>> ** New at NWS: Managed Database - >> >>> https://nws.netways.de/managed-database >> >>> <https://nws.netways.de/managed-database> ** >> >>> ** NETWAYS Web Services - https://nws.netways.de >> >>> <https://nws.netways.de> ** >> >>> >> >> >> >> -- >> >> Thomas Widhalm >> >> Lead Systems Engineer >> >> >> >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 >> >> Nuernberg >> >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> >> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> >> https://www.netways.de | thomas.widhalm@xxxxxxxxxx >> >> >> >> ** stackconf 2023 - September - https://stackconf.eu ** >> >> ** OSMC 2023 - November - https://osmc.de ** >> >> ** New at NWS: Managed Database - >> >> https://nws.netways.de/managed-database ** >> >> ** NETWAYS Web Services - https://nws.netways.de ** >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users@xxxxxxx >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > >> > -- >> > Thomas Widhalm >> > Lead Systems Engineer >> > >> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 >> > Nuernberg >> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> > https://www.netways.de | thomas.widhalm@xxxxxxxxxx >> > >> > ** stackconf 2023 - September - https://stackconf.eu ** >> > ** OSMC 2023 - November - https://osmc.de ** >> > ** New at NWS: Managed Database - >> > https://nws.netways.de/managed-database ** >> > ** NETWAYS Web Services - https://nws.netways.de ** >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> -- >> Thomas Widhalm >> Lead Systems Engineer >> >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 >> Nuernberg >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> https://www.netways.de | thomas.widhalm@xxxxxxxxxx >> >> ** stackconf 2023 - September - https://stackconf.eu ** >> ** OSMC 2023 - November - https://osmc.de ** >> ** New at NWS: Managed Database - https://nws.netways.de/managed-database >> ** >> ** NETWAYS Web Services - https://nws.netways.de ** >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx