I did a simple os update and reboot. Now mds is stuck in replay. I'm running octapus debug mds = 20 shows some pretty lame logs # tail -f ceph-mds.bridge.log 2021-05-11T18:24:04.859-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s 2021-05-11T18:24:05.860-0700 7f41314a1700 10 mds.0.cache cache not ready for trimming 2021-05-11T18:24:05.860-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 get_task_status 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 send_task_status: updating 1 status keys 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 schedule_update_timer_task 2021-05-11T18:24:06.859-0700 7f41314a1700 10 mds.0.cache cache not ready for trimming 2021-05-11T18:24:06.859-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s 2021-05-11T18:24:07.859-0700 7f41314a1700 10 mds.0.cache cache not ready for trimming 2021-05-11T18:24:07.859-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s # cephfs-journal-tool event recover_dentries summary gets stuck on an object and stays stuck. I tried to run rados -p cephfs_metadata_pool rmomapkey per https://tracker.ceph.com/issues/38452 but the cmd ran for hours and never completes. # cephfs-journal-tool --rank cephfs:0 journal reset 2021-05-11T18:31:26.860-0700 7f2e9c2a9700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol 2021-05-11T18:31:26.860-0700 7f2f2989ba80 4 waiting for MDS map... 2021-05-11T18:31:26.860-0700 7f2f2989ba80 4 Got MDS map 2898629 2021-05-11T18:31:26.861-0700 7f2f2989ba80 10 main: JournalTool::main 2021-05-11T18:31:26.861-0700 7f2f2989ba80 4 main: JournalTool: connecting to RADOS... 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: JournalTool: resolving pool 1 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: JournalTool: creating IoCtx.. 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: Executing for rank 0 2021-05-11T18:31:26.864-0700 7f2edc2aa700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol 2021-05-11T18:31:26.864-0700 7f2f2989ba80 4 waiting for MDS map... 2021-05-11T18:31:26.865-0700 7f2f2989ba80 4 Got MDS map 2898629 2021-05-11T18:31:26.865-0700 7f2f2989ba80 4 client.2024650.journalpointer Reading journal pointer '400.00000000' 2021-05-11T18:31:26.865-0700 7f2f2989ba80 1 client.2024650.journaler.resetter(ro) recover start 2021-05-11T18:31:26.865-0700 7f2f2989ba80 1 client.2024650.journaler.resetter(ro) read_head 2021-05-11T18:31:26.865-0700 7f291c293700 1 client.2024650.journaler.resetter(ro) _finish_read_head loghead(trim 14172553216, expire 14174788378, write 14400838791, stream_format 1). probing for end of log (from 14400838791)... 2021-05-11T18:31:26.865-0700 7f291c293700 1 client.2024650.journaler.resetter(ro) probing for end of the log I've been stuck here for hours # strace -f -p 10357 [pid 10360] <... sendmsg resumed>) = 9 [pid 10361] read(14, <unfinished ...> [pid 10360] epoll_wait(7, <unfinished ...> [pid 10361] <... read resumed>0x55e95d982000, 4096) = -1 EAGAIN (Resource temporarily unavailable) [pid 10360] <... epoll_wait resumed>[{EPOLLIN, {u32=16, u64=16}}, {EPOLLIN, {u32=18, u64=18}}], 5000, 30000) = 2 [pid 10361] epoll_wait(10, <unfinished ...> [pid 10360] read(16, "\23\1\10\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\354^\340;"..., 4096) = 57 [pid 10360] read(16, 0x55e95d9a8000, 4096) = -1 EAGAIN (Resource temporarily unavailable) [pid 10360] read(18, "\17\264R\233`\327\275\222+", 4096) = 9 [pid 10360] read(18, 0x55e95d9f4000, 4096) = -1 EAGAIN (Resource temporarily unavailable) [pid 10360] epoll_wait(7, ^X <unfinished ...> [pid 10370] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 10381] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 10370] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10389] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 10381] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731038214}) = 0 [pid 10389] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731105584}) = 0 [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731125991}) = 0 [pid 10370] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10381] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10389] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731162065}) = 0 [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731184311}) = 0 [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731174345}) = 0 [pid 10370] futex(0x55e95d97c2d8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 10381] futex(0x55e95d8a5320, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 10370] <... futex resumed>) = 0 [pid 10389] futex(0x55e95d97fad8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 10381] <... futex resumed>) = 0 [pid 10370] futex(0x55e95d97c31c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, tv_nsec=731161399}, 0xffffffff <unfinished ...> [pid 10389] <... futex resumed>) = 0 [pid 10381] futex(0x55e95d8a5364, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, tv_nsec=731173986}, 0xffffffff <unfinished ...> [pid 10389] futex(0x55e95d97fb1c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, tv_nsec=731183618}, 0xffffffff^Cstrace: Process 10357 detached Any help would be great. Thanks, /C _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx