Re: cephfs mds issues

Mazzystr <mazzystr@xxxxxxxxx> · Tue, 11 May 2021 21:55:45 -0700

I jogged my own memory... My mons servers came back and didn't take the
full ratio settings.  ceph osd state reported osd's in full status (96%).
That caused pools to report full.  I run hotter than default settings.  We
buy disk when we hit 98% capacity not sooner.  Arguing that policy is like
yelling at a brick wall.

Setting correct set-full-ratio, set-nearfull-ratio, set-backfillfull-ratio
let my osds flip back into good states

And restarting mds shows nice lush logs :)

Documentation for troubleshooting fs doesn't include this information.
Hopefully this helps someone having trouble in the future.

/C

On Tue, May 11, 2021 at 9:04 PM Mazzystr <mazzystr@xxxxxxxxx> wrote:

> I did a simple os update and reboot.  Now mds is stuck in replay.  I'm
> running octapus
>
> debug mds = 20 shows some pretty lame logs
>
> # tail -f ceph-mds.bridge.log
> 2021-05-11T18:24:04.859-0700 7f41314a1700 20 mds.0.cache upkeep thread
> waiting interval 1s
> 2021-05-11T18:24:05.860-0700 7f41314a1700 10 mds.0.cache cache not ready
> for trimming
> 2021-05-11T18:24:05.860-0700 7f41314a1700 20 mds.0.cache upkeep thread
> waiting interval 1s
> 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 get_task_status
> 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629
> send_task_status: updating 1 status keys
> 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629
> schedule_update_timer_task
> 2021-05-11T18:24:06.859-0700 7f41314a1700 10 mds.0.cache cache not ready
> for trimming
> 2021-05-11T18:24:06.859-0700 7f41314a1700 20 mds.0.cache upkeep thread
> waiting interval 1s
> 2021-05-11T18:24:07.859-0700 7f41314a1700 10 mds.0.cache cache not ready
> for trimming
> 2021-05-11T18:24:07.859-0700 7f41314a1700 20 mds.0.cache upkeep thread
> waiting interval 1s
>
>
> # cephfs-journal-tool event recover_dentries summary
> gets stuck on an object and stays stuck.  I tried to run rados -p
> cephfs_metadata_pool rmomapkey per https://tracker.ceph.com/issues/38452
> but the cmd ran for hours and never completes.
>
>
> # cephfs-journal-tool --rank cephfs:0 journal reset
> 2021-05-11T18:31:26.860-0700 7f2e9c2a9700 -1 NetHandler create_socket
> couldn't create socket (97) Address family not supported by protocol
> 2021-05-11T18:31:26.860-0700 7f2f2989ba80  4 waiting for MDS map...
> 2021-05-11T18:31:26.860-0700 7f2f2989ba80  4 Got MDS map 2898629
> 2021-05-11T18:31:26.861-0700 7f2f2989ba80 10 main: JournalTool::main
> 2021-05-11T18:31:26.861-0700 7f2f2989ba80  4 main: JournalTool: connecting
> to RADOS...
> 2021-05-11T18:31:26.863-0700 7f2f2989ba80  4 main: JournalTool: resolving
> pool 1
> 2021-05-11T18:31:26.863-0700 7f2f2989ba80  4 main: JournalTool: creating
> IoCtx..
> 2021-05-11T18:31:26.863-0700 7f2f2989ba80  4 main: Executing for rank 0
> 2021-05-11T18:31:26.864-0700 7f2edc2aa700 -1 NetHandler create_socket
> couldn't create socket (97) Address family not supported by protocol
> 2021-05-11T18:31:26.864-0700 7f2f2989ba80  4 waiting for MDS map...
> 2021-05-11T18:31:26.865-0700 7f2f2989ba80  4 Got MDS map 2898629
> 2021-05-11T18:31:26.865-0700 7f2f2989ba80  4 client.2024650.journalpointer
> Reading journal pointer '400.00000000'
> 2021-05-11T18:31:26.865-0700 7f2f2989ba80  1
> client.2024650.journaler.resetter(ro) recover start
> 2021-05-11T18:31:26.865-0700 7f2f2989ba80  1
> client.2024650.journaler.resetter(ro) read_head
> 2021-05-11T18:31:26.865-0700 7f291c293700  1
> client.2024650.journaler.resetter(ro) _finish_read_head loghead(trim
> 14172553216, expire 14174788378, write 14400838791, stream_format 1).
>  probing for end of log (from 14400838791)...
> 2021-05-11T18:31:26.865-0700 7f291c293700  1
> client.2024650.journaler.resetter(ro) probing for end of the log
>
> I've been stuck here for hours
>
>
> # strace -f -p 10357
> [pid 10360] <... sendmsg resumed>)      = 9
> [pid 10361] read(14,  <unfinished ...>
> [pid 10360] epoll_wait(7,  <unfinished ...>
> [pid 10361] <... read resumed>0x55e95d982000, 4096) = -1 EAGAIN (Resource
> temporarily unavailable)
> [pid 10360] <... epoll_wait resumed>[{EPOLLIN, {u32=16, u64=16}},
> {EPOLLIN, {u32=18, u64=18}}], 5000, 30000) = 2
> [pid 10361] epoll_wait(10,  <unfinished ...>
> [pid 10360] read(16,
> "\23\1\10\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\354^\340;"...,
> 4096) = 57
> [pid 10360] read(16, 0x55e95d9a8000, 4096) = -1 EAGAIN (Resource
> temporarily unavailable)
> [pid 10360] read(18, "\17\264R\233`\327\275\222+", 4096) = 9
> [pid 10360] read(18, 0x55e95d9f4000, 4096) = -1 EAGAIN (Resource
> temporarily unavailable)
> [pid 10360] epoll_wait(7, ^X <unfinished ...>
> [pid 10370] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed
> out)
> [pid 10381] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed
> out)
> [pid 10370] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 10389] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed
> out)
> [pid 10381] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989,
> tv_nsec=731038214}) = 0
> [pid 10389] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989,
> tv_nsec=731105584}) = 0
> [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989,
> tv_nsec=731125991}) = 0
> [pid 10370] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 10381] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 10389] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989,
> tv_nsec=731162065}) = 0
> [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989,
> tv_nsec=731184311}) = 0
> [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989,
> tv_nsec=731174345}) = 0
> [pid 10370] futex(0x55e95d97c2d8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 10381] futex(0x55e95d8a5320, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 10370] <... futex resumed>)        = 0
> [pid 10389] futex(0x55e95d97fad8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 10381] <... futex resumed>)        = 0
> [pid 10370] futex(0x55e95d97c31c,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990,
> tv_nsec=731161399}, 0xffffffff <unfinished ...>
> [pid 10389] <... futex resumed>)        = 0
> [pid 10381] futex(0x55e95d8a5364,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990,
> tv_nsec=731173986}, 0xffffffff <unfinished ...>
> [pid 10389] futex(0x55e95d97fb1c,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990,
> tv_nsec=731183618}, 0xffffffff^Cstrace: Process 10357 detached
>
>
> Any help would be great.
>
> Thanks,
> /C
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx