Ceph mon issues

Ilhaan Rasheed <ilhaan.rasheed@xxxxxxxxxx> · Tue, 19 Apr 2022 11:48:49 -0700

Hello Ceph users,

I have two issues affecting mon nodes in my ceph cluster.

1) mon store keeps growing
store.db directory (/var/lib/ceph/mon/ceph-v60/store.db/) has grown by
almost 20G the last two days. I've been clearing up space in /var and grew
/var a few times. I have compacted the mon store using ceph-monstore-tool a
few times as well, but after a few hours of running ceph-mon, /var becomes
full and I see that store.db is a bigger size than before.
ceph-monstore-tool compact doesn't show any clearing errors.

2) mon ceph_assert_fail
One of my three mon nodes fails to start and shows the following in logs. I
have tried injecting a monmap from mon node and still seeing the same.

Apr 19 10:44:33 v62 ceph-mon[1877692]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.6/rpm/el7/BUILD/ceph-15.2.6/src/mon/AuthMonitor.cc:
279: FAILED ceph_assert(ret == 0)
Apr 19 10:44:33 v62 ceph-mon[1877692]: ceph version 15.2.6
(cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)
Apr 19 10:44:33 v62 ceph-mon[1877692]: 1: (ceph::__ceph_assert_fail(char
const*, char const*, int, char const*)+0x14c) [0x7f519e1ec665]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 2: (()+0x26882d) [0x7f519e1ec82d]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 3:
(AuthMonitor::update_from_paxos(bool*)+0x2832) [0x559a623d6282]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 4:
(PaxosService::refresh(bool*)+0x103) [0x559a62474cc3]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 5:
(Monitor::refresh_from_paxos(bool*)+0x17c) [0x559a62355e4c]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 6: (Monitor::init_paxos()+0xfc)
[0x559a6235611c]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 7: (Monitor::preinit()+0xd5f)
[0x559a623773ef]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 8: (main()+0x2398) [0x559a6230e908]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 9: (__libc_start_main()+0xf5)
[0x7f519add9555]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 10: (()+0x2305d0) [0x559a6233f5d0]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 0> 2022-04-19T10:44:33.779-0700
7f51a6faa340 -1 *** Caught signal (Aborted) **
Apr 19 10:44:33 v62 ceph-mon[1877692]: in thread 7f51a6faa340
thread_name:ceph-mon
Apr 19 10:44:33 v62 ceph-mon[1877692]: ceph version 15.2.6
(cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)
Apr 19 10:44:33 v62 ceph-mon[1877692]: 1: (()+0xf630) [0x7f519bffa630]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 2: (gsignal()+0x37) [0x7f519aded387]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 3: (abort()+0x148) [0x7f519adeea78]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 4: (ceph::__ceph_assert_fail(char
const*, char const*, int, char const*)+0x19b) [0x7f519e1ec6b4]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 5: (()+0x26882d) [0x7f519e1ec82d]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 6:
(AuthMonitor::update_from_paxos(bool*)+0x2832) [0x559a623d6282]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 7:
(PaxosService::refresh(bool*)+0x103) [0x559a62474cc3]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 8:
(Monitor::refresh_from_paxos(bool*)+0x17c) [0x559a62355e4c]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 9: (Monitor::init_paxos()+0xfc)
[0x559a6235611c]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 10: (Monitor::preinit()+0xd5f)
[0x559a623773ef]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 11: (main()+0x2398) [0x559a6230e908]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 12: (__libc_start_main()+0xf5)
[0x7f519add9555]
Apr 19 10:44:33 v62 ceph-mon[1877692]: 13: (()+0x2305d0) [0x559a6233f5d0]
Apr 19 10:44:33 v62 ceph-mon[1877692]: NOTE: a copy of the executable, or
`objdump -rdS <executable>` is needed to interpret this.
Apr 19 10:44:33 v62 systemd[1]: ceph-mon@v62.service: main process exited,
code=killed, status=6/ABRT
Apr 19 10:44:33 v62 systemd[1]: Unit ceph-mon@v62.service entered failed
state.
Apr 19 10:44:33 v62 systemd[1]: ceph-mon@v62.service failed.
Apr 19 10:44:35 v62 ceph-mgr[1050242]: ::ffff:10.8.12.51 - -
[19/Apr/2022:10:44:35] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.27.1"
Apr 19 10:44:43 v62 systemd[1]: ceph-mon@v62.service holdoff time over,
scheduling restart.
Apr 19 10:44:43 v62 systemd[1]: start request repeated too quickly for
ceph-mon@v62.service
Apr 19 10:44:43 v62 systemd[1]: Unit ceph-mon@v62.service entered failed
state.
Apr 19 10:44:43 v62 systemd[1]: ceph-mon@v62.service failed.

Please let me know what else I can provide to help debug these issues

Thanks,
Ilhaan Rasheed

-- 
*******************************************************************
This 
message was sent from RiskIQ, and is intended only for the designated 
recipient(s). It may contain confidential or proprietary information and 
may be subject to confidentiality protections. If you are not a designated 
recipient, you may not review, copy or distribute this message. If you 
receive this in error, please notify the sender by reply e-mail and delete 
this message. Thank you.

*******************************************************************
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx