Hello Ceph users, I have two issues affecting mon nodes in my ceph cluster. 1) mon store keeps growing store.db directory (/var/lib/ceph/mon/ceph-v60/store.db/) has grown by almost 20G the last two days. I've been clearing up space in /var and grew /var a few times. I have compacted the mon store using ceph-monstore-tool a few times as well, but after a few hours of running ceph-mon, /var becomes full and I see that store.db is a bigger size than before. ceph-monstore-tool compact doesn't show any clearing errors. 2) mon ceph_assert_fail One of my three mon nodes fails to start and shows the following in logs. I have tried injecting a monmap from mon node and still seeing the same. Apr 19 10:44:33 v62 ceph-mon[1877692]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.6/rpm/el7/BUILD/ceph-15.2.6/src/mon/AuthMonitor.cc: 279: FAILED ceph_assert(ret == 0) Apr 19 10:44:33 v62 ceph-mon[1877692]: ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable) Apr 19 10:44:33 v62 ceph-mon[1877692]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x7f519e1ec665] Apr 19 10:44:33 v62 ceph-mon[1877692]: 2: (()+0x26882d) [0x7f519e1ec82d] Apr 19 10:44:33 v62 ceph-mon[1877692]: 3: (AuthMonitor::update_from_paxos(bool*)+0x2832) [0x559a623d6282] Apr 19 10:44:33 v62 ceph-mon[1877692]: 4: (PaxosService::refresh(bool*)+0x103) [0x559a62474cc3] Apr 19 10:44:33 v62 ceph-mon[1877692]: 5: (Monitor::refresh_from_paxos(bool*)+0x17c) [0x559a62355e4c] Apr 19 10:44:33 v62 ceph-mon[1877692]: 6: (Monitor::init_paxos()+0xfc) [0x559a6235611c] Apr 19 10:44:33 v62 ceph-mon[1877692]: 7: (Monitor::preinit()+0xd5f) [0x559a623773ef] Apr 19 10:44:33 v62 ceph-mon[1877692]: 8: (main()+0x2398) [0x559a6230e908] Apr 19 10:44:33 v62 ceph-mon[1877692]: 9: (__libc_start_main()+0xf5) [0x7f519add9555] Apr 19 10:44:33 v62 ceph-mon[1877692]: 10: (()+0x2305d0) [0x559a6233f5d0] Apr 19 10:44:33 v62 ceph-mon[1877692]: 0> 2022-04-19T10:44:33.779-0700 7f51a6faa340 -1 *** Caught signal (Aborted) ** Apr 19 10:44:33 v62 ceph-mon[1877692]: in thread 7f51a6faa340 thread_name:ceph-mon Apr 19 10:44:33 v62 ceph-mon[1877692]: ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable) Apr 19 10:44:33 v62 ceph-mon[1877692]: 1: (()+0xf630) [0x7f519bffa630] Apr 19 10:44:33 v62 ceph-mon[1877692]: 2: (gsignal()+0x37) [0x7f519aded387] Apr 19 10:44:33 v62 ceph-mon[1877692]: 3: (abort()+0x148) [0x7f519adeea78] Apr 19 10:44:33 v62 ceph-mon[1877692]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x7f519e1ec6b4] Apr 19 10:44:33 v62 ceph-mon[1877692]: 5: (()+0x26882d) [0x7f519e1ec82d] Apr 19 10:44:33 v62 ceph-mon[1877692]: 6: (AuthMonitor::update_from_paxos(bool*)+0x2832) [0x559a623d6282] Apr 19 10:44:33 v62 ceph-mon[1877692]: 7: (PaxosService::refresh(bool*)+0x103) [0x559a62474cc3] Apr 19 10:44:33 v62 ceph-mon[1877692]: 8: (Monitor::refresh_from_paxos(bool*)+0x17c) [0x559a62355e4c] Apr 19 10:44:33 v62 ceph-mon[1877692]: 9: (Monitor::init_paxos()+0xfc) [0x559a6235611c] Apr 19 10:44:33 v62 ceph-mon[1877692]: 10: (Monitor::preinit()+0xd5f) [0x559a623773ef] Apr 19 10:44:33 v62 ceph-mon[1877692]: 11: (main()+0x2398) [0x559a6230e908] Apr 19 10:44:33 v62 ceph-mon[1877692]: 12: (__libc_start_main()+0xf5) [0x7f519add9555] Apr 19 10:44:33 v62 ceph-mon[1877692]: 13: (()+0x2305d0) [0x559a6233f5d0] Apr 19 10:44:33 v62 ceph-mon[1877692]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 19 10:44:33 v62 systemd[1]: ceph-mon@v62.service: main process exited, code=killed, status=6/ABRT Apr 19 10:44:33 v62 systemd[1]: Unit ceph-mon@v62.service entered failed state. Apr 19 10:44:33 v62 systemd[1]: ceph-mon@v62.service failed. Apr 19 10:44:35 v62 ceph-mgr[1050242]: ::ffff:10.8.12.51 - - [19/Apr/2022:10:44:35] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.27.1" Apr 19 10:44:43 v62 systemd[1]: ceph-mon@v62.service holdoff time over, scheduling restart. Apr 19 10:44:43 v62 systemd[1]: start request repeated too quickly for ceph-mon@v62.service Apr 19 10:44:43 v62 systemd[1]: Unit ceph-mon@v62.service entered failed state. Apr 19 10:44:43 v62 systemd[1]: ceph-mon@v62.service failed. Please let me know what else I can provide to help debug these issues Thanks, Ilhaan Rasheed -- ******************************************************************* This message was sent from RiskIQ, and is intended only for the designated recipient(s). It may contain confidential or proprietary information and may be subject to confidentiality protections. If you are not a designated recipient, you may not review, copy or distribute this message. If you receive this in error, please notify the sender by reply e-mail and delete this message. Thank you. ******************************************************************* _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx