Fwd: MDS Daemon Damaged

Ben <ebolam@xxxxxxxxx> · Sun, 23 Feb 2020 18:39:16 -0500

I have a 3 node ceph cluster for my house that I have been using for a few
years now without issue. Each node is a MON, MGR, and MDS, and has 2-3 OSDs
on them. It has, however been slow. I decided to finally move the bluestore
DBs to SSDs. I did one OSD as a test case to make sure everything was going
to go OK. I deleted the OSD, then created a new OSD using the ceph-deploy
tool and pointed the DB at a LVM partition on a SSD.

Everything went OK, and recovery started. Later in the day I noticed that
my MDS daemon is damaged (PGs are still recovering).

I've tried the cephfs-journal-tool --rank=cephfs:all journal export
backup.bin command, but it gave me:
2020-02-23 17:50:03.589 7f7d8b225740 -1 Missing object 200.00c30b6d
2020-02-23 17:50:07.919 7f7d8b225740 -1 Bad entry start ptr
(0x30c2dbb92003) at 0x30c2d3a125ea
(both lines have several repeats)

and will not complete.

Looking at the log file of the mds that was active at the time shows:
2020-02-23 17:13:09.091 7fad40029700  0 mds.0.journaler.mdlog(ro)
_finish_read got error -2
2020-02-23 17:13:09.091 7fad40029700  0 mds.0.journaler.mdlog(ro)
_finish_read got error -2
2020-02-23 17:13:09.091 7fad40029700  0 mds.0.journaler.mdlog(ro)
_finish_read got error -2
2020-02-23 17:13:09.091 7fad3e826700  0 mds.0.log _replay journaler got
error -2, aborting
2020-02-23 17:13:09.091 7fad3e826700 -1 log_channel(cluster) log [ERR] :
missing journal object

One other thing that happened about the same time, I noticed I was having
memory pressure on all the nodes with only 200MB of free ram. I've tweaked
the bluestore osd_memory_target to try to help that not happen again. Even
so, I'm a bit confused how that could cause catastrophic failure, as I had
2 other MDSes on standby.

Any help would be appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx