Re: 12.2.4 Both Ceph MDS nodes crashed. Please help.

Sean Sullivan <lookcrabs@xxxxxxxxx> · Mon, 30 Apr 2018 19:24:10 -0500

So I think I can reliably reproduce this crash from a ceph client. 

```
root@kh08-8:~# ceph -s
  cluster:
    id:     9f58ee5a-7c5d-4d68-81ee-debe16322544
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
    mgr: kh08-8(active)
    mds: cephfs-1/1/1 up  {0=kh09-8=up:active}, 1 up:standby
    osd: 570 osds: 570 up, 570 in
```

then from a client try to mount aufs over cephfs:
```
mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
```

Now watch as your ceph mds servers fail:

```
root@kh08-8:~# ceph -s
  cluster:
    id:     9f58ee5a-7c5d-4d68-81ee-debe16322544
    health: HEALTH_WARN
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
    mgr: kh08-8(active)
    mds: cephfs-1/1/1 up  {0=kh10-8=up:active(laggy or crashed)}
```

I am now stuck in a degraded and I can't seem to get them to start again. 

On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:
I had 2 MDS servers (one active one standby) and both were down. I took a dumb chance and marked the active as down (it said it was up but laggy). Then started the primary again and now both are back up. I have never seen this before I am also not sure of what I just did. 

On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:
I was creating a new user and mount point. On another hardware node I mounted CephFS as admin to mount as root. I created /aufstest and then unmounted. From there it seems that both of my mds nodes crashed for some reason and I can't start them any more. 

https://pastebin.com/1ZgkL9fa -- my mds log

I have never had this happen in my tests so now I have live data here. If anyone can lend a hand or point me in the right direction while troubleshooting that would be a godsend! 

I tried cephfs-journal-tool inspect and it reports that the journal should be fine. I am not sure why it's crashing:

/home/lacadmin# cephfs-journal-tool journal inspect
Overall journal integrity: OK

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com