I forgot that I left my VM mount command running. It hangs my VM but more alarming is that it crashes my MDS servers on the ceph cluster. The ceph cluster is all hardware nodes and the openstack vm does not have an admin keyring (although the cephX keyring for cephfs generated does have write permissions to the ec42 pool.
As far as I am aware this shouldn't happen. I will try upgrading as soon as I can but I didn't see anything like this mentioned in the change log and am worried this will still exist in 12.2.5. Has anyone seen this before?
+-------------------------------------------------------------+
| |
| Luminous CephFS Cluster |
| version 12.2.4 |
| Ubuntu 16.04 |
| 4.10.0-38-generic (all hardware nodes) |
| |
+--------------------+ +-------------------+--------------------+--------------------+
| | | | | |
| Openstack VM | | Ceph Monitor A | Ceph Monitor B | Ceph Monitor C |
| Ubuntu 16.04 +-------------------> | Ceph Mon Server | Ceph MDS A | Ceph MDS Failover |
| 4.13.0-39-generic | | kh08-8 | Kh09-8 | kh10-8 |
| Cephfs via kernel | | | | |
+--------------------+ +-------------------+--------------------+--------------------+
| |
| ec42 16384 PGs |
| CephFS Data Pool |
| Erasure coded with 4/2 profile |
| |
+-------------------------------------------------------------+
| |
| cephfs_metadata 4096 PGs |
| CephFS Metadata Pool |
| Replicated pool (n=3) |
| |
+-------------------------------------------------------------+
As far as I am aware this shouldn't happen. I will try upgrading as soon as I can but I didn't see anything like this mentioned in the change log and am worried this will still exist in 12.2.5. Has anyone seen this before?
On Mon, Apr 30, 2018 at 7:24 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:
So I think I can reliably reproduce this crash from a ceph client.
```root@kh08-8:~# ceph -scluster:id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 health: HEALTH_OKservices:mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8mgr: kh08-8(active)mds: cephfs-1/1/1 up {0=kh09-8=up:active}, 1 up:standbyosd: 570 osds: 570 up, 570 in```
then from a client try to mount aufs over cephfs:
```
mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
```
Now watch as your ceph mds servers fail:
```root@kh08-8:~# ceph -scluster:id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 health: HEALTH_WARNinsufficient standby MDS daemons availableservices:mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8mgr: kh08-8(active)mds: cephfs-1/1/1 up {0=kh10-8=up:active(laggy or crashed)}```
I am now stuck in a degraded and I can't seem to get them to start again.On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:I had 2 MDS servers (one active one standby) and both were down. I took a dumb chance and marked the active as down (it said it was up but laggy). Then started the primary again and now both are back up. I have never seen this before I am also not sure of what I just did.On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:I was creating a new user and mount point. On another hardware node I mounted CephFS as admin to mount as root. I created /aufstest and then unmounted. From there it seems that both of my mds nodes crashed for some reason and I can't start them any more.
https://pastebin.com/1ZgkL9fa -- my mds log
I have never had this happen in my tests so now I have live data here. If anyone can lend a hand or point me in the right direction while troubleshooting that would be a godsend!
I tried cephfs-journal-tool inspect and it reports that the journal should be fine. I am not sure why it's crashing:/home/lacadmin# cephfs-journal-tool journal inspectOverall journal integrity: OK
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com