On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote: > > Dear All, > > Unfortunately the MDS has crashed on our Mimic cluster... > > First symptoms were rsync giving: > "No space left on device (28)" > when trying to rename or delete > > This prompted me to try restarting the MDS, as it reported laggy. > > Restarting the MDS, shows this as error in the log before the crash: > > elist.h: 39: FAILED assert(!is_on_list()) > > A full MDS log showing the crash is here: > > http://p.ip.fi/iWlz > > I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... > > The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x > replication for MDS. We have a single active MDS, with two failover MDS > > We have ~2PB of cephfs data here, all of which is currently > inaccessible, all and any advice gratefully received :) > Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set them to very large values before starting mds. If mds does not crash, restore the mds_cache_size and mds_cache_memory_limit to their original values (by admin socket) after mds becomes active for 10 seconds If mds still crash, try compile ceph-mds with following patch diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index d3461fba2e..c2731e824c 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn) // clean? if (dn->is_dirty()) dn->mark_clean(); + if (inode->is_stray()) + dn->item_stray.remove_myself(); if (dn->state_test(CDentry::STATE_BOTTOMLRU)) cache->bottom_lru.lru_remove(dn); > best regards, > > Jake > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com