Hi Zheng, Sorry - I've just re-read your email and saw your instruction to restore the mds_cache_size and mds_cache_memory_limit to original values if the MDS does not crash - I have now done this... thanks again for your help, best regards, Jake On 2/11/19 12:01 PM, Jake Grimmett wrote: > Hi Zheng, > > Many, many thanks for your help... > > Your suggestion of setting large values for mds_cache_size and > mds_cache_memory_limit stopped our MDS crashing :) > > The values in ceph.conf are now: > > mds_cache_size = 8589934592 > mds_cache_memory_limit = 17179869184 > > Should these values be left in our configuration? > > again thanks for the assistance, > > Jake > > On 2/11/19 8:17 AM, Yan, Zheng wrote: >> On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote: >>> >>> Dear All, >>> >>> Unfortunately the MDS has crashed on our Mimic cluster... >>> >>> First symptoms were rsync giving: >>> "No space left on device (28)" >>> when trying to rename or delete >>> >>> This prompted me to try restarting the MDS, as it reported laggy. >>> >>> Restarting the MDS, shows this as error in the log before the crash: >>> >>> elist.h: 39: FAILED assert(!is_on_list()) >>> >>> A full MDS log showing the crash is here: >>> >>> http://p.ip.fi/iWlz >>> >>> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... >>> >>> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x >>> replication for MDS. We have a single active MDS, with two failover MDS >>> >>> We have ~2PB of cephfs data here, all of which is currently >>> inaccessible, all and any advice gratefully received :) >>> >> >> Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set >> them to very large values before starting mds. If mds does not crash, >> restore the mds_cache_size and mds_cache_memory_limit to their >> original values (by admin socket) after mds becomes active for 10 >> seconds >> >> If mds still crash, try compile ceph-mds with following patch >> >> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc >> index d3461fba2e..c2731e824c 100644 >> --- a/src/mds/CDir.cc >> +++ b/src/mds/CDir.cc >> @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn) >> // clean? >> if (dn->is_dirty()) >> dn->mark_clean(); >> + if (inode->is_stray()) >> + dn->item_stray.remove_myself(); >> >> if (dn->state_test(CDentry::STATE_BOTTOMLRU)) >> cache->bottom_lru.lru_remove(dn); >> >> >>> best regards, >>> >>> Jake >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com