MDS consuming large memory and rebooting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All.

 

I came in this morning to find that one of my cephfs file systems was read only and that the MDS was replaying the log but the MDS processes kept crashing with out of memory.

I have had to increase the memory on the VM’s hosting the mds and the mds process now gets to ~76GB before it comes online briefly. I also had to set the standby_count_wanted to 0 to get the daemon up but then it promptly crashes again with the errors below. My research suggests I might be hitting this bug https://github.com/ceph/ceph/pull/25519/files.

 

Any suggestions on how I can recover from this situation

 

-10001> 2019-07-08 14:13:16.659 7f90df693700  5 -- 10.137.0.134:6800/1608067295 >> 10.120.0.58:0/4242249126 conn(0x563f8a8d6300 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=46 cs=1 l=0). rx client.47532 seq 20 0x564d120b13c0 client_session(request_renewcaps seq 96425)

-10001> 2019-07-08 14:13:17.043 7f90d9687700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

-10001> 2019-07-08 14:13:17.043 7f90d9687700  0 mds.beacon.ceph-b-3 Skipping beacon heartbeat to monitors (last acked 14.5042s ago); MDS internal heartbeat is not healthy!

-10001> 2019-07-08 14:13:17.159 7f90d9e88700 -1 /build/ceph-13.2.6/src/include/elist.h: In function 'elist<T>::item::~item() [with T = CDentry*]' thread 7f90d9e88700 time 2019-07-08 14:13:17.162533

/build/ceph-13.2.6/src/include/elist.h: 39: FAILED assert(!is_on_list())

 

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f90e4947b5e]

2: (()+0x2c4cb7) [0x7f90e4947cb7]

3: (CDentry::~CDentry()+0x372) [0x563a8cde1ee2]

4: (CDentry::~CDentry()+0x9) [0x563a8cde1f19]

5: (CDir::remove_dentry(CDentry*)+0x165) [0x563a8cdee215]

6: (MDCache::trim_dentry(CDentry*, std::map<int, MCacheExpire*, std::less<int>, std::allocator<std::pair<int const, MCacheExpire*> > >&)+0xfe) [0x563a8cd14bbe]

7: (MDCache::trim_lru(unsigned long, std::map<int, MCacheExpire*, std::less<int>, std::allocator<std::pair<int const, MCacheExpire*> > >&)+0x85d) [0x563a8cd1616d]

8: (MDCache::trim(unsigned long)+0x24a) [0x563a8cd1712a]

9: (MDSRankDispatcher::tick()+0xd9) [0x563a8cc35979]

10: (FunctionContext::finish(int)+0x2c) [0x563a8cc1badc]

11: (Context::complete(int)+0x9) [0x563a8cc19f89]

12: (SafeTimer::timer_thread()+0xf9) [0x7f90e4944329]

13: (SafeTimerThread::entry()+0xd) [0x7f90e4945a3d]

14: (()+0x76db) [0x7f90e41fb6db]

15: (clone()+0x3f) [0x7f90e33e188f]

 

-10001> 2019-07-08 14:13:17.163 7f90d9e88700 -1 *** Caught signal (Aborted) **

in thread 7f90d9e88700 thread_name:safe_timer

 

Regards

Robert Ruge

 


Important Notice:
The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux