MDS allocates all memory (>500G) replaying, OOM-killed, repeat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello


We are experiencing an issue where our ceph MDS gobbles up 500G of RAM, is killed by the kernel, dies, then repeats. We have 3 MDS daemons on different machines, and all are exhibiting this behavior. We are running the following versions (from Docker):


  • ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7
  • ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7
  • ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7 (downgraded in last-ditch effort to resolve, didn't help)

The machines hosting the MDS instances have 512G RAM. We tried adding swap, and the MDS just started eating into the swap (and got really slow, eventually being kicked out for exceeding the mds_beacon_grace of 240). mds_cache_memory_limit has been many values ranging from 200G to the default of 1073741824, and the result of replay is always the same: keep allocating memory until the kernel OOM killer stops it (or the mds_beacon_grace period expires, if swap is enabled).

Before it died, the active MDS reported 1.592 million inodes to Prometheus (ceph_mds_inodes) and 1.493 million caps (ceph_mds_caps).


At this point I feel like my best option is to try to destroy the journal and hope things come back, but while we can probably recover from this, I'd like to prevent it happening in the future. Any advice?


Neale Pickett <neale@xxxxxxxx>
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux