+1 Ever since upgrading to 10.2.x I have been seeing a lot of issues with our ceph cluster. I have been seeing osds down, osd servers running out of memory and killing all ceph-osd processes. Again, 10.2.5 on 4.4.x kernel. It seems what with every release there are more and more problems with ceph (((, which is a shame. Andrei ----- Original Message ----- > From: "Jim Kilborn" <jim at kilborns.com> > To: "ceph-users" <ceph-users at lists.ceph.com> > Sent: Wednesday, 8 February, 2017 19:45:58 > Subject: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4 > I have had two ceph monitor nodes generate swap space alerts this week. > Looking at the memory, I see ceph-mon using a lot of memory and most of the swap > space. My ceph nodes have 128GB mem, with 2GB swap (I know the memory/swap > ratio is odd) > > When I get the alert, I see the following > > > root at empire-ceph02 ~]# free > > total used free shared buff/cache available > > Mem: 131783876 67618000 13383516 53868 50782360 61599096 > > Swap: 2097148 2097092 56 > > > > root at empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM' > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > ceph 174239 0.3 45.8 62812848 60405112 ? Ssl 2016 269:08 > /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph > --setgroup ceph > > > In the ceph-mon log, I see the following: > > Feb 8 09:31:21 empire-ceph02 ceph-mon: 2017-02-08 09:31:21.211268 7f414d974700 > -1 lsb_release_parse - failed to call lsb_release binary with error: (12) > Cannot allocate memory > Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012856 7f3dcfe94700 > -1 osd.8 344 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back > 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901 > (cutoff 2017-02-08 09:31:04.012854) > Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012900 7f3dcfe94700 > -1 osd.8 344 heartbeat_check: no reply from 0x563e4214da10 osd.3 since back > 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901 > (cutoff 2017-02-08 09:31:04.012854) > Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012915 7f3dcfe94700 > -1 osd.8 344 heartbeat_check: no reply from 0x563e4214d410 osd.5 since back > 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901 > (cutoff 2017-02-08 09:31:04.012854) > Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012927 7f3dcfe94700 > -1 osd.8 344 heartbeat_check: no reply from 0x563e4214e490 osd.6 since back > 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901 > (cutoff 2017-02-08 09:31:04.012854) > Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012934 7f3dcfe94700 > -1 osd.8 344 heartbeat_check: no reply from 0x563e42149a10 osd.7 since back > 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901 > (cutoff 2017-02-08 09:31:04.012854) > Feb 8 09:31:25 empire-ceph02 ceph-osd: 2017-02-08 09:31:25.013038 7f3dcfe94700 > -1 osd.8 345 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back > 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901 > (cutoff 2017-02-08 09:31:05.013020) > > > Is this a setting issue? Or Maybe a bug? > When I look at the other ceph-mon processes on other nodes, they aren?t using > any swap, and only about 500MB of memory. > > When I restart ceph-mds on the server that shows the issue, the swap frees up, > and the memory for the new ceph-mon is 500MB again. > > Any ideas would be appreciated. > > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com