ceph-mon memory issue jewel 10.2.5 kernel 4.4

andrei@xxxxxxxxxx (Andrei Mikhailovsky) · Thu, 9 Feb 2017 00:58:49 +0000 (GMT)

+1

Ever since upgrading to 10.2.x I have been seeing a lot of issues with our ceph cluster. I have been seeing osds down, osd servers running out of memory and killing all ceph-osd processes. Again, 10.2.5 on 4.4.x kernel. 

It seems what with every release there are more and more problems with ceph (((, which is a shame.

Andrei

----- Original Message -----
> From: "Jim Kilborn" <jim at kilborns.com>
> To: "ceph-users" <ceph-users at lists.ceph.com>
> Sent: Wednesday, 8 February, 2017 19:45:58
> Subject: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel  4.4

> I have had two ceph monitor nodes generate swap space alerts this week.
> Looking at the memory, I see ceph-mon using a lot of memory and most of the swap
> space. My ceph nodes have 128GB mem, with 2GB swap  (I know the memory/swap
> ratio is odd)
> 
> When I get the alert, I see the following
> 
> 
> root at empire-ceph02 ~]# free
> 
>              total        used        free      shared  buff/cache   available
> 
> Mem:      131783876    67618000    13383516       53868    50782360    61599096
> 
> Swap:       2097148     2097092          56
> 
> 
> 
> root at empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'
> 
> USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> 
> ceph     174239  0.3 45.8 62812848 60405112 ?   Ssl   2016 269:08
> /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph
> --setgroup ceph
> 
> 
> In the ceph-mon log, I see the following:
> 
> Feb  8 09:31:21 empire-ceph02 ceph-mon: 2017-02-08 09:31:21.211268 7f414d974700
> -1 lsb_release_parse - failed to call lsb_release binary with error: (12)
> Cannot allocate memory
> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012856 7f3dcfe94700
> -1 osd.8 344 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back
> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
> (cutoff 2017-02-08 09:31:04.012854)
> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012900 7f3dcfe94700
> -1 osd.8 344 heartbeat_check: no reply from 0x563e4214da10 osd.3 since back
> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
> (cutoff 2017-02-08 09:31:04.012854)
> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012915 7f3dcfe94700
> -1 osd.8 344 heartbeat_check: no reply from 0x563e4214d410 osd.5 since back
> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
> (cutoff 2017-02-08 09:31:04.012854)
> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012927 7f3dcfe94700
> -1 osd.8 344 heartbeat_check: no reply from 0x563e4214e490 osd.6 since back
> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
> (cutoff 2017-02-08 09:31:04.012854)
> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012934 7f3dcfe94700
> -1 osd.8 344 heartbeat_check: no reply from 0x563e42149a10 osd.7 since back
> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
> (cutoff 2017-02-08 09:31:04.012854)
> Feb  8 09:31:25 empire-ceph02 ceph-osd: 2017-02-08 09:31:25.013038 7f3dcfe94700
> -1 osd.8 345 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back
> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
> (cutoff 2017-02-08 09:31:05.013020)
> 
> 
> Is this a setting issue? Or Maybe a bug?
> When I look at the other ceph-mon processes on other nodes, they aren?t using
> any swap, and only about 500MB of memory.
> 
> When I restart ceph-mds on the server that shows the issue, the swap frees up,
> and the memory for the new ceph-mon is 500MB again.
> 
> Any ideas would be appreciated.
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com