We have alerting on our mons to notify us when the memory usage is above 80% and go around and restart the mon services in that cluster. It is a
memory leak somewhere in the code, but the problem is so infrequent it's hard to get good enough logs to track it down. We restart the mons in a cluster every couple months or so, which is minimal enough that we haven't bothered to track down the leak in
the code ourselves. We see this on 0.94.7, but have heard of it on the mailing list for people using 0.94.9 as well as Jewel.
David Turner |
Cloud Operations Engineer |
StorageCraft
Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2760 | Mobile: 385.224.2943 |
If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. |
________________________________________
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Jim Kilborn [jim@xxxxxxxxxxxx]
Sent: Wednesday, February 08, 2017 12:45 PM
To: ceph-users@xxxxxxxxxxxxxx
Subject: ceph-mon memory issue jewel 10.2.5 kernel 4.4
I have had two ceph monitor nodes generate swap space alerts this week.
Looking at the memory, I see ceph-mon using a lot of memory and most of the swap space. My ceph nodes have 128GB mem, with 2GB swap (I know the memory/swap ratio is odd)
When I get the alert, I see the following
root@empire-ceph02 ~]# free
total used free shared buff/cache available
Mem: 131783876 67618000 13383516 53868 50782360 61599096
Swap: 2097148 2097092 56
root@empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 174239 0.3 45.8 62812848 60405112 ? Ssl 2016 269:08 /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph --setgroup ceph
In the ceph-mon log, I see the following:
Feb 8 09:31:21 empire-ceph02 ceph-mon: 2017-02-08 09:31:21.211268 7f414d974700 -1 lsb_release_parse - failed to call lsb_release binary with error: (12) Cannot allocate memory
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012856 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012900 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214da10 osd.3 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012915 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214d410 osd.5 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012927 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214e490 osd.6 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012934 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e42149a10 osd.7 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:25 empire-ceph02 ceph-osd: 2017-02-08 09:31:25.013038 7f3dcfe94700 -1 osd.8 345 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:05.013020)
Is this a setting issue? Or Maybe a bug?
When I look at the other ceph-mon processes on other nodes, they aren’t using any swap, and only about 500MB of memory.
When I restart ceph-mds on the server that shows the issue, the swap frees up, and the memory for the new ceph-mon is 500MB again.
Any ideas would be appreciated.
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Jim Kilborn [jim@xxxxxxxxxxxx]
Sent: Wednesday, February 08, 2017 12:45 PM
To: ceph-users@xxxxxxxxxxxxxx
Subject: ceph-mon memory issue jewel 10.2.5 kernel 4.4
I have had two ceph monitor nodes generate swap space alerts this week.
Looking at the memory, I see ceph-mon using a lot of memory and most of the swap space. My ceph nodes have 128GB mem, with 2GB swap (I know the memory/swap ratio is odd)
When I get the alert, I see the following
root@empire-ceph02 ~]# free
total used free shared buff/cache available
Mem: 131783876 67618000 13383516 53868 50782360 61599096
Swap: 2097148 2097092 56
root@empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 174239 0.3 45.8 62812848 60405112 ? Ssl 2016 269:08 /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph --setgroup ceph
In the ceph-mon log, I see the following:
Feb 8 09:31:21 empire-ceph02 ceph-mon: 2017-02-08 09:31:21.211268 7f414d974700 -1 lsb_release_parse - failed to call lsb_release binary with error: (12) Cannot allocate memory
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012856 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012900 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214da10 osd.3 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012915 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214d410 osd.5 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012927 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e4214e490 osd.6 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012934 7f3dcfe94700 -1 osd.8 344 heartbeat_check: no reply from 0x563e42149a10 osd.7 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb 8 09:31:25 empire-ceph02 ceph-osd: 2017-02-08 09:31:25.013038 7f3dcfe94700 -1 osd.8 345 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:05.013020)
Is this a setting issue? Or Maybe a bug?
When I look at the other ceph-mon processes on other nodes, they aren’t using any swap, and only about 500MB of memory.
When I restart ceph-mds on the server that shows the issue, the swap frees up, and the memory for the new ceph-mon is 500MB again.
Any ideas would be appreciated.
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com