ceph-mon memory issue jewel 10.2.5 kernel 4.4

andrei@xxxxxxxxxx (Andrei Mikhailovsky) · Thu, 9 Feb 2017 19:18:11 +0000 (GMT)

Hi Jim,

I've got a few questions for you as it looks like we have a similar cluster for our ceph infrastructure. A quick overview of what we have. We are also running a small cluster  of 3 storage nodes (30 osds in total) and 5 clients over 40gig/s infiniband link (ipoib). Ever since installing the cluster (back in 2013) we have had issues with ceph stability. During the upgrade cycles (ceph version upgrades were applied to practically all ceph stable releases, including major and minor versions) the stability has varied from improving to some degree to being poor once again. 

The main problem that we had (up until release 10.2.x) were slow requests and osds being marked as down due to heartbeat. I gave up having spent tons of time trying to figure out the cause of the problem with folks on irc, they were blaming the networking issue. However, I couldn't confirm this and it doesn't seem to be the case. I have ran about a doze of different networking tests for months and none of them showed any degradation in speed, packet loss, etc. I even tested the initiation of around 1000 tcp connections per second over the course of months and not had a single packet drop or unusual delay. While the network tests were running the ceph cluster was still producing slow requests and osds being marked as down due to heartbeats. The quoted figure of 10K+ per year for support is not an option for us, so we ended up biting the bullet.

After the recent upgrade to 10.2.x branch, we started to face additional issues of osds either crashing or being killed due to the lack of memory. My guess is the memory leaks. Now, I think we are approaching the limit to our suffering with ceph and are currently investigating an alternative solution as ceph has proved to be unstable and unfortunately, the community support did not help to resolve our problems during 4 years period.

I was hoping to have some insight on your setup and configuration on both the client and ceph backend and also learn more about the problems you are having or had in the past and managed to address? Would you be willing to discuss this further?

Many thanks

Andrei

----- Original Message -----
> From: "Jim Kilborn" <jim at kilborns.com>
> To: "Joao Eduardo Luis" <joao at suse.de>, "ceph-users" <ceph-users at lists.ceph.com>
> Sent: Thursday, 9 February, 2017 13:04:16
> Subject: Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4

> Joao,
> 
> Here is the information requested. Thanks for taking a look. Note that the below
> is after I restarted the ceph-mon processes yesterday. If this is not
> acceptable, I will have to wait until the issue reappears. This is on a small
> cluster. 4 ceph nodes, and 6 ceph kernel clients running over infiniband.
> 
> 
> 
> [root at empire-ceph02 log]# ceph -s
> 
>    cluster 62ed97d6-adf4-12e4-8fd5-3d9701b22b87
> 
>     health HEALTH_OK
> 
>     monmap e3: 3 mons at
>     {empire-ceph01=192.168.20.241:6789/0,empire-ceph02=192.168.20.242:6789/0,empire-ceph03=192.168.20.243:6789/0}
> 
>            election epoch 56, quorum 0,1,2 empire-ceph01,empire-ceph02,empire-ceph03
> 
>      fsmap e526: 1/1/1 up {0=empire-ceph03=up:active}, 1 up:standby
> 
>     osdmap e361: 32 osds: 32 up, 32 in
> 
>            flags sortbitwise,require_jewel_osds
> 
>      pgmap v2427955: 768 pgs, 2 pools, 2370 GB data, 1759 kobjects
> 
>            7133 GB used, 109 TB / 116 TB avail
> 
>                 768 active+clean
> 
>  client io 256 B/s wr, 0 op/s rd, 0 op/s wr
> 
> 
> 
> [root at empire-ceph02 log]# ceph daemon mon.empire-ceph02 ops
> 
> {
> 
>    "ops": [],
> 
>    "num_ops": 0
> 
> }
> 
> 
> 
> [root at empire-ceph02 mon]# du -sh ceph-empire-ceph02
> 
> 30M     ceph-empire-ceph02
> 
> 
> 
> [root at empire-ceph02 mon]# ls -lR
> 
> .:
> 
> total 0
> 
> drwxr-xr-x. 3 ceph ceph 46 Dec  6 14:26 ceph-empire-ceph02
> 
> 
> 
> ./ceph-empire-ceph02:
> 
> total 8
> 
> -rw-r--r--. 1 ceph ceph    0 Dec  6 14:26 done
> 
> -rw-------. 1 ceph ceph   77 Dec  6 14:26 keyring
> 
> drwxr-xr-x. 2 ceph ceph 4096 Feb  9 06:58 store.db
> 
> 
> 
> ./ceph-empire-ceph02/store.db:
> 
> total 30056
> 
> -rw-r--r--. 1 ceph ceph  396167 Feb  9 06:06 510929.sst
> 
> -rw-r--r--. 1 ceph ceph  778898 Feb  9 06:56 511298.sst
> 
> -rw-r--r--. 1 ceph ceph 5177344 Feb  9 07:01 511301.log
> 
> -rw-r--r--. 1 ceph ceph 1491740 Feb  9 06:58 511305.sst
> 
> -rw-r--r--. 1 ceph ceph 2162405 Feb  9 06:58 511306.sst
> 
> -rw-r--r--. 1 ceph ceph 2162047 Feb  9 06:58 511307.sst
> 
> -rw-r--r--. 1 ceph ceph 2104201 Feb  9 06:58 511308.sst
> 
> -rw-r--r--. 1 ceph ceph 2146113 Feb  9 06:58 511309.sst
> 
> -rw-r--r--. 1 ceph ceph 2123659 Feb  9 06:58 511310.sst
> 
> -rw-r--r--. 1 ceph ceph 2162927 Feb  9 06:58 511311.sst
> 
> -rw-r--r--. 1 ceph ceph 2129640 Feb  9 06:58 511312.sst
> 
> -rw-r--r--. 1 ceph ceph 2133590 Feb  9 06:58 511313.sst
> 
> -rw-r--r--. 1 ceph ceph 2143906 Feb  9 06:58 511314.sst
> 
> -rw-r--r--. 1 ceph ceph 2158434 Feb  9 06:58 511315.sst
> 
> -rw-r--r--. 1 ceph ceph 1649589 Feb  9 06:58 511316.sst
> 
> -rw-r--r--. 1 ceph ceph      16 Feb  8 13:42 CURRENT
> 
> -rw-r--r--. 1 ceph ceph       0 Dec  6 14:26 LOCK
> 
> -rw-r--r--. 1 ceph ceph  983040 Feb  9 06:58 MANIFEST-503363
> 
> 
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> 
> 
> From: Joao Eduardo Luis<mailto:joao at suse.de>
> Sent: Thursday, February 9, 2017 3:06 AM
> To: ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
> Subject: Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4
> 
> 
> 
> Hi Jim,
> 
> On 02/08/2017 07:45 PM, Jim Kilborn wrote:
>> I have had two ceph monitor nodes generate swap space alerts this week.
>> Looking at the memory, I see ceph-mon using a lot of memory and most of the swap
>> space. My ceph nodes have 128GB mem, with 2GB swap  (I know the memory/swap
>> ratio is odd)
>>
>> When I get the alert, I see the following
> [snip]
>> root at empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'
>>
>> USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>>
>> ceph     174239  0.3 45.8 62812848 60405112 ?   Ssl   2016 269:08
>> /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph
>> --setgroup ceph
>>
>> [snip]
>>
>>
>> Is this a setting issue? Or Maybe a bug?
>> When I look at the other ceph-mon processes on other nodes, they aren?t using
>> any swap, and only about 500MB of memory.
> 
> Can you get us the result of `ceph -s`, of `ceph daemon mon.ID ops`, and
> the size of your monitor's data directory? The latter, ideally,
> recursive with the sizes of all the children in the tree (which,
> assuming they're a lot, would likely be better on a pastebin).
> 
>   -Joao
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com