Re: Memory leak in Ceph OSD?

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Wed, 7 Mar 2018 02:47:55 +0100 (CET)

Hi,

I'm also seeing slow memory increase over time with my bluestore nvme osds (3,2tb each) , with default ceph.conf settings. (ceph 12.2.2)

each osd start around 5G memory, and go up to 8GB.

Currently I'm restarting them around each month to free memory.

here a dump of osd.0 after 1week running

ceph     2894538  3.9  9.9 7358564 6553080 ?     Ssl  mars01 303:03 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

root@ceph4-1:~#  ceph daemon osd.0 dump_mempools 
{
    "bloom_filter": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_alloc": {
        "items": 84070208,
        "bytes": 84070208
    },
    "bluestore_cache_data": {
        "items": 168,
        "bytes": 2908160
    },
    "bluestore_cache_onode": {
        "items": 947820,
        "bytes": 636935040
    },
    "bluestore_cache_other": {
        "items": 101250372,
        "bytes": 2043476720
    },
    "bluestore_fsck": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_txc": {
        "items": 8,
        "bytes": 5760
    },
    "bluestore_writing_deferred": {
        "items": 85,
        "bytes": 1203200
    },
    "bluestore_writing": {
        "items": 7,
        "bytes": 569584
    },
    "bluefs": {
        "items": 1774,
        "bytes": 106360
    },
    "buffer_anon": {
        "items": 68307,
        "bytes": 17188636
    },
    "buffer_meta": {
        "items": 284,
        "bytes": 24992
    },
    "osd": {
        "items": 333,
        "bytes": 4017312
    },
    "osd_mapbl": {
        "items": 0,
        "bytes": 0
    },
    "osd_pglog": {
        "items": 1195884,
        "bytes": 298139520
    },
    "osdmap": {
        "items": 4542,
        "bytes": 384464
    },
    "osdmap_mapping": {
        "items": 0,
        "bytes": 0
    },
    "pgmap": {
        "items": 0,
        "bytes": 0
    },
    "mds_co": {
        "items": 0,
        "bytes": 0
    },
    "unittest_1": {
        "items": 0,
        "bytes": 0
    },
    "unittest_2": {
        "items": 0,
        "bytes": 0
    },
    "total": {
        "items": 187539792,
        "bytes": 3089029956
    }
}

another osd after 1 month:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph     1718009  2.5 11.7 8542012 7725992 ?     Ssl   2017 2463:28 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph

root@ceph4-1:~# ceph daemon osd.5 dump_mempools 
{
    "bloom_filter": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_alloc": {
        "items": 98449088,
        "bytes": 98449088
    },
    "bluestore_cache_data": {
        "items": 759,
        "bytes": 17276928
    },
    "bluestore_cache_onode": {
        "items": 884140,
        "bytes": 594142080
    },
    "bluestore_cache_other": {
        "items": 116375567,
        "bytes": 2072801299
    },
    "bluestore_fsck": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_txc": {
        "items": 6,
        "bytes": 4320
    },
    "bluestore_writing_deferred": {
        "items": 99,
        "bytes": 1190045
    },
    "bluestore_writing": {
        "items": 11,
        "bytes": 4510159
    },
    "bluefs": {
        "items": 1202,
        "bytes": 64136
    },
    "buffer_anon": {
        "items": 76863,
        "bytes": 21327234
    },
    "buffer_meta": {
        "items": 910,
        "bytes": 80080
    },
    "osd": {
        "items": 328,
        "bytes": 3956992
    },
    "osd_mapbl": {
        "items": 0,
        "bytes": 0
    },
    "osd_pglog": {
        "items": 1118050,
        "bytes": 286277600
    },
    "osdmap": {
        "items": 6073,
        "bytes": 551872
    },
    "osdmap_mapping": {
        "items": 0,
        "bytes": 0
    },
    "pgmap": {
        "items": 0,
        "bytes": 0
    },
    "mds_co": {
        "items": 0,
        "bytes": 0
    },
    "unittest_1": {
        "items": 0,
        "bytes": 0
    },
    "unittest_2": {
        "items": 0,
        "bytes": 0
    },
    "total": {
        "items": 216913096,
        "bytes": 3100631833
    }
}

----- Mail original -----
De: "Kjetil Joergensen" <kjetil@xxxxxxxxxxxx>
À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Mercredi 7 Mars 2018 01:07:06
Objet: Re:  Memory leak in Ceph OSD?

Hi, 
addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b). 

The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw). 

-KJ 

On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensen < [ mailto:kjetil@xxxxxxxxxxxx | kjetil@xxxxxxxxxxxx ] > wrote: 

Hi, 
so.. +1 

We don't run compression as far as I know, so that wouldn't be it. We do actually run a mix of bluestore & filestore - due to the rest of the cluster predating a stable bluestore by some amount. 

The interesting part is - the behavior seems to be specific to our bluestore nodes. 

Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. Blue line - dump_mempools total bytes for all the OSDs running on the yellow line. The big dips - forced restarts after having suffered through after effects of letting linux deal with it by OOM->SIGKILL previously. 

A gross extrapolation - "right now" the "memory used" seems to be close enough to "sum of RSS of ceph-osd processes" running on the machines. 

-KJ 

On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachev < [ mailto:ag@xxxxxxxxxxxxxxxxxxx | ag@xxxxxxxxxxxxxxxxxxx ] > wrote: 

BQ_BEGIN
On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra 
< [ mailto:schandra@xxxxxxxxxxxx | schandra@xxxxxxxxxxxx ] > wrote: 
> Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives 
> filled to around 90%. One thing that does increase memory usage is the 
> number of clients simultaneously sending write requests to a particular 
> primary OSD if the write sizes are large. 

We have not seen a memory increase in Ubuntu 16.04, but I also 
observed repeatedly the following phenomenon: 

When doing a VMotion in ESXi of a large 3TB file (this generates a log 
of IO requests of small size) to a Ceph pool with compression set to 
force, after some time the Ceph cluster shows a large number of 
blocked requests and eventually timeouts become very large (to the 
point where ESXi aborts the IO due to timeouts). After abort, the 
blocked/slow requests messages disappear. There are no OSD errors. I 
have OSD logs if anyone is interested. 

This does not occur when compression is unset. 

-- 
Alex Gorbachev 
Storcium 

> 
> Subhachandra 
> 
> On Thu, Mar 1, 2018 at 6:18 AM, David Turner < [ mailto:drakonstein@xxxxxxxxx | drakonstein@xxxxxxxxx ] > wrote: 
>> 
>> With default memory settings, the general rule is 1GB ram/1TB OSD. If you 
>> have a 4TB OSD, you should plan to have at least 4GB ram. This was the 
>> recommendation for filestore OSDs, but it was a bit much memory for the 
>> OSDs. From what I've seen, this rule is a little more appropriate with 
>> bluestore now and should still be observed. 
>> 
>> Please note that memory usage in a HEALTH_OK cluster is not the same 
>> amount of memory that the daemons will use during recovery. I have seen 
>> deployments with 4x memory usage during recovery. 
>> 
>> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman < [ mailto:stefan@xxxxxx | stefan@xxxxxx ] > wrote: 
>>> 
>>> Quoting Caspar Smit ( [ mailto:casparsmit@xxxxxxxxxxx | casparsmit@xxxxxxxxxxx ] ): 
>>> > Stefan, 
>>> > 
>>> > How many OSD's and how much RAM are in each server? 
>>> 
>>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12 
>>> cores (at least one core per OSD). 
>>> 
>>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM 
>>> > right? 
>>> 
>>> Apparently. Sure they will use more RAM than just cache to function 
>>> correctly. I figured 3 GB per OSD would be enough ... 
>>> 
>>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of 
>>> > total 
>>> > RAM. The cache is a part of the memory usage by bluestore OSD's. 
>>> 
>>> A factor 4 is quite high, isn't it? Where is all this RAM used for 
>>> besides cache? RocksDB? 
>>> 
>>> So how should I size the amount of RAM in a OSD server for 10 bluestore 
>>> SSDs in a 
>>> replicated setup? 
>>> 
>>> Thanks, 
>>> 
>>> Stefan 
>>> 
>>> -- 
>>> | BIT BV [ http://www.bit.nl/ | http://www.bit.nl/ ] Kamer van Koophandel 09090351 
>>> | GPG: 0xD14839C6 [ tel:%2B31%20318%20648%20688 | +31 318 648 688 ] / [ mailto:info@xxxxxx | info@xxxxxx ] 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> [ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ] 
>>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> [ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ] 
>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 
>> 
> 
> 
> _______________________________________________ 
> ceph-users mailing list 
> [ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ] 
> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 
> 
_______________________________________________ 
ceph-users mailing list 
[ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

-- 
Kjetil Joergensen < [ mailto:kjetil@xxxxxxxxxxxx | kjetil@xxxxxxxxxxxx ] > 
SRE, Medallia Inc 

BQ_END

-- 
Kjetil Joergensen < [ mailto:kjetil@xxxxxxxxxxxx | kjetil@xxxxxxxxxxxx ] > 
SRE, Medallia Inc 

_______________________________________________ 
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com