Re: librbd leaks memory on crushmap updates

Peter Lieven <pl@xxxxxxx> · Wed, 22 Jun 2022 14:33:47 +0200

> Am 22.06.2022 um 14:28 schrieb Ilya Dryomov <idryomov@xxxxxxxxx>:
> 
> On Wed, Jun 22, 2022 at 11:14 AM Peter Lieven <pl@xxxxxxx> wrote:
>> 
>> 
>> 
>> Von meinem iPhone gesendet
>> 
>>>> Am 22.06.2022 um 10:35 schrieb Ilya Dryomov <idryomov@xxxxxxxxx>:
>>> 
>>> On Tue, Jun 21, 2022 at 8:52 PM Peter Lieven <pl@xxxxxxx> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> we noticed that some of our long running VMs (1 year without migration) seem to have a very slow memory leak. Taking a dump of the leaked memory revealed that it seemed to contain osd and pool information so we concluded that it must have something to do with crush map updates. We then wrote a test script in our dev environment that constantly takes out osds and kicks then back in as soon as all remappings are done.
>>> 
>>> Hi Peter,
>>> 
>>> How did you determine what memory is being leaked?
>> 
>> I found relatively large allocations in the qemu smaps and checked the contents. It contained several hundred repetitions of osd and pool names. We use the default builds on Ubuntu 20.04. Is there a special memory allocator in place that might not clean up properly?
> 
> Not really a special allocator but there is something referred to as
> mempools -- an abstraction created to help with fine-grained memory use
> tracking.  It is mostly used on the OSD side (various bluestore caches,
> etc), but also for osdmaps on the client side.

is it possible to dump mempool statistics to check if the number of allocations grows indefinitely or if this a fragmentation issue?

can the mempool poolsizes be tweaked or are they hardcoded?

> 
>> 
>>> 
>>>> 
>>>> With that script running the PSS usage of the Qemu process is constantly increasing (main memory of the VM is in hugetblfs) in an order of about 5MB / day for a very small dev cluster with approx. 40 OSDs and 5 pools.
>>>> 
>>>> We have observed this issue first with Nautilus 14.2.22 and then also tried Octopus 15.2.16 where some issues #38403 should have been fixed.
>>> 
>>> With the release of 15.2.17 in a few weeks, Octopus would be going
>>> EOL.  Given that this is a dev cluster, can you try something more
>>> recent -- preferably Quincy?
>> 
>> Yes, I can as this is only a client issue. But for production it’s no option to move to Quincy.
> 
> If the issue exists in Quincy, it will get a lot more attention ;)
> We will certainly consider a backport for the upcoming final Octopus
> release if the issue is identified and fixed in time.

i will try it with quincy asap.

Peter

> 
> Thanks,
> 
>                Ilya

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx