Re: OSDs taking too much memory, for buffer_anon

Michael Bisig <michael.bisig@xxxxxxxxx> · Tue, 7 Jul 2020 07:04:50 +0000

Hi Mark, Hi all

We still experience issues with our cluster that has 650 OSDs and is running on 14.2.8. Recently, we deleted 900M objects from an EC rgw pool what run pretty smooth with an own written script to fasten the deletion process (took about 10days, with radosgw-admin command it would have taken months). This gives us a bit more flexibility for our maintenance work and cleanup processes won't run for weeks anymore. We have seen slow_bytes DB usage of up to 850GB what's insane. After the deletions, we conducted an offline compaction on all OSDs with:

    ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$OSD compact

That helped much and DB size decreased to "normal" levels. Nevertheless, we still have the problem with the buffer_anon which uses too much memory. We regularly run into OOM killed OSDs what leads to outages. Normally, we are able to bring the OSDs back in with a lot of effort. In more detail, I see OSDs with 13GB of buffer_anon:

ceph daemon osd.31 dump_mempools
{
    "mempool": {
        "by_pool": {
...
            "buffer_anon": {
                "items": 15453,
                "bytes": 13269100576
            },
...
        },
        "total": {
            "items": 5033040,
            "bytes": 13423704543
        }
    }
}
Currently, we have about 120M object whereby 26.3M are EC 8+3 what's nothing compared to the 1 billion objects we had previously. So, I don't see the reason for the huge buffer_anon.

Is there a good advice how we can reduce the buffer_anon? We really appreciate every input because we run out of good ideas.

Thanks in advance! :)

Kind regards,
Michael

On 26.05.20, 17:17, "Mark Nelson" <mnelson@xxxxxxxxxx> wrote:

    Hi Harald,

    Yeah, I suspect your issue is definitely related to what Adam has been 
    investigating. FWIW, we are talking about re-introducing a periodic trim 
    in Adam's PR here:

    https://github.com/ceph/ceph/pull/35171

    That should help on the memory growth side, but if we still have objects 
    using huge amounts of memory for metadata (1MB+) it will thrash the the 
    onode cache and make everything slow.  Ultimately we still need to find 
    the root cause of the large per-object buffer_anon memory usage to 
    really fix things.

    Mark

    On 5/25/20 12:25 PM, Harald Staub wrote:
    > Hi Mark
    >
    > Thank you! This is 14.2.8, on Ubuntu Bionic. Some with kernel 4.15, 
    > some with 5.3, but that does not seem to make a difference here. 
    > Transparent Huge Pages are not used according to
    > grep -i AnonHugePages /proc/meminfo
    >
    > Workload is a mix of OpenStack volumes (replicated) and RGW on EC 8+3. 
    > EC pool with 1024 PGs, 900M objects.
    >
    > Around 500 hdd OSDs (4 and 8 TB), 30 ssd OSDs (2 TB). The maximum 
    > number of PGs per OSD is only 123. The hdd OSDs have DB on SSD, but a 
    > bit less than 30 GB unfortunately. I have seen 200 GB and more 
    > slow_bytes, compression of the DB seems to help a lot.
    >
    > No BlueStore compression.
    >
    > I had a look at the related thread:
    > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/JQ72K5LK3YFFETNNL4MX6HHZLF5GBYDT/ 
    >
    >
    > Today I saw a correlation that may match your thoughts. During 1 hour 
    > with a high number of write IOPS (not throughput) on the EC pool, 
    > available memory increased drastically.
    >
    > Cheers
    >  Harry
    >
    > On 20.05.20 15:15, Mark Nelson wrote:
    >> Hi Harald,
    >>
    >>
    >> Thanks!  So you can see from the perf dump that the target bytes are 
    >> a little below 4GB, but the mapped bytes are around 7GB. The priority 
    >> cache manager has reacted by setting the "cache_bytes" to 128MB which 
    >> is the minimum global value and each cache is getting 64MB (the local 
    >> minimum value per cache). Ultimately this means the priority cache 
    >> manager has basically told all of the caches to shrink to their 
    >> smallest possible values so it's doing the right thing.  So the next 
    >> question is why buffer_anon is so huge. Looking at the mempool stats, 
    >> there are not that many items but still a lot of memory used.  On 
    >> average those items in buffer_anon are ~150K.  It can't be just 
    >> buffer anon though, you've got several gigabytes of mapped memory 
    >> being used beyond that and around 4GB of unmapped memory that 
    >> tcmalloc should be freeing every iteration of the priority cache 
    >> manager.
    >>
    >>
    >> So next questions:  What version of Ceph is this, and do you have 
    >> transparent huge pages enabled? We automatically disable it now, but 
    >> if you are running an older version you might want to disable (or at 
    >> least set it to madvise) manually.  Also, what kind of workload is 
    >> hitting the OSDs?  If you can reliably make it grow you could try 
    >> doing a heap profile at the same time the workload is going on and 
    >> see if you can see where the memory is being used.
    >>
    >>
    >> Mark
    >>
    >>
    >> On 5/20/20 7:36 AM, Harald Staub wrote:
    >>> Hi Mark
    >>>
    >>> Thank you for you explanations! Some numbers of this example osd below.
    >>>
    >>> Cheers
    >>>  Harry
    >>>
    >>> From dump mempools:
    >>>
    >>>             "buffer_anon": {
    >>>                 "items": 29012,
    >>>                 "bytes": 4584503367
    >>>             },
    >>>
    >>> From perf dump:
    >>>
    >>>     "prioritycache": {
    >>>         "target_bytes": 3758096384,
    >>>         "mapped_bytes": 7146692608,
    >>>         "unmapped_bytes": 3825983488,
    >>>         "heap_bytes": 10972676096,
    >>>         "cache_bytes": 134217728
    >>>     },
    >>>     "prioritycache:data": {
    >>>         "pri0_bytes": 0,
    >>>         "pri1_bytes": 0,
    >>>         "pri2_bytes": 0,
    >>>         "pri3_bytes": 0,
    >>>         "pri4_bytes": 0,
    >>>         "pri5_bytes": 0,
    >>>         "pri6_bytes": 0,
    >>>         "pri7_bytes": 0,
    >>>         "pri8_bytes": 0,
    >>>         "pri9_bytes": 0,
    >>>         "pri10_bytes": 0,
    >>>         "pri11_bytes": 0,
    >>>         "reserved_bytes": 67108864,
    >>>         "committed_bytes": 67108864
    >>>     },
    >>>     "prioritycache:kv": {
    >>>         "pri0_bytes": 0,
    >>>         "pri1_bytes": 0,
    >>>         "pri2_bytes": 0,
    >>>         "pri3_bytes": 0,
    >>>         "pri4_bytes": 0,
    >>>         "pri5_bytes": 0,
    >>>         "pri6_bytes": 0,
    >>>         "pri7_bytes": 0,
    >>>         "pri8_bytes": 0,
    >>>         "pri9_bytes": 0,
    >>>         "pri10_bytes": 0,
    >>>         "pri11_bytes": 0,
    >>>         "reserved_bytes": 67108864,
    >>>         "committed_bytes": 67108864
    >>>     },
    >>>     "prioritycache:meta": {
    >>>         "pri0_bytes": 0,
    >>>         "pri1_bytes": 0,
    >>>         "pri2_bytes": 0,
    >>>         "pri3_bytes": 0,
    >>>         "pri4_bytes": 0,
    >>>         "pri5_bytes": 0,
    >>>         "pri6_bytes": 0,
    >>>         "pri7_bytes": 0,
    >>>         "pri8_bytes": 0,
    >>>         "pri9_bytes": 0,
    >>>         "pri10_bytes": 0,
    >>>         "pri11_bytes": 0,
    >>>         "reserved_bytes": 67108864,
    >>>         "committed_bytes": 67108864
    >>>     },
    >>>
    >>> On 20.05.20 14:05, Mark Nelson wrote:
    >>>> Hi Harald,
    >>>>
    >>>>
    >>>> Any idea what the priority_cache_manger perf counters show? (or you 
    >>>> can also enable debug osd / debug priority_cache_manager) The osd 
    >>>> memory autotuning works by shrinking the bluestore and rocksdb 
    >>>> caches to some target value to try and keep the mapped memory of 
    >>>> the process bellow the osd_memory_target.  In some cases it's 
    >>>> possible that something other than the caches are using the memory 
    >>>> (usually pglog) or there's tons of pinned stuff in the cache that 
    >>>> for some reason can't be evicted. Knowing the cache tuning stats 
    >>>> might help tell if it's trying to shrink the caches and can't for 
    >>>> some reason or if there's something else going on.
    >>>>
    >>>>
    >>>> Thanks,
    >>>>
    >>>> Mark
    >>>>
    >>>>
    >>>>
    >>>> On 5/20/20 6:10 AM, Harald Staub wrote:
    >>>>> As a follow-up to our recent memory problems with OSDs (with high 
    >>>>> pglog values: 
    >>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/LJPJZPBSQRJN5EFE632CWWPK3UMGG3VF/#XHIWAIFX4AXZK5VEFOEBPS5TGTH33JZO 
    >>>>> ), we also see high buffer_anon values. E.g. more than 4 GB, with 
    >>>>> "osd memory target" set to 3 GB. Is there a way to restrict it?
    >>>>>
    >>>>> As it is called "anon", I guess that it would first be necessary 
    >>>>> to find out what exactly is behind this?
    >>>>>
    >>>>> Well maybe it is just as Wido said, with lots of small objects, 
    >>>>> there will be several problems.
    >>>>>
    >>>>> Cheers
    >>>>>  Harry
    >>>>> _______________________________________________
    >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
    >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >>>>>
    >>>> _______________________________________________
    >>>> ceph-users mailing list -- ceph-users@xxxxxxx
    >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >>> _______________________________________________
    >>> ceph-users mailing list -- ceph-users@xxxxxxx
    >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >> _______________________________________________
    >> ceph-users mailing list -- ceph-users@xxxxxxx
    >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx