Re: OSDs taking too much memory, for buffer_anon

Harald Staub <harald.staub@xxxxxxxxx> · Mon, 25 May 2020 19:25:32 +0200

Hi Mark

Thank you! This is 14.2.8, on Ubuntu Bionic. Some with kernel 4.15, some 
with 5.3, but that does not seem to make a difference here. Transparent 
Huge Pages are not used according to
grep -i AnonHugePages /proc/meminfo

Workload is a mix of OpenStack volumes (replicated) and RGW on EC 8+3. 
EC pool with 1024 PGs, 900M objects.

Around 500 hdd OSDs (4 and 8 TB), 30 ssd OSDs (2 TB). The maximum number 
of PGs per OSD is only 123. The hdd OSDs have DB on SSD, but a bit less 
than 30 GB unfortunately. I have seen 200 GB and more slow_bytes, 
compression of the DB seems to help a lot.

No BlueStore compression.

I had a look at the related thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/JQ72K5LK3YFFETNNL4MX6HHZLF5GBYDT/

Today I saw a correlation that may match your thoughts. During 1 hour 
with a high number of write IOPS (not throughput) on the EC pool, 
available memory increased drastically.

Cheers
 Harry

On 20.05.20 15:15, Mark Nelson wrote:
Hi Harald,

Thanks!  So you can see from the perf dump that the target bytes are a 
little below 4GB, but the mapped bytes are around 7GB.  The priority 
cache manager has reacted by setting the "cache_bytes" to 128MB which is 
the minimum global value and each cache is getting 64MB (the local 
minimum value per cache).  Ultimately this means the priority cache 
manager has basically told all of the caches to shrink to their smallest 
possible values so it's doing the right thing.  So the next question is 
why buffer_anon is so huge. Looking at the mempool stats, there are not 
that many items but still a lot of memory used.  On average those items 
in buffer_anon are ~150K.  It can't be just buffer anon though, you've 
got several gigabytes of mapped memory being used beyond that and around 
4GB of unmapped memory that tcmalloc should be freeing every iteration 
of the priority cache manager.

So next questions:  What version of Ceph is this, and do you have 
transparent huge pages enabled? We automatically disable it now, but if 
you are running an older version you might want to disable (or at least 
set it to madvise) manually.  Also, what kind of workload is hitting the 
OSDs?  If you can reliably make it grow you could try doing a heap 
profile at the same time the workload is going on and see if you can see 
where the memory is being used.

Mark

On 5/20/20 7:36 AM, Harald Staub wrote:
Hi Mark

Thank you for you explanations! Some numbers of this example osd below.

Cheers
 Harry

From dump mempools:

            "buffer_anon": {
                "items": 29012,
                "bytes": 4584503367
            },

From perf dump:

    "prioritycache": {
        "target_bytes": 3758096384,
        "mapped_bytes": 7146692608,
        "unmapped_bytes": 3825983488,
        "heap_bytes": 10972676096,
        "cache_bytes": 134217728
    },
    "prioritycache:data": {
        "pri0_bytes": 0,
        "pri1_bytes": 0,
        "pri2_bytes": 0,
        "pri3_bytes": 0,
        "pri4_bytes": 0,
        "pri5_bytes": 0,
        "pri6_bytes": 0,
        "pri7_bytes": 0,
        "pri8_bytes": 0,
        "pri9_bytes": 0,
        "pri10_bytes": 0,
        "pri11_bytes": 0,
        "reserved_bytes": 67108864,
        "committed_bytes": 67108864
    },
    "prioritycache:kv": {
        "pri0_bytes": 0,
        "pri1_bytes": 0,
        "pri2_bytes": 0,
        "pri3_bytes": 0,
        "pri4_bytes": 0,
        "pri5_bytes": 0,
        "pri6_bytes": 0,
        "pri7_bytes": 0,
        "pri8_bytes": 0,
        "pri9_bytes": 0,
        "pri10_bytes": 0,
        "pri11_bytes": 0,
        "reserved_bytes": 67108864,
        "committed_bytes": 67108864
    },
    "prioritycache:meta": {
        "pri0_bytes": 0,
        "pri1_bytes": 0,
        "pri2_bytes": 0,
        "pri3_bytes": 0,
        "pri4_bytes": 0,
        "pri5_bytes": 0,
        "pri6_bytes": 0,
        "pri7_bytes": 0,
        "pri8_bytes": 0,
        "pri9_bytes": 0,
        "pri10_bytes": 0,
        "pri11_bytes": 0,
        "reserved_bytes": 67108864,
        "committed_bytes": 67108864
    },

On 20.05.20 14:05, Mark Nelson wrote:
Hi Harald,

Any idea what the priority_cache_manger perf counters show? (or you 
can also enable debug osd / debug priority_cache_manager) The osd 
memory autotuning works by shrinking the bluestore and rocksdb caches 
to some target value to try and keep the mapped memory of the process 
bellow the osd_memory_target.  In some cases it's possible that 
something other than the caches are using the memory (usually pglog) 
or there's tons of pinned stuff in the cache that for some reason 
can't be evicted. Knowing the cache tuning stats might help tell if 
it's trying to shrink the caches and can't for some reason or if 
there's something else going on.

Thanks,

Mark

On 5/20/20 6:10 AM, Harald Staub wrote:
As a follow-up to our recent memory problems with OSDs (with high 
pglog values: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/LJPJZPBSQRJN5EFE632CWWPK3UMGG3VF/#XHIWAIFX4AXZK5VEFOEBPS5TGTH33JZO 
), we also see high buffer_anon values. E.g. more than 4 GB, with 
"osd memory target" set to 3 GB. Is there a way to restrict it?

As it is called "anon", I guess that it would first be necessary to 
find out what exactly is behind this?

Well maybe it is just as Wido said, with lots of small objects, 
there will be several problems.

Cheers
 Harry
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx