Re: ebitmap_node ate over 40GB of memory

Ondrej Mosnacek <omosnace@xxxxxxxxxx> · Wed, 15 Apr 2020 15:44:41 +0200

On Wed, Apr 15, 2020 at 2:31 PM 郭彬 <anole1949@xxxxxxxxx> wrote:
> I'm running a batch of CoreOS boxes, the lsb_release is:
>
> ```
> # cat /etc/lsb-release
> DISTRIB_ID="Container Linux by CoreOS"
> DISTRIB_RELEASE=2303.3.0
> DISTRIB_CODENAME="Rhyolite"
> DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0 (Rhyolite)"
> ```
>
> ```
> # uname -a
> Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019
> x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel GNU/Linux
> ```
> Recently, I found my vms constently being killed due to OOM, and after
> digging into the problem, I finally realized that the kernel is leaking
> memory.
>
> Here's my slabinfo:
>
> ```
> # slabtop --sort c -o
>   Active / Total Objects (% used)    : 739390584 / 740008326 (99.9%)
>   Active / Total Slabs (% used)      : 11594275 / 11594275 (100.0%)
>   Active / Total Caches (% used)     : 105 / 129 (81.4%)
>   Active / Total Size (% used)       : 47121380.33K / 47376581.93K (99.5%)
>   Minimum / Average / Maximum Object : 0.01K / 0.06K / 8.00K
>
>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 734506368 734506368 100%    0.06K 11476662       64 45906648K ebitmap_node
[...]
> You can see that the `ebitmap_node` is over 40GB and still growing. The
> only thing I can do is rebooting the OS, but there are tens of them and
> lots of workloads running on them, I can't just reboot whenever I want.
> So, I run out of options, any help?

Pasting in relevant comments/questions from [1]:

2. Your kernel seems to be quite behind the current upstream and is
probably maintained by your distribution (seems to be derived from the
4.19 stable branch). Can you reproduce the issue on a more recent
kernel (at least 5.5+)? If you can't or the recent kernel doesn't
exhibit the issue, then you should report this to your distribution.
3. Was this working fine with some earlier kernel? If you can
determine the last working version, then it could help us identify the
cause and/or the fix.

On top of that, I realized one more thing - the kernel merges the
caches for objects of the same size - so any cache with object size 64
bytes will be accounted under 'ebitmap_node' here. For example, on my
system there are several caches that all alias to the common 64-byte
cache:
# ls -l /sys/kernel/slab/ | grep -- '-> :0000064'
lrwxrwxrwx. 1 root root 0 apr 15 15:26 dmaengine-unmap-2 -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 ebitmap_node -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 fanotify_event -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 io -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 iommu_iova -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 jbd2_inode -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 ksm_rmap_item -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 ksm_stable_node -> :0000064
lrwxrwxrwx. 1 root root 0 apr 15 15:26 vmap_area -> :0000064

On your kernel you might get a different list, but any of the caches
you get could be the culprit, ebitmap_node is just one of the
possibilities. You can disable this merging by adding "slab_nomerge"
to your kernel boot command-line. That will allow you to identify
which cache is really the source of the leak.

[1] https://github.com/SELinuxProject/selinux/issues/220#issuecomment-613944748

-- 
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.