Prometheus Node Exporter and cadvisor seem to run into deadlocks (?) since change in fs/eventpoll.c

Thilo-Alexander Ginkel <thilo@xxxxxxxxxx> · Tue, 22 Nov 2022 14:33:38 +0100

Hi there,

since applying the recent Ubuntu security release for the 5.4.0 kernel
(5.4.0-134) our Node Exporter and cadvisor processes have started
acting up when collecting metrics for consumption by Prometheus. What
has previously taken under a second is now taking > 1 minute.

A small reproducer in Go that queries netclass data (similar to what
Node Exporter does) is available at [1].

Bisecting did not really help (due to the non-deterministic nature of
the bug), but an educated guess in the Node Exporter issue #2500
discussion [2] on GitHub brought up the following commit as the
possible culprit:

commit bcf91619e32fe584ecfafa49a3db3d1db4ff70b2
Author: Benjamin Segall <bsegall@xxxxxxxxxx>
Date:   Wed Jun 15 14:24:23 2022 -0700

    epoll: autoremove wakers even more aggressively

    BugLink: https://bugs.launchpad.net/bugs/1990190

    commit a16ceb13961068f7209e34d7984f8e42d2c06159 upstream.

    If a process is killed or otherwise exits while having active network
    connections and many threads waiting on epoll_wait, the threads will all
    be woken immediately, but not removed from ep->wq.  Then when network
    traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will
    not remove the entries from the list.

    This means that the cost of the wakeup attempt is far higher than usual,
    does not decrease, and this also competes with the dying threads trying to
    actually make progress and remove themselves from the wq.

    Handle this by removing visited epoll wq entries unconditionally, rather
    than only when the wakeup succeeds - the structure of ep_poll means that
    the only potential loss is the timed_out->eavail heuristic, which now can
    race and result in a redundant ep_send_events attempt.  (But only when
    incoming data and a timeout actually race, not on every timeout)

    Shakeel added:

    : We are seeing this issue in production with real workloads and it has
    : caused hard lockups.  Particularly network heavy workloads with a lot
    : of threads in epoll_wait() can easily trigger this issue if they get
    : killed (oom-killed in our case).

    Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@xxxxxxxxxx
    Signed-off-by: Ben Segall <bsegall@xxxxxxxxxx>
    Tested-by: Shakeel Butt <shakeelb@xxxxxxxxxx>
    Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
    Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
    Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
    Cc: Eric Dumazet <edumazet@xxxxxxxxxx>
    Cc: Roman Penyaev <rpenyaev@xxxxxxx>
    Cc: Jason Baron <jbaron@xxxxxxxxxx>
    Cc: Khazhismel Kumykov <khazhy@xxxxxxxxxx>
    Cc: Heiher <r@xxxxxx>
    Cc: <stable@xxxxxxxxxx>
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Kamal Mostafa <kamal@xxxxxxxxxxxxx>
    Signed-off-by: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>

I have reverted this commit in my test environment and haven't been
able to reproduce the issue since.

An alternative workaround seems to be to reduce Node Exporter's
concurrency, which hints at some kind of race condition being
involved.

There are also reports for other distributions, which suggests that
the issue is more wide-spread.

My Vagrant-based test environment is available at [3].

Any ideas?

Thanks & kind regards,
Thilo

[1] https://github.com/prometheus/node_exporter/issues/2500#issuecomment-1304847221
[2] https://github.com/prometheus/node_exporter/issues/2500#issuecomment-1322491565
[3] https://github.com/tgbyte/stuck-node-exporter