/proc/stat interrupt counter wrap-around

Alexei Lozovsky <me@xxxxxxxxxx> · Fri, 10 Sep 2021 17:53:28 +0900

A monitoring dashboard caught my attention when it displayed weird
spikes in computed interrupt rates. A machine with constant network
load shows about ~250k interrupts per second, then every 4-5 hours
there's a one-off spike to 11 billions.

Turns out, if you plot the interrupt counter, you get a graph like this:

                                     ###.                           
                                  ###   | ####.              #######
                  #####.     #####      ##    |     #########       
             #####     |   ##                 ######                
            #          ####                                         
        ####                                                        
       #                                                            
    ###                                                             

While monitoring tools are typically used to handling counter
wrap-arounds, they may not be ready to handle dips like this.

What is the impact
------------------

Not much, actually.

The counters always decrement by exactly 2^32 (which is suggestive),
so if you mask out the high bits of the counter and consider only
the low 32 bits, then the value sequence actually make sense,
given an appropriate sampling rate.

However, if you don't mask out the value and assume it to be accurate --
well, that assumption is incorrect. Interrupt sums might look correct
and contain some big number, but it could be arbitrarily distant from
the actual number of interrupts serviced since the boot time.

This concerns only the total value of "intr" and "softirq" rows:

    intr    14390913189 32 11 0 0 238 0 0 0 0 0 0 0 88 0 [...]
    softirq 14625063745 0 596000256 300149 272619841 0 0 [...]
            ^^^^^^^^^^^
            these ones

Why this happens
----------------

The reason for such behaviour is that the "total" interrupt counters
presented by /proc/stat are actually computed by adding up per-interrupt
per-CPU counters. Most of these are "unsigned int", while some of them
are "unsigned long", and the accumulator is "u64". What a mess...

Individual counters are monotonically increasing (modulo wrapping),
however if you add multiple values with different bit widths then
the sum is *not* guaranteed to be monotonically increasing.

What can be done
----------------

 1. Do nothing.

    Userspace can trivially compensate for this 'curious' behavior
    by masking out the high bits, observing only the low
    sizeof(unsigned) part, and taking care to handle wrap-arounds.

    This maintains status quo, but the "issue" of interrupt sums
    not being quite accurate remains.

 2. Change the presentation type to the lowest denominator.

    That is, unsigned int. Make the kernel mask out not-quite-accurate
    bits from the value it reports. Keep it that way until every
    underlying counter type is changed to something wider.

    The benefit here is that users that *are* ready to handle proper
    wrap-arounds will be able to handle them automagically without
    undocumented hacks (see option 1).

    This changes the observed value and will cause "unexpected"
    wrap-arounds to happen earlier in some use-cases, which might
    upset users that are not ready to handle them, or don't want
    to poll /proc/stat more frequently.

    It's debatable what's better: a lower-width value that might
    need to be polled more often, or a wider-width value that is
    not completely accurate.

 3. Change the interrupt counter types to be wider.

    A different take on the issue: instead of narrowing the presentation
    from faux-u64 to unsigned it, widen the interrupt counters from
    unsigned int to... something else:

    - u64             interrupt counters are 64-bit everywhere, period

    - unsigned long   interrupt counters are 64-bit if the platform
                      thinks that "long" is longer than "int"

    Whatever the type is used, it must be the same for all interrupt
    counters across the kernel as well as the type used to compute
    and display the sum of all these counters by /proc/stat.

    The advantage here is that 64-bit counters will be probably enough
    for *anything* to not overflow anytime soon before the heat death
    of the universe, thus making the wrap-around problem irrelevant.

    The disadvantage here is that some hardware counters are 32-bit,
    and you can't make them wider. Some platforms also don't have
    proper atomic support for 64-bit integers, making wider counters
    problematic to implement efficiently.

So what do we do?
-----------------

I suggest to wrap interrupt counter sum at "unsigned int", the same
type used for (most) individual counters. That makes for the most
predictable behavior.

I have a patch set cooking that does this.

Will this be of any interest? Or do you think changing the behavior
of /proc/stat will cause more trouble than merit?

Prior discussion
----------------

This question is by no means new, it has been discussed several times:

2019 - genirq, proc: Speedup /proc/stat interrupt statistics

    The issue of overflow and wrap-around has been touched upon,
    suggesting that userspace should just deal with it. The issue of
    using u64 for the sum has been brought up too, but it did not
    go anywhere.

https://lore.kernel.org/all/20190208143255.9dec696b15f03bf00f4c60c2@xxxxxxxxxxxxxxxxxxxx/
https://lore.kernel.org/all/3460540b50784dca813a57ddbbd41656@xxxxxxxxxxxxxxxx/

2014 - Why do we still have 32 bit counters? Interrupt counters overflow within 50 days

    Discussion on whether it's appropriate to bump counter width to
    64 bits in order to avoid the overflow issues entirely.

https://lore.kernel.org/lkml/alpine.DEB.2.11.1410030435260.8324@xxxxxxxxxx/