Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-08-12 19:03, Stanislav Fomichev wrote:
On 08/12, Martin Karsten wrote:
On 2024-08-12 16:19, Stanislav Fomichev wrote:
On 08/12, Joe Damato wrote:
Greetings:

Martin Karsten (CC'd) and I have been collaborating on some ideas about
ways of reducing tail latency when using epoll-based busy poll and we'd
love to get feedback from the list on the code in this series. This is
the idea I mentioned at netdev conf, for those who were there. Barring
any major issues, we hope to submit this officially shortly after RFC.

The basic idea for suspending IRQs in this manner was described in an
earlier paper presented at Sigmetrics 2024 [1].

Let me explicitly call out the paper. Very nice analysis!

Thank you!

[snip]

Here's how it is intended to work:
    - An administrator sets the existing sysfs parameters for
      defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.

    - An administrator sets the new sysfs parameter irq_suspend_timeout
      to a larger value than gro-timeout to enable IRQ suspension.

Can you expand more on what's the problem with the existing gro_flush_timeout?
Is it defer_hard_irqs_count? Or you want a separate timeout only for the
perfer_busy_poll case(why?)? Because looking at the first two patches,
you essentially replace all usages of gro_flush_timeout with a new variable
and I don't see how it helps.

gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
deferral mechanism and as such, always active when configured. Its static
periodic softirq processing leads to a situation where:

- A long gro-flush-timeout causes high latencies when load is sufficiently
below capacity, or

- a short gro-flush-timeout causes overhead when softirq execution
asynchronously competes with application processing at high load.

The shortcomings of this are documented (to some extent) by our experiments.
See defer20 working well at low load, but having problems at high load,
while defer200 having higher latency at low load.

irq-suspend-timeout is only active when an application uses
prefer-busy-polling and in that case, produces a nice alternating pattern of
application processing and networking processing (similar to what we
describe in the paper). This then works well with both low and high load.

So you only want it for the prefer-busy-pollingc case, makes sense. I was
a bit confused by the difference between defer200 and suspend200,
but now I see that defer200 does not enable busypoll.

I'm assuming that if you enable busypool in defer200 case, the numbers
should be similar to suspend200 (ignoring potentially affecting
non-busypolling queues due to higher gro_flush_timeout).

defer200 + napi busy poll is essentially what we labelled "busy" and it does not perform as well, since it still suffers interference between application and softirq processing.

Maybe expand more on what code paths are we trying to improve? Existing
busy polling code is not super readable, so would be nice to simplify
it a bit in the process (if possible) instead of adding one more tunable.

There are essentially three possible loops for network processing:

1) hardirq -> softirq -> napi poll; this is the baseline functionality

2) timer -> softirq -> napi poll; this is deferred irq processing scheme
with the shortcomings described above

3) epoll -> busy-poll -> napi poll

If a system is configured for 1), not much can be done, as it is difficult
to interject anything into this loop without adding state and side effects.
This is what we tried for the paper, but it ended up being a hack.

If however the system is configured for irq deferral, Loops 2) and 3)
"wrestle" with each other for control. Injecting the larger
irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
of Loop 3) and creates the nice pattern describe above.

And you hit (2) when the epoll goes to sleep and/or when the userspace
isn't fast enough to keep up with the timer, presumably? I wonder
if need to use this opportunity and do proper API as Joe hints in the
cover letter. Something over netlink to say "I'm gonna busy-poll on
this queue / napi_id and with this timeout". And then we can essentially make
gro_flush_timeout per queue (and avoid
napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
too hacky already :-(

If someone would implement the necessary changes to make these parameters per-napi, this would improve things further, but note that the current proposal gives strong performance across a range of workloads, which is otherwise difficult to impossible to achieve.

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of an individual queue or application to make sure that IRQ suspension is enabled/disabled right away when the state of the system changes from busy to idle and back.

[snip]

    - suspendX:
      - set defer_hard_irqs to 100
      - set gro_flush_timeout to X,000
      - set irq_suspend_timeout to 20,000,000
      - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
        busy_poll_budget = 64, prefer_busy_poll = true)

What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
to busy_poll sysctl value?

Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
(sysctl) or busy_poll_usecs is nonzero. However, this might lead to
busy-polling even when the application does not actually need or want it.
Only one iteration through the busy loop is needed to make the new scheme
work. Additional napi busy polling over and above is optional.

Ack, thanks, was trying to understand why not stay with
busy_poll_usecs=64 for consistency. But I guess you were just
trying to show that patch 4/5 works.

Right, and we would potentially be wasting CPU cycles by adding more busy-looping.

Thanks,
Martin




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux