For posterity, I've finally solved this issue. It ended up having
nothing to do with the interrupts/tasklets themselves. The driver uses
ioremap() to get hold of some reserved memory, and it seems from about
2.6.25 onwards or so this defaults to ioremap_nocache(), so our driver
was doing memory operations in the tasklet on uncacheable pages.
Calling ioremap_cache() explicitly in the driver solved the issue
(nice when you can fix a perf regression of 50-100x with a single line
fix!). Oprofile was of tremendous help in solving this issue.
On 09 Sep 2009, at 10:39 AM, Jason Nymble wrote:
Hi,
Background: We use a custom kernel driver module for our PCIe device
which processes bulk data between the host and the card. The card
issues MSI interrupts at up to 20kHz to the host, and the driver
interrupt routine essentially just calls tasklet_schedule() and
returns IRQ_HANDLED, and the work is performed inside the tasklet
routine. This has worked very well for us for the past several
years, with acceptably low overhead on the processor servicing the
interrupts and running the tasklet, using Linux kernel versions from
about 2.6.13 to 2.6.24.
Recent tests on kernels from 2.6.25 to 2.6.30 indicate some serious
regression however. The CPU core servicing the interrupts/tasklets
shows 100% si usage in top for ksoftirqd, and the driver can
consequently only handle a very small fraction of what it was able
to handle using kernel <=2.6.24 (slowdown of around 50-100x)... Even
when we scale back our interrupt rate to 1kHz, we still see this
poor behavior, and from what we can tell the time isn't actually
spent in our tasklet code itself (not 100% sure of this).
The question is, does anybody know of something that has changed in
kernels >= 2.6.25 that might cause this behavior? I've pored over
changelogs and lwn.net articles and lwn.net kernel API change lists
and kernelnewbies kernel change webpage etc., and cannot find
anything which could explain my phenomenon.
Any suggestions for ways to track down where the problem lies? I've
tried running kernels with all the debugging+sanity checks enabled,
and they don't report any badness in the driver. My next step is to
get oprofile going and try to determine exactly where that time is
spent. I would _maybe_ have believed it could perhaps be a Linux
kernel bug (e.g. the softirq that handles tasklets somehow not
ending its loop or something) if it only happened on one kernel
version, but it seems to happen on all kernels from 2.6.25 onwards ...
Thanks in advance
--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ