Hi Finn,
On 08/09/21 11:50, Finn Thain wrote:
On Tue, 7 Sep 2021, Michael Schmitz wrote:
Does anyone know what causes this?
Our practice to run interrupt handlers at the IPL of the current
interrupt under service, instead of disabling local interrupts for the
duration of the handler?
Lock contention will happen anyway.
If using spin_trylock() outside of "atomic" context was a bug (really?)
then it can't be used here.
add_interrupt_randomness() relies on interrupts being disabled in
__handle_irq_event_percpu(), so this check assumes atomic context.
Our definition of irqs_disabled() invalidates that assumption.
Perhaps add_interrupt_randomness() should use the lock in irq mode, like
the rest of drivers/char/random.c does.
That might deadlock on SMP when a task reading from the random pool
holding the lock executes on the same CPU as the interrupt.
In the UP case, the spinlock is optimized away, and other users taking
the lock also disable interrupts as you said. That ought to protect
against re-entering _mix_pool_bytes(), unless we're re-entering from
another, higher priority interrupt. We either need to disable interrupts
before entering _mix_pool_bytes() on UP, or treat this spin_trylock() as
real lock operation (not optimized away).
I think this is something that should be discussed with Ted Ts'o.
In a related case, I've managed to swap my 'resume_userspace' format
error for a nice 'illegal instruction' format error apparently caused by
an invalid function pointer in __handle_irq_event_percpu(), just by
disabling all interrupts upon entering the auto_inthandler and
user_inthandler exception handlers. This bug is quite readily reproduced
by running your kernel_coverage.sh script in a loop (panics on the first
stress test on the second pass):
Stress run 2
Logging to stress-ng-20210908-0838.log
./kernel-coverage.sh: line 272: lcov: command not found
running --fork 1 --fork-vm -t 60 --timestamp --no-rand-seed --times
stress-ng: 08:40:08.70 info: [1914] setting to a 60 second run per stressor
stress-ng: 08:40:08.82 info: [1914] dispatching hogs: 1 fork
packet_write_wait: Connection to 10.1.1.4 port 22: Broken pipe
Why disabling interrupts during interrupt processing would make matters
worse doesn't make any sense to me...
Cheers,
Michael