Re: Mainline kernel crashes, was Re: RFC: remove set_fs for m68k

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Finn,

On 16/09/21 21:04, Finn Thain wrote:
On Wed, 15 Sep 2021, Michael Schmitz wrote:

On 15/09/21 13:38, Finn Thain wrote:
On Mon, 13 Sep 2021, Michael Schmitz wrote:

Incidentally - have you ever checked whether Al Viro's signal
handling fixes have an impact on these bugs?

I will try that patch series if you think it is related.

Initial tests look promising (but I've said that before).

Here's what I found in recent tests on my Quadra 630.

The usual stress-ng panic can happen without list corruption, even
with local_irq_save/restore() added to do_IRQ().

The panic did not show up at all during stress tests with Al's signal
handling patch series.

I think my results are consistent with yours.

Thanks - that's encouraging to hear. My tests with Christoph's patches
on top of Al's haven't shown any further errors either, but I'll give
that combination some more workout.

Further stress testing here using Al's patches did eventually result in
the same panic that I see using mainline (below).

That's bad - there's another bug lurking in the exception return code, it seems. Not a regression though.



Would you care to add your tested-by for Al's patches?

Sure. I haven't seen any regression, so
Tested-by: Finn Thain <fthain@xxxxxxxxxxxxxx>

---
running --mmap -1 --mmap-osync --mmap-bytes 100% -t 60 --timestamp --no-rand-seed --times
stress-ng: 22:52:11.63 info:  [5491] setting to a 60 second run per stressor
stress-ng: 22:52:11.64 info:  [5491] dispatching hogs: 1 mmap
[ 9858.090000] Kernel panic - not syncing: Aiee, killing interrupt handler!

That one's from do_exit(), right at the start. Can you instrument that to print the hardirq and softirq counts separate?

[ 9858.090000] CPU: 0 PID: 5493 Comm: stress-ng Not tainted 5.14.0-multi-00003-gb2406d5d331a #7
[ 9858.090000] Stack from 00b4bde4:
[ 9858.090000]         00b4bde4 00488d5f 00488d5f 00040000 00b4be00 003f3630 00488d5f 00b4be20
[ 9858.090000]         003f2636 00040000 418004fc 00b4a000 009f8540 00b4a000 00a07440 00b4be5c
[ 9858.090000]         0003171e 00480965 00000009 418004fc 00b4a000 00000000 073f8000 00000009
[ 9858.090000]         00000008 00b4bf38 00a07440 00000006 00000000 00000001 00b4be6c 000318d4
[ 9858.090000]         00000009 01438f30 00b4beb8 0003ac18 00000009 0000000f 0000000e c043c000
[ 9858.090000]         00000000 073f8000 00000003 00b4bf98 eff82944 eff818a8 00039a22 00b4a000
[ 9858.090000] Call Trace: [<00040000>] rcu_free_pwq+0x1c/0x1e
[ 9858.090000]  [<003f3630>] dump_stack+0x10/0x16
[ 9858.090000]  [<003f2636>] panic+0xba/0x2bc
[ 9858.090000]  [<00040000>] rcu_free_pwq+0x1c/0x1e
[ 9858.090000]  [<0003171e>] do_exit+0x87e/0x9d6

That offset into do_exit() does not make sense to me - in my version, that's beyond the end of do_exit(). Does this correspond to the in_interrupt() test in do_exit() in your image?

[ 9858.090000]  [<000318d4>] do_group_exit+0x28/0xb6
[ 9858.090000]  [<0003ac18>] get_signal+0x126/0x720
[ 9858.090000]  [<00039a22>] send_signal+0xde/0x16e
[ 9858.090000]  [<00004f0c>] do_notify_resume+0x38/0x5dc
[ 9858.090000]  [<0003aad2>] force_sig_fault_to_task+0x36/0x3a
[ 9858.090000]  [<0003aaee>] force_sig_fault+0x18/0x1c
[ 9858.090000]  [<00007450>] send_fault_sig+0x44/0xc6
[ 9858.090000]  [<000069be>] buserr_c+0x2c8/0x6a2
[ 9858.090000]  [<00002cd8>] do_signal_return+0x10/0x1a

RESTORE_SWITCH_STACK in my version. We don't get there in interrupt context unless it's the only interrupt on the kernel stack.

This is after do_notify_resume() which would have called setup_frame() in case there was a signal pending (which we can pretty much assume here, unless you're tracing stress-ng).

I can't see anything in do_signal() and its call chain that would cause our stack pointer to change upon return from do_notify_resume() ...

Could you add code to do_notify_resume() that compares the 'regs' argument upon entry and return, and prints both if there is a mismatch?

I know, grasping at straws again ...

Cheers,

	Michael


[ 9858.090000]  [<0018800e>] ext4_htree_fill_tree+0x154/0x32a
[ 9858.090000]  [<0010800a>] d_path+0x86/0x114
[ 9858.090000]
[ 9858.090000] ---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]---




[Index of Archives]     [Video for Linux]     [Yosemite News]     [Linux S/390]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux