Re: IPMI related kernel panics since v4.19.286

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 20, 2023 at 09:56:50AM +0000, Janne Huttunen (Nokia) wrote:
> 
> > It looks like
> > 
> >   b4a34aa6d "ipmi: Fix how the lower layers are told to watch for
> > messages"
> > 
> > was backported to fullfill a dependency for another backport, but
> > there was another change:
> > 
> >   e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately
> > when not needed"
> > 
> > That is needed to avoid calling a lower layer function with
> > xmit_msgs_lock held.  It doesn't apply completely cleanly because of
> > other changes, but you just need to leave in the free_user_work()
> > function and delete the other function in the conflict.  In addition
> > to that, you will also need:
> > 
> >   383035211c79 "ipmi: move message error checking to avoid deadlock"
> > 
> > to fix a bug in that change.
> > 
> > Can you try this out?
> 
> Yes, sorry for the delay, had a bit of technical problems testing
> your proposed patches. In the meantime we found out that over
> a dozen of our test servers have had the same crash, some of them
> multiple times since the kernel update.

I don't consider this a delay, it was quite speedy.

> 
> Anyways, with your proposed patches on top of 4.19.286, I couldn't
> trigger the lockdep warning anymore even in a server that without
> the fixes triggers it very reliably right after the boot. I also
> saw in another very similar server (without the fixes) that it
> took almost 17 hours to get even the lockdep warning. Maybe some
> specific BMC behavior affects this or something? Sadly, that kind
> of diminishes the value of the short duration tests, but at least
> there has so far been zero lockdep warnings with the fixes applied.
> The actual lockups are then way too unpredictable to test reliably
> in any kind of short time frame.

It does depend on what you are doing to the driver, but it sounds like
you are running the same software everywhere.  I'm not sure; I've seen
timing do strange things before.

> 
> Anyways, looking at e1891cffd4c4, it's right there where the issue
> seems to originate from, so it makes total sense to me that it does
> fix it. I was already kind of looking at it when you confirmed it.
> Thanks for pointing out also the 383035211c79 patch, it might have
> been easily missed.
> 

Ok, thank you for testing.  I'll prepare a stable kernel request.

-corey



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux