Re: sched/debug: Dump end of stack when detected corrupted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Adrian, 

On Tue, Sep 03, 2024 at 06:33:55PM +0200, John Paul Adrian Glaubitz wrote:
> Hi Feng,
> 
> > When debugging a kernel hang during suspend/resume, there are random
> > memory corruptions in different places like being detected by scheduler
> > with error message:
> > 
> >   "Kernel panic - not syncing: corrupted stack end detected inside scheduler"
> > 
> > Dump the corrupted memory around the stack end will give more direct
> > hints about how the memory is corrupted:
> > 
> >  "
> >  Corrupted Stack: ff11000122770000: ff ff ff ff ff ff 14 91 82 3b 78 e8 08 00 45 00  .........;x...E.
> >  Corrupted Stack: ff11000122770010: 00 1d 2a ff 40 00 40 11 98 c8 0a ef 30 2c 0a ef  ..*.@.@.....0,..
> >  Corrupted Stack: ff11000122770020: 30 ff a2 00 22 3d 00 09 9a 95 2a 00 00 00 00 00  0..."=....*.....
> >  ...
> >  Kernel panic - not syncing: corrupted stack end detected inside scheduler
> >  "
> > 
> > And with it, the culprit was quickly identified to be an ethernet
> > driver with its DMA operations.
> > 
> > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
> > ---
> >  kernel/sched/core.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a795e030678c..1280f7012bc5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5949,8 +5949,18 @@ static noinline void __schedule_bug(struct task_struct *prev)
> >  static inline void schedule_debug(struct task_struct *prev, bool preempt)
> >  {
> >  #ifdef CONFIG_SCHED_STACK_END_CHECK
> > -	if (task_stack_end_corrupted(prev))
> > +	if (task_stack_end_corrupted(prev)) {
> > +		unsigned long *ptr = end_of_stack(prev);
> > +
> > +		/* Dump 16 ulong words around the corruption point */
> > +#ifdef CONFIG_STACK_GROWSUP
> > +		ptr -= 15;
> > +#endif
> > +		print_hex_dump(KERN_ERR, "Corrupted Stack: ",
> > +			DUMP_PREFIX_ADDRESS, 16, 1, ptr, 16 * sizeof(*ptr), 1);
> > +
> >  		panic("corrupted stack end detected inside scheduler\n");
> > +	}
> >  
> >  	if (task_scs_end_corrupted(prev))
> >  		panic("corrupted shadow stack detected inside scheduler\n");
> 
> Have you gotten any feedback on this? Would be nice to get this merged as we're
> seeing crashes due to stack corruption on sparc from time to time and having the
> end of the stack dumped in such cases would make debugging here a bit easier.

Thanks for the review and providing feedback! So far I haven't got response
from maintainers yet.

Hi Peter and maintainers,

Could you help to review this patch which can help debugging those naughty
memory corruption issues? Thanks!

There is a v2 version which can be applied to latest linux-next branch:
https://lore.kernel.org/lkml/20240207143523.438816-1-feng.tang@xxxxxxxxx/

- Feng




[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux