RE: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and watchdog reset

Linu Cherian <lcherian@xxxxxxxxxxx> · Tue, 19 Sep 2023 11:39:27 +0000

Hi James,

> -----Original Message-----
> From: James Clark <james.clark@xxxxxxx>
> Sent: Friday, September 15, 2023 7:20 PM
> To: Linu Cherian <lcherian@xxxxxxxxxxx>; suzuki.poulose@xxxxxxx;
> mike.leach@xxxxxxxxxx; leo.yan@xxxxxxxxxx
> Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; coresight@xxxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; robh+dt@xxxxxxxxxx;
> krzysztof.kozlowski+dt@xxxxxxxxxx; conor+dt@xxxxxxxxxx;
> devicetree@xxxxxxxxxxxxxxx; Sunil Kovvuri Goutham
> <sgoutham@xxxxxxxxxxx>; George Cherian <gcherian@xxxxxxxxxxx>
> Subject: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and
> watchdog reset
> 
> External Email
> 
> ----------------------------------------------------------------------
> 
> 
> On 04/09/2023 06:05, Linu Cherian wrote:
> > This RFC v3 patch series is rebased on v6.5-rc7 and is dependent on
> > the below two patches.
> [...]
> >
> > Steps for reading trace data captured in previous boot
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1. cd /sys/bus/coresight/devices/tmc_etrXX/
> >
> > 2. Change to special mode called, read_prevboot.
> >
> >    #echo 1 > read_prevboot
> >
> > 3. Dump trace buffer data to a file,
> >
> >    #dd if=/dev/tmc_etrXX of=~/cstrace.bin
> 
> Hi Linu,
> 
> I left this comment on V2, but I tested it again and get the same result.
> Instead of linking it I'll just re-paste it here:
> 
> I made a reserved region, but when I run this command I get "Unable to
> handle kernel paging request at virtual address 001f1921ed10ffae".
> 
> Is there an extra step involved if there was no trace captured from a previous
> panic? I thought I'd just be able to read out uninitialised data. Or is it the
> uninitialised metadata that's causing this issue?
> 
> Also that's without KASAN or lockdep turned on. If I have a kernel with either
> of those things I get a different warning for each one. I expect the lockdep
> one would happen even in the working scenario though?

Somehow I missed this comment on V2.

I retried the above steps on my board and I do not see issues either with KASAN OR lockdep enabled configs.
Please see logs below. 

a. Lockdep enabled config
~# cd /sys/bus/coresight/devices/tmc_etr0
tmc_etr0# echo 1 > read_prevboot
tmc_etr0# dd if=/dev/tmc_etr0 of=~/cstrace.bin
12324+1 records in
12324+1 records out
6310032 bytes (6.3 MB, 6.0 MiB) copied, 0.122883 s, 51.3 MB/s

# zcat /proc/config.gz | grep LOCKDEP
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
# CONFIG_DEBUG_LOCKDEP is not set

b. KASAN enabled config
# cd /sys/bus/coresight/devices/tmc_etr0/
tmc_etr0# ls
buf_mode_preferred   connections  power          trigger_cntr
buf_modes_available  enable_sink  read_prevboot  uevent
buffer_size          mgmt         subsystem      waiting_for_supplier
tmc_etr0# echo 1 > read_prevboot
tmc_etr0# dd if=/dev/tmc_etr0 of=~/cstrace.bin
12324+1 records in
12324+1 records out
6310032 bytes (6.3 MB, 6.0 MiB) copied, 0.0940671 s, 67.1 MB/s

~# zcat /proc/config.gz | grep -i kasan
CONFIG_KASAN_SHADOW_OFFSET=0xdfff800000000000
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_SW_TAGS=y
CONFIG_HAVE_ARCH_KASAN_HW_TAGS=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_KASAN_SW_TAGS=y
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
# CONFIG_KASAN_SW_TAGS is not set
# CONFIG_KASAN_HW_TAGS is not set
CONFIG_KASAN_OUTLINE=y
# CONFIG_KASAN_INLINE is not set
CONFIG_KASAN_STACK=y
CONFIG_KASAN_VMALLOC=y
# CONFIG_KASAN_MODULE_TEST is not set

But then I am able to trigger kernel crash with bad metadata(corrupted rwp and rrp) with below stack trace.

[  107.442991]  __arch_copy_to_user+0x180/0x240
[  107.447254]  vfs_read+0xc8/0x2a8
[  107.450476]  ksys_read+0x74/0x110
[  107.453783]  __arm64_sys_read+0x24/0x38
[  107.457611]  invoke_syscall.constprop.0+0x58/0xf8
[  107.462309]  do_el0_svc+0x6c/0x158
[  107.465704]  el0_svc+0x54/0x1c0
[  107.468839]  el0t_64_sync_handler+0x100/0x130
[  107.473188]  el0t_64_sync+0x190/0x198
[  107.476843] Code: d503201f d503201f d503201f d503201f (a8c12027)

Does your stack trace looks similar ? Then its very likely due to bad metadata.
If not, kindly please share yours.

For example, if we have bad values for rwp and rrp, offset can get messed up resulting in above crash.
Will add more validation checks while setting up the prevboot buffer,  so as to avoid processing with bogus metadata values
in the next patch version.

Thanks James for trying this out.

> 
> >
> > 4. Reset back to normal mode
> >
> >    #echo 0 > read_prevboot
> >