Hi Umang, On Thu, 28 Mar 2024 at 16:54, Umang Jain <umang.jain@xxxxxxxxxxxxxxxx> wrote: > > > > On 28/03/24 10:19 pm, Stefan Wahren wrote: > > Hi Umang, > > > > Am 28.03.24 um 06:16 schrieb Umang Jain: > >> It turns out that stopping kthreads on vchiq_shutdown_internal() > >> corrupts the vchiq firmware during probe. > >> > >> [ 11.005324] bcm2835_vchiq 3f00b840.mailbox: sync: error: 0: sf > >> unexpected msgid 0@b18db30a,0 > >> > >> Since the kthreads are created during vchiq_probe(), symmetrically > >> they should be stopped on vchiq_remove(). > >> > >> Also, vchiq_shutdown_internal() is called by vchiq_shutdown() which > >> is a exported function. Hence, modules can take indirectly shut the > >> vchiq interface by stopping the kthreads. > >> > >> Fix it by stopping the kthreads in .remove instead. > >> > >> Fixes: d9c60badccc1 ("staging: vc04_services: vchiq_core: Stop > >> kthreads on shutdown") > >> Reported-by: Stefan Wahren <wahrenst@xxxxxxx> > >> Signed-off-by: Umang Jain <umang.jain@xxxxxxxxxxxxxxxx> > >> --- > >> > >> Reproduced on rpi-3-b 32-bit kernel with multi_v7_defconfig and all > >> vchiq configs options. Patch seems to fix the error mentioned in the > >> commit message. > > i tested the patch on my Raspberry Pi 3B+ with multi_v7_defconfig and > > > > CONFIG_BCM_VIDEOCORE=y > > CONFIG_BCM2835_VCHIQ=m > > CONFIG_VCHIQ_CDEV=y > > > > Now X came up, but if i run > > > > modprobe -r vchiq > > > > but i'm still getting > > > > [ 146.730014] bcm2835_vchiq 3f00b840.mailbox: sync: error: 0: sf > > unexpected msgid 0@10b01b8d,0 > > > > so it seems to me stopping those kthreads isn't that trivial (which i > > never expected). Maybe we need to care about the order or an even more > > complex synchronization mechanism between VideoCore firmware and the > > kthreads. > > Dave, Phil - would you happen to have any opinions on this behavior ? > > > > I won't have the time for further investigations during the eastern > > holidays. Maybe we should revert d9c60badccc1 and take the necessary > > time for proper testing. > > It's hard to be sure, but I would guess that stopping the thread is causing the remote_event_wait in sync_func to terminate early. The following code assumes that the message is valid, when it fact it will have a message type of VCHIQ_MSG_PADDING. If that's the case, you can either find some way to spot that the wait has actually failed, or handle VCHIQ_MSG_PADDING explicitly in the switch. Phil