On Tue, Aug 22, 2023 at 7:10 AM Shreeya Patel <shreeya.patel@xxxxxxxxxxxxx> wrote: > > Hi Robin, > > On 21/08/23 18:09, Robin Murphy wrote: > > On 2023-08-21 12:35, Shreeya Patel wrote: > >> > >> On 19/08/23 01:49, Saravana Kannan wrote: > >>> On Thu, Aug 17, 2023 at 4:13 PM Shreeya Patel > >>> <shreeya.patel@xxxxxxxxxxxxx> wrote: > >>>> Hi Geert, Saravana, > >>>> > >>>> On 18/08/23 00:03, Saravana Kannan wrote: > >>>>> On Thu, Aug 17, 2023 at 4:37 AM Shreeya Patel > >>>>> <shreeya.patel@xxxxxxxxxxxxx> wrote: > >>>>>> Hi Greg, > >>>>>> > >>>>>> On 16/08/23 20:33, Greg Kroah-Hartman wrote: > >>>>>>> On Wed, Aug 16, 2023 at 03:09:27PM +0530, Shreeya Patel wrote: > >>>>>>>> On 13/06/22 15:40, Greg Kroah-Hartman wrote: > >>>>>>>>> From: Saravana Kannan<saravanak@xxxxxxxxxx> > >>>>>>>>> > >>>>>>>>> [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ] > >>>>>>>>> > >>>>>>>>> Mounting NFS rootfs was timing out when deferred_probe_timeout > >>>>>>>>> was > >>>>>>>>> non-zero [1]. This was because ip_auto_config() initcall > >>>>>>>>> times out > >>>>>>>>> waiting for the network interfaces to show up when > >>>>>>>>> deferred_probe_timeout was non-zero. While ip_auto_config() calls > >>>>>>>>> wait_for_device_probe() to make sure any currently running > >>>>>>>>> deferred > >>>>>>>>> probe work or asynchronous probe finishes, that wasn't > >>>>>>>>> sufficient to > >>>>>>>>> account for devices being deferred until deferred_probe_timeout. > >>>>>>>>> > >>>>>>>>> Commit 35a672363ab3 ("driver core: Ensure > >>>>>>>>> wait_for_device_probe() waits > >>>>>>>>> until the deferred_probe_timeout fires") tried to fix that by > >>>>>>>>> making > >>>>>>>>> sure wait_for_device_probe() waits for deferred_probe_timeout > >>>>>>>>> to expire > >>>>>>>>> before returning. > >>>>>>>>> > >>>>>>>>> However, if wait_for_device_probe() is called from the > >>>>>>>>> kernel_init() > >>>>>>>>> context: > >>>>>>>>> > >>>>>>>>> - Before deferred_probe_initcall() [2], it causes the boot > >>>>>>>>> process to > >>>>>>>>> hang due to a deadlock. > >>>>>>>>> > >>>>>>>>> - After deferred_probe_initcall() [3], it blocks kernel_init() > >>>>>>>>> from > >>>>>>>>> continuing till deferred_probe_timeout expires and beats > >>>>>>>>> the point of > >>>>>>>>> deferred_probe_timeout that's trying to wait for > >>>>>>>>> userspace to load > >>>>>>>>> modules. > >>>>>>>>> > >>>>>>>>> Neither of this is good. So revert the changes to > >>>>>>>>> wait_for_device_probe(). > >>>>>>>>> > >>>>>>>>> [1] > >>>>>>>>> -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > >>>>>>>>> [2] > >>>>>>>>> -https://lore.kernel.org/lkml/YowHNo4sBjr9ijZr@dev-arch.thelio-3990X/ > >>>>>>>>> > >>>>>>>>> [3] -https://lore.kernel.org/lkml/Yo3WvGnNk3LvLb7R@xxxxxxxxxxxxx/ > >>>>>>>> Hi Saravana, Greg, > >>>>>>>> > >>>>>>>> > >>>>>>>> KernelCI found this patch causes the > >>>>>>>> baseline.bootrr.deferred-probe-empty test to fail on > >>>>>>>> r8a77960-ulcb, > >>>>>>>> see the following details for more information. > >>>>>>>> > >>>>>>>> KernelCI dashboard link: > >>>>>>>> https://linux.kernelci.org/test/plan/id/64d2a6be8c1a8435e535b264/ > >>>>>>>> > >>>>>>>> Error messages from the logs :- > >>>>>>>> > >>>>>>>> + UUID=11236495_1.5.2.4.5 > >>>>>>>> + set +x > >>>>>>>> + export > >>>>>>>> 'PATH=/opt/bootrr/libexec/bootrr/helpers:/lava-11236495/1/../bin:/sbin:/usr/sbin:/bin:/usr/bin' > >>>>>>>> + cd /opt/bootrr/libexec/bootrr > >>>>>>>> + sh helpers/bootrr-auto > >>>>>>>> e6800000.ethernet > >>>>>>>> e6700000.dma-controller > >>>>>>>> e7300000.dma-controller > >>>>>>>> e7310000.dma-controller > >>>>>>>> ec700000.dma-controller > >>>>>>>> ec720000.dma-controller > >>>>>>>> fea20000.vsp > >>>>>>>> feb00000.display > >>>>>>>> fea28000.vsp > >>>>>>>> fea30000.vsp > >>>>>>>> fe9a0000.vsp > >>>>>>>> fe9af000.fcp > >>>>>>>> fea27000.fcp > >>>>>>>> fea2f000.fcp > >>>>>>>> fea37000.fcp > >>>>>>>> sound > >>>>>>>> ee100000.mmc > >>>>>>>> ee140000.mmc > >>>>>>>> ec500000.sound > >>>>>>>> /lava-11236495/1/../bin/lava-test-case > >>>>>>>> <8>[ 17.476741] <LAVA_SIGNAL_TESTCASE > >>>>>>>> TEST_CASE_ID=deferred-probe-empty RESULT=fail> > >>>>>>>> > >>>>>>>> Test case failing :- > >>>>>>>> Baseline Bootrr deferred-probe-empty test > >>>>>>>> -https://github.com/kernelci/bootrr/blob/main/helpers/bootrr-generic-tests > >>>>>>>> > >>>>>>>> Regression Reproduced :- > >>>>>>>> > >>>>>>>> Lava job after reverting the commit 5ee76c256e92 > >>>>>>>> https://lava.collabora.dev/scheduler/job/11292890 > >>>>>>>> > >>>>>>>> > >>>>>>>> Bisection report from KernelCI can be found at the bottom of > >>>>>>>> the email. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Shreeya Patel > >>>>>>>> > >>>>>>>> #regzbot introduced: 5ee76c256e92 > >>>>>>>> #regzbot title: KernelCI: Multiple devices deferring on > >>>>>>>> r8a77960-ulcb > >>>>>>>> > >>>>>>>> --------------------------------------------------------------------------------------------------------------------------------------------------- > >>>>>>>> > >>>>>>>> > >>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** > >>>>>>>> * If you do send a fix, please include this trailer: * > >>>>>>>> * Reported-by: "kernelci.org bot" <bot@...> * > >>>>>>>> * * > >>>>>>>> * Hope this helps! * > >>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > >>>>>>>> > >>>>>>>> stable-rc/linux-5.10.y bisection: > >>>>>>>> baseline.bootrr.deferred-probe-empty on > >>>>>>>> r8a77960-ulcb > >>>>>>> You are testing 5.10.y, yet the subject says 5.17? > >>>>>>> > >>>>>>> Which is it here? > >>>>>> Sorry, I accidentally used the lore link for 5.17 while reporting > >>>>>> this > >>>>>> issue, > >>>>>> but this test does fail on all the stable releases from 5.10 > >>>>>> onwards. > >>>>>> > >>>>>> stable 5.15 :- > >>>>>> https://linux.kernelci.org/test/case/id/64dd156a5ac58d0cf335b1ea/ > >>>>>> mainline :- > >>>>>> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/ > >>>>>> > >>>>> Shreeya, can you try the patch Geert suggested and let us know if it > >>>>> helps? If not, then I can try to take a closer look. > >>>> I tried to test the kernel with 9be4cbd09da8 but it didn't change the > >>>> result. > >>>> https://lava.collabora.dev/scheduler/job/11311615 > >>>> > >>>> Also, I am not sure if this can change things but just FYI, KernelCI > >>>> adds some kernel parameters when running these tests and one of the > >>>> parameter is deferred_probe_timeout=60. > >>> Ah this is good to know. > >>> > >>>> You can check this in the definition details given in the Lava job. I > >>>> also tried to remove this parameter and rerun the test but again I got > >>>> the same result. > >>> How long does the test wait after boot before checking for the > >>> deferred devices list? > >>> > >> > >> AFAIK, script for running the tests is immediately ran after the boot > >> process is complete so there is no wait time. > > > > Regardless of what the kernel is doing, it seems like a fundamentally > > dumb test to specifically ask deferred probe to wait for up to a > > minute then complain that it hasn't finished after 11 seconds :/ > > > > If anything, it seems plausible that the "regression" might actually > > be the correct behaviour, and it was wrong before. I can't manage to > > pull up a boot log for a pre-5.10 kernel since all the async stuff on > > the KernelCI dashboard always just times out for me with a helpful > > "Error while loading data from the server (error code: 0)", but what > > would be interesting is whether those devices on the list are expected > > to successfully probe anyway - the mainline log below also shows other > > stuff failing to probe and CPUs failing to come online, so it's > > clearly not a very happy platform to begin with. > > > > Sorry about the dashboard issues you are facing, KernelCI team is > working on a new dashboard which will fix all of these issues. But we > need to wait for it be ready for some more time. > > > Your point makes sense and that is why we did a test to add some sleep > time of 60-65 seconds before running the tests and it actually fixed the > problem. There are no more deferred devices as you can see in the > following job. > https://lava.collabora.dev/scheduler/job/11330931 > > This change, to add deferred_probe_timeout=60 as kernel parameter was > recently added in KernelCI since there were number of devices failing to > probe on different platforms, specifically for chromebooks. > Unfortunately, no one realized to change the start of the test time and > hence these issues are seen now on different platforms. > > Thanks for pointing it out, it will help us to eliminate quite many > deferred probe test failures that we are seeing. > > Saravana, thanks to you as well for asking the valid questions about > when the test starts running. Glad we sorted this out. Thanks for maintaining the tests. -Saravana