On 2023-08-21 12:35, Shreeya Patel wrote:
On 19/08/23 01:49, Saravana Kannan wrote:
On Thu, Aug 17, 2023 at 4:13 PM Shreeya Patel
<shreeya.patel@xxxxxxxxxxxxx> wrote:
Hi Geert, Saravana,
On 18/08/23 00:03, Saravana Kannan wrote:
On Thu, Aug 17, 2023 at 4:37 AM Shreeya Patel
<shreeya.patel@xxxxxxxxxxxxx> wrote:
Hi Greg,
On 16/08/23 20:33, Greg Kroah-Hartman wrote:
On Wed, Aug 16, 2023 at 03:09:27PM +0530, Shreeya Patel wrote:
On 13/06/22 15:40, Greg Kroah-Hartman wrote:
From: Saravana Kannan<saravanak@xxxxxxxxxx>
[ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
Mounting NFS rootfs was timing out when deferred_probe_timeout was
non-zero [1]. This was because ip_auto_config() initcall times out
waiting for the network interfaces to show up when
deferred_probe_timeout was non-zero. While ip_auto_config() calls
wait_for_device_probe() to make sure any currently running deferred
probe work or asynchronous probe finishes, that wasn't
sufficient to
account for devices being deferred until deferred_probe_timeout.
Commit 35a672363ab3 ("driver core: Ensure
wait_for_device_probe() waits
until the deferred_probe_timeout fires") tried to fix that by
making
sure wait_for_device_probe() waits for deferred_probe_timeout to
expire
before returning.
However, if wait_for_device_probe() is called from the
kernel_init()
context:
- Before deferred_probe_initcall() [2], it causes the boot
process to
hang due to a deadlock.
- After deferred_probe_initcall() [3], it blocks kernel_init() from
continuing till deferred_probe_timeout expires and beats
the point of
deferred_probe_timeout that's trying to wait for userspace
to load
modules.
Neither of this is good. So revert the changes to
wait_for_device_probe().
[1]
-https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[2]
-https://lore.kernel.org/lkml/YowHNo4sBjr9ijZr@dev-arch.thelio-3990X/
[3] -https://lore.kernel.org/lkml/Yo3WvGnNk3LvLb7R@xxxxxxxxxxxxx/
Hi Saravana, Greg,
KernelCI found this patch causes the
baseline.bootrr.deferred-probe-empty test to fail on r8a77960-ulcb,
see the following details for more information.
KernelCI dashboard link:
https://linux.kernelci.org/test/plan/id/64d2a6be8c1a8435e535b264/
Error messages from the logs :-
+ UUID=11236495_1.5.2.4.5
+ set +x
+ export
'PATH=/opt/bootrr/libexec/bootrr/helpers:/lava-11236495/1/../bin:/sbin:/usr/sbin:/bin:/usr/bin'
+ cd /opt/bootrr/libexec/bootrr
+ sh helpers/bootrr-auto
e6800000.ethernet
e6700000.dma-controller
e7300000.dma-controller
e7310000.dma-controller
ec700000.dma-controller
ec720000.dma-controller
fea20000.vsp
feb00000.display
fea28000.vsp
fea30000.vsp
fe9a0000.vsp
fe9af000.fcp
fea27000.fcp
fea2f000.fcp
fea37000.fcp
sound
ee100000.mmc
ee140000.mmc
ec500000.sound
/lava-11236495/1/../bin/lava-test-case
<8>[ 17.476741] <LAVA_SIGNAL_TESTCASE
TEST_CASE_ID=deferred-probe-empty RESULT=fail>
Test case failing :-
Baseline Bootrr deferred-probe-empty test
-https://github.com/kernelci/bootrr/blob/main/helpers/bootrr-generic-tests
Regression Reproduced :-
Lava job after reverting the commit 5ee76c256e92
https://lava.collabora.dev/scheduler/job/11292890
Bisection report from KernelCI can be found at the bottom of the
email.
Thanks,
Shreeya Patel
#regzbot introduced: 5ee76c256e92
#regzbot title: KernelCI: Multiple devices deferring on
r8a77960-ulcb
---------------------------------------------------------------------------------------------------------------------------------------------------
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * **
* If you do send a fix, please include this trailer: *
* Reported-by: "kernelci.org bot" <bot@...> *
* *
* Hope this helps! *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
stable-rc/linux-5.10.y bisection:
baseline.bootrr.deferred-probe-empty on
r8a77960-ulcb
You are testing 5.10.y, yet the subject says 5.17?
Which is it here?
Sorry, I accidentally used the lore link for 5.17 while reporting this
issue,
but this test does fail on all the stable releases from 5.10 onwards.
stable 5.15 :-
https://linux.kernelci.org/test/case/id/64dd156a5ac58d0cf335b1ea/
mainline :-
https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
Shreeya, can you try the patch Geert suggested and let us know if it
helps? If not, then I can try to take a closer look.
I tried to test the kernel with 9be4cbd09da8 but it didn't change the
result.
https://lava.collabora.dev/scheduler/job/11311615
Also, I am not sure if this can change things but just FYI, KernelCI
adds some kernel parameters when running these tests and one of the
parameter is deferred_probe_timeout=60.
Ah this is good to know.
You can check this in the definition details given in the Lava job. I
also tried to remove this parameter and rerun the test but again I got
the same result.
How long does the test wait after boot before checking for the
deferred devices list?
AFAIK, script for running the tests is immediately ran after the boot
process is complete so there is no wait time.
Regardless of what the kernel is doing, it seems like a fundamentally
dumb test to specifically ask deferred probe to wait for up to a minute
then complain that it hasn't finished after 11 seconds :/
If anything, it seems plausible that the "regression" might actually be
the correct behaviour, and it was wrong before. I can't manage to pull
up a boot log for a pre-5.10 kernel since all the async stuff on the
KernelCI dashboard always just times out for me with a helpful "Error
while loading data from the server (error code: 0)", but what would be
interesting is whether those devices on the list are expected to
successfully probe anyway - the mainline log below also shows other
stuff failing to probe and CPUs failing to come online, so it's clearly
not a very happy platform to begin with.
Robin.
I will try to add 9be4cbd09da8 to mainline kernel and see what results I
get.
Now I'm confused. What do you mean by mainline? Are you saying the tip
of tree of Linus's tree is also hitting this issue?
KernelCI runs tests on different kernel branches and trees, we also have
this same test running on mainline tree.
Following is the link to the dashboard for it and as you can see, it
does fail there too.
https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
-Saravana