Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 02/04/2021 20:02, Luis Chamberlain wrote:
No sorry, I dropped the ball on this but I managed to now spawn up the
virtual guests where I was doing development to reproduce this. Give
me some time and I will zero in on this now.

For now what I have is the following to test this, I next will work
on the userspace part.

I can report that your patch for test_firmware works (applied to current master from linus d93a0d4; I get some offsets and had to adjust some whitespace for it to apply). With that module (and resume test enabled) I get freezes at resume in the same cases that would also cause problems with si2168.

I'm testing this on bare metal F34 beta with / on a btrfs. I'm using nvidia driver again to make sure the system does not otherwise use any firmware from /usr/lib/firmware (confirmed with kernel debug messages for firmware_loader). My si2168 is not plugged in.

I tested it with a normal population of files in /usr/lib/firmware , without test-firmware.bin and also with a random 1MiB file in place. I tested after a reboot so it does not do caching. With /usr/lib/firmware on a separate ext4 partition I can confirm with dmesg that the test_firmware suspend test actually works (does not freeze). With /usr/lib/firmware on btrfs it fails in both cases (with and without the firmware files). With caching (first suspend with the ext4 partition mounted, then a second suspend without) it does not freeze even with the firmware on btrfs.


One further thing I noticed which might be problematic in rare cases:
According to the kernel debug messages, the firmware-loader does not attempt to cache the firmware during suspend, if the previous call to request_firmware() has failed (file not present; call made during previous resume). In my opinion it should attempt to cache the firmware on suspend even in this case (If I remember correctly, firmware_request_cache also works without the file being present). In case some low-memory condition has caused the file system cache to lose the information about the file being non-present (or the file has been written after the initial attempt and is no longer in the file system cache), this might lead to freezes even for well-behaved drivers in case they reattempt to do request_firmware() on resume.
If the firmware is found during resume, it is cached on further suspends.
Given how long it took me to narrow down this problem in this (for me) reliably reproducible case, something like this happening at random would be almost impossible to debug/locate and might actually happen frequently in the wild.


On 03/04/2021 00:19, Luis Chamberlain wrote:
Lukas, can you share your /etc/fstab ?

This is the core part (everything else is unmounted), I shortened the UUIDs. The ext4 mount of course is also unmounted when I want it to fail, with it in place it reliably not-fails.

UUID=<1> /                       btrfs   subvol=linux1 0 0
UUID=<2> /usr/lib/firmware       ext4    defaults 1 1
UUID=<1> /home/lukas             btrfs   subvol=linux1-f34-lukas 0 0
tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0

Also, how long do you stay in the boot before you try to suspend?

During my reproduction sessions usually only shortly 1-5min, but I think I have seen this also after a slightly longer time. I can try to let it sit for longer if you think that is important.

On 03/04/2021 00:58, Luis Chamberlain wrote:
On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:

Lukas, can you share your /etc/fstab ? Also, how long do you stay in
the boot before you try to suspend?

OK I cannot reproduce the issue with the modified patch I sent to
test_firmware, which if you enable config_enable_resume_test will
trigger a request_firmware() on resume, thus trying to mimic the race
you note. To test this you can simply use a loopback filesystem for
your /lib/firmware and create a btrfs filesystem for it, and then run:

echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test

systemctl suspend

Then resume. You should see "resume test" print on dmesg. I keep my
/lib/firmware/ empty and still, nothing.

Did you also try to create a random test-firmware.bin (I used 1M from /dev/urandom) instead of an empty /lib/firmware ? If the directory is completely empty, it also does not freeze for me. If the directory is empty, any attempt to access its content can likely be directly served from cache, even if the actual directory has never been accessed before, as long as /lib (which is a symlink to /usr/lib on fedora) has been accessed (which will likely always be true). So I have to further add to my previous findings that "firmware directory is not completely empty" is a further prerequisite for it to fail.

Can you provide kernel logs for where you are seeing things get stuck at?

The log does not have any entries from resume. For the attempts where it freezes the last entry in journalctl is
systemd-sleep[5050]: Suspending system...

Note that I had mentioned the races on suspend/resume do exist for any journaling
filesystem, but this typically happens if you are doing ongoing
writes. I suppose you are *not* doing writes and your filesystem is
idle.

I can of course not completely rule out some random write (log files or similar), but there is definitely no heavy writing going on. I think only writes caused by the act of suspending and resuming could cause it this reliably. I have seen it also with a completely isolated btrfs file system for /usr/lib/firmware, where there should not have been any writes. For ext4 (which is also journaling) it works properly.

As such without kernel logs I cannot be sure what the issue is, but at
this point after the initial testing I've done I don't suspect this is
a firmware API issue. You might be better off just reposting your
patches with the respective Reviewed-by tags and pestering your
maintainer.

I will try to be a little bit more insistent this time. Is "just repost" the usual way to handle if patches are ignored?

Lukas



[Index of Archives]     [Linux Input]     [Video for Linux]     [Gstreamer Embedded]     [Mplayer Users]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux