Re: Deadlock under load with Linux 5.9 and other recent kernels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

To me this bug description is very similar to what I'm struggling with on an amd64-platform.

When I get too much data sent via usb, it seems as the usb controlmsg is delayed so it times out and unmounts the block device.

I have been working on my related bug for long to get it easily reproducible, but failed. It is there all the time. New hardware is on its way so I can continue my testing.

Maybe you can test the patch I'm using to see if it works better for you?

In the meanwhile here is my description of my bug:

I have stress tested the usb system. To the USB is now seven mechanical hard disks and two ssd disks connected. Six processes are at the same time writing random data to the disks. One of them is to the ssd disk I couldn't write data to before without it failed. Also the other usb-ssd disk is my root partition.

Before I applied the patch, my root partition sometimes failed to be kept mounted. Now I have not had any crashes.

This is a quick fix for hard disks, but working. It continued to work when I started three virtualbox guests and let them also do work. The guests' hard disks is on my usb-root partition.

It doesn't work if I also use my usb2ethernet adapter (ID 2001:4a00 D-Link Corp.), although my root partition and two randomize tests survived. Maybe a much larger timeout in this case will help? But this I don't find as a good solution.

The behavior is the same on the other (much slower) computer with a different usb hub. I have also tested it with exactly the same setup as earlier, with no mechanical hard disks, and it works with the patch and not without it.

Best regards,
Patrik

---start of diff---
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 5b768b80d1ee..3c550934815c 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -105,7 +105,7 @@ MODULE_PARM_DESC(use_both_schemes,
 DECLARE_RWSEM(ehci_cf_port_reset_rwsem);
 EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);

-#define HUB_DEBOUNCE_TIMEOUT    2000
+#define HUB_DEBOUNCE_TIMEOUT    10000
 #define HUB_DEBOUNCE_STEP      25
 #define HUB_DEBOUNCE_STABLE     100

diff --git a/include/linux/usb.h b/include/linux/usb.h
index 20c555db4621..e64d441bb78f 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -1841,8 +1841,8 @@ extern int usb_set_configuration(struct usb_device *dev, int configuration);
  * USB identifies 5 second timeouts, maybe more in a few cases, and a few
  * slow devices (like some MGE Ellipse UPSes) actually push that limit.
  */
-#define USB_CTRL_GET_TIMEOUT    5000
-#define USB_CTRL_SET_TIMEOUT    5000
+#define USB_CTRL_GET_TIMEOUT    10000
+#define USB_CTRL_SET_TIMEOUT    10000


 /**
---end of diff---


On 28/09/2020 03:37, Christian Hewitt wrote:
On 26 Sep 2020, at 4:28 pm, Christian Hewitt <christianshewitt@xxxxxxxxx> wrote:

On 26 Sep 2020, at 4:13 pm, Jens Axboe <axboe@xxxxxxxxx> wrote:

On 9/26/20 5:55 AM, Christian Hewitt wrote:
On 26 Sep 2020, at 2:51 pm, Jens Axboe <axboe@xxxxxxxxx> wrote:

On 9/26/20 1:55 AM, Christian Hewitt wrote:
I am using an ARM SBC device with Amlogic S922X chip (Beelink
GS-King-X, an Android STB) to boot the Kodi mediacentre distro
LibreELEC (which I work on) although the issue is also reproducible
with Manjaro and Armbian on the same hardware, and with the GT-King
and GT-King Pro devices from the same vendor - all three devices are
using a common dtsi:

https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-gsking-x.dts
https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-gtking-pro.dts
https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-gtking.dts
https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-w400.dtsi

I have schematics for the devices, but can only share those privately
on request.

For testing I am booting LibreELEC from SD card. The box has a 4TB
SATA drive internally connected with a USB > SATA bridge, see dmesg:
http://ix.io/2yLh and I connect a USB stick with a 4GB ISO file that I
copy to the internal SATA drive. Within 10-20 seconds of starting the
copy the box deadlocks needing a hard power cycle to recover. The
timing of the deadlock is variable but the device _always_ deadlocks.
Although I am using a simple copy use-case, there are similar reports
in Armbian forums performing tasks like installs/updates that involve
I/O loads.

Following advice in the #linux-amlogic IRC channel I added
CONFIG_SOFTLOCKUP_DETECTOR and CONFIG_DETECT_HUNG_TASK and was able to
get output on the HDMI screen (it is not possible to connect to UART
pins without destroying the box case). If you advance the following
video frame by frame in VLC you can see the output:

https://www.dropbox.com/s/klvcizim8cs5lze/lockup_clip.mov?dl=0
Try with this patch:

https://lore.kernel.org/linux-block/20200925191902.543953-1-shakeelb@xxxxxxxxxx/
It still locks up approx. 25 seconds into the copy operation. Here’s the output in video again (a little blurry):

https://www.dropbox.com/s/3j2czaq509arg6g/lockup_clip2.mov?dl=0
Can you try and set CONFIG_SLUB in your .config instead of CONFIG_SLAB?
CONFIG_SLUB is already set, here’s the full defconfig http://paste.ubuntu.com/p/5BNdZv6J3c/

# dmesg | grep -i slub
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1

Also, just take a picture, should be easier to get readable than a video.
And the static trace is all that is needed.
This is from a GT-King Pro which someone reminded me has a large RS232 port on the rear:

https://pastebin.com/raw/sGtzgreN
from 5.9—rc7 https://pastebin.com/raw/nbHJmrqe

Christian




--
PGP-key fingerprint: 1B30 7F61 AF9E 538A FCD6  2BE7 CED7 B0E4 3BF9 8D6C




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux