On 08/01/16 16:09, Greg Kroah-Hartman wrote:
On Fri, Jan 08, 2016 at 03:56:56PM +0000, Dean Jenkins wrote:
Hi,
Here is a summary of the kernel ticket that I created at
https://bugzilla.kernel.org/show_bug.cgi?id=110531
Before kernel 3.13, sysfs read_descriptors() was non-blocking and in kernel
3.13 the following commit was added
"232275a USB: fix substandard locking for the sysfs files"
which added usb_lock_device(udev) which can cause blocking to
read_descriptors().
Blocking occurs when sysfs read_descriptors' udev == hub_event's hdev value.
If USB comms have failed in hub_port_init() then the blocking can be up to
30 seconds due to the 6 retry attempts at re-establishing USB comms with a
USB device.
This period of mutex blocking can be too long for applications as the
application's thread for reading sysfs hangs until usb_unlock_device(hdev)
is called in hub_event().
This issue was observed in our embedded ARM based commercial project that
had an application that was reading the descriptors via sysfs to determine
whether the hub and devices supported OTG mode. The system had a check to
see whether the application got to a "normal" system state within 25 seconds
of power-up.
In the failure scenario, a malfunctioning USB stick was connected (present
at power-up) to the ARM system. The USB stick worked OK for 10 seconds but
failed causing usb_events() (ARM was based on kernel 3.14) to cause
hub_init_port() to attempt up to 6 times to try to re-establish USB
communications, this can take up to 30 seconds to complete all attempts.
Whilst the connection attempts are being attempted, hub_events() holds a
device lock referenced by hdev. However, sysfs read_descriptors() can be
blocked on this held lock when udev == hdev of hub_events(). When
read_descriptors() is blocked, this blocking can be up to 30 seconds.
In our project, a blocked read_descriptors() thread causes the sysfs read
thread in the application to hang which prevents the application from
reaching the "normal" state within the 25 second constraint set by the
project and subsequently the management entity triggers a controlled system
reboot. The board cyclically reset as the USB stick failed on each power-up.
For the project, this behaviour is undesirable.
I recommend not using USB devices that break :)
In the real-world of commercial projects it is not possible to prevent
end-users from physically connecting broken devices.
When broken devices are connected it demonstrates weaknesses in the
robustness of the capability of error handling and recovery.
Proposal:
=========
Remove the usb_lock_device(udev) and usb_unlock_device(udev) from
devices/usb/core/sysfs.c in read_descriptors() that was added by the kernel
3.13 commit
"232275a USB: fix substandard locking for the sysfs files"
Any comments to this proposal ?
The locking was added because it was needed, we can't just remove it and
expect things to work properly, don't you agree?
Is there any evidence that the added locking helped sysfs
read_descriptors() ?
At least kernels before 3.13 seemed to work OK with a non-blocking
non-locking read_descriptors() implementation.
read_descriptors() does not touch the USB hardware; read_descriptors()
just reads the contents of allocated memory. The sysfs descriptors vfs
entry cannot exist if the allocated memory does not exist otherwise
kernels before 3.13 would fail with NULL pointer dereferences which did
not happen (as far as I know).
I think you missed my point. The blocking sysfs read_descriptors() call
is NOT for the device that is failing. There should be no reason why the
static descriptors of the root hub should ever cause blocking to occur.
It should be possible to read this data whilst a different device is
failing.
I understand that having a timeout like this is not "nice", but given
the issues involved, I think that's the best solution, your system
shouldn't reset itself due to an artificial constraint you set on it
(the 25 second timeout), when relying on commodity hardware that can
easily fail.
I understand your point but real-world systems cannot restrict end-users
from attaching broken devices. For commercial systems, error detection
and recovery are important.
sorry,
It seems that I will need to prove that removing the lock checking in
sysfs read_descriptors() is safe so that it works like pre 3.13 kernels
of the past. Meaning sysfs read_descriptors() should be non-blocking.
Our workaround was to modify read_descriptors() to try to get the lock
and give up after 10 seconds and return no data so that the system
remained up. Side effects is a concern so a better solution is being
determined.
We also considered returning -EAGAIN or -ETIMEDOUT in read_descriptors()
to prevent any blocking occurring. However, it was unclear whether the
sysfs layer copes with error codes.
greg k-h
I welcome any further discussion on this.
Mentor Graphics is committed to giving back helpful improvements to the
community. In the long run this makes the kernel more robust.
Regards,
Dean
--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html