1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it
to 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can
use? It would be easy to do it If there is a Ubuntu package with 5.9,
which I would install and, after the tests, uninstall.
2) Why do you believe that 5.9 would solve the problem? I am asking that
because I cannot change the production machine for a test if I cannot go
back to the original state. There is always a risk involved.
3) It is one single thread dealing with all 36 devices. Each device has
its own co-routine (not preemptive), but all co-routines are executed by
a unique thread.
4) By network console, do you mean ssh? It dies as well when it locks.
The screen is the regular GNOME3 screen and nothing can be seen there.
Every time it locks they send a picture, and I cannot see anything
meaningful there. I am thinking about disabling GNOME3, but I need their
blessing for that.
Thanks,
Alberto
On 11/10/20 3:51 PM, Alan Stern wrote:
On Tue, Nov 10, 2020 at 02:20:50PM -0500, Alberto Sentieri wrote:
I’ve seen many kernel locks caused by a particular user-level application.
After the kernel locks, there is no report left in the machine, neither in
the logs. These locks have to do with USB input and output.
The objective of this email is to get guidance about how to collect more
data related to the locks.
Follows a description of the problem.
I manage a few remote machines installed at a manufacturing facility, which
run Ubuntu 18.04. For months I had seen unexpected kernel locks, which I
could not explain. By locks I mean that the machine completely dies. The
graphical screen and keyboard freezes. I cannot ping or connect through ssh
during the locks. The only way of making the machine come back is through a
“pull the plug”. After rebooting I cannot find anything meaningful about the
lock in the logs. The machine is a good quality one with a 6-core Xeon, 32
GB ECC memory (and the application is using about 1GB). Exact the same
problem happens in two identical machines, one running kernel 5.0.0-37
generic and the other running kernel 5.3.0-62-generic.
Can you update either machine to a 5.9 kernel?
A few days ago I was able to create a sequence of events that produce the
locks in a couple of minutes. These events have to do with USB 2.0 interrupt
I/O on USB devices connected at 12 Mbits/s and the frequency URBs are
submitted and reaped . It is necessary to have at least 36 devices connected
to reproduce the problem easily, which I cannot do from where I am. The
machines are in a country other than the one I live, and my physical access
to them is not possible due to COVID-19 restrictions.
There is no special USB drivers installed. However, there is a NVIDIA
manufacturer driver installed, which I installed using the Ubuntu regular
tools for non-free software. All USB I/O is done by a regular user opening
/dev/bus/usb/xxx/xxx (the device group is set to the user group by udev).
Each set of 18 USB devices is connected to a 10-Amp.-power-supply powered
HUB. Each hub has its own USB 2.0 root, I mean, I installed multiple USB 2.0
PCI express expansion cards, and only one port of each expansion card is
used for each HUB.
The protocol to talk to any of the 36 devices is pretty simple. It uses USB
interrupt frames. A 64-byte frame is sent to the device (request packet). I
use ioctl (USBDEVFS_SUBMITURB). The file descriptor is monitored by epoll
and when an answer comes back, the response packet (another 64-byte
interrupt packet) is recovered by ioctl (USBDEVFS_REAPURBNDELAY). Then a
64-byte packet (confirmation packet) is sent through USBDEVFS_SUBMITURB.
This sequence happens once every few seconds and the delay between the three
packets is just a couple of milliseconds. All process of dealing with the 36
devices is in a unique thread, under the same epoll loop.
This sentence is ambiguous. Do you mean there is a single unique thread
which talks to all 36 devices? Or do you mean there is a separate
unique thread for each device (so 36 threads)?
So if I synchronize all 36 devices, I mean, I try to talk to all them
basically at the same time, the kernel will lock in about 2 minutes or less.
By “at the same time” I mean to submit the URBs for the request packet
around the same time for all of them, and then sit there, waiting for the
proper epoll wake-up to deal with the state machine (response and
confirmation packets).
However, if I lock a semaphore before sending the request packet for one
device, and only unlock after reaping the URB I used to send the
confirmation packet, it ran for ate least 72 hours without problems. So, one
device at a time (using basically the same software plus the semaphore) does
not cause the kernel lock.
My point is that simple ioctl calls to USB devices should not break the
kernel. I need help to address the kernel issue. The problem is difficult to
reproduce at my office because it needs many devices connected to it, which
are available only in a place I do not have physical access to, due to
COVID-19 travel restrictions.
My guess is that, for a regular user, this bug rarely manifests itself and
it may be there for a long time.
I would like to figure out exactly where the problem is and I am looking for
your guidance to get more information about it.
You could try using a network console. Or have someone who is on-site
take a picture of the computer screen when a crash occurs.
Alan Stern