On Tue, Nov 10, 2020 at 02:20:50PM -0500, Alberto Sentieri wrote: > I’ve seen many kernel locks caused by a particular user-level application. > After the kernel locks, there is no report left in the machine, neither in > the logs. These locks have to do with USB input and output. > > The objective of this email is to get guidance about how to collect more > data related to the locks. > > Follows a description of the problem. > > I manage a few remote machines installed at a manufacturing facility, which > run Ubuntu 18.04. For months I had seen unexpected kernel locks, which I > could not explain. By locks I mean that the machine completely dies. The > graphical screen and keyboard freezes. I cannot ping or connect through ssh > during the locks. The only way of making the machine come back is through a > “pull the plug”. After rebooting I cannot find anything meaningful about the > lock in the logs. The machine is a good quality one with a 6-core Xeon, 32 > GB ECC memory (and the application is using about 1GB). Exact the same > problem happens in two identical machines, one running kernel 5.0.0-37 > generic and the other running kernel 5.3.0-62-generic. Can you update either machine to a 5.9 kernel? > A few days ago I was able to create a sequence of events that produce the > locks in a couple of minutes. These events have to do with USB 2.0 interrupt > I/O on USB devices connected at 12 Mbits/s and the frequency URBs are > submitted and reaped . It is necessary to have at least 36 devices connected > to reproduce the problem easily, which I cannot do from where I am. The > machines are in a country other than the one I live, and my physical access > to them is not possible due to COVID-19 restrictions. > > There is no special USB drivers installed. However, there is a NVIDIA > manufacturer driver installed, which I installed using the Ubuntu regular > tools for non-free software. All USB I/O is done by a regular user opening > /dev/bus/usb/xxx/xxx (the device group is set to the user group by udev). > > Each set of 18 USB devices is connected to a 10-Amp.-power-supply powered > HUB. Each hub has its own USB 2.0 root, I mean, I installed multiple USB 2.0 > PCI express expansion cards, and only one port of each expansion card is > used for each HUB. > > The protocol to talk to any of the 36 devices is pretty simple. It uses USB > interrupt frames. A 64-byte frame is sent to the device (request packet). I > use ioctl (USBDEVFS_SUBMITURB). The file descriptor is monitored by epoll > and when an answer comes back, the response packet (another 64-byte > interrupt packet) is recovered by ioctl (USBDEVFS_REAPURBNDELAY). Then a > 64-byte packet (confirmation packet) is sent through USBDEVFS_SUBMITURB. > This sequence happens once every few seconds and the delay between the three > packets is just a couple of milliseconds. All process of dealing with the 36 > devices is in a unique thread, under the same epoll loop. This sentence is ambiguous. Do you mean there is a single unique thread which talks to all 36 devices? Or do you mean there is a separate unique thread for each device (so 36 threads)? > So if I synchronize all 36 devices, I mean, I try to talk to all them > basically at the same time, the kernel will lock in about 2 minutes or less. > By “at the same time” I mean to submit the URBs for the request packet > around the same time for all of them, and then sit there, waiting for the > proper epoll wake-up to deal with the state machine (response and > confirmation packets). > > However, if I lock a semaphore before sending the request packet for one > device, and only unlock after reaping the URB I used to send the > confirmation packet, it ran for ate least 72 hours without problems. So, one > device at a time (using basically the same software plus the semaphore) does > not cause the kernel lock. > > My point is that simple ioctl calls to USB devices should not break the > kernel. I need help to address the kernel issue. The problem is difficult to > reproduce at my office because it needs many devices connected to it, which > are available only in a place I do not have physical access to, due to > COVID-19 travel restrictions. > > My guess is that, for a regular user, this bug rarely manifests itself and > it may be there for a long time. > > I would like to figure out exactly where the problem is and I am looking for > your guidance to get more information about it. You could try using a network console. Or have someone who is on-site take a picture of the computer screen when a crash occurs. Alan Stern