I’ve seen many kernel locks caused by a particular user-level
application. After the kernel locks, there is no report left in the
machine, neither in the logs. These locks have to do with USB input and
output.
The objective of this email is to get guidance about how to collect more
data related to the locks.
Follows a description of the problem.
I manage a few remote machines installed at a manufacturing facility,
which run Ubuntu 18.04. For months I had seen unexpected kernel locks,
which I could not explain. By locks I mean that the machine completely
dies. The graphical screen and keyboard freezes. I cannot ping or
connect through ssh during the locks. The only way of making the machine
come back is through a “pull the plug”. After rebooting I cannot find
anything meaningful about the lock in the logs. The machine is a good
quality one with a 6-core Xeon, 32 GB ECC memory (and the application is
using about 1GB). Exact the same problem happens in two identical
machines, one running kernel 5.0.0-37 generic and the other running
kernel 5.3.0-62-generic.
A few days ago I was able to create a sequence of events that produce
the locks in a couple of minutes. These events have to do with USB 2.0
interrupt I/O on USB devices connected at 12 Mbits/s and the frequency
URBs are submitted and reaped . It is necessary to have at least 36
devices connected to reproduce the problem easily, which I cannot do
from where I am. The machines are in a country other than the one I
live, and my physical access to them is not possible due to COVID-19
restrictions.
There is no special USB drivers installed. However, there is a NVIDIA
manufacturer driver installed, which I installed using the Ubuntu
regular tools for non-free software. All USB I/O is done by a regular
user opening /dev/bus/usb/xxx/xxx (the device group is set to the user
group by udev).
Each set of 18 USB devices is connected to a 10-Amp.-power-supply
powered HUB. Each hub has its own USB 2.0 root, I mean, I installed
multiple USB 2.0 PCI express expansion cards, and only one port of each
expansion card is used for each HUB.
The protocol to talk to any of the 36 devices is pretty simple. It uses
USB interrupt frames. A 64-byte frame is sent to the device (request
packet). I use ioctl (USBDEVFS_SUBMITURB). The file descriptor is
monitored by epoll and when an answer comes back, the response packet
(another 64-byte interrupt packet) is recovered by ioctl
(USBDEVFS_REAPURBNDELAY). Then a 64-byte packet (confirmation packet) is
sent through USBDEVFS_SUBMITURB. This sequence happens once every few
seconds and the delay between the three packets is just a couple of
milliseconds. All process of dealing with the 36 devices is in a unique
thread, under the same epoll loop.
So if I synchronize all 36 devices, I mean, I try to talk to all them
basically at the same time, the kernel will lock in about 2 minutes or
less. By “at the same time” I mean to submit the URBs for the request
packet around the same time for all of them, and then sit there, waiting
for the proper epoll wake-up to deal with the state machine (response
and confirmation packets).
However, if I lock a semaphore before sending the request packet for one
device, and only unlock after reaping the URB I used to send the
confirmation packet, it ran for ate least 72 hours without problems. So,
one device at a time (using basically the same software plus the
semaphore) does not cause the kernel lock.
My point is that simple ioctl calls to USB devices should not break the
kernel. I need help to address the kernel issue. The problem is
difficult to reproduce at my office because it needs many devices
connected to it, which are available only in a place I do not have
physical access to, due to COVID-19 travel restrictions.
My guess is that, for a regular user, this bug rarely manifests itself
and it may be there for a long time.
I would like to figure out exactly where the problem is and I am looking
for your guidance to get more information about it.
Thanks,
Alberto Sentieri
# cat /proc/version
Linux version 5.3.0-62-generic (buildd@lcy01-amd64-004) (gcc version
7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #56~18.04.1-Ubuntu SMP Wed Jun 24
16:17:03 UTC 2020
# free -h (while running the application)
total used free shared buff/cache
available
Mem: 31G 860M 29G 7.1M 1.3G
29G
Swap: 15G 0B 15G
# lspci
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.6 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:04.7 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers
(rev 04)
00:05.0 System peripheral: Intel Corporation Sky Lake-E MM/Vt-d
Configuration Registers (rev 04)
00:05.2 System peripheral: Intel Corporation Device 2025 (rev 04)
00:05.4 PIC: Intel Corporation Device 2026 (rev 04)
00:08.0 System peripheral: Intel Corporation Sky Lake-E Ubox Registers
(rev 04)
00:08.1 Performance counters: Intel Corporation Sky Lake-E Ubox
Registers (rev 04)
00:08.2 System peripheral: Intel Corporation Sky Lake-E Ubox Registers
(rev 04)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family
USB 3.0 xHCI Controller
00:14.2 Signal processing controller: Intel Corporation 200 Series PCH
Thermal Subsystem
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME
HECI #1
00:17.0 RAID bus controller: Intel Corporation C600/X79 series chipset
SATA RAID Controller
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root
Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root
Port #3 (rev f0)
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root
Port #5 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root
Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation C422 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset
Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus
Controller
00:1f.5 Non-VGA unclassified device: Intel Corporation 200 Series/Z370
Chipset Family SPI Controller
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2)
I219-LM
01:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI
Reversible Bridge (rev 02)
02:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
02:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
02:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
Connection (rev 03)
05:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI
Reversible Bridge (rev 02)
06:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
06:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
06:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
14:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A
(rev 04)
14:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
14:05.2 System peripheral: Intel Corporation Sky Lake-E RAS
Configuration Registers (rev 04)
14:05.4 PIC: Intel Corporation Device 2036 (rev 04)
14:08.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:08.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:09.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:09.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0e.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0f.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:0f.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:1d.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:1d.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:1d.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:1d.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers
(rev 04)
14:1e.0 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
14:1e.1 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
14:1e.2 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
14:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
14:1e.4 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
14:1e.5 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
14:1e.6 System peripheral: Intel Corporation Sky Lake-E PCU Registers
(rev 04)
15:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI
Reversible Bridge (rev 02)
16:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
16:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
16:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
20:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A
(rev 04)
20:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
20:05.2 System peripheral: Intel Corporation Sky Lake-E RAS
Configuration Registers (rev 04)
20:05.4 PIC: Intel Corporation Device 2036 (rev 04)
20:08.0 System peripheral: Intel Corporation Device 2066 (rev 04)
20:09.0 System peripheral: Intel Corporation Device 2066 (rev 04)
20:0a.0 System peripheral: Intel Corporation Device 2040 (rev 04)
20:0a.1 System peripheral: Intel Corporation Device 2041 (rev 04)
20:0a.2 System peripheral: Intel Corporation Device 2042 (rev 04)
20:0a.3 System peripheral: Intel Corporation Device 2043 (rev 04)
20:0a.4 System peripheral: Intel Corporation Device 2044 (rev 04)
20:0a.5 System peripheral: Intel Corporation Device 2045 (rev 04)
20:0a.6 System peripheral: Intel Corporation Device 2046 (rev 04)
20:0a.7 System peripheral: Intel Corporation Device 2047 (rev 04)
20:0b.0 System peripheral: Intel Corporation Device 2048 (rev 04)
20:0b.1 System peripheral: Intel Corporation Device 2049 (rev 04)
20:0b.2 System peripheral: Intel Corporation Device 204a (rev 04)
20:0b.3 System peripheral: Intel Corporation Device 204b (rev 04)
20:0c.0 System peripheral: Intel Corporation Device 2040 (rev 04)
20:0c.1 System peripheral: Intel Corporation Device 2041 (rev 04)
20:0c.2 System peripheral: Intel Corporation Device 2042 (rev 04)
20:0c.3 System peripheral: Intel Corporation Device 2043 (rev 04)
20:0c.4 System peripheral: Intel Corporation Device 2044 (rev 04)
20:0c.5 System peripheral: Intel Corporation Device 2045 (rev 04)
20:0c.6 System peripheral: Intel Corporation Device 2046 (rev 04)
20:0c.7 System peripheral: Intel Corporation Device 2047 (rev 04)
20:0d.0 System peripheral: Intel Corporation Device 2048 (rev 04)
20:0d.1 System peripheral: Intel Corporation Device 2049 (rev 04)
20:0d.2 System peripheral: Intel Corporation Device 204a (rev 04)
20:0d.3 System peripheral: Intel Corporation Device 204b (rev 04)
21:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro
P400] (rev a1)
21:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio
Controller (rev a1)
2c:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A
(rev 04)
2c:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C
(rev 04)
2c:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D
(rev 04)
2c:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
2c:05.2 System peripheral: Intel Corporation Sky Lake-E RAS
Configuration Registers (rev 04)
2c:05.4 PIC: Intel Corporation Device 2036 (rev 04)
2c:12.0 Performance counters: Intel Corporation Sky Lake-E M3KTI
Registers (rev 04)
2c:12.1 Performance counters: Intel Corporation Sky Lake-E M3KTI
Registers (rev 04)
2c:12.2 System peripheral: Intel Corporation Sky Lake-E M3KTI Registers
(rev 04)
2c:15.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers
(rev 04)
2c:16.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers
(rev 04)
2c:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers
(rev 04)
2c:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers
(rev 04)
2d:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI
Reversible Bridge (rev 02)
2e:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
2e:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1
Controller (rev 62)
2e:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)