The objective of this email is to report the current status of my findings.
I loaded netconsole on both machines I was having problems with. I tried
3 times on the machine with kernel 5.0.0-37 and twice with on the
machine with kernel 5.3.0-62. Each attempt consisted of running the
program which lock the kernel until it locked (about 3 minutes after
stating the program). The referred program had the "semphore code"
commented out. Nothing was sent to netconsole on all the 5 attempts I
made when the kernel locked.
Just to be clear about my use of netconsole, before loading the
netconsole kernel module, I ran "dmesg -n 8". When netconsole module was
loaded I could clearly see about 9 message lines on the computer
receiving the netconsole messages telling me that netconsole was loaded
(and how it was configured), so no doubts about the correct netconsole
setup. The "netconsole server" was a machine on the same local network.
My next attempt will be to compile kernel 5.9, as you suggest, and try it.
Thanks,
Alberto Sentieri
On 11/11/20 10:51 AM, Alan Stern wrote:
On Tue, Nov 10, 2020 at 06:42:17PM -0500, Alberto Sentieri wrote:
1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it to
5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can use? It
would be easy to do it If there is a Ubuntu package with 5.9, which I would
install and, after the tests, uninstall.
If you want to know what Ubuntu packages are available, you should ask
on an Ubuntu mailing list instead of the linux-usb mailing list.
2) Why do you believe that 5.9 would solve the problem? I am asking that
because I cannot change the production machine for a test if I cannot go
back to the original state. There is always a risk involved.
We do not believe that 5.9 will solve the problem -- we have no reason
to believe this -- but we could be wrong. In any case it is always
best to test with the most up-to-date software available, and 5.9 is the
version closest to what we are working on now.
3) It is one single thread dealing with all 36 devices. Each device has its
own co-routine (not preemptive), but all co-routines are executed by a
unique thread.
If everything runs within a single thread, how can adding a semaphore
or mutex make any difference?
4) By network console, do you mean ssh? It dies as well when it locks. The
screen is the regular GNOME3 screen and nothing can be seen there. Every
time it locks they send a picture, and I cannot see anything meaningful
there. I am thinking about disabling GNOME3, but I need their blessing for
that.
See https://www.kernel.org/doc/Documentation/networking/netconsole.txt
for instructions on netconsole. And when you use it for testing, be
sure to set the console log level to a high value.
Alan Stern