Hi, I'd like to resurrect this thread (copied below). I have a system
showing this error. Its a HP ML350 server with 2x Xeon 5675 running
Rocky Linux 8.5. It has a Hauppauge HVR5525 card that uses the same
cx23885 kernel module as the quadHD card discussed above. The HVR5525
is a dual DVB-T2/DVB-S2 card.
In other threads I read about the dma_reset_workaround option. That
option did not appear to be in the version included in standard
kernel in Rocky 8.5. I have loaded a 5.4 kernel and compiled the DVB
media modules from .git source and set dma_reset_workaround=2 in a
file in modprobe.d. The built module shows version 0.0.4
Sadly the error remains. The system runs MythTV v.31. The main
symptom is occasional aborted recordings. Although the card does
appear to recover, not requiring a reboot/cold restart.
I'd appreciate some assistance with this. What information can I
provide to help to trace this.
I'm also maintaining a driver which started to show problems on
systems with new CPUs and chipsets quite some time ago, for example on
some Ryzen CPUs. In my case it turned out that the problem was because
my driver accessed memory locations on a my PCI card directly via a
pointer.
Looks like the problem occurred because the CPU/chipset "optimized"
and re-ordered the execution of some machine instructions. There are
"barrier" instructions that can be inserted in the source code to
avoid this, but my original code didn't use them because the driver
had been working on many systems for a long time.
Anyway, the low level functions provided by the kernel to access
registers on a peripheral are implemented to use those barriers, so
simply using those primitives (writel, readl and friends) instead of
accessing the registers directly via a pointer (*p = cmd; val = *(p+1)
) fixed the problem for my driver.
All the symptoms described here for the cx23885 module make me assume
that the problem is very similar, i.e. due to a missing barrier
instruction somewhere in the source code. Unfortunately I'm not
familiar with the Linux media driver stuff, so I don't know where I
could start to look for a missing barrier instruction.
The only workaround that fixed the problem for me, and that I'm still
using, is to load the cx23885 module with a high debug level, by
putting a line
options cx23885 debug=8
into a file
/etc/modprobe.d/cx23885.conf
This produces a HUGE amount of kernel log messages (dmesg), but with
lower debug levels the driver still didn't work reliably.
To make this stable for a long time, I changed /var/log/ to NOT point
to my SSD but to a real hard disk, and I created a cronjob file in
/etc/etc/cron.d/ with the line
1 0-23 * * * root rm -f /var/log/kern.log*
to periodically remove the huge kernel log files.
This hack works for me since this has been discussed on this ML years
ago.
Martin
Thank you Martin and Robert.
I've been doing some testing today. intel_iommu=off and
dma_reset_workaround=2 or dma_reset_workaround=0 didn't change the
symptoms.
This system has journald. I initially set debug=1 to see where the
messages go and I see what you mean about the volume of messages. I need
to work out how to divert this torrent to /dev/null if that option is to
be workable.
I fully understand your comment about out of order instructions, Martin.
Looks like this driver may need the same attention as the one you
maintain. One option for me is to move the HVR5525 to a lower power
machine and run that as a slave MythBackend.
Many thanks
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.