On Tue, 11 Nov 2014 19:05:37 +0100 Hans Verkuil <hverkuil@xxxxxxxxx> wrote: > On 11/11/2014 06:46 PM, Andrey Utkin wrote: > > At Bluecherry, we have issues with servers which have 3 solo6110 > > cards (and cards have up to 16 analog video cameras connected to > > them, and being actively read). > > This is a kernel which I tested with such a server last time. It is > > based on linux-next of October, 31, with few patches of mine (all > > are in review for upstream). > > https://github.com/krieger-od/linux/ . The HEAD commit is > > 949e18db86ebf45acab91d188b247abd40b6e2a1 at the moment. > > > > The problem is the following: after ~1 hour of uptime with working > > application reading the streams, one card (the same one every time) > > stops producing interrupts (counter in /proc/interrupts freezes), > > and all threads reading from that card hang forever in > > ioctl(VIDIOC_DQBUF). The application uses libavformat (ffmpeg) API > > to read the corresponding /dev/videoX devices of H264 encoders. > > Application restart doesn't help, just interrupt counter increases > > by 64. To help that, we need reboot or programmatic PCI device > > reset by "echo 1 > /sys/bus/pci/devices/0000\:03\:05.0/reset", > > which requires unloading app and driver and is not a solution > > obviously. > > > > We had this issue for a long time, even before we used libavformat > > for reading from such sources. > > A few days ago, we had standalone ffmpeg processes working stable > > for several days. The kernel was 3.17, the only probably-relevant > > change in code over the above mentioned revision is an additional > > bool variable set in solo_enc_v4l2_isr() and checked in > > solo_ring_thread() to figure out whether to do or skip > > solo_handle_ring(). The variable was guarded with > > spin_lock_irqsave(). I am not sure if it makes any difference, will > > try it again eventually. > > > > Any thoughts, can it be a bug in driver code causing that (please > > point which areas of code to review/fix)? Or is that desperate > > hardware issue? How to figure out for sure whether it is the former > > or the latter? > > I would first try to exclude hardware issues: since you say it is > always the same card, try either replacing it or swapping it with > another solo card and see if the problem follows the card or not. If > it does, then it is likely a hardware problem. If it doesn't, then it > suggests a race condition in the interrupt handling somewhere. > > Regards, > > Hans CC'ing Curtis, hope you don't mind. It's just coincidence. This has been a long standing issue, and only depends on having enough cards. One of the problems I had to weed out this one was that I didn't have the right hardware (only one 16-port card), and my guess is that Andrey is in the same position.
Attachment:
pgpX0B7qHMSiG.pgp
Description: OpenPGP digital signature