Hi everyone, On 18/12/2018 06.45, Mauro Carvalho Chehab wrote: > Em Mon, 17 Dec 2018 21:05:11 -0500 > Alex Deucher <alexdeucher@xxxxxxxxx> escreveu: > >> On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab >> <mchehab@xxxxxxxxxx> wrote: >>> Em Sun, 16 Dec 2018 11:37:02 +0100 >>> Markus Dobel <markus.dobel@xxxxxx> escreveu: >>> >>>> On 06.12.2018 19:01, Mauro Carvalho Chehab wrote: >>>>> Em Thu, 06 Dec 2018 18:18:23 +0100 >>>>> Markus Dobel <markus.dobel@xxxxxx> escreveu: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I will try if the hack mentioned fixes the issue for me on the weekend >>>>>> (but I assume, as if effectively removes the function). >>>>> It should, but it keeps a few changes. Just want to be sure that what >>>>> would be left won't cause issues. If this works, the logic that would >>>>> solve Ryzen DMA fixes will be contained into a single point, making >>>>> easier to maintain it. >>>> Hi, >>>> >>>> I wanted to have this setup running stable for a few days before >>>> replying, that's why I am answering only now. >>>> >>>> But yes, as expected, with Mauro's hack, the driver has been stable for >>>> me for about a week, with several >>>> scheduled recordings in tvheadend, none of them missed. >>>> >>>> So, adding a reliable detection for affected chipsets, where the `if >>>> (1)` currently is, should work. >>> Markus, >>> >>> Thanks for testing! >>> >>> Brad/Alex, >>> >>> I guess we should then stick with this patch: >>> https://patchwork.linuxtv.org/patch/53351/ >>> >>> The past approach that we used on cx88, bttv and other old drivers >>> were to patch drivers/pci/quirks.c, making them to "taint" DMA >>> memory controllers that were known to bad affect on media devices, >>> and then some logic at the drivers to check for such "taint". >>> >>> However, that would require to touch another subsystem, with >>> usually cause delays. Also, as Alex pointed, this could well >>> be just a matter of incompatibility between the cx23885 and >>> the Ryzen DMA controller, and may not affect any other drivers. >>> >>> So, let's start with a logic like what I proposed, fine >>> tuning it to the Ryzen DMA controllers with we know have >>> troubles with the driver. >>> >>> We need to list the PCI ID of the memory controllers at the >>> device ID table on that patch, though. At the RFC patch, >>> I just added an IOMMU PCI ID from a randon Ryzen CPU: >>> >>> +static struct { >>> + int vendor, dev; >>> +} const broken_dev_id[] = { >>> + /* According with >>> + * https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci, >>> + * 0x1451 is PCI ID for the IOMMU found on Ryzen 7 >>> + */ >>> + { PCI_VENDOR_ID_AMD, 0x1451 }, >>> +}; >>> + >>> >>> Ideally, the ID for the affected Ryzen DMA engines should be there at >>> include/linux/pci_ids.h, instead of hard-coded inside a driver. >>> >>> Also, we should, instead, add there the PCI IDs of the DMA engines >>> that are known to have problems with the cx23885. >> These aren't really DMA engines. Isn't this just the pcie bridge on the CPU? > Yeah, it is not the DMA engine itself, but the CPU/chipset support for it. > > Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines > have internally a RISC CPU that it is programmed, in runtime, to do > DMA scatter/gather. The actual DMA engine is there. For it to work, the > Northbridge (or the CPU chipset - as nowadays several chipsets integrated > the Northbridge inside an IP block at the CPU) has to do the counter part, > by allowing the board's DMA engine to access the mainboard's main memory, > usually via IOMMU, in a safe way[1]. > > [1] preventing memory corruption if two devices try to do DMA to the > same area, or if the DMA from the board tries to write at the same > time the CPU tries to access it. > > Media PCI boards usually push the DMA logic to unusual conditions, as > a large amount of data is transferred, in a synchronous way, > between the PCIe card and memory. > > If the video stream is recorded, the same physical memory DMA mapped area > where the data is written by the video board could be used on another DMA > transfer via the HD disk controller. > > It is even possible to setup the Conexant's DMA engine to do transfers > directly to the GPU's internal memory, causing a PCI to PCI DMA transfer, > using V4L2 API overlay mode. > > There was a time where it used to be common to have Intel CPUs (or > Intel-compatible CPUs) using non-Intel North Bridges. On such time, > we've seen a lot of troubles with PCI to PCI transfers most of them > when using non-Intel north bridges. > > With some north bridges, having the same block of memory mapped > for two DMA operations (where memory writes come from the video > card and memory reads from the HD disk controller) was also > problematic, as the IOMMU had issues on managing two kinds of > transfer for the same physical memory block. > > The report we have on the 95f408bb commit is: > > "media: cx23885: Ryzen DMA related RiSC engine stall fixes > > This bug affects all of Hauppauge QuadHD boards when used on all Ryzen > platforms and some XEON platforms. On these platforms it is possible to > error out the RiSC engine and cause it to stall, whereafter the only > way to reset the board to a working state is to reboot. > ... > [ 255.663598] cx23885: cx23885[0]: mpeg risc op code error" > > Brad could fill more details here, but I've seen the "risc op code > error" before with bt878 and cx88 chipsets (with use a similar RISC). > We usually get such error when there's a problem with the North Bridge > that was not capable of doing their part at the DMA transfer. > > As far as I know, the Hauppauge QuadHD boards can receive 4 different > HD MPEG-TS streams (either from cable or air transmissions). On cable, > one transponder can have up to ~40 Mbits/second. So, this board will > produce 4 streams of up to 40 Mbps each, happening on different times, > each filled in a synchronous way. As nobody watches 4 channels at > the same time, it is safe to assume that at least 3 channels will > be recorded (if not all 4 channels). So, we're talking about 320 MBps > of traffic that may be competing with other DMA traffic (including > some from the Kernel itself, in order to handle memory swap). > > That can be recording channels for several weeks. > > This usually pushes the North Bridge into their limits, and could > be revealing some North Bridge/IOMMU issues that it would otherwise > be not noticed under normal traffic. Thanks for the detailed description Mauro. What you've said here is pretty much my understanding. I submitted a patch to the list and cc'd you all. I simply took Mauro's patch and added a module option. The option is set to default enable for Ryzen, and also have a force on and force off option. I added a comment in the driver in case someone encounters this hereafter. Regards, Brad > >>> There one thing that still bothers me: could this problem be due to >>> some BIOS setup [1]? If so, are there any ways for dynamically >>> disabling such features inside the driver? >>> >>> [1] like this: https://www.techarp.com/bios-guide/cpu-pci-write-buffer/ >>> >> possibly? It's still not clear to me that this is specific to ryzen >> chips rather than a problem with the DMA setup on the cx board. Is >> there a downside to enabling the workaround in general? > The problem here is that the code with resets the DMA engine (required > for it to work with Ryzen) causes trouble with non-Ryzen North Bridges. > > So, one solution that would fit all doesn't seem to exist. > >> The original commit mentioned that xeon platforms were affected as well. > Xeon uses different chipsets and a different solution for the North > Bridge functionality, with may explain why some Xeon CPUs have the > same issue. > >> Is it possible it's just particular platforms with wonky bioses? > Good point. Yeah, it could be triggered by a wonky bios or a bad setup > (like enabling overclock or activating some chipset-specific feature > that would increase the chance for a DMA transfer to fail). > >> Maybe DMI matching would be better? > Mapping via DMI could work too, but it would be a way harder to map, > as one would need to have a cx23885 board (if possible one with 4 > tuners) and a series of different machines in order to test it. > > Based with previous experiences with bttv and cx88, I suspect that > we'll end by needing to map all machines with the same chipset. > >>> Brad, >>> >>> From your reports about the DMA issues, do you know what generations >>> of the Ryzen are affected? >>> >>> Alex, >>> >>> Do you know if are there any differences at the IP block for the >>> DMA engine used on different Ryzen CPUs? I mean: I suspect that >>> the engine for Ryzen 2nd generation would likely be different than >>> the one at the 1st generation, but, along the same generation, does >>> the Ryzen 3, 5, 7 and Threadripper use the same DMA engine? >> + Suravee. I'm not really familiar with the changes, if any, that are >> in the pcie bridges on various AMD CPUs. Or if there are changes, it >> would be hard to say whether this issue would affect them or not. >> >> Alex > > > Thanks, > Mauro