On Fri, Jul 23, 2021 at 1:24 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote: > > Marking both of these as "not applicable" for now because I don't > > think we really understand what's going on. > > > > Apparently a DMA occurs during suspend or resume and triggers an ACS > > violation. I don't think think such a DMA should occur in the first > > place. > > > > Or maybe, since you say the problem happens right after ACS is enabled > > during resume, we're doing the ACS enable incorrectly? Although I > > would think we should not be doing DMA at the same time we're enabling > > ACS, either. > > > > If this really is a system firmware issue, both HP and Dell should > > have the knowledge and equipment to figure out what's going on. > > DMA on resume sounds really odd. OTOH the below mentioned case of > a DMA during suspend seems very like in some setup. NVMe has the > concept of a host memory buffer (HMB) that allows the PCIe device > to use arbitrary host memory for internal purposes. Combine this > with the "Storage D3" misfeature in modern x86 platforms that force > a slot into d3cold without consulting the driver first and you'd see > symptoms like this. Another case would be the NVMe equivalent of the > AER which could lead to a completion without host activity. The issue can also be observed on non-HMB NVMe. > > We now have quirks in the ACPI layer and NVMe to fully shut down the > NVMe controllers on these messed up systems with the "Storage D3" > misfeature which should avoid such "spurious" DMAs at the cost of > wearning out the device much faster. Since the issue is on S3, I think the NVMe always fully shuts down. Kai-Heng