On Tue, Dec 13, 2022 at 11:03:52AM -0800, Jiaqi Yan wrote: > On Tue, Dec 13, 2022 at 10:10 AM Luck, Tony <tony.luck@xxxxxxxxx> wrote: > > > > > I think that one point not mentioned yet is how the in-kernel scanner finds > > > a broken page before the page is marked by PG_hwpoison. Some mechanism > > > similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary > > > because we just want to check the healthiness of pages. So a core routine > > > like mcsafe-read would be introduced in the first patchset (or we already > > > have it)? > > > > I don’t think that there is an existing routine to do the mcsafe-read. But it should > > be easy enough to write one. If an architecture supports a way to do this without > > evicting other data from caches, that would be a bonus. X86 has a non-temporal > > read that could be interesting ... but I'm not sure that it would detect poison > > synchronously. I could be wrong, but I expect that you won’t see a machine check, > > but you should see the memory controller log a UCNA error reported by a CMCI. > > > > -Tony > > To Naoya: yes, we will introduce a new scanning routine. It "touches" > cacheline by cacheline of a page to detect memory error. This "touch" > is essentially an ANDQ operation of loaded cacheline with 0, to avoid > leaking user data in the register. > > To Tony: thanks. I think you are referring to PREFETCHNTA before ANDQ? > (which we are using in our scanning routine to minimize cache > pollution.) We tested the attached scanning draft on Intel Skylake + > Cascadelake + Icelake CPUs, and the ANDQ instruction does raise a MC > synchronously when an injected memory error is encountered. > > To Yazen and Vilas: We haven't tested on any AMD hardware. Do you have > any thoughts on PREFETCHNTA + MC? > Hi Jiaqi, I'm not sure of the behavior. I think it'll require some experimentation. The AMD APM has the following statement in the "PREFETCHlevel" description: "The operation of this instruction is implementation-dependent." So it may be the case that the behavior changes between products. Maybe this procedure should be opt-in and only apply to products that are verified to work? Thanks, Yazen