Re: [RFC] Expose a memory poison detector ioctl to user space.

Jue Wang <juew@xxxxxxxxxx> · Tue, 26 Apr 2022 12:23:12 -0700

On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 4/26/22 10:57, Jue Wang wrote:
> > On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> >> From your description, you have me mostly convinced that this is
> >> something that needs to get fixed.  The hardware patrol scrubber(s)
> >> address the same basic problem, but don't seem to be flexible to your
> >> specific needs.
> >>
> >> But, have hardware vendors been receptive at all to making the patrol
> >> scrubbers more tunable?
> >
> > We have discussed the use case in detail with Intel. There are
> > improvements in progress to address some of the issues like the
> > signaling to avoid broadcasted MCEs. But fundamentally, the needed
> > throughput is not quite compatible with the patrol scrubber's design
> > purpose and arch.
>
> This would be great material to cover in the changelog in some more
> detail.
>
> >> On 4/25/22 09:34, Jue Wang wrote:
> >>> /* Could stop and return after the 1st poison is detected */
> >>> #define MCESCAN_IOCTL_SCAN 0
> >>>
> >>> struct SysramRegion {
> >>>   /* input */
> >>>   uint64_t first_byte;   /* first page-aligned physical address to scan */
> >>>   uint64_t length;       /* page-aligned length of memory region to scan */
> >>>   /* output */
> >>>   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
> >>>   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
> >>> }
> >>
> >> So, the ioctl() caller has to know the physical address layout of the
> >> system?
> >
> > This info is available from /proc/iomem and /proc/zoneinfo already
> > supported / exposed by the kernel.
>
> I don't think they are good enough.
>
> Think of a TDX guest.  It can't touch "unaccepted" memory.  But, that
> information is not present in /proc/iomem.  In a TDX host (not upstream
> yet), it can't touch any guest memory.  That's also not in /proc/iomem.

We will follow up on these topics wrt the interactions with TDX/SEV-SNP.

>
> What if you're in a normal (non-TDX) guest and some of the physical
> address space has been ballooned away?

Accessing to memory that gets ballooned away will cause extra EPT
violations and have the memory faulted in on the host side, which is
transparent to the guest.
>
> What does the kernel do when userspace asks it to poke a non-"System
> RAM" address?

I expect the kernel should reject the request with -EINVAL.
>
> >> While this is a good start at a conversation, I think you might want to
> >> back up a bit.  You alluded to a few requirements that you have, like:
> >>
> >>  * Adjustable detector resource use based on system utilization
> >>  * Adjustable scan rate to ensure issues are found at a deterministic
> >>    rate
> >>  * Detector must be able to find errors in allocated, in-use memory
> >>
> >> What about SEV-SNP or TDX private memory?  It might be unmapped *and*
> >> limited in how it can be accessed.  For instance, TDX hosts can't
> >> practically read guest memory.  SEV-SNP hosts have special page mapping
> >> requirements; the cost can't create arbitrary mappings with arbitrary
> >> mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
> >> private page?
> >
> > Thanks for raising the UPM case for SEV-SNP / TDX private memory. This
> > is what we like to get more feedback and more experts' weigh-ins.
> >
> > Is reading private memory via kernel's direct mapping benign for
> > SEV-SNP and TDX?
>
> No.  It causes machine checks for TDX.
>
> For SEV-SNP, I think reads of private memory read ciphertext.  I'm not
> sure how benign it is or if it has any cache coherency implications.
>
> > Otherwise this feature should be defined as mutually exclusive with
> > incompatible features.
>
> Just as an exercise, I'd suggest going and asking some of your
> colleagues about this.  Surely, you're asking for this functionality
> because Google wants to use it, and use it *widely*.  What would your
> colleagues think if this wasn't available  at all on systems that use or
> might use TDX?
>
> For upstream, making features mutually exclusive is a deal breaker
> unless it's absolutely necessary.

Ack, we will follow up within Google.

Just curious, what could be recommendations from Intel's perspective
to make proactively poison detection work on TDX / SEV-SNP?

>
> > Even in that case, I believe SEV-SNP or TDX may still benefit fro
> > _reactive_ memory poison recovery if the MCE handling and
> > CONFIG_MEMORY_FAILURE still function the same on uncorrectable error
> > raised #MC.
>
> If I remember right, the blast radius for machine checks on systems
> using TDX is substantially bigger than without TDX.  I think there are
> quite a few more cases that are non-recoverable, like poison detected in
> TDX metadata.  TDX systems have a *stronger* requirement to proactively
> find issues than non-TDX systems.