Re: [RFC] Expose a memory poison detector ioctl to user space.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/26/22 10:57, Jue Wang wrote:
> On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>> From your description, you have me mostly convinced that this is
>> something that needs to get fixed.  The hardware patrol scrubber(s)
>> address the same basic problem, but don't seem to be flexible to your
>> specific needs.
>>
>> But, have hardware vendors been receptive at all to making the patrol
>> scrubbers more tunable?
> 
> We have discussed the use case in detail with Intel. There are
> improvements in progress to address some of the issues like the
> signaling to avoid broadcasted MCEs. But fundamentally, the needed
> throughput is not quite compatible with the patrol scrubber's design
> purpose and arch.

This would be great material to cover in the changelog in some more
detail.

>> On 4/25/22 09:34, Jue Wang wrote:
>>> /* Could stop and return after the 1st poison is detected */
>>> #define MCESCAN_IOCTL_SCAN 0
>>>
>>> struct SysramRegion {
>>>   /* input */
>>>   uint64_t first_byte;   /* first page-aligned physical address to scan */
>>>   uint64_t length;       /* page-aligned length of memory region to scan */
>>>   /* output */
>>>   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
>>>   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
>>> }
>>
>> So, the ioctl() caller has to know the physical address layout of the
>> system?
> 
> This info is available from /proc/iomem and /proc/zoneinfo already
> supported / exposed by the kernel.

I don't think they are good enough.

Think of a TDX guest.  It can't touch "unaccepted" memory.  But, that
information is not present in /proc/iomem.  In a TDX host (not upstream
yet), it can't touch any guest memory.  That's also not in /proc/iomem.

What if you're in a normal (non-TDX) guest and some of the physical
address space has been ballooned away?

What does the kernel do when userspace asks it to poke a non-"System
RAM" address?

>> While this is a good start at a conversation, I think you might want to
>> back up a bit.  You alluded to a few requirements that you have, like:
>>
>>  * Adjustable detector resource use based on system utilization
>>  * Adjustable scan rate to ensure issues are found at a deterministic
>>    rate
>>  * Detector must be able to find errors in allocated, in-use memory
>>
>> What about SEV-SNP or TDX private memory?  It might be unmapped *and*
>> limited in how it can be accessed.  For instance, TDX hosts can't
>> practically read guest memory.  SEV-SNP hosts have special page mapping
>> requirements; the cost can't create arbitrary mappings with arbitrary
>> mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
>> private page?
> 
> Thanks for raising the UPM case for SEV-SNP / TDX private memory. This
> is what we like to get more feedback and more experts' weigh-ins.
> 
> Is reading private memory via kernel's direct mapping benign for
> SEV-SNP and TDX? 

No.  It causes machine checks for TDX.

For SEV-SNP, I think reads of private memory read ciphertext.  I'm not
sure how benign it is or if it has any cache coherency implications.

> Otherwise this feature should be defined as mutually exclusive with
> incompatible features.

Just as an exercise, I'd suggest going and asking some of your
colleagues about this.  Surely, you're asking for this functionality
because Google wants to use it, and use it *widely*.  What would your
colleagues think if this wasn't available  at all on systems that use or
might use TDX?

For upstream, making features mutually exclusive is a deal breaker
unless it's absolutely necessary.

> Even in that case, I believe SEV-SNP or TDX may still benefit fro
> _reactive_ memory poison recovery if the MCE handling and
> CONFIG_MEMORY_FAILURE still function the same on uncorrectable error
> raised #MC.

If I remember right, the blast radius for machine checks on systems
using TDX is substantially bigger than without TDX.  I think there are
quite a few more cases that are non-recoverable, like poison detected in
TDX metadata.  TDX systems have a *stronger* requirement to proactively
find issues than non-TDX systems.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux