Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault

Dave Hansen <dave.hansen@xxxxxxxxx> · Thu, 8 Jul 2021 09:58:43 -0700

On 7/8/21 9:48 AM, Brijesh Singh wrote:
> On 7/8/21 10:30 AM, Dave Hansen wrote:
>>> The reason for iterating through 2MB region is; if the faulting address
>>> is not assigned in the RMP table, and page table walk level is 2MB then
>>> one of entry within the large page is the root cause of the fault. Since
>>> we don't know which entry hence I dump all the non-zero entries.
>>
>> Logically you can figure this out though, right?  Why throw 511 entries
>> at the console when we *know* they're useless?
> 
> Logically its going to be tricky to figure out which exact entry caused
> the fault, hence I dump any non-zero entry. I understand it may dump
> some useless.

What's tricky about it?

Sure, there's a possibility that more than one entry could contribute to
a fault.  But, you always know *IF* an entry could contribute to a fault.

I'm fine if you run through the logic, don't find a known reason
(specific RMP entry) for the fault, and dump the whole table in that
case.  But, unconditionally polluting the kernel log with noise isn't
very nice for debugging.

>>> There are two cases which we need to consider:
>>>
>>> 1) the faulting page is a guest private (aka assigned)
>>> 2) the faulting page is a hypervisor (aka shared)
>>>
>>> We will be primarily seeing #1. In this case, we know its a assigned
>>> page, and we can decode the fields.
>>>
>>> The #2 will happen in rare conditions,
>>
>> What rare conditions?
> 
> One such condition is RMP "in-use" bit is set; see the patch 20/40.
> After applying the patch we should not see "in-use" bit set. If we run
> into similar issues, a full RMP dump will greatly help debug.

OK... so dump the "in-use" bit here if you see it.

>>> if it happens, one of the undocumented bit in the RMP entry can
>>> provide us some useful information hence we dump the raw values.
>> You're saying that there are things that can cause RMP faults that
>> aren't documented?  That's rather nasty for your users, don't you think?
> 
> The "in-use" bit in the RMP entry caught me off guard. The AMD APM does
> says that hardware sets in-use bit but it *never* explained in the
> detail on how to check if the fault was due to in-use bit in the RMP
> table. As I said, the documentation folks will be updating the RMP entry
> to document the in-use bit. I hope we will not see any other
> undocumented surprises, I am keeping my finger cross :)

Oh, ok.  That sounds fine.  Documentation is out of date all the time.