On 7/8/21 9:48 AM, Brijesh Singh wrote: > On 7/8/21 10:30 AM, Dave Hansen wrote: >>> The reason for iterating through 2MB region is; if the faulting address >>> is not assigned in the RMP table, and page table walk level is 2MB then >>> one of entry within the large page is the root cause of the fault. Since >>> we don't know which entry hence I dump all the non-zero entries. >> >> Logically you can figure this out though, right? Why throw 511 entries >> at the console when we *know* they're useless? > > Logically its going to be tricky to figure out which exact entry caused > the fault, hence I dump any non-zero entry. I understand it may dump > some useless. What's tricky about it? Sure, there's a possibility that more than one entry could contribute to a fault. But, you always know *IF* an entry could contribute to a fault. I'm fine if you run through the logic, don't find a known reason (specific RMP entry) for the fault, and dump the whole table in that case. But, unconditionally polluting the kernel log with noise isn't very nice for debugging. >>> There are two cases which we need to consider: >>> >>> 1) the faulting page is a guest private (aka assigned) >>> 2) the faulting page is a hypervisor (aka shared) >>> >>> We will be primarily seeing #1. In this case, we know its a assigned >>> page, and we can decode the fields. >>> >>> The #2 will happen in rare conditions, >> >> What rare conditions? > > One such condition is RMP "in-use" bit is set; see the patch 20/40. > After applying the patch we should not see "in-use" bit set. If we run > into similar issues, a full RMP dump will greatly help debug. OK... so dump the "in-use" bit here if you see it. >>> if it happens, one of the undocumented bit in the RMP entry can >>> provide us some useful information hence we dump the raw values. >> You're saying that there are things that can cause RMP faults that >> aren't documented? That's rather nasty for your users, don't you think? > > The "in-use" bit in the RMP entry caught me off guard. The AMD APM does > says that hardware sets in-use bit but it *never* explained in the > detail on how to check if the fault was due to in-use bit in the RMP > table. As I said, the documentation folks will be updating the RMP entry > to document the in-use bit. I hope we will not see any other > undocumented surprises, I am keeping my finger cross :) Oh, ok. That sounds fine. Documentation is out of date all the time.