Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

zhenwei pi <pizhenwei@xxxxxxxxxxxxx> · Mon, 30 May 2022 19:33:35 +0800

On 5/30/22 15:41, David Hildenbrand wrote:
On 27.05.22 08:32, zhenwei pi wrote:
On 5/27/22 02:37, Peter Xu wrote:
On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
The hypervisor _must_ emulate poisons identified in guest physical
address space (could be transported from the source VM), this is to
prevent silent data corruption in the guest. With a paravirtual
approach like this patch series, the hypervisor can clear some of the
poisoned HVAs knowing for certain that the guest OS has isolated the
poisoned page. I wonder how much value it provides to the guest if the
guest and workload are _not_ in a pressing need for the extra KB/MB
worth of memory.

I'm curious the same on how unpoisoning could help here.  The reasoning
behind would be great material to be mentioned in the next cover letter.

Shouldn't we consider migrating serious workloads off the host already
where there's a sign of more severe hardware issues, instead?

Thanks,

I'm maintaining 1000,000+ virtual machines, from my experience:
UE is quite unusual and occurs randomly, and I did not hit UE storm case
in the past years. The memory also has no obvious performance drop after
hitting UE.

I hit several CE storm case, the performance memory drops a lot. But I
can't find obvious relationship between UE and CE.

So from the point of my view, to fix the corrupted page for VM seems
good enough. And yes, unpoisoning several pages does not help
significantly, but it is still a chance to make the virtualization better.

I'm curious why we should care about resurrecting a handful of poisoned
pages in a VM. The cover letter doesn't touch on that.

IOW, I'm missing the motivation why we should add additional
code+complexity to unpoison pages at all.

If we're talking about individual 4k pages, it's certainly sub-optimal,
but does it matter in practice? I could understand if we're losing
megabytes of memory. But then, I assume the workload might be seriously
harmed either way already?

Yes, resurrecting a handful of poisoned pages does not help 
significantly. And, in some ways, it seems nice to have. :D

A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, 
the 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest 
poisons 4K (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE 
([GPAx, GPAz) except GPAy). This is the worse case, so I want to add
 '__le32 corrupted_pages' in struct virtio_balloon_config, it is used 
in the next step: reporting 512 * 4K 'corrupted_pages' to the guest, the 
guest has a chance to isolate the other 511 pages ahead of time. And the 
guest actually loses 2M, fixing 512*4K seems to help significantly.

I assume when talking about "the performance memory drops a lot", you
imply that this patch set can mitigate that performance drop?

But why do you see a performance drop? Because we might lose some
possible THP candidates (in the host or the guest) and you want to plug
does holes? I assume you'll see a performance drop simply because
poisoning memory is expensive, including migrating pages around on CE.

If you have some numbers to share, especially before/after this change,
that would be great.

The CE storm leads 2 problems I have even seen:
1, the memory bandwidth slows down to 10%~20%, and the cycles per 
instruction of CPU increases a lot.
2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use 
a lot time to handle IRQ.

But no corrupted page occurs. Migrating VM to another healthy host seems 
a good choice. This patch does not handle CE storm case.

--
zhenwei pi