On 27.05.22 08:32, zhenwei pi wrote: > On 5/27/22 02:37, Peter Xu wrote: >> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote: >>> The hypervisor _must_ emulate poisons identified in guest physical >>> address space (could be transported from the source VM), this is to >>> prevent silent data corruption in the guest. With a paravirtual >>> approach like this patch series, the hypervisor can clear some of the >>> poisoned HVAs knowing for certain that the guest OS has isolated the >>> poisoned page. I wonder how much value it provides to the guest if the >>> guest and workload are _not_ in a pressing need for the extra KB/MB >>> worth of memory. >> >> I'm curious the same on how unpoisoning could help here. The reasoning >> behind would be great material to be mentioned in the next cover letter. >> >> Shouldn't we consider migrating serious workloads off the host already >> where there's a sign of more severe hardware issues, instead? >> >> Thanks, >> > > I'm maintaining 1000,000+ virtual machines, from my experience: > UE is quite unusual and occurs randomly, and I did not hit UE storm case > in the past years. The memory also has no obvious performance drop after > hitting UE. > > I hit several CE storm case, the performance memory drops a lot. But I > can't find obvious relationship between UE and CE. > > So from the point of my view, to fix the corrupted page for VM seems > good enough. And yes, unpoisoning several pages does not help > significantly, but it is still a chance to make the virtualization better. > I'm curious why we should care about resurrecting a handful of poisoned pages in a VM. The cover letter doesn't touch on that. IOW, I'm missing the motivation why we should add additional code+complexity to unpoison pages at all. If we're talking about individual 4k pages, it's certainly sub-optimal, but does it matter in practice? I could understand if we're losing megabytes of memory. But then, I assume the workload might be seriously harmed either way already? I assume when talking about "the performance memory drops a lot", you imply that this patch set can mitigate that performance drop? But why do you see a performance drop? Because we might lose some possible THP candidates (in the host or the guest) and you want to plug does holes? I assume you'll see a performance drop simply because poisoning memory is expensive, including migrating pages around on CE. If you have some numbers to share, especially before/after this change, that would be great. -- Thanks, David / dhildenb