Hi Huang,
Huang Ying wrote:
UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.
For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.
Signed-off-by: Huang Ying <ying.huang@xxxxxxxxx>
---
qemu-kvm.c | 173 ++++++++++++++++++++++++++++++++++++++++++++++++++----
target-i386/cpu.h | 20 +++++-
2 files changed, 181 insertions(+), 12 deletions(-)
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -27,10 +27,23 @@
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <signal.h>
+#include <sys/signalfd.h>
+#include <sys/prctl.h>
#define false 0
#define true 1
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
+#ifndef BUS_MCEERR_AR
+#define BUS_MCEERR_AR 4
+#endif
+#ifndef BUS_MCEERR_AO
+#define BUS_MCEERR_AO 5
+#endif
+
#define EXPECTED_KVM_API_VERSION 12
#if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
@@ -702,6 +715,24 @@ int kvm_get_dirty_pages_range(kvm_contex
return 0;
}
+static int kvm_addr_userspace_to_phys(unsigned long userspace_addr,
+ unsigned long *phys_addr)
+{
+ int i;
+ struct slot_info *slot;
+
+ for (i = 0; i < KVM_MAX_NUM_MEM_REGIONS; ++i) {
+ slot = &slots[i];
+ if (slot->len && slot->userspace_addr <= userspace_addr &&
+ (slot->userspace_addr + slot->len) > userspace_addr) {
+ *phys_addr = userspace_addr - slot->userspace_addr +
+ slot->phys_addr;
+ return 0;
+ }
+ }
+ return -1;
+}
+
The slot mapping is actually a copy of the qemu's ram_blocks structure
(see exec.c). If you base your check on that, it will Just Work for
QEMU too.
#ifdef KVM_CAP_IRQCHIP
int kvm_set_irq_level(kvm_context_t kvm, int irq, int level, int *status)
@@ -1515,6 +1546,38 @@ static void sig_ipi_handler(int n)
{
}
+static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx)
+{
+ if (siginfo->ssi_code == BUS_MCEERR_AO) {
+ uint64_t status;
+ unsigned long paddr;
+ CPUState *cenv;
+
+ /* Hope we are lucky for AO MCE */
Even if the error was limited to guest memory, it could have been
generated by either the kernel or userspace reading guest memory, no?
Does this potentially open a security hole for us? Consider the following:
1) We happen to read guest memory and that causes an MCE. For instance,
say we're in virtio.c and we read the virtio ring.
2) That should trigger the kernel to generate a sigbus.
3) We catch sigbus, and queue an MCE for delivery.
4) After sigbus handler completes, we're back in virtio.c, what was the
value of the memory operation we just completed?
If the instruction gets skipped, we may be leaking host memory because
the access never happened.
--
Regards,
Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html