Re: [PATCH] KVM: ARM: updtae the VMID generation logic

Marc Zyngier <marc.zyngier@xxxxxxx> · Fri, 30 Mar 2018 17:13:39 +0100

On Fri, 30 Mar 2018 21:42:04 +0800
Shannon Zhao <zhaoshenglong@xxxxxxxxxx> wrote:

> On 2018/3/30 18:48, Marc Zyngier wrote:
> > On Fri, 30 Mar 2018 17:52:07 +0800
> > Shannon Zhao <zhaoshenglong@xxxxxxxxxx> wrote:
> >   
> >>
> >>
> >> On 2018/3/30 17:01, Marc Zyngier wrote:  
> >>> On Fri, 30 Mar 2018 09:56:10 +0800
> >>> Shannon Zhao <zhaoshenglong@xxxxxxxxxx> wrote:
> >>>  
> >>>> On 2018/3/30 0:48, Marc Zyngier wrote:  
> >>>>> On Thu, 29 Mar 2018 16:27:58 +0100,
> >>>>> Mark Rutland wrote:    
> >>>>>>
> >>>>>> On Thu, Mar 29, 2018 at 11:00:24PM +0800, Shannon Zhao wrote:    
> >>>>>>> From: zhaoshenglong <zhaoshenglong@xxxxxxxxxx>
> >>>>>>>
> >>>>>>> Currently the VMID for some VM is allocated during VCPU entry/exit
> >>>>>>> context and will be updated when kvm_next_vmid inversion. So this will
> >>>>>>> cause the existing VMs exiting from guest and flush the tlb and icache.
> >>>>>>>
> >>>>>>> Also, while a platform with 8 bit VMID supports 255 VMs, it can create
> >>>>>>> more than 255 VMs and if we create e.g. 256 VMs, some VMs will occur
> >>>>>>> page fault since at some moment two VMs have same VMID.    
> >>>>>>
> >>>>>> Have you seen this happen?
> >>>>>>    
> >>>> Yes, we've started 256 VMs on D05. We saw kernel page fault in some guests.  
> >>>
> >>> What kind of fault? Kernel configuration? Can you please share some
> >>> traces with us? What is the workload? What happens if all the guests are
> >>> running on the same NUMA node?
> >>>
> >>> We need all the information we can get.
> >>>  
> >> All 256 VMs run without special workload. The testcase is just starting
> >> 256 VMs and then shutting down them. We found several VMs will not
> >> shutdown since the guest kernel crash. While if we only start 255 VMs,
> >> it works well.
> >>
> >> We didn't run the testcase that pins all VMs to the same NUMA node. I'll
> >> try.
> >>
> >> The fault is
> >> [ 2204.633871] Unable to handle kernel NULL pointer dereference at
> >> virtual address 00000008
> >> [ 2204.633875] Unable to handle kernel paging request at virtual address
> >> a57f4a9095032
> >>
> >> Please see the attachment for the detailed log.  
> > 
> > Thanks. It looks pretty ugly indeed.
> > 
> > Can you please share your host kernel config (and version number -- I
> > really hope the host is something more recent than the 4.1.44 stuff you
> > run as a guest...)?
> >   
> We do run a 4.1.44 host kernel but with more recently KVM module(at
> least 4.14) since we backport upstream KVM ARM patches to our kernel tree.

Can you please reproduce it with a mainline kernel? I'm not going to
even try to reproduce this issue on a kernel that has been that heavily
hacked.

> See the attachment for the kernel config.
> 
> > For the record, I'm currently running 5 concurrent Debian installs,
> > each with 2 vcpus, on a 4 CPU system artificially configured to have
> > only 2 bits of VMID (and thus at most 3 running VMs at any given time),
> > a setup that is quite similar to what you're doing, only on a smaller
> > scale.
> > 
> > It is pretty slow (as you'd expect), but so far I haven't seen any
> > issue.
> >   
> Could you try to shutdown all VMs at the same time? The issue we
> encounter happened at the shutdown step.

Halted the VMs just fine, no issue.

	M.
-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm