On 03/03/16 14:26, Shanker Donthineni wrote: > > > On 03/03/2016 08:03 AM, Marc Zyngier wrote: >> On 03/03/16 13:25, Shanker Donthineni wrote: >>> >>> On 03/02/2016 11:35 AM, Marc Zyngier wrote: >>>> On 02/03/16 15:48, Shanker Donthineni wrote: >>>> >>>>> We haven't started running heavy workloads in VMs. So far we >>>>> have noticed this random nature behavior only during guest >>>>> kernel boot (at EL1). >>>>> >>>>> We didn't see this problem on 4.3 kernel. Do you think it is >>>>> related to TLB conflicts? >>>> I cannot imagine why a DSB would solve a TLB conflict. But the fact that >>>> you didn't see it crashing on 4.3 is a good indication that something >>>> else it at play. >>>> >>>> In 4.5, we've rewritten a large part of KVM in C, which has changed the >>>> ordering of the various accesses a lot. It could be that a latent >>>> problem is now exposed more widely. >>>> >>>> Can you try moving this DSB around and find out what is the earliest >>>> point where it solves this problem? Some sort of bisection? >>> The maximum I can move up 'dsb ishst' to the beginning of >>> __guest_enter() but not out side of this function. >>> >>> I don't understand why it is failing below code, branch >>> instruction causing problems. >>> >>> /* Jump in the fire! */ >>> + dsb(ishst); >>> exit_code = __guest_enter(vcpu, host_ctxt); >>> /* And we're baaack! */ >> That's very worrying. I can't see how the branch can have an influence >> on the the DSB (nor why the DSB has an influence on the rest of the >> execution, btw). >> >> What if you replace the DSB with an ISB? Do you observe a similar >> behaviour (works if the barrier is in __guest_enter, but not if it is >> outside)? > I have already tried with isb without success. I did another > experiment flush stage-2 TLBs before calling __guest_enetr(), > it fixed the problem. I suspected something like that. But it is such a massive hammer that it will hide any sort of subtle bug (HW *and* SW). > >> Another thing worth looking at is what happened just before we decided >> to get back into the guest. Or to put it differently, what was the >> reason to exit the first place. Was it a Stage-2 fault by any chance? > > I will collect as much possible debug data and update results > to you. I went through your KVM refracted 'C' code and did not > find any thing suspicious. I am thinking may be Qualcomm CPUs > have a very aggressive prefech logic that causing the problem. OK. Please keep me posted about your findings. Also maybe involving some HW people ouwld be a good idea (running something in an emulator, for example...). Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm