On Mon, Jul 17, 2017 at 05:16:17PM +0200, Andrea Arcangeli wrote: > On Mon, Jul 17, 2017 at 04:45:10PM +0200, Christoffer Dall wrote: > > I would also very much like to get to the bottom of this, and at the > > very least try to get a valid explanation as to how a thread can be > > *running* for a process where there are zero references to the struct > > mm? > > A thread shouldn't be possibly be running if mm->mm_users is zero. > ok, good, then I don't have to re-take OS 101. > > I guess I am asking where this mmput() can happen for a perfectly > > running thread, which hasn't processes signals or exited itself yet. > > mmput runs during exit(), after that point the vcpu can't run the KVM > ioctl anymore. > also very comforting that we agree on this. > > The dump you reference above seems to indicate that it's happening > > under memory pressure and trying to unmap memory from the VM to > > allocate memory to the VM, but all seems to be happening within a VCPU > > thread, or am I reading this wrong? > > In the oops the pgd was none while KVM vcpu ioctl was running, the > most likely explanation is there were two VM running in parallel in > the host, and the other one was quitting (mm_count of the other VM was > zero, while mm_count of the VM that oopsed within the vcpu ioctl was > > 0). The oops information itself can't tell if there was one or two VM > running in the host so > 1 VM running is the most plausible > explanation that doesn't break the above in invariants. That's very keenly observed, and a really good explanation. > It'd be nice > if Alexander can confirm it, if he remembers about that specific setup > after a couple of months since it happened. My guess is that this was observed on the suse build machines with arm64, and Alex ususally explains that these machines run *lots* of VMs at the same time, so this sounds very likely. Alex, can you confirm this was the type of workload? > > Even if there was just one VM running in the host, it would more > likely mean something inside KVM ARM code is clearing the pgd before > mm_users reaches zero, i.e. before the last mmput. I don't think we have this. > > It's very unlikely mm_users could have been > 0 while the vcpu thread > was running as many more things would fall apart in such case, not > just the needed pgd check during mmu notifier post process exit. > That was my rationale exactly. Thanks for confirming! -Christoffer