Re: WARNING in __mmdrop

Jason Wang <jasowang@xxxxxxxxxx> · Tue, 23 Jul 2019 13:47:04 +0800

On 2019/7/23 下午1:01, Michael S. Tsirkin wrote:
On Tue, Jul 23, 2019 at 12:01:40PM +0800, Jason Wang wrote:
On 2019/7/22 下午4:08, Michael S. Tsirkin wrote:
On Mon, Jul 22, 2019 at 01:24:24PM +0800, Jason Wang wrote:
On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
syzbot has bisected this bug to:

commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
Author: Jason Wang<jasowang@xxxxxxxxxx>
Date:   Fri May 24 08:12:18 2019 +0000

       vhost: access vq metadata through kernel virtual address

bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
start commit:   6d21a41b Add linux-next specific files for 20190718
git tree:       linux-next
final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000

Reported-by:syzbot+e58112d71f77113ddb7b@xxxxxxxxxxxxxxxxxxxxxxxxx
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
address")

For information about bisection process see:https://goo.gl/tpsmEJ#bisection
OK I poked at this for a bit, I see several things that
we need to fix, though I'm not yet sure it's the reason for
the failures:

1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
      That's just a bad hack, in particular I don't think device
      mutex is taken and so poking at two VQs will corrupt
      memory.
      So what to do? How about a per vq notifier?
      Of course we also have synchronize_rcu
      in the notifier which is slow and is now going to be called twice.
      I think call_rcu would be more appropriate here.
      We then need rcu_barrier on module unload.
      OTOH if we make pages linear with map then we are good
      with kfree_rcu which is even nicer.

2. Doesn't map leak after vhost_map_unprefetch?
      And why does it poke at contents of the map?
      No one should use it right?

3. notifier unregister happens last in vhost_dev_cleanup,
      but register happens first. This looks wrong to me.

4. OK so we use the invalidate count to try and detect that
      some invalidate is in progress.
      I am not 100% sure why do we care.
      Assuming we do, uaddr can change between start and end
      and then the counter can get negative, or generally
      out of sync.

So what to do about all this?
I am inclined to say let's just drop the uaddr optimization
for now. E.g. kvm invalidates unconditionally.
3 should be fixed independently.
Above implements this but is only build-tested.
Jason, pls take a look. If you like the approach feel
free to take it from here.

One thing the below does not have is any kind of rate-limiting.
Given it's so easy to restart I'm thinking it makes sense
to add a generic infrastructure for this.
Can be a separate patch I guess.
I don't get why must use kfree_rcu() instead of synchronize_rcu() here.
synchronize_rcu has very high latency on busy systems.
It is not something that should be used on a syscall path.
KVM had to switch to SRCU to keep it sane.
Otherwise one guest can trivially slow down another one.

I think you mean the synchronize_rcu_expedited()? Rethink of the code, the
synchronize_rcu() in ioctl() could be removed, since it was serialized with
memory accessor.

Really let's just use kfree_rcu. It's way cleaner: fire and forget.

Looks not, you need rate limit the fire as you've figured out? And in 
fact, the synchronization is not even needed, does it help if I leave a 
comment to explain?

Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
(just a little bit more hard to trigger):

AFAIK these never run in response to guest events.
So they can take very long and guests still won't crash.

What if guest manages to escape to qemu?

Thanks

     case KVM_RUN: {
...
         if (unlikely(oldpid != task_pid(current))) {
             /* The thread running this VCPU changed. */
             struct pid *newpid;

             r = kvm_arch_vcpu_run_pid_change(vcpu);
             if (r)
                 break;

             newpid = get_task_pid(current, PIDTYPE_PID);
             rcu_assign_pointer(vcpu->pid, newpid);
             if (oldpid)
                 synchronize_rcu();
             put_pid(oldpid);
         }
...
         break;

Signed-off-by: Michael S. Tsirkin<mst@xxxxxxxxxx>
Let me try to figure out the root cause then decide whether or not to go for
this way.

Thanks
The root cause of the crash is relevant, but we still need
to fix issues 1-4.

More issues (my patch tries to fix them too):

5. page not dirtied when mappings are torn down outside
     of invalidate callback

Yes.

6. potential cross-VM DOS by one guest keeping system busy
     and increasing synchronize_rcu latency to the point where
     another guest stars timing out and crashes

This will be addressed after I remove the synchronize_rcu() from ioctl path.

Thanks