On 2017-12-07 12:13 AM, Jason Wang wrote:
On 2017年12月07日 12:42, David Hill wrote:
On 2017-12-06 11:34 PM, David Hill wrote:
On 2017-12-04 02:51 PM, David Hill wrote:
On 2017-12-03 11:08 PM, Jason Wang wrote:
On 2017年12月02日 00:38, David Hill wrote:
Finally, I reverted 581fe0ea61584d88072527ae9fb9dcb9d1f2783e too
... compiling and I'll keep you posted.
So I'm still able to reproduce this issue even with reverting
these 3 commits. Would you have other suspect commits ?
Thanks for the testing. No, I don't have other suspect commits.
Looks like somebody else it hitting your issue too (see
https://www.spinics.net/lists/netdev/msg468319.html)
But he claims the issue were fixed by using qemu 2.10.1.
So you may:
-try to see if qemu 2.10.1 solves your issue
It didn't solve it for him... it's only harder to reproduce. [1]
-if not, try to see if commit
2ddf71e23cc246e95af72a6deed67b4a50a7b81c ("net: add notifier hooks
for devmap bpf map") is the first bad commit
I'll try to see what I can do here
I'm looking at that commit and it's been introduced before v4.13 if
I'm not mistaken while this issue appeared between v4.13 and
v4.14-rc1 . Between those two releases, there're 1352 commits.
Is there a way to quickly know which commits are touching vhost-net,
zerocopy ?
[ 7496.553044] __schedule+0x2dc/0xbb0
[ 7496.553055] ? trace_hardirqs_on+0xd/0x10
[ 7496.553074] schedule+0x3d/0x90
[ 7496.553087] vhost_net_ubuf_put_and_wait+0x73/0xa0 [vhost_net]
[ 7496.553100] ? finish_wait+0x90/0x90
[ 7496.553115] vhost_net_ioctl+0x542/0x910 [vhost_net]
[ 7496.553144] do_vfs_ioctl+0xa6/0x6c0
[ 7496.553166] SyS_ioctl+0x79/0x90
[ 7496.553182] entry_SYSCALL_64_fastpath+0x1f/0xbe
That vhost_net_ubuf_put_and)wait call has been changed in this commit
with the following comment:
commit 0ad8b480d6ee916aa84324f69acf690142aecd0e
Author: Michael S. Tsirkin <mst@xxxxxxxxxx>
Date: Thu Feb 13 11:42:05 2014 +0200
vhost: fix ref cnt checking deadlock
vhost checked the counter within the refcnt before decrementing. It
really wanted to know that it is the one that has the last
reference, as
a way to batch freeing resources a bit more efficiently.
Note: we only let refcount go to 0 on device release.
This works well but we now access the ref counter twice so there's a
race: all users might see a high count and decide to defer freeing
resources.
In the end no one initiates freeing resources until the last
reference
is gone (which is on VM shotdown so might happen after a looooong
time).
Let's do what we probably should have done straight away:
switch from kref to plain atomic, documenting the
semantics, return the refcount value atomically after decrement,
then use that to avoid the deadlock.
Reported-by: Qin Chuanyu <qinchuanyu@xxxxxxxxxx>
Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
Acked-by: Jason Wang <jasowang@xxxxxxxxxx>
Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
So at this point, are we hitting a deadlock when using
experimental_zcopytx ?
Yes. But there could be another possibility that it was not caused by
vhost_net itself but other places that holds a packet.
Thanks
While bisecting, when I reach this commit
46d4b68f891bee5d83a32508bfbd9778be6b1b63, the system kernel panic when I
run virt-customize :
Message from syslogd@zappa at Dec 8 12:52:06 ...
kernel:[ 350.016376] Kernel panic - not syncing: Fatal exception in
interrupt
I marked that commit as bad again. Will continue bisecting!