On 8/20/19, 8:48 PM, "Nadav Amit" <namit@xxxxxxxxxx> wrote: > Francois reported that VMware balloon gets stuck after a balloon reset, > when the VMCI doorbell is removed. A similar error can occur when the > balloon driver is removed with the following splat: > > [ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds. > [ 1088.622035] Tainted: G W 5.2.0 #4 > [ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1088.622205] modprobe D 0 3565 1450 0x00000000 > [ 1088.622210] Call Trace: > [ 1088.622246] __schedule+0x2a8/0x690 > [ 1088.622248] schedule+0x2d/0x90 > [ 1088.622250] schedule_timeout+0x1d3/0x2f0 > [ 1088.622252] wait_for_completion+0xba/0x140 > [ 1088.622320] ? wake_up_q+0x80/0x80 > [ 1088.622370] vmci_resource_remove+0xb9/0xc0 [vmw_vmci] > [ 1088.622373] vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci] > [ 1088.622379] vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon] > [ 1088.622381] vmballoon_exit+0x18/0xcc8 [vmw_balloon] > [ 1088.622394] __x64_sys_delete_module+0x146/0x280 > [ 1088.622408] do_syscall_64+0x5a/0x130 > [ 1088.622410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 1088.622415] RIP: 0033:0x7f54f62791b7 > [ 1088.622421] Code: Bad RIP value. > [ 1088.622421] RSP: 002b:00007fff2a949008 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 > [ 1088.622426] RAX: ffffffffffffffda RBX: 000055dff8b55d00 RCX: 00007f54f62791b7 > [ 1088.622426] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055dff8b55d68 > [ 1088.622427] RBP: 000055dff8b55d00 R08: 00007fff2a947fb1 R09: 0000000000000000 > [ 1088.622427] R10: 00007f54f62f5cc0 R11: 0000000000000206 R12: 000055dff8b55d68 > [ 1088.622428] R13: 0000000000000001 R14: 000055dff8b55d68 R15: 00007fff2a94a3f0 > > The cause for the bug is that when the "delayed" doorbell is invoked, it > takes a reference on the doorbell entry and schedules work that is > supposed to run the appropriate code and drop the doorbell entry > reference. The code ignores the fact that if the work is already queued, > it will not be scheduled to run one more time. As a result one of the > references would not be dropped. When the code waits for the reference > to get to zero, during balloon reset or module removal, it gets stuck. > > Fix it. Drop the reference if schedule_work() indicates that the work is > already queued. > > Note that this bug got more apparent (or apparent at all) due to > commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status"). > > Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.") > Reported-by: Francois Rigault <rigault.francois@xxxxxxxxx> > Cc: Jorgen Hansen <jhansen@xxxxxxxxxx> > Cc: Adit Ranadive <aditr@xxxxxxxxxx> > Cc: Alexios Zavras <alexios.zavras@xxxxxxxxx> > Cc: Vishnu DASA <vdasa@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Nadav Amit <namit@xxxxxxxxxx> > --- > drivers/misc/vmw_vmci/vmci_doorbell.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) Thanks for the fix, looks good to me. Reviewed-by: Vishnu Dasa <vdasa@xxxxxxxxxx> -- vishnu