Re: [PATCH rdma-next 6/6] IB/core: Fix addr_req list delete corruption due to re-enqueue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Mar 25, 2018 at 01:40:24PM +0300, Leon Romanovsky wrote:
> From: Parav Pandit <parav@xxxxxxxxxxxx>
> 
> mod_delayed_work() may enqueue the work item even if it is getting
> executed while it is getting rescheduled via mod_delayed_work().
> 
> When process_one_req() is about to get executed, rdma_cancel_addr()
> modifies the delay to 0.
> 
> This results into enqueueing the freed addr_req and work item again.
> Below trace shows that one request is getting executed twice results
> into list del corruption kernel crash.
> 
> Therefore, let the request gets completed whoever sees it first;
> if process_one_req() sees that rdma_addr_cancel() already progressed
> to cancel the request, it ignore the work item processing.
> 
> Cancel request waits for any existing work item to finish execution or
> cancel any past execution. This ensures that cancel routine doesn't
> free the request that workqueue might schedule.
> 
> list_del corruption, 0000000009223578->prev is
> LIST_POISON2 (000000000f85575b)
> WARNING: CPU: 3 PID: 340 at lib/list_debug.c:50
> __list_del_entry_valid+0x9a/0xa0
> CPU: 3 PID: 340 Comm: kworker/u16:4 Not tainted 4.15.0-rc2-bfix+ #26
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> Workqueue: ib_addr process_one_req [ib_core]
> task: 00000000e0ed2fa9 task.stack: 00000000e1ddfff5
> RIP: 0010:__list_del_entry_valid+0x9a/0xa0
> RSP: 0018:ffffc9000044fde0 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffff880f8b82c200 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: ffff88107fccd078 RDI: ffff88107fccd078
> RBP: ffffc9000044fde0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: ffff880f8b82c200
> R13: ffff880f8b82c210 R14: ffff880f8b82c340 R15: ffff880f8b82c338
> FS:  0000000000000000(0000) GS:ffff88107fcc0000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffec6001228 CR3: 0000000001c10004 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> __list_del_entry+0xd/0x30 [ib_core]
> process_one_req+0x9f/0x140 [ib_core]
> process_one_work+0x1e3/0x5f0
> ? process_one_work+0x156/0x5f0
> worker_thread+0x4d/0x3e0
> kthread+0x14b/0x180
> ? process_one_work+0x5f0/0x5f0
> ? kthread_stop+0x2b0/0x2b0
> ret_from_fork+0x24/0x30
> 
> Fixes: 5fff41e1f89d ("IB/core: Fix race condition in resolving IP to MAC")
> Reviewed-by: Mark Bloch <markb@xxxxxxxxxxxx>
> Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx>
> Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> ---
>  drivers/infiniband/core/addr.c | 54 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 41 insertions(+), 13 deletions(-)

As already discussed I will use the one line version instead of this.

When I first glanced at this patch I thought it was creating a
sensible fence in cancel, but it turns out it doesn't even do that...

So the two accomplish exactly the same thing and the other patch is
much more understandable.

I have a small series here that tidies this whoel flow up..

https://github.com/jgunthorpe/linux/commits/cma-fix

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux