> -----Original Message----- > From: Ben Greear [mailto:greearb@xxxxxxxxxxxxxxx] > Sent: Friday, July 08, 2011 1:19 PM > To: Myklebust, Trond > Cc: linux-nfs@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx > Subject: Re: [RFC] sunrpc: Fix race between work-queue and > rpc_killall_tasks. > > On 07/06/2011 04:45 PM, Trond Myklebust wrote: > > On Wed, 2011-07-06 at 15:49 -0700, greearb@xxxxxxxxxxxxxxx wrote: > >> From: Ben Greear<greearb@xxxxxxxxxxxxxxx> > >> > >> The rpc_killall_tasks logic is not locked against > >> the work-queue thread, but it still directly modifies > >> function pointers and data in the task objects. > >> > >> This patch changes the killall-tasks logic to set a flag > >> that tells the work-queue thread to terminate the task > >> instead of directly calling the terminate logic. > >> > >> Signed-off-by: Ben Greear<greearb@xxxxxxxxxxxxxxx> > >> --- > >> > >> NOTE: This needs review, as I am still struggling to understand > >> the rpc code, and it's quite possible this patch either doesn't > >> fully fix the problem or actually causes other issues. That said, > >> my nfs stress test seems to run a bit more stable with this patch > applied. > > > > Yes, but I don't see why you are adding a new flag, nor do I see why > we > > want to keep checking for that flag in the rpc_execute() loop. > > rpc_killall_tasks() is not a frequent operation that we want to > optimise > > for. > > > > How about the following instead? > > Ok, I looked at your patch closer. I think it can still cause > bad race conditions. > > For instance: > > Assume that tk_callback is NULL at beginning of while loop in > __rpc_execute, > and tk_action is rpc_exit_task. > > While do_action(task) is being called, tk_action is set to NULL in > rpc_exit_task. > > But, right after tk_action is set to NULL in rpc_exit_task, the > rpc_killall_tasks > method calls rpc_exit, which sets tk_action back to rpc_exit_task. > > I believe this could cause the xprt_release(task) logic to be called in > the > work-queue's execution of rpc_exit_task due to tk_action != NULL when > it should not be. Why would this be a problem? xprt_release() can certainly be called multiple times on an rpc_task. Ditto rpbc_getport_done. The only thing that is not re-entrant there is rpcb_map_release, which should only ever be called once whether or not something calls rpc_killall_tasks. > I have no hard evidence this exact scenario is happening in my case, > but I > believe the code is still racy with your patch. > > For that matter, is it safe to modify the flags in rpc_killall_tasks: > > rovr->tk_flags |= RPC_TASK_KILLED; > > Is that guaranteed to be atomic with any other modification of flags? Task->tk_flags should never change after the rpc_task is set up. The only allowed change is the RPC_TASK_KILLED. We could convert that into an atomic bit in task->tk_runstate, but again, this isn't something that is likely to be responsible for the problem you are seeing. ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥