> -----Original Message----- > From: Mike Snitzer [mailto:snitzer@xxxxxxxxxx] > Sent: Tuesday, October 29, 2013 7:57 PM > To: Merla, ShivaKrishna > Cc: Hannes Reinecke; dm-devel@xxxxxxxxxx; agk@xxxxxxxxxx; Mikulas > Patocka > Subject: Re: [PATCH]dm-mpath: fix for race condition between > multipath_dtr and pg_init_done. > > On Thu, Oct 17 2013 at 6:03pm -0400, > Merla, ShivaKrishna <ShivaKrishna.Merla@xxxxxxxxxx> wrote: > > > > From: Mike Snitzer [mailto:snitzer@xxxxxxxxxx] > > > Sent: Thursday, October 17, 2013 4:10 PM > > > To: Hannes Reinecke > > > Cc: Merla, ShivaKrishna; dm-devel@xxxxxxxxxx; agk@xxxxxxxxxx; > Mikulas > > > Patocka > > > Subject: Re: [PATCH]dm-mpath: fix for race condition between > > > multipath_dtr and pg_init_done. > > > > > > On Thu, Oct 17 2013 at 5:47pm -0400, > > > Hannes Reinecke <hare@xxxxxxx> wrote: > > > > > > > On 10/17/2013 08:53 PM, Mike Snitzer wrote: > > > > >Thanks for reporting this. Much appreciated. More comments below. > > > > > > > > > >On Thu, Oct 17 2013 at 1:31pm -0400, > > > > >Merla, ShivaKrishna <ShivaKrishna.Merla@xxxxxxxxxx> wrote: > > > > > > > > > >>Whenever multipath_dtr is happening, we should prevent queueing > any > > > further path > > > > >>activation work. There was a kernel panic where after pg_init_done() > > > decrements > > > > >>pg_init_in_progress to 0, wait_for_pg_init_completion call assumes > > > there are no > > > > >>more pending path management commands. But if pg_init_required > is > > > set by > > > > >>pg_init_done call due to retriable mode_select errors , then > > > process_queued_ios() > > > > >>will again queue the path activation work. If free_multipath call has > been > > > > >>completed by the time activate_path work is called, kernel panic was > > > seen on > > > > >>accessing multipath members. > > > > > > > > > >Your locking looks suspect to me, see comment inlined below > > > multipath_dtr > > > > > > > > > >But shouldn't we just train multipath_wait_for_pg_init_completion() > to > > > > >look at m->pg_init_required? Have it wait for both pg_init_required > and > > > > >pg_init_in_progress to be zero? We'd also have to audit that > > > > >pg_init_required cannot be set while pg_init_in_progress. > > > > > > > > > Hmm. > > > > > > > > We _could_ try to resolve it by pushing I/O back onto the request > queue > > > > (cf my earlier post 'requeue I/O during pg_init'). > > > > > > > > I was hoping to excite some comments with that, but seems to be my > > > > fate nowadays to send out patches with no reply. > > > > > > patchwork caught it: > > > https://patchwork.kernel.org/patch/2969111/ > > > > > > I've just been distracted with other stuff the past week; but I'll be > > > looking closer at this issue (and your earlier patch) shortly and we'll > > > get a fix queued for 3.13. > > > > > > > Anyway, maybe this will be giving it some more attention. > > > > It definitely would avoid this problem, by virtue of not having to > > > > queue I/O internally during pg_init, so we could easily tear down > > > > the queue. > > > > > > Sounds good. > > > > Thanks for your comments. I agree we should lock while setting > dtr_in_progress, I think I overlooked it as its handled in process_queued_ios > as well. > > We looked into handling this in wait_for_pg_init_completion() but checking > for pg_init_required here will not help as well ( until we prevent setting > pg_init_required while pg_init_in_progress is set ). > > Here due to SCSI_DH_RETRY on mode_select, pg_init_done will set the > pg_init_required as activation needs to be retried under normal > > circumstances. But it completely differs when multipath target is being > destroyed. Apparently I didn't see any pending_ios in our test > > while this is happening. Just path activations are held up since controller > was returning 5/91/36 CC's. With this condition either one of > pg_init_required or pg_init_in_progress flags are set all the time. > > Hannes patch will take care of preventing queueing of IO's when > pg_init_in_progress is set, but currently running activation commands will > not return until controller returns SUCCESS on mode_select. > > So... do you have an updated patch with proper locking that takes > Hannes' patch into consideration? > > I'll be reviewing hannes patch closer (for v3.13) tomorrow so if you'd > like your issue resolved in the near-term I'd appreciate us getting some > closer on proposed solutions. > > Thanks, > Mike I have re-submitted my patch with locking in multipath_dtr. I revisited Hannes patch and I think we still have this issue when multipath is destroyed and path activation work is queued. This is with the scenario of scsi_dh_rdac returning SCSI_DH_RETRY on mode_select due to 5/91/36 CC's. -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel