On Fri, Apr 23, 2021 at 10:52:43AM -0700, Bart Van Assche wrote: > On 4/22/21 8:52 PM, Ming Lei wrote: > > For example, scsi aacraid normal completion vs. reset together with elevator > > switch, aacraid is one single queue HBA, and the request will be completed > > via IPI or softirq asynchronously, that said request isn't really completed > > after blk_mq_complete_request() returns. > > > > 1) interrupt comes, and request A is completed via blk_mq_complete_request() > > from aacraid's interrupt handler via ->scsi_done() > > > > 2) _aac_reset_adapter() comes because of reset event which can be > > triggered by sysfs store or whatever, irq is drained in > > _aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq > > context is done, but request A is just scheduled to be completed via IPI > > or softirq asynchronously, not really done yet. > > > > 3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for > > failing all pending requests. request A is still visible in > > scsi_host_complete_all_commands, because its tag isn't freed yet. But the > > tag & request A can be completed & freed exactly after scsi_host_complete_all_commands() > > reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter() > > -> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via > > IPI or softirq, and request A is addded in ipi or softirq list. > > > > 4) meantime request A is freed from normal completion triggered by interrupt, one > > pending elevator switch can move on since request A drops the last reference; and > > bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return > > too, then the whole scheduler request pool is freed now. > > > > 5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF > > is triggered. > > > > It is supposed that driver covers normal completion vs. error handling, but wrt. > > remove completion, not sure driver is capable of covering that. > > Hi Ming, > > I agree that the scenario above can happen and also that a fix is > needed. However, I do not agree that this should be fixed by modifying > the tag iteration functions. I see scsi_host_complete_all_commands() as > a workaround for broken storage controllers - storage controllers that > do not have a facility for terminating all pending commands. NVMe > controllers can be told to terminate all pending I/O commands by e.g. > deleting all I/O submission queues. Many SCSI controllers also provide a > facility for aborting all pending commands. For SCSI controllers that do > not provide such a facility I propose to fix races like the race > described above in the SCSI LLD instead of modifying the block layer tag > iteration functions. Hi Bart, Terminating all pending commands can't avoid the issue wrt. request UAF, so far blk_mq_tagset_wait_completed_request() is used for making sure that all pending requests are really aborted. However, blk_mq_wait_for_tag_iter() still may return before blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter() supposes all request reference is just done inside bt_tags_iter(), especially .iter_rwsem and read rcu lock is added in bt_tags_iter(). What I really meant is that .iter_rwsem/read rcu added or blk_mq_wait_for_tag_iter can't do what is expected. Thanks, Ming