On Thu, Feb 21, 2019 at 4:55 PM Jason Yan <yanaijie@xxxxxxxxxx> wrote: > > Hi, Christoph > > On 2019/2/20 23:18, Christoph Hellwig wrote: > > [fullquote removed, please follow proper mail etiquette] > > > > On Tue, Feb 19, 2019 at 08:56:28AM -0800, Bart Van Assche wrote: > >> regression in the SCSI sd driver due to the switch from the legacy block > >> layer to scsi-mq. The above patch introduces two atomic operations in the > >> hot path and hence would introduce a performance regression. I think this > >> can be avoided by making sure that sd_uninit_command() gets called before > >> the request tag is freed. What changes would be required to make the block > >> layer core call sd_uninit_command() before the request tag is freed? Would > >> introducing prep_rq_fn and unprep_rq_fn callbacks in struct blk_mq_ops and > >> making sure that the SCSI core sets these callback function pointers > >> appropriately be sufficient? Would such a change allow to simplify the NVMe > >> initiator driver? Are there any alternatives to this approach that are more > >> elegant? > > > > Additional indirect calls in the I/O fast path is something I'd rather > > avoid. But I don't fully understand the problem yet - where do > > we release a disk reference from blk_update_request? > > When userspace close the fd after blk_update_request() and before > scsi_mq_uninit_cmd(), a disk reference will be released. It is not the > blk_update_request() directly released it. > > close > ->sd_release > ->scsi_disk_put > ->scsi_disk_release > ->disk->private_data = NULL; > > The userspace can close the fd because blk_update_request() returned the > last IO , the userspace application does not have to stuck on read() or > write(). The window is very small, but it can be reproduce every day > in our testcases. So I'm very curious why. One possible explanation is > that we enabled kernel preempt(CONFIG_PREEMPT). Another solution is to drain in-flight FS IO in scsi_disk_release(), and one counter is needed for tracking in-flight passthrough IO, so we can use sdev->device_busy - sdev->passthrough_ios to drain inflight FS IO. > > And why can't > > we move that release to __blk_mq_end_request? I think it is doable, then ending bio needs to be moved out of blk_update_request(), such as, add one list of rq->done_bio to track completed bio, then complete all in free request. And for partial completion, the completed bio can be done in blk_update_request(), since the remained bios will cause fs to hold the disk. Thanks, Ming Lei