On 2022/8/29 16:11, Ming Lei wrote: > On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote: >> On 2022/8/29 13:40, Ming Lei wrote: >>> On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote: >>>> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash >>>> this rq is actually handled by an io_uring fallback wq. We have to >>>> end(abort) this rq since this fallback wq is a task other than the >>>> crashed task. However, current code does not call io_uring_cmd_done() >>>> at the same time but do it in ublk_cancel_queue(). With current design, >>>> this does work because ublk_cancel_queue() is called AFTER del_gendisk(), >>>> which waits for the rq ended(aborted) in fallback wq. This implies that >>>> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done() >>>> on the corresponding ioucmd in ublk_cancel_queue(). >>> >>> Right. [1] >>> >>>> >>>> However, while considering recovery feature, we cannot rely on >>>> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all >>>> rqs because we may not want any aborted rq. Besides, io_uring does not >>>> provide "flush fallback" machenism so we cannot trace this ioucmd. >>> >>> Why not? >>> >>> If user recovery is enabled, del_gendisk() can be replaced with >>> blk_mq_quiesce_queue(), then let abort work function do: >>> >>> - cancel all in-flight requests by holding them into requeue list >>> instead of finishing them as before, and this way is safe because >>> abort worker does know the ubq daemon is dying >>> - cancel pending commands as before, because the situation is same >>> with disk deleted or queue frozen >> >> The problem is: we cannot control when fallback wq is scheduled. >> So we are unsafe to call io_uring_cmd_done() in another process. > > What is the other process? Hi Ming. Actually patch 1-5 are all preparations for patch 6-9. So I suggest you may read patch 6-9 at the same time if you are confused on why I say there is a 'problem'. In current ublk_drv we really do not need to consider it(As I have explained in [1] and you replied 'Right'). Answer your question now: in the patchset, it is START_RECOVERY process, which calls ublk_recover_rq(). Please see patch 8. > > It can't be fallback wq since any ublk request is aborted at the beginning > of __ublk_rq_task_work(). With recovery feature enabled, we cannot abort the rq, but just let __ublk_rq_task_work() return. We will requeue the rq soon. > > It shouldn't be the process calling ublk_cancel_dev(), since it is > safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0. > > Or others? It is START_RECOVERY process(or the 'abort_work' you proposed). It is whatever that calls io_uring_cmd_done() on the old ioucmd belonging to the dying ubq_daemon. > >> Otherwise, there is a UAF, just as >> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a >> re-issued request aborted previously to ioucmd's task_work). > > As I mentioned, del_gendisk() can be replaced with > blk_mq_quiesce_queue() in case of user recovery, then no any new > request can be queued after blk_mq_quiesce_queue() returns. For this, +1. > >> >> Yeah I know the answer is very simple: flush the fallback wq. >> But here are two more questions: > > I don't see why we need to flush fallback wq, care to provide some > details? Anyway, here is a case: (1) Assume there is only one tag, only one ubq. (2) The ublk_io is ACTIVE, which means an ioucmd(cmd_1) is sent from ublksrv and ublk_drv has not completed it yet(no io_uring_cmd_done() is called). (3) New rq is coming, and ublk_queue_rq() is called. (4) The ublksrv process crashes(PF_EXITING). (5) io_uring_cmd_complete_in_task(cmd_1) is called in ublk_queue_rq(), and cmd_1 is put into a fallback_list. (6) We want to re-attach a new process and assing a new ioucmd(cmd_2) to ublk_io. (7) We try to complete the old cmd_1 now by io_uring_cmd_done(cmd_1)... (8) Shortly after (7), io_uring exit work is scheduled and it finds that no inflight iocumd exists, so it starts to release some resources (9) Shortly after (8), in fallback wq, a null-deref on cmd_1 or ctx->refs may happen, just like (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work). If we flush fallback wq before (7), then everything is OKAY. Why this happens? The root cause is that we do not ALWAYS complete io_uring cmd in ublk_rq_task_work_cb(). The commit: "ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work" does fix a problem. But I think we really need to refactor ublk_rq_task_work_cb() which focuses on old ioucmd. > >> >> (1) Should ublk_drv rely on the fallback wq machenism? >> IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task() >> because its implementation may change in the future. >> BTW, I think current ublk_rq_task_work_cb() is not correct because >> it does not always call io_uring_cmd_done() before returning. >> nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd >> no matter the rq succeeds or fails. >> >> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call >> it before starting a new process(recovery)? >> What if fallback wq is not scheduled immediately if there are many processes >> running and the system overhead is heavy. In this case the recovery process >> may wait for too long. Really we should not depend on fallback wq and please >> let the fallback wq complete the ioucmd itself. >> >>> >>> With this way, the current abort logic won't be changed much. >>> >>> And user recovery should only be started _after_ ublk device is found >>> as aborted. >> >> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING. > > That is different. If START_RECOVERY is only run on aborted device, the > recovery handler could be simplified. But stop_dev could become complicated since with recovery enabled, stop_dev has to do different things. Please see patch 5. Really we do not have to do anything after the crash until user sends START_RECOVERY or STOP_DEV. Regards, Zhang > >> >>> >>>> >>>> The recovery machenism needs to complete all ioucmds of a dying ubq >>>> to avoid leaking io_uring ctx. But as talked above, we are unsafe >>>> to call io_uring_cmd_done() in the recovery task if fallback wq happens >>>> to run simultaneously. This is a UAF case because io_uring ctx may be >>>> freed. Actually a similar case happens in >>>> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a >>>> re-issued request aborted previously to ioucmd's task_work). >>> >>> If you take the above approach, I guess there isn't such problem because >>> abort can handle the case well as before. >> >> Ming, we did think this approach(quiesce, requeue rq/complete ioucmd) >> at the very beginning. But we decided to drop it because we don not want >> rely on 'flush fallback wq' machenism, which >> makes ublk_drv rely on io_uring's internal implementation. > > Then the focus is 'flush fallback wq', please see my above question. > > > Thanks, > Ming