Re: [PATCH] ublk: decouple hctx and ublk server threads

Uday Shankar <ushankar@xxxxxxxxxxxxxxx> · Mon, 7 Oct 2024 13:50:09 -0600

On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote:
> On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > task (called the queue's ubq_daemon) which is allowed to issue
> > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > to do so, the command fails immediately with EINVAL. When considered
> > together with the block layer architecture, the result is that for each
> > CPU C on the system, there is a unique ublk server thread which is
> > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > performance under imbalanced load generation. For an extreme example,
> > suppose all the load is generated on CPUs mapping to a single ublk
> > server thread. Then that thread may be fully utilized and become the
> > bottleneck in the system, while other ublk server threads are totally
> > idle.
> 
> I am wondering why the problem can't be avoided by setting ublk server's
> thread affinity manually.

I don't think the ublk server thread CPU affinity has any effect here.
Assuming that the ublk server threads do not pass I/Os between
themselves to balance the load, each ublk server thread must handle all
the I/O issued to its associated hctx, and each thread is limited by how
much CPU it can get. Since threads are the unit of parallelism, one
thread can make use of at most one CPU, regardless of the affinity of
the thread. And this can become a bottleneck.

> > be balanced across all ublk server threads by having the threads fetch
> > I/Os for the same QID in a round robin manner. For example, in a system
> > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> > could issue fetch requests as follows (where each entry is of the form
> > qid, tag):
> > 
> > poller thread:	T0	T1	T2	T3
> > 		0,0	0,1	0,2	0,3
> > 		1,3	1,0	1,1	1,2
> 
> How many ublk devices there are? If it is 1, just wondering why you use
> 4 threads? Usually one thread is enough to drive one queue, and the
> actually io command handling can be moved to new work thread if the queue
> thread is saturated.

This is just a small example to demonstrate the idea, not necessarily a
realistic one.

> > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> > -					unsigned issue_flags)
> > -{
> > -	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > -	struct ublk_rq_data *data, *tmp;
> > -
> > -	io_cmds = llist_reverse_order(io_cmds);
> > -	llist_for_each_entry_safe(data, tmp, io_cmds, node)
> > -		__ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> > -}
> > -
> > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> > -{
> > -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > -	struct ublk_queue *ubq = pdu->ubq;
> > -
> > -	ublk_forward_io_cmds(ubq, issue_flags);
> > -}
> > -
> >  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> >  {
> > -	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > -
> > -	if (llist_add(&data->node, &ubq->io_cmds)) {
> > -		struct ublk_io *io = &ubq->ios[rq->tag];
> > -
> > -		io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> > -	}
> > +	struct ublk_io *io = &ubq->ios[rq->tag];
> > +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> > +	pdu->req = rq;
> > +	io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
> >  }
> 
> It should be fine to convert to io_uring_cmd_complete_in_task() since
> the callback list is re-ordered in io_uring.

Yes, I noticed that task_work has (lockless) internal queueing, so
there shouldn't be a need to maintain our own queue of commands in
ublk_drv. I can factor this change out into its own patch if that is
useful.

> 
> >  
> >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  {
> >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> >  	unsigned int nr_inflight = 0;
> >  	int i;
> >  
> >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > -		if (!ubq->timeout) {
> > -			send_sig(SIGKILL, ubq->ubq_daemon, 0);
> > -			ubq->timeout = true;
> > -		}
> > -
> > +		send_sig(SIGKILL, data->task, 0);
> >  		return BLK_EH_DONE;
> >  	}
> >  
> > -	if (!ubq_daemon_is_dying(ubq))
> > +	if (!(data->task->flags & PF_EXITING))
> >  		return BLK_EH_RESET_TIMER;
> 
> ->task is only for error handling, but it may not work any more since
> who knows which task is for handling the io command actually.

Yes, you are right - this part right here is the only reason we need to
save/take a reference to the task. I have a couple alternative ideas:

1. Don't kill anything if a timeout happens. Instead, synchronize
   against the "normal" completion path (i.e. commit_and_fetch), and if
   timeout happens first, normal completion gets an error. If normal
   completion happens first, timeout does nothing.
2. Require that all threads handling I/O are threads of the same process
   (in the kernel, I think this means their task_struct::group_leader is
   the same?) In the normal completion path, we replace the check that
   exists today (check equality with ubq_daemon) with ensuring that the
   current task is within the process. In the timeout path, we send
   SIGKILL to the top-level process, which should propagate to the
   threads as well.

Does either of those sound okay?