Re: [PATCH 1/7] SUNRPC: change service idle list to be an llist

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 19 Aug 2023, Chuck Lever wrote:
> On Fri, Aug 18, 2023 at 11:45:06AM +1000, NeilBrown wrote:
> > With an llist we don't need to take a lock to add a thread to the list,
> > though we still need a lock to remove it.  That will go in the next
> > patch.
> > 
> > Unlike double-linked lists, a thread cannot reliably remove itself from
> > the list.  Only the first thread can be removed, and that can change
> > asynchronously.  So some care is needed.
> > 
> > We already check if there is pending work to do, so we are unlikely to
> > add ourselves to the idle list and then want to remove ourselves again.
> > 
> > If we DO find something needs to be done after adding ourselves to the
> > list, we simply wake up the first thread on the list.  If that was us,
> > we successfully removed ourselves and can continue.  If it was some
> > other thread, they will do the work that needs to be done.  We can
> > safely sleep until woken.
> > 
> > We also remove the test on freezing() from rqst_should_sleep().  Instead
> > we set TASK_FREEZABLE before scheduling.  This makes is safe to
> > schedule() when a freeze is pending.  As we now loop waiting to be
> > removed from the idle queue, this is a cleaner way to handle freezing.
> > 
> > task_state_index() incorrectly identifies a task with
> >    TASK_IDLE | TASK_FREEZABLE
> > as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
> > just as it ignores extra flags on other states.
> 
> Let's split the task_state_index() change into a separate patch
> that can be sent to the scheduler maintainers for their Review/Ack.
> 

I've sent that to appropriate people.  It doesn't really matter if it
lands before or after we change sunrpc to use TASK_FREEZABLE.  The only
problem is that nfsd threads look like "D" in a ps listing rather than
"I".  The actually behaviour of the threads isn't changed.

> 
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> >  include/linux/sched.h      |  4 +--
> >  include/linux/sunrpc/svc.h | 14 ++++++-----
> >  net/sunrpc/svc.c           | 13 +++++-----
> >  net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
> >  4 files changed, 42 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 609bde814cb0..a5f3badcb629 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1666,7 +1666,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> >  
> >  	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
> >  
> > -	if (tsk_state == TASK_IDLE)
> > +	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
> >  		state = TASK_REPORT_IDLE;
> >  
> >  	/*
> > @@ -1674,7 +1674,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> >  	 * to userspace, we can make this appear as if the task has gone through
> >  	 * a regular rt_mutex_lock() call.
> >  	 */
> > -	if (tsk_state == TASK_RTLOCK_WAIT)
> > +	if (tsk_state & TASK_RTLOCK_WAIT)
> >  		state = TASK_UNINTERRUPTIBLE;
> >  
> >  	return fls(state);
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 22b3018ebf62..5216f95411e3 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -37,7 +37,7 @@ struct svc_pool {
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> >  	struct list_head	sp_all_threads;	/* all server threads */
> > -	struct list_head	sp_idle_threads; /* idle server threads */
> > +	struct llist_head	sp_idle_threads; /* idle server threads */
> >  
> >  	/* statistics on pool operation */
> >  	struct percpu_counter	sp_messages_arrived;
> > @@ -186,7 +186,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >   */
> >  struct svc_rqst {
> >  	struct list_head	rq_all;		/* all threads list */
> > -	struct list_head	rq_idle;	/* On the idle list */
> > +	struct llist_node	rq_idle;	/* On the idle list */
> >  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
> >  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> >  
> > @@ -270,22 +270,24 @@ enum {
> >   * svc_thread_set_busy - mark a thread as busy
> >   * @rqstp: the thread which is now busy
> >   *
> > - * If rq_idle is "empty", the thread must be busy.
> > + * By convention a thread is busy if rq_idle.next points to rq_idle.
> > + * This ensures it is not on the idle list.
> >   */
> >  static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
> >  {
> > -	INIT_LIST_HEAD(&rqstp->rq_idle);
> > +	rqstp->rq_idle.next = &rqstp->rq_idle;
> >  }
> 
> I don't understand the comment "This ensures it is not on the idle
> list." svc_thread_set_busy() is called in two places: The first
> when an svc_rqst is created, and once directly after an
> llist_del_first() has been done. @rqstp is already not on the
> idle list in either case.

"ensures" is the wrong word.  Maybe I meant "assures" or even "records".
I change it to
 * This will never be the case for threads on the idle list.

> 
> What really needs an explanation here is that there's no
> existing utility to check whether an llist_node is on a list or
> not.

I guess no-one found the need before...
Though I note that md/raid5.c has a bit flag STRIPE_ON_RELEASE_LIST
to check if the stripe is on a llist.  So it could use that
functionality.

I would happily include a patch to add this to llist.h, except that by
the end of the series we don't need it any more...
Maybe I should do it anyway?
> 
> 
> >  /**
> >   * svc_thread_busy - check if a thread as busy
> >   * @rqstp: the thread which might be busy
> >   *
> > - * If rq_idle is "empty", the thread must be busy.
> > + * By convention a thread is busy if rq_idle.next points to rq_idle.
> > + * This ensures it is not on the idle list.
> 
> This function doesn't modify the thread, so it can't ensure it
> is not on the idle list.
I changed that to

 * This will never be the case for threads on the idle list.

too.

> 
> 
> >   */
> >  static inline bool svc_thread_busy(struct svc_rqst *rqstp)
> 
> const struct svc_rqst *rqstp

ack.

> 
> >  {
> > -	return list_empty(&rqstp->rq_idle);
> > +	return rqstp->rq_idle.next == &rqstp->rq_idle;
> >  }
> >  
> >  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index 051f08c48418..addbf28ea50a 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -510,7 +510,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  		pool->sp_id = i;
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> >  		INIT_LIST_HEAD(&pool->sp_all_threads);
> > -		INIT_LIST_HEAD(&pool->sp_idle_threads);
> > +		init_llist_head(&pool->sp_idle_threads);
> >  		spin_lock_init(&pool->sp_lock);
> >  
> >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > @@ -701,15 +701,16 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  void svc_pool_wake_idle_thread(struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > +	struct llist_node *ln;
> >  
> >  	rcu_read_lock();
> >  	spin_lock_bh(&pool->sp_lock);
> > -	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
> > -					 struct svc_rqst, rq_idle);
> > -	if (rqstp)
> > -		list_del_init(&rqstp->rq_idle);
> > +	ln = llist_del_first(&pool->sp_idle_threads);
> >  	spin_unlock_bh(&pool->sp_lock);
> > -	if (rqstp) {
> > +	if (ln) {
> > +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
> > +		svc_thread_set_busy(rqstp);
> > +
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >  		wake_up_process(rqstp->rq_task);
> >  		rcu_read_unlock();
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index fa0d854a5596..81327001e074 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -715,10 +715,6 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  	if (svc_thread_should_stop(rqstp))
> >  		return false;
> >  
> > -	/* are we freezing? */
> > -	if (freezing(current))
> > -		return false;
> > -
> >  #if defined(CONFIG_SUNRPC_BACKCHANNEL)
> >  	if (svc_is_backchannel(rqstp)) {
> >  		if (!list_empty(&rqstp->rq_server->sv_cb_list))
> > @@ -734,30 +730,32 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
> >  	struct svc_pool *pool = rqstp->rq_pool;
> >  
> >  	if (rqst_should_sleep(rqstp)) {
> > -		set_current_state(TASK_IDLE);
> > -		spin_lock_bh(&pool->sp_lock);
> > -		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > -		spin_unlock_bh(&pool->sp_lock);
> > +		set_current_state(TASK_IDLE | TASK_FREEZABLE);
> 
> I'm trying to learn about the semantics of IDLE|FREEZABLE, but there
> isn't another instance of this flag combination in the kernel.

No - which I why no one else notice that the state reported by ps was
wrong. 
Quite a few places use INTERRUPTIBLE|FREEZABLE which is much the same
thing except sunrpc threads don't want interrupts.

My interpretation of FREEZABLE is that it avoids the need to check for
freezing, or to freeze.  A higher-level interface.

> 
> 
> > +		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > +
> > +		if (unlikely(!rqst_should_sleep(rqstp)))
> > +			/* maybe there were no idle threads when some work
> > +			 * became ready and so nothing was woken.  We've just
> > +			 * become idle so someone can to the work - maybe us.
> > +			 * But we cannot reliably remove ourselves from the
> > +			 * idle list - we can only remove the first task which
> > +			 * might be us, and might not.
> > +			 * So remove and wake it, then schedule().  If it was
> > +			 * us, we won't sleep.  If it is some other thread, they
> > +			 * will do the work.
> > +			 */
> 
> Large block comments suggest that this code doesn't document itself
> very well.

Self documenting code is best, but when the logic is subtle and
unfamiliar, a bit of text can help.
Of course the logic is quite familiar to me now and doesn't seem so
subtle, so I wonder if any comment it needed.
> 
> At least, some of the textual redundancy can be removed, though this
> comment and the next are removed or replaced in later patches. I'll
> leave it up to you to find an approach that is a bit cleaner.

Is now:
	/* Work just became available.  This thread cannot simply
	 * choose not to sleep as it *must* wait until removed.
	 * So wake the first waiter - whether it is this
	 * thread or some other, it will get the work done.
	 */


> 
> 
> > +			svc_pool_wake_idle_thread(pool);
> >  
> > -		/* Need to check should_sleep() again after
> > -		 * setting task state in case a wakeup happened
> > -		 * between testing and setting.
> > +		/* We mustn't continue while on the idle list, and we
> > +		 * cannot remove outselves reliably.  The only "work"
> > +		 * we can do while on the idle list is to freeze.
> > +		 * So loop until someone removes us
> >  		 */
> 
> Here and above, the use of "we" obscures the explanation. What is
> "we" in this context? I think it might be "this thread" or @rqstp,
> but I can't be certain. For instance:
> 
> 		/* Since a thread cannot remove itself from an llist,
> 		 * schedule until someone else removes @rqstp from
> 		 * the idle list.
> 		 */

Looks good - thanks.

> 
> 
> > -		if (rqst_should_sleep(rqstp)) {
> > +		while (!svc_thread_busy(rqstp)) {
> 
> I need to convince myself that this while() can't possibly result in
> an infinite loop.
> 

It cannot be a busy-loop because we set state and call schedule().
It could loop for a long time if there is no work to do.
But at service shutdown, everything will be removed from the busy list
and woken, so the loop will exit then.

> 
> >  			schedule();
> > -		} else {
> > -			__set_current_state(TASK_RUNNING);
> > -			cond_resched();
> > -		}
> > -
> > -		/* We *must* be removed from the list before we can continue.
> > -		 * If we were woken, this is already done
> > -		 */
> > -		if (!svc_thread_busy(rqstp)) {
> > -			spin_lock_bh(&pool->sp_lock);
> > -			list_del_init(&rqstp->rq_idle);
> > -			spin_unlock_bh(&pool->sp_lock);
> > +			set_current_state(TASK_IDLE | TASK_FREEZABLE);
> >  		}
> > +		__set_current_state(TASK_RUNNING);
> > 
> >  	} else {
> >  		cond_resched();
> >  	}
> 
> There's a try_to_freeze() call just after this hunk. Is there a
> reason to invoke cond_resched(); before freezing?

try_to_freeze() can compile to {return false;}.  Even when not, it
usually does not more than tests a global variable.  So the presence of
try_to_freeze() has no bearing on the use of cond_resched().

> 
> 
> > @@ -870,9 +868,10 @@ void svc_recv(struct svc_rqst *rqstp)
> >  		struct svc_xprt *xprt = rqstp->rq_xprt;
> >  
> >  		/* Normally we will wait up to 5 seconds for any required
> > -		 * cache information to be provided.
> > +		 * cache information to be provided.  When there are no
> > +		 * idle threads, we reduce the wait time.
> >  		 */
> > -		if (!list_empty(&pool->sp_idle_threads))
> > +		if (pool->sp_idle_threads.first)
> >  			rqstp->rq_chandle.thread_wait = 5 * HZ;
> >  		else
> >  			rqstp->rq_chandle.thread_wait = 1 * HZ;
> > -- 
> > 2.40.1
> > 

I'll send a new batch today I hope.  I decided to move lwq into lib/
and make all the changes to llist in separate patches.

Thanks,
NeilBrown




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux