On Tue, Dec 15, 2015 at 6:44 PM, NeilBrown <neilb@xxxxxxxx> wrote: > > Commit: c05eecf63610 ("SUNRPC: Don't allow low priority tasks to pre-empt higher priority ones") > > removed the 'fair scheduling' feature from SUNRPC priority queues. > This feature caused problems for some queues (send queue and session slot queue) > but is still needed for others, particularly the tcp slot queue. > > Without fairness, reads (priority 1) can starve background writes > (priority 0) so a streaming read can cause writeback to block > indefinitely. This is not easy to measure with default settings as > the current slot table size is much larger than the read-ahead size. > However if the slot-table size is reduced (seen when backporting to > older kernels with a limited size) the problem is easily demonstrated. > > This patch conditionally restores fair scheduling. It is now the > default unless rpc_sleep_on_priority() is called directly. Then the > queue switches to strict priority observance. > > As that function is called for both the send queue and the session > slot queue and not for any others, this has exactly the desired > effect. > > The "count" field that was removed by the previous patch is restored. > A value for '255' means "strict priority queuing, no fair queuing". > Any other value is a could of owners to be processed before switching > to a different priority level, just like before. > > Signed-off-by: NeilBrown <neilb@xxxxxxxx> > --- > > It is quite possible that you won't like the overloading of > rpc_sleep_on_priority() to disable fair-scheduling and would prefer an > extra arg to rpc_init_priority_wait_queue(). I can do it that way if > you like. > NeilBrown > > > include/linux/sunrpc/sched.h | 1 + > net/sunrpc/sched.c | 12 +++++++++--- > 2 files changed, 10 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h > index d703f0ef37d8..985efe8d7e26 100644 > --- a/include/linux/sunrpc/sched.h > +++ b/include/linux/sunrpc/sched.h > @@ -184,6 +184,7 @@ struct rpc_wait_queue { > pid_t owner; /* process id of last task serviced */ > unsigned char maxpriority; /* maximum priority (0 if queue is not a priority queue) */ > unsigned char priority; /* current priority */ > + unsigned char count; /* # task groups remaining to be serviced */ > unsigned char nr; /* # tasks remaining for cookie */ > unsigned short qlen; /* total # tasks waiting in queue */ > struct rpc_timer timer_list; > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index 73ad57a59989..e8fcd4f098bb 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -117,6 +117,8 @@ static void rpc_set_waitqueue_priority(struct rpc_wait_queue *queue, int priorit > rpc_rotate_queue_owner(queue); > queue->priority = priority; > } > + if (queue->count != 255) > + queue->count = 1 << (priority * 2); > } > > static void rpc_set_waitqueue_owner(struct rpc_wait_queue *queue, pid_t pid) > @@ -144,8 +146,10 @@ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue, > INIT_LIST_HEAD(&task->u.tk_wait.links); > if (unlikely(queue_priority > queue->maxpriority)) > queue_priority = queue->maxpriority; > - if (queue_priority > queue->priority) > - rpc_set_waitqueue_priority(queue, queue_priority); > + if (queue->count == 255) { > + if (queue_priority > queue->priority) > + rpc_set_waitqueue_priority(queue, queue_priority); > + } > q = &queue->tasks[queue_priority]; > list_for_each_entry(t, q, u.tk_wait.list) { > if (t->tk_owner == task->tk_owner) { > @@ -401,6 +405,7 @@ void rpc_sleep_on_priority(struct rpc_wait_queue *q, struct rpc_task *task, > * Protect the queue operations. > */ > spin_lock_bh(&q->lock); > + q->count = 255; > __rpc_sleep_on_priority(q, task, action, priority - RPC_PRIORITY_LOW); > spin_unlock_bh(&q->lock); > } > @@ -478,7 +483,8 @@ static struct rpc_task *__rpc_find_next_queued_priority(struct rpc_wait_queue *q > /* > * Check if we need to switch queues. > */ > - goto new_owner; > + if (queue->count == 255 || --queue->count) > + goto new_owner; > } > > /* > Are we sure there is value in keeping FLUSH_LOWPRI for background writes? Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html