Re: [ISSUE] The time cost of IOSQE_IO_LINK

Oleg Nesterov <oleg@xxxxxxxxxx> · Tue, 18 Feb 2020 15:56:46 +0100

On 02/18, Peter Zijlstra wrote:
>
> But this has me wondering about task_work_run(), as it is it will
> unconditionally take pi_lock,

because spin_unlock_wait() was removed ;) task_work_run() doesn't
really need to take pi_lock at all. 

> would not something like this make sense?

I think yes, but see below.

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -93,16 +93,20 @@ void task_work_run(void)
>  	struct callback_head *work, *head, *next;
>
>  	for (;;) {
> +		work = READ_ONCE(task->task_work);
> +		if (!work)
> +			break

This is wrong if PF_EXITING is set, in this case we must set
task->task_works = work_exited.

> +
>  		/*
>  		 * work->func() can do task_work_add(), do not set
>  		 * work_exited unless the list is empty.
>  		 */
>  		raw_spin_lock_irq(&task->pi_lock);
>  		do {
> -			work = READ_ONCE(task->task_works);
> -			head = !work && (task->flags & PF_EXITING) ?
> -				&work_exited : NULL;
> -		} while (cmpxchg(&task->task_works, work, head) != work);
> +			head = NULL;
> +			if (unlikely(!work && (task->flags & PF_EXITING)))
> +				head = &work_exited;
> +		} while (!try_cmpxchg(&task->task_works, &work, head));
>  		raw_spin_unlock_irq(&task->pi_lock);
>
>  		if (!work)

otherwise I think this is correct, but how about the patch below?
Then this code can be changed to use try_cmpxchg().

Oleg.

--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -97,17 +97,24 @@ void task_work_run(void)
 		 * work->func() can do task_work_add(), do not set
 		 * work_exited unless the list is empty.
 		 */
-		raw_spin_lock_irq(&task->pi_lock);
 		do {
+			head = NULL;
 			work = READ_ONCE(task->task_works);
-			head = !work && (task->flags & PF_EXITING) ?
-				&work_exited : NULL;
+			if (!work) {
+				if (task->flags & PF_EXITING)
+					head = &work_exited;
+				else
+					break;
+			}
 		} while (cmpxchg(&task->task_works, work, head) != work);
-		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (!work)
 			break;
 
+		// Synchronize with task_work_cancel()
+		raw_spin_lock_irq(&task->pi_lock);
+		raw_spin_unlock_irq(&task->pi_lock);
+
 		do {
 			next = work->next;
 			work->func(work);